# Natural language processing

## Introduction to Transformers Architecture
Currently, the most popular models for Natural Language Processing use the Transformer Architecture. There are several libraries implementing this architecture. However, in the context of NLP Huggingface transformers are most commonly used.

Apart from the source code itself, this library contains a number of other elements. Among the most important of these are:

[models](https://huggingface.co/models) - a huge and growing number of ready-made models that we can use to solve many problems in NLP (but also in speech recognition or image processing),
[datasets](https://huggingface.co/datasets) - a very large catalogue of useful datasets that we can easily use to train our own NLP models (and other models).

## Environment preparation - How to start with Google Colab

Training NLP models requires access to hardware accelerators to accelerate the learning of neural networks. If our computer is not equipped with a GPU, we can use the Google Colab environment.

In this environment, we can choose an accelerator from GPU and TPU. Let us check if we have access to an environment equipped with an NVidia accelerator by executing the following command:


In [None]:
!nvidia-smi

If the accelerator is unavailable (the command ended with an error), we change the execution environment by selecting from the "Execution environment" menu -> "Change execution environment type" -> GPU.

We will then install all the necessary libraries. In addition to the `transformers` library itself, we also install the `datasets` management library datasets, a library that defines many metrics used in AI `evaluate` algorithms, and additional tools such as `sacremoses` and `sentencepiece`.

In [None]:
!pip install transformers sacremoses datasets evaluate sentencepiece

With the necessary libraries installed, we can use all the models and datasets registered in the catalogue.

A typical way of using the available models is:

- using a ready-made model that performs a specific task, e.g. [sentiment analysis](https://huggingface.co/finiteautomata/bertweet-base-sentiment-analysis) - a model of this kind does not need to be trained, it is enough to run it in order to obtain a classification result (this can be seen in the demo at the indicated link),
- using a base model that is trained for a specific task; an example of such a model is the [HerBERT base](https://huggingface.co/allegro/herbert-base-cased), which was taught as a masked language model. To use it for a specific task, we need to select a 'classification head' for it and retrain it on our own dataset.

Models of this kind are different, and can be loaded using a common interface, but it is best to use one of the specialised classes, tailored to the task at hand. We will start by loading the BERT base model - one of the most popular models, for English. We will use it to guess missing words in the text. We will use the `AutoModelForMaskedLM` call to do this.
Use the code to see the outcome:

In [None]:
from transformers import AutoModelForMaskedLM, AutoTokenizer

model = AutoModelForMaskedLM.from_pretrained("bert-base-cased")

## Connecting Google Drive
The final element of preparation, which is optional, is to attach your own Google Drive to the Colab environment. This makes it possible to save trained models, during the training process, to an "external" drive. If Google Colab leads to an interruption of the training process, the files that were successfully saved during the training will nevertheless not be lost. It will be possible to resume training already on a partially trained model.


To do this, we mount the Google Drive in Colab. This requires authorisation of the Colab tool in Google Drive.

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Once the drive is mounted, we have access to the entire contents of Google Drive. When indicating where to save data during a workout, indicate a path starting with `/content/gdrive`, but indicate some subdirectory within our drive space. The full path could be `/content/gdrive/MyDrive/output`. It is a good idea to check that the data writes to the drive before running the workout.

## Text tokenization
Loading the model itself, however, is not enough to start using it. We must have a mechanism for converting text (a string of characters), into a sequence of tokens, belonging to a specific dictionary. During the training of the model, this dictionary is determined (selected algorithmically) before the actual training of the neural network. Although it is possible to extend it later (training on the training data, it also allows to obtain a representation of missing tokens), usually the dictionary in the form that was defined before the neural network training is used. Therefore, it is important to specify the correct dictionary for the tokeniser performing the text splitting.

The library has an `AutoTokenizer` class that accepts the model name, which allows the dictionary corresponding to the selected neural network model to be automatically loaded. However, it is important to remember that if you are using 2 models, each will most likely have a different dictionary, and therefore they must have their own instances of the `Tokenizer` class.

In [None]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

The Tokenizer uses a fixed-size dictionary. This, of course, subordinates to the fact that not all words occurring in the text will be included. Furthermore, if we use the tokenizer to split text in a language other than the one for which it was created, such text will be split into a larger number of tokens.

In [None]:
sentence1 = tokenizer.encode(
    "The quick brown fox jumps over the lazy dog.", return_tensors="pt"
)
print(sentence1)
print(sentence1.shape)

sentence2 = tokenizer.encode("Zażółć gęślą jaźń.", return_tensors="pt")
print(sentence2)
print(sentence2.shape)

(Using the tokenizer for English to split any other language sentence, we see that we get a much larger number of tokens. To see how the tokenizer has split the text, we can use the call `covert_ids_to_tokens`:

In [None]:
print("|".join(tokenizer.convert_ids_to_tokens(list(sentence1[0]))))
print("|".join(tokenizer.convert_ids_to_tokens(list(sentence2[0]))))

We can see that for English, all the words in the sentence have been converted into single tokens. In the case of a sentence in any  other language containing a number of diacritical signs, the situation is completely different - each sign has been extracted into a separate sub-token. The fact that we are dealing with sub-tokens is signalled by two crosses preceding a given sub-token. These indicate that this sub-token must be glued together with the preceding token to obtain the correct character string.)

## Excercise 1

Use the tokenizer for `xlm-roberta-large` to tokenize the same sentences. What conclusions can be drawn by looking at how tokenisation is done using different dictionaries?

In [None]:
# your code

As an outcome of the tokenization, beside words/tokens present in the original text, additional [CLS] and [SEP] tags (or other tags - depending on the dictionary used) appear in the tokenisation results. These have a special meaning and can be used to perform specific functions related to text analysis. For example, the [CLS] token representation is used in sentence classification tasks. The token [SEP], on the other hand, is used to distinguish between sentences, in tasks requiring two sentences as an input (e.g. determining how similar the sentences are to each other).

##Language modelling

Models pretreated in the self-supervised learning (SSL) regime do not have special capabilities for solving specific natural language processing tasks, such as answering questions or classifying text (except for very large models such as GPT-3, for example). However, they can be used to determine the probability of words in a text, and thus to test how much knowledge a specific model has in terms of language knowledge, or general knowledge of the world.

In order to check how the model performs in these tasks, we can perform inference on the input data, in which some words will be replaced by special masking symbols used during the pre-training of the model.

Keep in mind that different models may use different special sequences during pretraining. For example, Bert uses the sequence [MASK]. We can check the appearance of the mask token or its identifier in [the tokeniser configuration file](https://huggingface.co/bert-base-cased/raw/main/tokenizer.json) distributed with the model.

As a first step, we will try to fill in the missing word in the English sentence.

In [None]:
sentence_en = tokenizer.encode(
    "The quick brown [MASK] jumps over the lazy dog.", return_tensors="pt"
)
print("|".join(tokenizer.convert_ids_to_tokens(list(sentence_en[0]))))
target = model(sentence_en)
print(target.logits[0][4])

Since the sentence is completed with the `[CLS]` tag after stocenisation, the masked word is in position 4. The `call target.logits[0][4]` shows a tensor with the probability distribution of the individual words, which was determined from the model parameters. We can select the words that have the highest probability using the call `torch.topk`:

In [None]:
import torch

top = torch.topk(target.logits[0][4], 5)
top

We obtained two vectors - `values` containing the components of the output vector of the neural network (unnormalised) and `indices`containing the indices of these components. From this, we can display the expression that the model believes are the most likely complements of the masked expression:

In [None]:
words = tokenizer.convert_ids_to_tokens(top.indices)

import matplotlib.pyplot as plt
plt.bar(words, top.values.detach().numpy())

As expected, the most likely replacement for the missing word is dog. The second word ##ie may be a little surprising, but when added to the existing text we get the sentence 'The quick brownie jumps over the lazy dog', which also seems sensible (if a little surprising).

## Excercise 2

Using the `xlm-roberta-model`, propose sentences with one missing word, verifying the ability of this model to:

accommodate meaning in semantic context,
account for long-distance relationships in a text,
represent knowledge about the world.
For each problem, come up with 3 test sentences and display the prediction for the 5 most likely words.

Please try to come up with examples having masked item in different positions within the sentence.

You can use the code from the `plot_words` function to help you display the results. Also, verify what masking token is used in this model, and remember to load the `xlm-roberta-model`.

Evaluate the model's capabilities for the tasks indicated.

In [None]:
def plot_words(sentence, word_model, word_tokenizer, mask="[MASK]"):
    sentence = word_tokenizer.encode(sentence, return_tensors="pt")
    tokens = word_tokenizer.convert_ids_to_tokens(list(sentence[0]))
    print("|".join(tokens))
    target = word_model(sentence)
    top = torch.topk(target.logits[0][tokens.index(mask)], 5)
    words = word_tokenizer.convert_ids_to_tokens(top.indices)
    plt.xticks(rotation=45)
    plt.bar(words, top.values.detach().numpy())
    plt.show()


# your code