# Transformer-based Natural Language Processing
## Introduction to 🤗 Transformers
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/texttechnologylab/WiSe22-M-PNLR-PR-TbNLP/blob/master/transformers.ipynb)

### Installing necessary packages (i.e. if on Colab)

In [None]:
% pip install torch datasets tokenizers transformers

### Premise

This notebook will guide you through the process of finetuning a transformer model using the 🤗 Transformers library.

First, we need to select a task and suitable dataset. Here, we will use the [Textual Entailment](https://aclweb.org/aclwiki/Recognizing_Textual_Entailment) task as an example. A suitable dataset can be found in the [SuperGLUE repository](https://www.tensorflow.org/datasets/catalog/super_glue#super_gluerte). While this is a TensorFlow dataset, we are lucky as it is mirrored on the 🤗 Dataset hub.

Thus, we can load the Recognizing Textual Entailment subset of the SuperGLUE dataset as follows:

In [1]:
from datasets import load_dataset

rte_dataset = load_dataset("super_glue", "rte")
print(rte_dataset)

DatasetDict({
    train: Dataset({
        features: ['premise', 'hypothesis', 'idx', 'label'],
        num_rows: 2490
    })
    validation: Dataset({
        features: ['premise', 'hypothesis', 'idx', 'label'],
        num_rows: 277
    })
    test: Dataset({
        features: ['premise', 'hypothesis', 'idx', 'label'],
        num_rows: 3000
    })
})


As we can see above, the dataset is already split into train, development and test splits.
Each row contains four, but we only need to focus the premise, hypothesis and the label.

The textual entailment task requires us to recognize, given two text fragments, whether the meaning of one text is entailed (*can be inferred*) from the other text.

In this example, we will use a BERT-family model. With BERT, we formulate the entailment task as a simple classification task by concatenating the premise and hypothesis and training our classifier on the first token (the `[CLS]` token) of the input string:

```
"[CLS] This is the premise, i.e. a text that means something. [SEP] This is the hypothesis, i.e. what we may be able to infer [SEP]"
```

But let's first take a look at the dataset.

In [2]:
print(rte_dataset['train'][:2])
print(rte_dataset['validation'][:2])
print(rte_dataset['test'][:2])

{'premise': ['No Weapons of Mass Destruction Found in Iraq Yet.', 'A place of sorrow, after Pope John Paul II died, became a place of celebration, as Roman Catholic faithful gathered in downtown Chicago to mark the installation of new Pope Benedict XVI.'], 'hypothesis': ['Weapons of Mass Destruction Found in Iraq.', 'Pope Benedict XVI is the new leader of the Roman Catholic Church.'], 'idx': [0, 1], 'label': [1, 0]}
{'premise': ['Dana Reeve, the widow of the actor Christopher Reeve, has died of lung cancer at age 44, according to the Christopher Reeve Foundation.', 'Yet, we now are discovering that antibiotics are losing their effectiveness against illness. Disease-causing bacteria are mutating faster than we can come up with new antibiotics to fight the new variations.'], 'hypothesis': ['Christopher Reeve had an accident.', 'Bacteria is winning the war against antibiotics.'], 'idx': [0, 1], 'label': [1, 0]}
{'premise': ["Mangla was summoned after Madhumita's sister Nidhi Shukla, who w

As we see above, the `test` split contains **unlabeled** samples, be we can ignore that for now.

Let's construct the sentences as we outlined above.

In [3]:
prepared_dataset = rte_dataset.map(
    lambda sample: {'input': f"{sample['premise']} [SEP] {sample['hypothesis']}"},
    remove_columns=['premise', 'hypothesis', 'idx']
)
print(prepared_dataset['train'][:2])

{'label': [1, 0], 'input': ['No Weapons of Mass Destruction Found in Iraq Yet. [SEP] Weapons of Mass Destruction Found in Iraq.', 'A place of sorrow, after Pope John Paul II died, became a place of celebration, as Roman Catholic faithful gathered in downtown Chicago to mark the installation of new Pope Benedict XVI. [SEP] Pope Benedict XVI is the new leader of the Roman Catholic Church.']}


### Loading Pre-Trained Models

Now we need to load a pre-trained BERT model. You should use a subclass of [AutoModel](https://huggingface.co/docs/transformers/main/en/autoclass_tutorial).

#### Load and instantiate a model for the textual entailment task

In [None]:
from transformers import AutoConfig, AutoTokenizer  #, AutoModelFor?

config = ...  # TODO
tokenizer = ...  # TODO
model = ...  # TODO

Now we could use the [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) class for easy training. You can follow the tutorial from [the official documentation](https://huggingface.co/docs/transformers/quicktour#trainer-a-pytorch-optimized-training-loop).

#### Write the training procedure

In [None]:
from transformers import Trainer

trainer = Trainer(...)

# TODO

trainer.train()

# TODO: evaluation

### Custom Training
While using the trainer class is very convenient, if you have to run custom procedures during training, a regular training loop can be more accessible.

We can re-use code from the datasets notebook.

In [None]:
def encode_pt(batch: dict):
    return tokenizer(
        batch['input'],
        add_special_tokens=True,
        return_token_type_ids=False,
        return_attention_mask=False,
        padding=False,
        truncation=True,
    )


pt_dataset = prepared_dataset.map(encode_pt)
print(pt_dataset['train'][:2])

However, in a manual training loop, we will want to make use of PyTorch's DataLoaders, which require some extra care to collate batches with samples of different lengths.

#### Implement `custom_collate`:
- Pad and stack the `input_ids` in a tensor.
- Stack the labels in a tensor of type `long`.

In [None]:
import torch
from torch.nn.utils.rnn import pad_sequence


def custom_collate(batch: list[dict]) -> tuple[torch.Tensor, torch.Tensor]:
    input_ids = ...
    label = ...
    return input_ids, label

#### Write the training and evaluation loops

In [None]:
from tqdm.notebook import tqdm, trange
from torch.utils.data import DataLoader
from torch.optim import *  # TODO
from torch.nn import *  # TODO

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

criterion = ...  # TODO
optimizer = ...  # TODO
num_epochs = ...  # TODO
batch_size = ...  # TODO

train_dataloader = DataLoader(pt_dataset['train'], batch_size=batch_size, shuffle=True, collate_fn=custom_collate)
dev_dataloader = DataLoader(pt_dataset['validation'], batch_size=batch_size, shuffle=False, collate_fn=custom_collate)

model.to(device)
for epoch in trange(num_epochs, position=0):
    model.train()
    for batch in tqdm(train_dataloader, position=1, leave=False):
        ...  # TODO

    model.eval()
    for batch in tqdm(dev_dataloader, position=1, leave=False):
        ...  # TODO