# Transformer-based Natural Language Processing
## Introduction to 🤗 Transformers
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/texttechnologylab/WiSe22-M-PNLR-PR-TbNLP/blob/master/transformers.ipynb)

### Installing necessary packages (i.e. if on Colab)

In [20]:
# Huggingface evaluate for evaluation metrics
# torchmetrics for no-hassle confusion matrices

%pip install --upgrade datasets tokenizers transformers evaluate torchmetrics

<snip>

### Premise

This notebook will guide you through the process of finetuning a transformer model using the 🤗 Transformers library.

First, we need to select a task and suitable dataset. Here, we will use the [Textual Entailment](https://aclweb.org/aclwiki/Recognizing_Textual_Entailment) task as an example. A suitable dataset can be found in the [SuperGLUE repository](https://www.tensorflow.org/datasets/catalog/super_glue#super_gluerte). While this is a TensorFlow dataset, we are lucky as it is mirrored on the 🤗 Dataset hub.

Thus, we can load the Recognizing Textual Entailment subset of the SuperGLUE dataset as follows:

In [2]:
import datasets
datasets.utils.logging.set_verbosity("error")

from datasets import load_dataset


snli_dataset = load_dataset("snli")
print(snli_dataset)

  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    test: Dataset({
        features: ['premise', 'hypothesis', 'label'],
        num_rows: 10000
    })
    train: Dataset({
        features: ['premise', 'hypothesis', 'label'],
        num_rows: 550152
    })
    validation: Dataset({
        features: ['premise', 'hypothesis', 'label'],
        num_rows: 10000
    })
})


As we can see above, the dataset is already split into train, development and test splits.
Each row contains four, but we only need to focus the premise, hypothesis and the label.

The textual entailment task requires us to recognize, given two text fragments, whether the meaning of one text is entailed (*can be inferred*) from the other text.

In this example, we will use a BERT-family model. With BERT, we formulate the entailment task as a simple classification task by concatenating the premise and hypothesis and training our classifier on the first token (the `[CLS]` token) of the input string:

```
"[CLS] This is the premise, i.e. a text that means something. [SEP] This is the hypothesis, i.e. what we may be able to infer [SEP]"
```

But let's first take a look at the dataset.

In [3]:
print(snli_dataset['train'][:2])
print(snli_dataset['validation'][:2])
print(snli_dataset['test'][:2])

{'premise': ['A person on a horse jumps over a broken down airplane.', 'A person on a horse jumps over a broken down airplane.'], 'hypothesis': ['A person is training his horse for a competition.', 'A person is at a diner, ordering an omelette.'], 'label': [1, 2]}
{'premise': ['Two women are embracing while holding to go packages.', 'Two women are embracing while holding to go packages.'], 'hypothesis': ['The sisters are hugging goodbye while holding to go packages after just eating lunch.', 'Two woman are holding packages.'], 'label': [1, 0]}
{'premise': ['This church choir sings to the masses as they sing joyous songs from the book at a church.', 'This church choir sings to the masses as they sing joyous songs from the book at a church.'], 'hypothesis': ['The church has cracks in the ceiling.', 'The church is filled with song.'], 'label': [1, 0]}


As we see above, the `test` split contains **unlabeled** samples, be we can ignore that for now.

Let's construct the sentences as we outlined above.

In [4]:
prepared_dataset = snli_dataset.filter(lambda sample: sample['label'] >= 0)
prepared_dataset = prepared_dataset.map(
    lambda sample: {'text': f"{sample['premise']} [SEP] {sample['hypothesis']}"},
    remove_columns=['premise', 'hypothesis']
)
print(prepared_dataset['train'][:2])

{'label': [1, 2], 'text': ['A person on a horse jumps over a broken down airplane. [SEP] A person is training his horse for a competition.', 'A person on a horse jumps over a broken down airplane. [SEP] A person is at a diner, ordering an omelette.']}


### Loading Pre-Trained Models

Now we need to load a pre-trained BERT model. You should use a subclass of [AutoModel](https://huggingface.co/docs/transformers/main/en/autoclass_tutorial).

#### Load and instantiate a model for the textual entailment task

In [5]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoConfig

config = AutoConfig.from_pretrained('distilbert-base-uncased')
config.num_labels = 3
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.weight', 'pre_classi

Now we could use the [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) class for easy training. You can follow the tutorial from [the official documentation](https://huggingface.co/docs/transformers/quicktour#trainer-a-pytorch-optimized-training-loop).

#### Write the training procedure

In [6]:
from transformers import DataCollatorWithPadding

def tokenize_dataset(dataset):
    return tokenizer(dataset["text"])

dataset = prepared_dataset.map(tokenize_dataset, batched=True, remove_columns=["text"])

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

  0%|          | 0/10 [00:00<?, ?ba/s]

  0%|          | 0/550 [00:00<?, ?ba/s]

  0%|          | 0/10 [00:00<?, ?ba/s]

In [55]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="/mnt/ssd2/team/stoeckel/distilbert-snli/",
    save_total_limit=3,
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    logging_steps=5000,
    log_level="error"
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


In [56]:
from evaluate import load as load_metric
precision = load_metric("precision")
recall = load_metric("recall")
f1 = load_metric("f1")

def compute_metrics(eval_pred, logits=True):
    predictions, labels = eval_pred
    if logits:
        predictions = predictions.argmax(axis=-1)
    return precision.compute(predictions=predictions, references=labels, average="weighted") \
            | recall.compute(predictions=predictions, references=labels, average="weighted") \
            | f1.compute(predictions=predictions, references=labels, average="weighted")

In [59]:
from transformers import Trainer

trainer = Trainer(
    model_init=lambda: AutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased', config=config),
    args=training_args,
    train_dataset=dataset['train'],
    eval_dataset=dataset['validation'],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

print(trainer.evaluate(dataset['validation'], metric_key_prefix='dev_pre'))
trainer.train()
print(trainer.evaluate(dataset['validation'], metric_key_prefix='dev_post'))

{'dev_pre_loss': 1.0994470119476318, 'dev_pre_precision': 0.33251962492686143, 'dev_pre_recall': 0.3351960983539931, 'dev_pre_f1': 0.17729867787420958, 'dev_pre_runtime': 6.6451, 'dev_pre_samples_per_second': 1481.092, 'dev_pre_steps_per_second': 92.7}


Step,Training Loss
5000,0.5774
10000,0.4709
15000,0.432
20000,0.4093
25000,0.3941
30000,0.3855
35000,0.3694
40000,0.3097
45000,0.3149
50000,0.3101


{'dev_post_loss': 0.3272220194339752, 'dev_post_precision': 0.8997085763358726, 'dev_post_recall': 0.8996138996138996, 'dev_post_f1': 0.8996580795507888, 'dev_post_runtime': 6.1061, 'dev_post_samples_per_second': 1611.818, 'dev_post_steps_per_second': 100.882, 'epoch': 3.0}


In [90]:
import pandas as pd

metrics = trainer.evaluate(dataset['test'], metric_key_prefix='test')
print(pd.DataFrame.from_dict({metric.title(): value for metric, value in metrics.items()}, columns=["Test"], orient='index')[:4])

                    Test
Test_Loss       0.341835
Test_Precision  0.895842
Test_Recall     0.895765
Test_F1         0.895799


### Training only the Classifier Head

We can also freeze the weights of the transformer model and instead only train the weights of the classifier head on top.

The classifier head consists of an projection layer (`model.pre_classifier`) and the classifier layer it self (`model.classifier`).

Freezing layers in Pytorch can be done by setting the `Tensor.requires_grad` field to `False`.

In [61]:
def only_classifier_init():
    _model = AutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased', config=config)
    
    # Freeze all parameters
    for p in _model.parameters():
        p.requires_grad = False

    # Unfreeze only projection and classifier layers
    _model.pre_classifier.requires_grad_()
    _model.classifier.requires_grad_()
    return _model
    

oc_trainer = Trainer(
    model_init=only_classifier_init,
    args=training_args,
    train_dataset=dataset['train'],
    eval_dataset=dataset['validation'],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

print(oc_trainer.evaluate(dataset['validation'], metric_key_prefix='dev_pre'))
oc_trainer.train()
print(oc_trainer.evaluate(dataset['validation'], metric_key_prefix='dev_post'))

{'dev_pre_loss': 1.0994470119476318, 'dev_pre_precision': 0.33251962492686143, 'dev_pre_recall': 0.3351960983539931, 'dev_pre_f1': 0.17729867787420958, 'dev_pre_runtime': 4.9685, 'dev_pre_samples_per_second': 1980.875, 'dev_pre_steps_per_second': 123.981}


Step,Training Loss
5000,1.0458
10000,1.0158
15000,0.9992
20000,0.9928
25000,0.9867
30000,0.9826
35000,0.9787
40000,0.973
45000,0.9715
50000,0.9674


{'dev_post_loss': 0.9226549863815308, 'dev_post_precision': 0.5693611488832085, 'dev_post_recall': 0.5689900426742532, 'dev_post_f1': 0.5683723041672778, 'dev_post_runtime': 6.1349, 'dev_post_samples_per_second': 1604.267, 'dev_post_steps_per_second': 100.409, 'epoch': 3.0}


In [91]:
import pandas as pd

metrics = oc_trainer.evaluate(dataset['test'], metric_key_prefix='test')
print(pd.DataFrame.from_dict({metric.title(): value for metric, value in metrics.items()}, columns=["Test"], orient='index')[:4])

                    Test
Test_Loss       0.920187
Test_Precision  0.571025
Test_Recall     0.570033
Test_F1         0.569503


### Custom Training
While using the trainer class is very convenient, if you have to run custom procedures during training, a regular training loop can be more accessible.

We can re-use code from the datasets notebook.

In [10]:
def tokenize_custom(batch: dict):
    return tokenizer(
        batch['text'],
        add_special_tokens=True,
        return_token_type_ids=False,
        return_attention_mask=False,
        padding=False,
        truncation=True,
    )

pt_dataset = prepared_dataset.map(tokenize_custom, batched=True)
print(pt_dataset['train'][:2])

  0%|          | 0/550 [00:00<?, ?ba/s]

{'label': [1, 2], 'text': ['A person on a horse jumps over a broken down airplane. [SEP] A person is training his horse for a competition.', 'A person on a horse jumps over a broken down airplane. [SEP] A person is at a diner, ordering an omelette.'], 'input_ids': [[101, 1037, 2711, 2006, 1037, 3586, 14523, 2058, 1037, 3714, 2091, 13297, 1012, 102, 1037, 2711, 2003, 2731, 2010, 3586, 2005, 1037, 2971, 1012, 102], [101, 1037, 2711, 2006, 1037, 3586, 14523, 2058, 1037, 3714, 2091, 13297, 1012, 102, 1037, 2711, 2003, 2012, 1037, 15736, 1010, 13063, 2019, 18168, 12260, 4674, 1012, 102]]}


In [75]:
from typing import Tuple
import torch
from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pad_sequence

def custom_collate(batch: list[dict]) -> Tuple[torch.Tensor, torch.Tensor]:
    """
    Custom collation method that extracts the 'input_ids' and 'label' fields from the input dictionary,
    pads the input_ids to equal length and stacks the input_ids and labels into tensors.
    """
    input_ids = [torch.tensor(sample['input_ids']) for sample in batch]
    input_ids = pad_sequence(input_ids, batch_first=True)
    label = torch.tensor([sample['label'] for sample in batch]).long()
    return input_ids, label

In [76]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
print(device)

cuda


In [None]:
from torch.nn import CrossEntropyLoss


def train(model, dataloader, postfix_loss=True, postfix_freq=100):
    """
    Run a training epoch for the model on the given dataloader. Each batch is moved to the models device,
    so make sure the model is already on the correct device when you pass it to this method.
    
    If postfix_loss is True, the dataloader is expected to be a tqdm wrapper around the actual iterable.
    The average loss of the `postfix_freq` last batches is then post-fixed to the tqdm progress bar.
    
    Returns the finetuned model.
    """
    model.train()
    criterion = CrossEntropyLoss()
    loss_acc = []
    for idx, (x, y) in enumerate(dataloader):
        x_hat = model.forward(x.to(model.device))
        
        # Reset gradients to zero
        optimizer.zero_grad()
        
        loss = criterion(x_hat.logits, y.long().to(model.device))
        
        loss.backward()
        
        optimizer.step()
        
        if postfix_loss:
            loss_acc.append(float(loss.item()))
            if idx % postfix_freq == 0:
                loss = sum(loss_acc) / len(loss_acc)
                dataloader.set_postfix_str(f"loss: {loss:0.3f}")
                loss_acc = []
    return model

def predict(model, dataloader) -> Tuple[torch.Tensor, torch.Tensor]:
    """
    Run prediction for the data in the given dataloader.
    
    Returns the predictions and targets as integer tensors.
    """
    preds = []
    targets = []
    model.eval()
    for (x, y) in dataloader:
        x_hat = model.forward(x.to(model.device))
        pred = x_hat.logits.argmax(1)

        preds.extend(pred.detach().cpu().tolist())
        targets.extend(y.detach().cpu().tolist())

    preds = torch.tensor(preds)
    targets = torch.tensor(targets)
    return preds, targets


In [87]:
from tqdm.notebook import tqdm, trange
from torch.optim import AdamW


model = AutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased', config=config)

# Use this lambda expression to filter out any parameters that do not .require_grad
# and thus should not be changed (i.e. if we froze them on purpose)
optimizer = AdamW(filter(lambda p: p.requires_grad, model.parameters()), lr=2e-5)

# Create one validation dataloader, no shuffeling necessary
dev_dataloader = DataLoader(pt_dataset['validation'], batch_size=64, shuffle=False, collate_fn=custom_collate)

num_epochs = 3
model.to(device)
for epoch in trange(num_epochs, position=0):
    # The training dataloader is re-initialized each epoch
    # resulting in a different random sequence due to shuffle=True
    # Remember that the batch_size affects the training!
    train_dataloader = DataLoader(pt_dataset['train'], batch_size=16, shuffle=True, collate_fn=custom_collate)
    model = train(model, tqdm(train_dataloader, position=1, leave=False))

    preds, trues = predict(model, tqdm(dev_dataloader, position=1, leave=False))
    metrics = compute_metrics((preds, trues), logits=False)
    print(pd.DataFrame.from_dict({metric.title(): value for metric, value in metrics.items()}, columns=[f"Epoch {epoch+1:d}"], orient='index'))


  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/34336 [00:00<?, ?it/s]

  0%|          | 0/154 [00:00<?, ?it/s]

           Epoch 0 
Precision  0.881906
Recall     0.881731
F1         0.881722


  0%|          | 0/34336 [00:00<?, ?it/s]

  0%|          | 0/154 [00:00<?, ?it/s]

           Epoch 1 
Precision  0.891297
Recall     0.890876
F1         0.891040


  0%|          | 0/34336 [00:00<?, ?it/s]

  0%|          | 0/154 [00:00<?, ?it/s]

           Epoch 2 
Precision  0.893199
Recall     0.892400
F1         0.892654


In [88]:
from torchmetrics import ConfusionMatrix
import pandas as pd


test_dataloader = DataLoader(pt_dataset['test'], batch_size=64, shuffle=False, collate_fn=custom_collate)

preds, trues = predict(model, tqdm(test_dataloader, position=1, leave=False))
metrics = compute_metrics((preds, trues), logits=False)

print(pd.DataFrame.from_dict({metric.title(): value for metric, value in metrics.items()}, columns=["Test"], orient='index'))
print()

cm = ConfusionMatrix(num_classes=3)(preds, trues).numpy()
labels = ["Entailment", "Unrelated", "Contradiction"]
df = pd.DataFrame(cm, columns=labels, index=labels)
print(df)

  0%|          | 0/154 [00:00<?, ?it/s]

               Test
Precision  0.889144
Recall     0.888742
F1         0.888862

               Entailment  Unrelated  Contradiction
Entailment           2988        298             82
Unrelated             234       2774            211
Contradiction          60        208           2969


## Summary

### Comparison
#### Huggingface Trainer 
- smaller boilerplate 👍
    - slightly larger overhead (`+11` min training time) 👎
- acceleration trivial 👍
- very "opaque" 👎
 
#### Custom Training
- larger boilerplate 👎
    - slightly less overhead (`-11` min training time) 👍
- acceleration non-trivial 👎
- fully transparent 👍

### Takeaway
#### For Trainer
- Prefer for pre-defined tasks

#### For Custom
- Prefer for custom tasks
- Reduce boilerplate with projects from the PyTorch Ecosystem:
    - [PyTorch-Lightning](https://pytorch-lightning.readthedocs.io/en/stable/)
    - [Ignite](https://pytorch-ignite.ai/)
- Accelerate training with projects from the PyTorch Ecosystem:
    - [accelerate](https://huggingface.co/docs/accelerate/index)
    - [Ray](https://docs.ray.io/en/latest)
    
#### For Transformers + Classifier Head Models
- Finetuning the whole model takes about **four times longer** than just finetuning the classification head
    - `74` min for fine-tuning
    - `18` min for classifier head only
- Finetuning the whole model yields about **two times larger improvement** on the SNLI dataset
    - compared with random `33.33%`:
        - `+56.64%` for fine-tuning (`89.97%`)
        - `+23.62%` for classifier head only (`56.95%`)
- See [`Adapters`](https://adapterhub.ml/) for trade-off between the two extremes
    - Paper: [AdapterHub: A Framework for Adapting Transformers (EMNLP 2020)](https://aclanthology.org/2020.emnlp-demos.7/)