# Hugging Face NLP Course - Part 2


## Chapter 3 : Finetuning a pretrained model

### Outline
- [3.1 - Process the data](#1)
  - [3.1.1 - Load a dataset from the Hub](#1.1)
  - [3.1.2 - Preprocess a dataset](#1.2)
- [3.2 - Finetuning models with `Trainer` API](#2)
- [3.3 - A full training without `Trainer` API](#3)
- [3.4 - Supercharge training loop with HuggingFace Accelerate](#4)


### Objectives :
- Prapare a large dataset from the Hub
- Use high-level Trainer API to fine-tune a model
- Use a custom training loop
- leverage the HuggingFace Accelerate Library to easily run that custom training loop on any distributed setup




We will use the MRPC (Microsoft Research Paraphrase Corpus) dataset, one of the 10 datasets composing the GLUE benchmark, which is an academic benchmark that is used to measure the performance of ML models across 10 different text classification tasks.

The dataset consists of 5801 pairs of sentences, with a label indicating if they are paraphrases or not.

In [2]:
!pip install transformers[torch]

Collecting transformers[torch]
  Downloading transformers-4.34.0-py3-none-any.whl (7.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m27.9 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.16.4 (from transformers[torch])
  Downloading huggingface_hub-0.18.0-py3-none-any.whl (301 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m29.4 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.15,>=0.14 (from transformers[torch])
  Downloading tokenizers-0.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m75.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers[torch])
  Downloading safetensors-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m71.5 MB/s

In [3]:
!pip install datasets

Collecting datasets
  Downloading datasets-2.14.5-py3-none-any.whl (519 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.6/519.6 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: xxhash, dill, multiprocess, datasets
Successfully installed datasets

In [4]:
!pip install evaluate

Collecting evaluate
  Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/84.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
Collecting responses<0.19 (from evaluate)
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Installing collected packages: responses, evaluate
Successfully installed evaluate-0.4.1 responses-0.18.0


In [5]:
!pip install accelerate -U



<a name="1"></a>
#### 3.1 Process the data
<a name="1.1"></a>
3.1.1 Load a dataset from the Hub

In [6]:
from datasets import load_dataset

raw_datasets = load_dataset("glue", "mrpc")
raw_datasets

Downloading builder script:   0%|          | 0.00/28.8k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/28.7k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/27.9k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data: 0.00B [00:00, ?B/s]

Downloading data: 0.00B [00:00, ?B/s]

Downloading data: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/3668 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/408 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1725 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

The dataset is splited into training, validation and test dataset.

Let's take a look at a data sample in training dataset, which contains two sentences, label (0 or 1) and index.

In [7]:
train_dataset = raw_datasets['train']
train_dataset[0]

{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
 'label': 1,
 'idx': 0}

In [8]:
validation_dataset = raw_datasets["validation"]
test_dataset = raw_datasets["test"]

<a name="1.2"></a>
3.1.2 Preprocess a dataset

As we discussed in part 1, we need to preprocess the text data to numerical format by encoding (tokenization and associate tokens with a vector representation which we call it word embedding).

We can firstly tokenize all first sentences and all second sentences.

In [9]:
from transformers import AutoTokenizer
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

tokenized_sentences_1 = tokenizer(train_dataset["sentence1"])
tokenized_sentences_2 = tokenizer(train_dataset["sentence2"])

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

To accelerate dataset processing, we define a function.

In [10]:
def tokenization(dataset):
  return tokenizer(dataset["sentence1"], dataset["sentence2"], truncation = True)

Then we apply the function to the whole dataset, with `batched = True` to process multiple elements of dataset at once, and not on each element seperately.

In [11]:
tokenized_datasets = raw_datasets.map(tokenization, batched = True)

Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

Nextly, we need to pad all examples to the length of the longest element when we batch elements together, it's a technique called **Dynamic Padding**.

Before that, let's talk about collate function. When you build `DataLoader`, you need to create a *collate function* to put together samples inside a batch.

In our case, we need to define a collate function to apply the correct amount to dataset items that we want to batch together. There is a function available in Huggingface - `DataCollatorWithPadding`.


In [12]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer = tokenizer)

<a name="2"></a>
#### 3.2 Finetuning models with `Trainer` API

HuggingFace provide a Trainer class to help fine-tune any pretrained models on the specific dataset.

We need to prepare an environment for using `Trainer.train()`, as it will run very slowly on CPU.

Then, we need to follow several steps:

- Define hyperparameters that `Trainer` will use for training and evaluation with `TrainingArguments` class.
- Define the model.
- Define `compute_metrics()` function for evaluation.
- Define a Training by passing it all the objects constructed.

In [13]:
from transformers import Trainer, TrainingArguments, AutoModelForSequenceClassification
import numpy as np
import evaluate

# We choose model configuration by default, specify a directory name where the model will be saved, then activate eval strategy to report metrics at the end of each epoch.
training_args = TrainingArguments("test-trainer", evaluation_strategy="epoch")

In [14]:
# define the model
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels = 2)

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Here we get a warning message when we instantiate the model because BERT model has not been pretrained on classifying pairs of sentences.

Therefore, the head of pretrained model has been discarded :
 > Some weights of BertForSequenceClassification were not initialized

A new head suitable for sequence classification has been added :
 > ...are newly initialized

In [15]:
def compute_metrics(eval_preds):
  # load evaluation metrics
  metric = evaluate.load("glue", "mrpc")
  # prediction : logits, labels : labels (expected values)
  logits, labels = eval_preds
  #  take the index with the maximum value on the 2nd axis
  predictions = np.argmax(logits, axis = -1)
  return metric.compute(predictions = predictions, references = labels)

In [16]:
trainer = Trainer(
    model,
    training_args,
    train_dataset = tokenized_datasets["train"],
    eval_dataset = tokenized_datasets["validation"],
    data_collator = data_collator, # DataCollatorWithPadding that we defined before
    tokenizer = tokenizer,
    compute_metrics = compute_metrics
)

Start the finetuning and report the training loss every 500 steps.

In [17]:
trainer.train()

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.404686,0.821078,0.871252
2,0.532100,0.568905,0.845588,0.892675
3,0.293400,0.743857,0.840686,0.888124


Downloading builder script:   0%|          | 0.00/5.75k [00:00<?, ?B/s]

TrainOutput(global_step=1377, training_loss=0.3328030266966089, metrics={'train_runtime': 246.822, 'train_samples_per_second': 44.583, 'train_steps_per_second': 5.579, 'total_flos': 405540469624800.0, 'train_loss': 0.3328030266966089, 'epoch': 3.0})

In [18]:
## Want to know how the evaluation phase works exactly ? uncomment the code below and comment the block `compute_metrics` in the trainer

# predictions = trainer.predict(tokenized_datasets["validation"])
# predictions.predictions.shape, predictions.label_ids.shape
# preds = np.argmax(predictions.predictions, axis = 0)
# metric = evaluate.load("glue", "mrpc")
# metric.compute(predictions = preds, references = predictions.label_ids)

<a name="3"></a>

#### 3.3 A full training without `Trainer` API.

How do we achieve the same results without `Trainer` API?  Now let's talk about a whole fine-tuning process.

Compared to `Trainer` API that automizes the training and evaluation process, we need to define some objects in PyTorch (or other frameworks like Tensorflow). Remember we keep all code blocks in 3.1.

- Postprocess tokenized dataset

In [19]:
# remove columns the model does not need
tokenized_datasets = tokenized_datasets.remove_columns(["sentence1","sentence2","idx"])
# rename column label to labels
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
# set data in PyTorch format
tokenized_datasets.set_format("torch")
# quick check column names
tokenized_datasets["train"].column_names

['labels', 'input_ids', 'token_type_ids', 'attention_mask']

- define the dataloader

In [20]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(
    tokenized_datasets["train"], shuffle=True, batch_size=8, collate_fn = data_collator
)

eval_dataloader = DataLoader(
    tokenized_datasets["validation"], batch_size=8, collate_fn = data_collator
)

In [21]:
# get a quick check of an example
for batch in train_dataloader:
  break
{k:v.shape for k,v in batch.items()}

{'labels': torch.Size([8]),
 'input_ids': torch.Size([8, 75]),
 'token_type_ids': torch.Size([8, 75]),
 'attention_mask': torch.Size([8, 75])}

- define the model

In [22]:
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [23]:
# get a quick check of an example above
outputs = model(**batch)
print(outputs.loss, outputs.logits.shape)

tensor(0.7815, grad_fn=<NllLossBackward0>) torch.Size([8, 2])


- choose an optimizer (`Adam` by default in `Trainer` API)

In [24]:
from transformers import AdamW

optimizer = AdamW(model.parameters(), lr=5e-5)



- set learning rate scheduler : a linear decay from the max value (5e-5) to 0.

We need to know the number of training steps to take, which is **epochs** * **training batches** (length of training dataloader). As `Trainer` uses 3 epochs by default, we will set the same number.

In [25]:
from transformers import get_scheduler

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    "linear",
    optimizer = optimizer,
    num_warmup_steps = 0,
    num_training_steps = num_training_steps
)

print(num_training_steps)

1377


- define a device we will put the model / batches on

In [26]:
import torch
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)
device

device(type='cuda')

- train the model and add progress bar over numbers of training steps using `tqdm`

In [28]:
from tqdm.auto import tqdm
progress_bar = tqdm(range(num_training_steps))
model.train()

for epoch in range(num_epochs):
  for batch in train_dataloader:
    batch ={k : v.to(device) for k,v in batch.items()}
    outputs = model(**batch)
    loss = outputs.loss
    loss.backward()

    optimizer.step()
    lr_scheduler.step()
    optimizer.zero_grad()
    progress_bar.update(1)

  0%|          | 0/1377 [00:00<?, ?it/s]

- evaluation loop

In [29]:
import evaluate

metric = evaluate.load("glue", "mrpc")
model.eval() # turn off gradient computation and swith to evaluation mode

for batch in eval_dataloader:
  batch = {k : v.to(device) for k, v in batch.items()}
  with torch.no_grad(): # work together with model.eval() to turn off gradient computation and swith to evaluation mode
    outputs = model(**batch)

  logits = outputs.logits
  predictions = torch.argmax(logits, dim=-1)
  # accumulate batches as we go over prediction loop with add_batch
  metric.add_batch(predictions=predictions, references = batch["labels"])

metric.compute()

{'accuracy': 0.8676470588235294, 'f1': 0.9059233449477352}

<a name="4"></a>
#### 3.4 Supercharge training loop with [HuggingFace Accelerate](https://huggingface.co/learn/nlp-course/chapter3/4?fw=pt#:~:text=But%20using%20the-,%F0%9F%A4%97%20Accelerate,-library%2C%20with%20just)

With Acceleate library, we can enable distributed training on multiple GPUs/TPUs.

In [30]:
from accelerate import Accelerator
from transformers import AdamW, AutoModelForSequenceClassification, get_scheduler

# instantiate an object that will look at env and initialize the proper distributed setup.
accelerator = Accelerator()

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
optimizer = AdamW(model.parameters(), lr=3e-5)

# send dataloaders, model and optimizer to accelerator.prepare
train_dl, eval_dl, model, optimizer = accelerator.prepare(
    train_dataloader, eval_dataloader, model, optimizer
)

num_epochs = 3
num_training_steps = num_epochs * len(train_dl)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dl:
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/1377 [00:00<?, ?it/s]