In [1]:
%run supportvectors-common.ipynb



<div style="color:#aaa;font-size:8pt">
<hr/>

 </blockquote>
 <hr/>
</div>



## <img src="https://huggingface.co/datasets/huggingface/brand-assets/resolve/main/hf-logo.png" width="60px">  Fine tuning the Transformers

Sometimes, a pretrained model may not exactly fit our task, and so the inference performance will leave scope for improvement. At such a time, if additional task-specific training data is available, we can fine tune the models. This is the second part of the transfer learning journey: take a pretrained model, and fine tune it by further training it on a task-specific dataset for a few epochs.

It is worth noting that, for pre-training the models, the process is oftentimes self-supervised. In other words, labeled data is not necessary. For example, we can pre-train the masked language models simply by taking a vast corpus of documents, decompose them into sentences. Each sentence then becomes the datum for pre-training as an `<input, label>` pair, simply by randomly masking a few words, and treating the masked words as labels.

Therefore, pre-training is done with:

* vast quantities of data, and consequently large number of training steps over the mini-batches
* using unlabeled data most of the time (i.e., as self-supervised learning), by cleverly extracting `<input, label>` pairs as training data instances from the data

On the other hand, the fine-tuning of the models involves:

* availability of labeled data for the specific task
* the dataset sizes are **generally not needed to be big -- indeed, small datasets mostly suffice.**
* running a much shorter training cycle, with a few epochs over the task-specific data
* **far less hardware resource**
* because of the above, two far less model (fine-tuning) training time
* and causes a **far smaller carbon footprint or environmental impact**!


Therefore, wherever feasible, we should resort to transfer learning, and therefore, fine-tuning of the pretrained models as our preferred approach.

The `Huggingface` core libraries makes it rather easily to fine-tune the pretrained models. For this, we will use the following libraries:

* `datasets` the library that makes it easy to load and use a vast number of publicly available datasets, many of which belong to well-known benchmarks. For example, in this lab we will use the `mrpc` dataset from the `GLUE` benchmark for nlp tasks.

* `transformers` the main library of models and tokenizers, etc.

* `evaluate` an excellent library to evaluate the inference performance of the trained models, and to compare performances across models.

#### Installation

Quite likey, by now you have installed each of the below library on your workstation. Otherwise, uncomment the below cell and run it.

In [2]:
#
# Uncomment this only if needed.
#
#!pip install datasets transformers evaluate --upgrade

### Load the `mrpc` dataset
This dataset contains sentence-pairs; the label specifies whether the second sentence is the paraphrase of the first. 

In [3]:
from datasets import load_dataset
raw_datasets = load_dataset("glue", "mrpc")

Found cached dataset glue (/home/asif/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


  0%|          | 0/3 [00:00<?, ?it/s]

Let us explore it a bit.

In [4]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

#### Observations

The `raw_datasets` comprises of three datasets (`train`, `validation`, `test`), stored in a dictionary `DatasetDict` object. This is rather convenient, since it has already been split into the train-validate-test subsets.

We also note that there only `3668 + 408 + 1725 = 5801` instances of data. This is a considerably smaller dataset, and only `3668` for fine-tuning a pre-trained model.

Also note that the input $\mathbf{X}$ is derived from a pair of sentences `<sentence1, sentence2>`. The ouput is, of-course, marked as the `label`.


### Load pretrained models

Let us now load the pretrained tokenizer and transformer models. This is nearly identical to what we did in the previous labs.

There is however a subtlety when using the tokenizer. So far, we have been passing only one sentence to the tokenizer. But here we have pairs of sentences in this dataset. Fortunately, the tokenizers for `BERT` do accept pairs of sentences.

Recall that in the original exploration of the `BERT` research paper, we saw how the model was pre-trained using inputs as  sentence pairs.

#### Data Collator

Observe a new animal we introduced to our zoo here, the `DataCollatorWithPadding`.

Data collators are responsible for forming (mini-) batches of the datasets. They preserve the form of the original data, but may perform some post-processing as needed.

For example, in our task, the sentences may be of variable length. So the `DataCollatorWithPadding` will pad each of the tokenized sequences in a mini-batch so that its length is that of the large sequence present in that mini-batch.

<img src="images/data-prep-for-fine-tuning.jpeg" />

In [5]:
from transformers import AutoTokenizer, DataCollatorWithPadding
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)


def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)


tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Loading cached processed dataset at /home/asif/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-8fa0ec02105bbfdd.arrow
Loading cached processed dataset at /home/asif/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-32460ce84408f1c7.arrow


Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

#### Load the pretrained model checkpoint

Next we load a pre-trained model checkpoint, that is specific to the sequence classification task.

> The classification task here: inferring whether the `sentence2` is a **paraphrasing** of the `sentence1` for a given input.

In [6]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [7]:
from transformers import Trainer
from transformers import TrainingArguments

training_args = TrainingArguments("test-trainer")

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

trainer.train()

[codecarbon INFO @ 10:04:43] [setup] RAM Tracking...
[codecarbon INFO @ 10:04:43] [setup] GPU Tracking...
[codecarbon INFO @ 10:04:43] Tracking Nvidia GPU via pynvml
[codecarbon INFO @ 10:04:43] [setup] CPU Tracking...
[codecarbon INFO @ 10:04:43] Tracking Intel CPU via RAPL interface
[codecarbon INFO @ 10:04:44] >>> Tracker's metadata:
[codecarbon INFO @ 10:04:44]   Platform system: Linux-5.19.0-38-generic-x86_64-with-glibc2.35
[codecarbon INFO @ 10:04:44]   Python version: 3.10.9
[codecarbon INFO @ 10:04:44]   Available RAM : 125.570 GB
[codecarbon INFO @ 10:04:44]   CPU count: 32
[codecarbon INFO @ 10:04:44]   CPU model: 13th Gen Intel(R) Core(TM) i9-13900K
[codecarbon INFO @ 10:04:44]   GPU count: 1
[codecarbon INFO @ 10:04:44]   GPU model: 1 x NVIDIA GeForce RTX 4090
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded e

Step,Training Loss
500,0.4895
1000,0.2482


[codecarbon INFO @ 10:05:02] Energy consumed for RAM : 0.000196 kWh. RAM Power : 47.08856391906738 W
[codecarbon INFO @ 10:05:02] Energy consumed for all GPUs : 0.001474 kWh. All GPUs Power : 353.70300000000003 W
[codecarbon INFO @ 10:05:02] Energy consumed for all CPUs : 0.000166 kWh. All CPUs Power : 39.877596251666446 W
[codecarbon INFO @ 10:05:02] 0.001836 kWh of electricity used since the begining.
[codecarbon INFO @ 10:05:17] Energy consumed for RAM : 0.000392 kWh. RAM Power : 47.08856391906738 W
[codecarbon INFO @ 10:05:17] Energy consumed for all GPUs : 0.002964 kWh. All GPUs Power : 357.88200000000006 W
[codecarbon INFO @ 10:05:17] Energy consumed for all CPUs : 0.000332 kWh. All CPUs Power : 39.84626478680841 W
[codecarbon INFO @ 10:05:17] 0.003689 kWh of electricity used since the begining.
[codecarbon INFO @ 10:05:21] Energy consumed for RAM : 0.000441 kWh. RAM Power : 47.08856391906738 W
[codecarbon INFO @ 10:05:21] Energy consumed for all GPUs : 0.003329 kWh. All GPUs Pow

TrainOutput(global_step=1377, training_loss=0.2946628717382316, metrics={'train_runtime': 33.7076, 'train_samples_per_second': 326.455, 'train_steps_per_second': 40.851, 'total_flos': 406183858377360.0, 'train_loss': 0.2946628717382316, 'epoch': 3.0})

### `Evaluate` library to compute model metrics

The `evaluate` library proves very helpful here in computing the model performance metrics. Now, each dataset has its relevant metric for the task the model is trained for. In this case, it is `accuracy` and `F1` score. We can instantiate a `metric` object from a dataset, in much the same way that we used the `datasets.load()` to load the data.

The `compute()` method takes as arguments what you would expect: the predictions and the ground-truth labels.

In [8]:
def compute_metrics(eval_preds):
    metric = evaluate.load("glue", "mrpc")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

### The `Trainer` 

The trainer finally takes all the parts one would expect:

* some specific training arguments, such as the name of the training run, an evaluation-strategy, etc.
* `model` -- the model to train or fine tune
* the training and validation datasets
* the tokenizer
* the relevant data collator
* the function to compute the metrics


In [9]:
training_args = TrainingArguments("test-trainer", evaluation_strategy="epoch")
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

### Fine-tune the model

In order to fine-tune the pre-trained model, let the `trainer` now fire off the training epochs.

In [10]:
import evaluate
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.346711,0.838235,0.882979
2,0.481100,0.593656,0.85049,0.895369
3,0.259200,0.705571,0.852941,0.894366


[codecarbon INFO @ 10:05:41] Energy consumed for RAM : 0.000196 kWh. RAM Power : 47.08856391906738 W
[codecarbon INFO @ 10:05:41] Energy consumed for all GPUs : 0.001443 kWh. All GPUs Power : 346.274 W
[codecarbon INFO @ 10:05:41] Energy consumed for all CPUs : 0.000178 kWh. All CPUs Power : 42.66688583978885 W
[codecarbon INFO @ 10:05:41] 0.001817 kWh of electricity used since the begining.
[codecarbon INFO @ 10:05:56] Energy consumed for RAM : 0.000392 kWh. RAM Power : 47.08856391906738 W
[codecarbon INFO @ 10:05:56] Energy consumed for all GPUs : 0.002894 kWh. All GPUs Power : 348.398 W
[codecarbon INFO @ 10:05:56] Energy consumed for all CPUs : 0.000347 kWh. All CPUs Power : 40.59553536814495 W
[codecarbon INFO @ 10:05:56] 0.003634 kWh of electricity used since the begining.
[codecarbon INFO @ 10:06:02] Energy consumed for RAM : 0.000472 kWh. RAM Power : 47.08856391906738 W
[codecarbon INFO @ 10:06:02] Energy consumed for all GPUs : 0.003341 kWh. All GPUs Power : 262.50100000000003

TrainOutput(global_step=1377, training_loss=0.29535167807673923, metrics={'train_runtime': 36.1305, 'train_samples_per_second': 304.563, 'train_steps_per_second': 38.112, 'total_flos': 406183858377360.0, 'train_loss': 0.29535167807673923, 'epoch': 3.0})