In [1]:
%run supportvectors-common.ipynb



<center><img src="https://d4x5p7s4.rocketcdn.me/wp-content/uploads/2016/03/logo-poster-smaller.png"/> </center>
<div style="color:#aaa;font-size:8pt">
<hr/>
&copy; SupportVectors. All rights reserved. <blockquote>This notebook is the intellectual property of SupportVectors, and part of its training material. 
Only the participants in SupportVectors workshops are allowed to study the notebooks for educational purposes currently, but is prohibited from copying or using it for any other purposes without written permission.

<b> These notebooks are chapters and sections from Asif Qamar's textbook that he is writing on Data Science. So we request you to not circulate the material to others.</b>
 </blockquote>
 <hr/>
</div>



## <img src="https://huggingface.co/datasets/huggingface/brand-assets/resolve/main/hf-logo.png" width="60px">  Fine tuning the Transformers

Sometimes, a pretrained model may not exactly fit our task, and so the inference performance will leave scope for improvement. At such a time, if additional task-specific training data is available, we can fine tune the models. This is the second part of the transfer learning journey: take a pretrained model, and fine tune it by further training it on a task-specific dataset for a few epochs.

It is worth noting that, for pre-training the models, the process is oftentimes self-supervised. In other words, labeled data is not necessary. For example, we can pre-train the masked language models simply by taking a vast corpus of documents, decompose them into sentences. Each sentence then becomes the datum for pre-training as an `<input, label>` pair, simply by randomly masking a few words, and treating the masked words as labels.

Therefore, pre-training is done with:

* vast quantities of data, and consequently large number of training steps over the mini-batches
* using unlabeled data most of the time (i.e., as self-supervised learning), by cleverly extracting `<input, label>` pairs as training data instances from the data

On the other hand, the fine-tuning of the models involves:

* availability of labeled data for the specific task
* the dataset sizes are **generally not needed to be big -- indeed, small datasets mostly suffice.**
* running a much shorter training cycle, with a few epochs over the task-specific data
* **far less hardware resource**
* because of the above, two far less model (fine-tuning) training time
* and causes a **far smaller carbon footprint or environmental impact**!


Therefore, wherever feasible, we should resort to transfer learning, and therefore, fine-tuning of the pretrained models as our preferred approach.

The `Huggingface` core libraries makes it rather easily to fine-tune the pretrained models. For this, we will use the following libraries:

* `datasets` the library that makes it easy to load and use a vast number of publicly available datasets, many of which belong to well-known benchmarks. For example, in this lab we will use the `mrpc` dataset from the `GLUE` benchmark for nlp tasks.

* `transformers` the main library of models and tokenizers, etc.

* `evaluate` an excellent library to evaluate the inference performance of the trained models, and to compare performances across models.

#### Installation

Quite likey, by now you have installed each of the below library on your workstation. Otherwise, uncomment the below cell and run it.

In [2]:
#
# Uncomment this only if needed.
#
#!pip install datasets transformers evaluate --upgrade

### Load the `mrpc` dataset
This dataset contains sentence-pairs; the label specifies whether the second sentence is the paraphrase of the first. 

In [3]:
from datasets import load_dataset
raw_datasets = load_dataset("glue", "mrpc")

Let us explore it a bit.

In [4]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

#### Observations

The `raw_datasets` comprises of three datasets (`train`, `validation`, `test`), stored in a dictionary `DatasetDict` object. This is rather convenient, since it has already been split into the train-validate-test subsets.

We also note that there only `3668 + 408 + 1725 = 5801` instances of data. This is a considerably smaller dataset, and only `3668` for fine-tuning a pre-trained model.

Also note that the input $\mathbf{X}$ is derived from a pair of sentences `<sentence1, sentence2>`. The ouput is, of-course, marked as the `label`.


### Load pretrained models

Let us now load the pretrained tokenizer and transformer models. This is nearly identical to what we did in the previous labs.

There is however a subtlety when using the tokenizer. So far, we have been passing only one sentence to the tokenizer. But here we have pairs of sentences in this dataset. Fortunately, the tokenizers for `BERT` do accept pairs of sentences.

Recall that in the original exploration of the `BERT` research paper, we saw how the model was pre-trained using inputs as  sentence pairs.

#### Data Collator

Observe a new animal we introduced to our zoo here, the `DataCollatorWithPadding`.

Data collators are responsible for forming (mini-) batches of the datasets. They preserve the form of the original data, but may perform some post-processing as needed.

For example, in our task, the sentences may be of variable length. So the `DataCollatorWithPadding` will pad each of the tokenized sequences in a mini-batch so that its length is that of the large sequence present in that mini-batch.

<img src="images/data-prep-for-fine-tuning.jpeg" />

In [5]:
from transformers import AutoTokenizer, DataCollatorWithPadding
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)


def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)


tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

#### Load the pretrained model checkpoint

Next we load a pre-trained model checkpoint, that is specific to the sequence classification task.

> The classification task here: inferring whether the `sentence2` is a **paraphrasing** of the `sentence1` for a given input.

In [6]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [7]:
from transformers import Trainer
from transformers import TrainingArguments

training_args = TrainingArguments("test-trainer")

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

trainer.train()

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
500,0.5287
1000,0.3293


TrainOutput(global_step=1377, training_loss=0.369309646285091, metrics={'train_runtime': 978.0259, 'train_samples_per_second': 11.251, 'train_steps_per_second': 1.408, 'total_flos': 405258858573360.0, 'train_loss': 0.369309646285091, 'epoch': 3.0})

### `Evaluate` library to compute model metrics

The `evaluate` library proves very helpful here in computing the model performance metrics. Now, each dataset has its relevant metric for the task the model is trained for. In this case, it is `accuracy` and `F1` score. We can instantiate a `metric` object from a dataset, in much the same way that we used the `datasets.load()` to load the data.

The `compute()` method takes as arguments what you would expect: the predictions and the ground-truth labels.

In [8]:
def compute_metrics(eval_preds):
    metric = evaluate.load("glue", "mrpc")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

### The `Trainer` 

The trainer finally takes all the parts one would expect:

* some specific training arguments, such as the name of the training run, an evaluation-strategy, etc.
* `model` -- the model to train or fine tune
* the training and validation datasets
* the tokenizer
* the relevant data collator
* the function to compute the metrics


In [9]:
training_args = TrainingArguments("test-trainer", evaluation_strategy="epoch")
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Fine-tune the model

In order to fine-tune the pre-trained model, let the `trainer` now fire off the training epochs.

In [10]:
import evaluate
trainer.train()

ModuleNotFoundError: No module named 'evaluate'