**Hugging Face APIs for fine-tuning NLP models (Part 2 - Final Finetuning)**

Now that we know about mrpc dataset, tokenizers and models, let's finally finetune "bert-base-uncased" NLP model for the task (described in the previous notebook)

*Note - We should enable GPU in colab for faster training*

In [1]:
'''
First, let's install transformers and datasets
'''
!pip install transformers
!pip install datasets





---


Now we FINALLY move to finetuning. Transformers in Hugging Face provides Trainer class to fine-tune pretrained models

**Jargon Alert!!**

*DataCollator-* To make batches from our dataset, we use DataCollator. More information [here](https://huggingface.co/transformers/main_classes/data_collator.html)


---



In [2]:
'''
STEP 1: MAKE THE DATASET and DATA COLLATOR
summary from the previous notebook(Data Preprocessing to build our dataset in the correct form)
'''
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

#dataset
raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"

#tokenizer
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

#required dataset
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

#data collator used by Trainer to make batches
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Reusing dataset glue (/root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


  0%|          | 0/3 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

  0%|          | 0/4 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

In [3]:
'''
STEP 2: SETUP TRAININGARGUMENTS
Define a TrainingArguments class containing all hyperparameters. We only need directory name where the trained model will be stored
Leave the rest as default
'''

from transformers import TrainingArguments

training_args = TrainingArguments("test-trainer")

In [4]:
'''
STEP 3: DEFINE THE MODEL

You get a warning after instantiating this pretrained model. 
This is because BERT has not been pretrained on classifying pairs of sentences, so the head 
of the pretrained model has been discarded and a new head suitable for sequence classification 
has been added instead.The warnings indicate that some weights were not used (the ones corresponding 
to the dropped pretraining head) and that some others were randomly initialized (the ones for the new 
head). It concludes by encouraging you to train the model, which is exactly what we are going to do now.
'''
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels = 2)


Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [6]:
'''
STEP 4: MAKE THE TRAINER

Note that when you pass the tokenizer as we did here, 
the default data_collator used by the Trainer will be a DataCollatorWithPadding as defined previously, 
so you can skip the line data_collator=data_collator in this call.
'''
from transformers import Trainer
trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator = data_collator,
    tokenizer=tokenizer,
)


In [7]:
'''
STEP 5: TRAIN!
Note that this should ideally be done on a GPU for faster training (Enable GPU in collab).
'''
trainer.train()

The following columns in the training set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence1, sentence2, idx.
***** Running training *****
  Num examples = 3668
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 1377


Step,Training Loss
500,0.6076
1000,0.5001


Saving model checkpoint to test-trainer/checkpoint-500
Configuration saved in test-trainer/checkpoint-500/config.json
Model weights saved in test-trainer/checkpoint-500/pytorch_model.bin
tokenizer config file saved in test-trainer/checkpoint-500/tokenizer_config.json
Special tokens file saved in test-trainer/checkpoint-500/special_tokens_map.json
Saving model checkpoint to test-trainer/checkpoint-1000
Configuration saved in test-trainer/checkpoint-1000/config.json
Model weights saved in test-trainer/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in test-trainer/checkpoint-1000/tokenizer_config.json
Special tokens file saved in test-trainer/checkpoint-1000/special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=1377, training_loss=0.5058202251802776, metrics={'train_runtime': 519.9313, 'train_samples_per_second': 21.164, 'train_steps_per_second': 2.648, 'total_flos': 405470580750720.0, 'train_loss': 0.5058202251802776, 'epoch': 3.0})


---


Hey wait! How do we know if the model was performing well? This above cell just trained the model. But there was no evaluation strategy (metric) and evaluation dataset used! 

So let's add them!


---


In [8]:
'''
We can load the metrics associated with the MRPC dataset as easily as 
we loaded the dataset, this time with the load_metric API
Wrapping everything, we get our compute_metrics function which uses eval_preds (evaluation results form our model)
'''

from datasets import load_metric
import numpy as np
def compute_metrics(eval_preds):
  metric = load_metric("glue", "mrpc")
  logits, labels = eval_preds
  predictions = np.argmax(logits, axis=-1)
  return metric.compute(predictions=predictions, references=labels)

In [9]:
#defining new Trainer with our compute_metrics function and evaluation dataset
from transformers import Trainer

training_args = TrainingArguments("test-trainer", evaluation_strategy="epoch")
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)


PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
loading configuration file https://huggingface.co/bert-base-uncased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e
Model config BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "

In [10]:
trainer.train()

The following columns in the training set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence1, sentence2, idx.
***** Running training *****
  Num examples = 3668
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 1377


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.378523,0.823529,0.875
2,0.503200,0.613996,0.830882,0.888169
3,0.258300,0.734155,0.860294,0.903226


The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence1, sentence2, idx.
***** Running Evaluation *****
  Num examples = 408
  Batch size = 8


Downloading:   0%|          | 0.00/1.86k [00:00<?, ?B/s]

Saving model checkpoint to test-trainer/checkpoint-500
Configuration saved in test-trainer/checkpoint-500/config.json
Model weights saved in test-trainer/checkpoint-500/pytorch_model.bin
tokenizer config file saved in test-trainer/checkpoint-500/tokenizer_config.json
Special tokens file saved in test-trainer/checkpoint-500/special_tokens_map.json
The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence1, sentence2, idx.
***** Running Evaluation *****
  Num examples = 408
  Batch size = 8
Saving model checkpoint to test-trainer/checkpoint-1000
Configuration saved in test-trainer/checkpoint-1000/config.json
Model weights saved in test-trainer/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in test-trainer/checkpoint-1000/tokenizer_config.json
Special tokens file saved in test-trainer/checkpoint-1000/special_tokens_map.json
The following columns in the evaluation set  don't hav

TrainOutput(global_step=1377, training_loss=0.3026201307730304, metrics={'train_runtime': 534.1297, 'train_samples_per_second': 20.602, 'train_steps_per_second': 2.578, 'total_flos': 405470580750720.0, 'train_loss': 0.3026201307730304, 'epoch': 3.0})



---


Now time for inference using a custom pair of sentences!


---



In [17]:
inputs = tokenizer("the man who drank my coffee is not alive","The dead person drank my coffee",return_tensors='pt')

inputs = inputs.to('cuda')

outputs = model(**inputs)

#use softmax to getprobabilities
import torch
#softmax probablities
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)

print(predictions)

#classification results for both inputs
results = torch.argmax(predictions, axis=1)
print(results)
#classification labels mapping to results
print(model.config.id2label)


tensor([[8.6778e-04, 9.9913e-01]], device='cuda:0', grad_fn=<SoftmaxBackward>)
tensor([1], device='cuda:0')
{0: 'LABEL_0', 1: 'LABEL_1'}




---


So the output is 1, which means they are paraphrases. Let's try with another example, shall we?


---



In [19]:
inputs = tokenizer("the man who drank my coffee is not alive","The dead person ate my food",return_tensors='pt')

inputs = inputs.to('cuda')

outputs = model(**inputs)

#use softmax to getprobabilities
import torch
#softmax probablities
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)

print(predictions)

#classification results for both inputs
results = torch.argmax(predictions, axis=1)
print(results)
#classification labels mapping to results
print(model.config.id2label)

tensor([[0.9969, 0.0031]], device='cuda:0', grad_fn=<SoftmaxBackward>)
tensor([0], device='cuda:0')
{0: 'LABEL_0', 1: 'LABEL_1'}




---
Looking good! So this is how we train a model for NLP tasks using Hugging Face!


---


