<a href="https://colab.research.google.com/github/subratamondal1/transformers-for-natural-language-processing/blob/main/02_Transfer_Learning_and_Fine_Tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Transfer Learning

Transfer learning is a machine learning ***technique*** that **reuses a pre-trained model on a new problem** that is different but related to the original one. It allows the machine to exploit the knowledge gained from a previous task to improve generalization or speed up the training process for the new task.

Transfer learning can be applied to any model, but it is especially useful for transformers, which are very complex and require a lot of data and computational power to train from scratch. By using a pre-trained transformer model on a large and diverse dataset, such as ImageNet or Wikipedia, we can leverage the features and patterns that the model has already learned and adapt them to a new task that may have less data or resources.

For example, we can use a pre-trained transformer model for natural language processing to fine-tune it for sentiment analysis, text summarization, or question answering. This way, we can achieve better performance and efficiency than training a new model from scratch.

* Transfer learning is what makes transformers powerful and practical for many real-world applications.

* Transfer learning was first used in the field of **Computer Vision**.

Even though CNN is pretrained as an Image Classifier, the image can be used for other tasks like:

- **Face Recognition**: CNNs can be used to identify and verify the faces of people in images or videos. For example, CNNs can be used to `unlock smartphones, tag friends on social media, or detect criminals in surveillance footage`.

- **Medical Image Computing**: CNNs can be used to analyze and diagnose medical images, such as X-rays, MRI scans, or CT scans. For example, CNNs can be used to `detect tumors, fractures, or diseases in various organs`.

- **Text Generation**: CNNs can be used to generate text based on a given image or context. For example, `CNNs can be used to caption images, summarize articles, or write stories`.


> **Note** that Transfer Learning can be used for any task - we just need the right number of **outputs, activations, loss function**, etc.

# Transfer Learning vs Fine-Tuning

Transfer learning and fine-tuning are two related but distinct techniques in machine learning. They both involve **reusing a pre-trained model** on a new task/dataset, but they differ in how they update the model parameters.

**Transfer learning** is when a model developed for one task is reused to work on a second task that is different but related to the original one. It often involves using the same objective function and may freeze certain layers to retain general features.
* For example, we can use a pre-trained model for image classification to extract features from images and use them as inputs for another model that performs a different task, such as face recognition or medical image analysis.

**Fine-tuning** is a step further than transfer learning, where you specialize the model to a particular task by changing the model output to fit the new task and train only the output layer or a few layers on top of the pre-trained model. It involves further training the pre-trained model on the new task by updating its weights with a smaller learning rate and often using a different training objective.

* For example, we can use a pre-trained model for natural language processing to fine-tune it for a specific task, such as text classification, question answering, or text generation.

The **main difference** between transfer learning and fine-tuning is that transfer learning freezes the pre-trained model and only trains the new layers, while fine-tuning unfreezes the pre-trained model and trains it along with the new layers. Transfer learning is usually faster and less prone to overfitting, while fine-tuning can achieve better performance and adaptability. The choice between transfer learning and fine-tuning depends on the similarity between the original task and the new task, the size and quality of the new dataset, and the computational resources available.

# How to Fine-Tune

* Hugging Face **Datasets**
* Hugging Face **Trainer, TrainerArguments**
* Hugging Face **Computing Metrics**
* Hugging Face **Saving** and using the trained model

## Fine Tuning Sentiment Analysis with GLUE Benchmark

**The GLUE Benchmark**

* Researchers use benchmark datasets to test their models on a common task for fair comparisons (e.g. MNIST and IMAGENET for Computer Vision), similarly

* GLUE benchmark is for NLP, and consists of multiple datasets/tasks

* The SST2 task is for Sentiment Analysis

In [137]:
! pip install transformers datasets -qq

In [138]:
from datasets import load_dataset # use datasets library
import numpy as np # use numpy library

In [139]:
raw_datasets = load_dataset(
    path = "glue",
    name = "sst2"
)

raw_datasets

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 67349
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 872
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1821
    })
})

In [140]:
raw_datasets["train"].to_pandas()

Unnamed: 0,sentence,label,idx
0,hide new secretions from the parental units,0,0
1,"contains no wit , only labored gags",0,1
2,that loves its characters and communicates som...,1,2
3,remains utterly satisfied to remain the same t...,0,3
4,on the worst revenge-of-the-nerds clichés the ...,0,4
...,...,...,...
67344,a delightful comedy,1,67344
67345,"anguish , anger and frustration",0,67345
67346,"at achieving the modest , crowd-pleasing goals...",1,67346
67347,a patient viewer,1,67347


In [141]:
type(raw_datasets["train"])

datasets.arrow_dataset.Dataset

In [142]:
raw_datasets["train"].features

{'sentence': Value(dtype='string', id=None),
 'label': ClassLabel(names=['negative', 'positive'], id=None),
 'idx': Value(dtype='int32', id=None)}

* **LABEL_0** is **Negative**
* **LABEL_1** is **POSITIVE**

In [143]:
raw_datasets["train"].data

MemoryMappedTable
sentence: string
label: int64
idx: int32
----
sentence: [["hide new secretions from the parental units ","contains no wit , only labored gags ","that loves its characters and communicates something rather beautiful about human nature ","remains utterly satisfied to remain the same throughout ","on the worst revenge-of-the-nerds clichés the filmmakers could dredge up ",...,"you wish you were at home watching that movie instead of in the theater watching this one ","'s no point in extracting the bare bones of byatt 's plot for purposes of bland hollywood romance ","underdeveloped ","the jokes are flat ","a heartening tale of small victories "],["suspense , intriguing characters and bizarre bank robberies , ","a gritty police thriller with all the dysfunctional family dynamics one could wish for ","with a wonderful ensemble cast of characters that bring the routine day to day struggles of the working class to life ","nonetheless appreciates the art and reveals a music sc

In [144]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(
    pretrained_model_name_or_path = checkpoint,
)

tokenizer(raw_datasets["train"][0:2]["sentence"])

{'input_ids': [[101, 5342, 2047, 3595, 8496, 2013, 1996, 18643, 3197, 102], [101, 3397, 2053, 15966, 1010, 2069, 4450, 2098, 18201, 2015, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

In [145]:
def tokenize_fn(batch):
  return tokenizer(batch["sentence"], truncation = True)

Why only truncation and no padding? because here, padding will be automatically applied.

In [146]:
tokenized_datasets = raw_datasets.map(tokenize_fn, batched = True)

In [147]:
tokenized_datasets["train"].to_pandas()

Unnamed: 0,sentence,label,idx,input_ids,attention_mask
0,hide new secretions from the parental units,0,0,"[101, 5342, 2047, 3595, 8496, 2013, 1996, 1864...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]"
1,"contains no wit , only labored gags",0,1,"[101, 3397, 2053, 15966, 1010, 2069, 4450, 209...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]"
2,that loves its characters and communicates som...,1,2,"[101, 2008, 7459, 2049, 3494, 1998, 10639, 201...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]"
3,remains utterly satisfied to remain the same t...,0,3,"[101, 3464, 12580, 8510, 2000, 3961, 1996, 216...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]"
4,on the worst revenge-of-the-nerds clichés the ...,0,4,"[101, 2006, 1996, 5409, 7195, 1011, 1997, 1011...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
...,...,...,...,...,...
67344,a delightful comedy,1,67344,"[101, 1037, 26380, 4038, 102]","[1, 1, 1, 1, 1]"
67345,"anguish , anger and frustration",0,67345,"[101, 21782, 1010, 4963, 1998, 9135, 102]","[1, 1, 1, 1, 1, 1, 1]"
67346,"at achieving the modest , crowd-pleasing goals...",1,67346,"[101, 2012, 10910, 1996, 10754, 1010, 4306, 10...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]"
67347,a patient viewer,1,67347,"[101, 1037, 5776, 13972, 102]","[1, 1, 1, 1, 1]"


In [148]:
! pip install accelerate -Uqq
!pip install transformers[torch] -qq

In [149]:
from transformers import TrainingArguments

In [150]:
training_args = TrainingArguments(
    output_dir = "my_trainer",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    num_train_epochs = 1
)

In [151]:
from transformers import AutoModelForSequenceClassification

In [152]:
model = AutoModelForSequenceClassification.from_pretrained(
    pretrained_model_name_or_path = checkpoint,
    num_labels = 2
)

model

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.bias', 'classifier.weight', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

In [153]:
type(model)

transformers.models.distilbert.modeling_distilbert.DistilBertForSequenceClassification

In [154]:
! pip install torchinfo -qq

In [155]:
from torchinfo import summary
summary(
    model = model,
    input_size = (16, 512),
    dtypes = ["torch.IntTensor"],
    device = "cpu"
  )

Layer (type:depth-idx)                                  Output Shape              Param #
DistilBertForSequenceClassification                     [16, 2]                   --
├─DistilBertModel: 1-1                                  [16, 512, 768]            --
│    └─Embeddings: 2-1                                  [16, 512, 768]            --
│    │    └─Embedding: 3-1                              [16, 512, 768]            23,440,896
│    │    └─Embedding: 3-2                              [1, 512, 768]             393,216
│    │    └─LayerNorm: 3-3                              [16, 512, 768]            1,536
│    │    └─Dropout: 3-4                                [16, 512, 768]            --
│    └─Transformer: 2-2                                 [16, 512, 768]            --
│    │    └─ModuleList: 3-5                             --                        42,527,232
├─Linear: 1-2                                           [16, 768]                 590,592
├─Dropout: 1-3                 

In [156]:
params_before = []

for name, p in model.named_parameters():
  params_before.append(p.detach().cpu().numpy())

In [157]:
from transformers import Trainer
from datasets import load_metric

In [158]:
metric = load_metric(
    "glue",
    "sst2"
)

In [159]:
# random inputs
metric.compute(
    predictions = [0,1,0], # predicted label
    references = [0,0,1] # target label
)

{'accuracy': 0.3333333333333333}

In [160]:
import torch

In [161]:
# Creating custom compute metrics because the Trainer wants them in different format
def compute_metrics(logits_and_labels:tuple):
  logits, labels = logits_and_labels
  predictions = np.argmax(logits, axis = -1)
  return metric.compute(
      predictions = predictions,
      references = labels
  )

In [162]:
trainer = Trainer(
    model,
    training_args,
    train_dataset = tokenized_datasets["train"],
    eval_dataset = tokenized_datasets["validation"],
    tokenizer = tokenizer,
    compute_metrics = compute_metrics,
)

In [163]:
# lets train the model with the data
trainer.train()

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.2157,0.333749,0.909404


TrainOutput(global_step=8419, training_loss=0.2644536002173052, metrics={'train_runtime': 438.4499, 'train_samples_per_second': 153.607, 'train_steps_per_second': 19.202, 'total_flos': 518596929468840.0, 'train_loss': 0.2644536002173052, 'epoch': 1.0})

In [164]:
trainer.save_model("my_saved_model")

In [165]:
!ls

airline-sentiment-tweets  data.csv  my_saved_model  my_trainer	sample_data


In [166]:
!ls my_saved_model

config.json	   special_tokens_map.json  tokenizer.json     vocab.txt
model.safetensors  tokenizer_config.json    training_args.bin


`my_saved_model` is our trained model saved in that folder that contains all the things required for the `pipeline` of HuggingFace to work.

In [167]:
from transformers import pipeline

In [168]:
my_trained_model = pipeline(
    task = "text-classification",
    model = "my_saved_model"
)

In [169]:
result = my_trained_model("This movie sucks, don't know why did I agree to watch it though.")
result

[{'label': 'LABEL_0', 'score': 0.9934436082839966}]

In [170]:
!cat my_saved_model/config.json

{
  "_name_or_path": "distilbert-base-uncased",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "problem_type": "single_label_classification",
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "torch_dtype": "float32",
  "transformers_version": "4.35.2",
  "vocab_size": 30522
}


In [171]:
import json
config_path = "my_saved_model/config.json"

with open(file = config_path) as f:
  j = json.load(f)

j["id2label"] = {0:"negative", 1:"positive"}

with open(file = config_path, mode="w") as f:
  json.dump(j, f, indent = 2)

In [172]:
params_after = []

for name, p in model.named_parameters():
  params_after.append(p.detach().cpu().numpy())

In [173]:
for p1, p2 in zip(params_before,params_after):
  print(np.sum(np.abs(p1-p2)))

13343.587
89.95245
1.8103788
1.1018606
1299.748
1.622119
1287.1423
0.0033159768
1188.93
1.0522709
1121.815
0.8296045
1.747123
0.83485544
4934.3213
5.7018785
4514.0356
0.7062712
1.592768
0.6393697
1278.6615
1.3608233
1267.9501
0.002905826
1098.2262
0.8267909
1047.7579
0.7257391
1.6331531
0.7305273
4884.7554
5.308346
4449.6885
0.7054193
1.5153456
0.7100934
1276.245
1.5411466
1289.016
0.0025998633
1107.9624
0.7786826
1093.9341
0.71520805
1.5335522
0.7910348
4938.441
5.6657248
4356.6494
0.6755383
1.4680973
0.69340146
1287.8444
1.4287503
1303.4701
0.0028647142
1163.6178
0.7275519
1119.0985
0.7275117
1.4476287
0.71741545
4780.397
5.3437295
4108.367
0.76159513
1.3230817
0.78689617
1193.1625
1.5182226
1180.3444
0.0021471973
961.22736
0.7428751
996.9521
0.87817174
1.362813
0.9176675
4289.3755
4.830091
3276.3987
0.6887346
1.305912
0.60231626
1120.3427
1.3709259
1136.3058
0.0011294237
907.46313
0.6721538
900.2316
0.80879134
1.228035
1.0860406
3444.561
4.3018227
3008.7441
0.87695414
1.2556564
0.69

The difference is non-zero for most of the cases, which indicates that after finetuning the model's parameters did change, hence the difference and which we wanted.

## Summary


1. **Load the important libraries**
```python
from datasets import load_dataset, load_metric
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import TrainingArguments, Trainer
```

2. **Load the dataset**
```python
dataset = load_dataset(
    path = "glue",
    name = "sst2"
)
```