<a href="https://colab.research.google.com/github/ujjalkumarmaity/NLP/blob/main/Huggingface-NLP-Course/Fine_Tune_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
%%capture
!pip install datasets transformers transformers[torch] accelerate

### Processing the data

In [None]:
#The dataset consists of 5,801 pairs of sentences, with a label indicating if they are paraphrases or not (i.e., if both sentences mean the same thing).
from datasets import load_dataset
raw_datasets = load_dataset("glue", "mrpc")

In [3]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

In [4]:
# To access train data
raw_datasets['train'][0]

{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
 'label': 1,
 'idx': 0}

In [5]:
# To see dataset feature
raw_datasets['train'].features

{'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None),
 'label': ClassLabel(names=['not_equivalent', 'equivalent'], id=None),
 'idx': Value(dtype='int32', id=None)}

In [8]:
from transformers import AutoTokenizer

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenizer("this is sentence one","this is sentence two")

{'input_ids': [101, 2023, 2003, 6251, 2028, 102, 2023, 2003, 6251, 2048, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [9]:
tokenizer.decode(tokenizer("this is sentence one","this is sentence two")['input_ids'])

'[CLS] this is sentence one [SEP] this is sentence two [SEP]'

In [10]:
from transformers import AutoTokenizer

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)
data = raw_datasets.map(tokenize_function,batched=True)

Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

In [11]:
data['train'].features

{'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None),
 'label': ClassLabel(names=['not_equivalent', 'equivalent'], id=None),
 'idx': Value(dtype='int32', id=None),
 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None),
 'token_type_ids': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None),
 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None)}

In [12]:
from pprint import pprint
print(data['train'][0]['input_ids'])
print(data['train'][0]['sentence1'])
print(data['train'][0]['sentence2'])
print(len(data['train'][0]['input_ids']))

[101, 2572, 3217, 5831, 5496, 2010, 2567, 1010, 3183, 2002, 2170, 1000, 1996, 7409, 1000, 1010, 1997, 9969, 4487, 23809, 3436, 2010, 3350, 1012, 102, 7727, 2000, 2032, 2004, 2069, 1000, 1996, 7409, 1000, 1010, 2572, 3217, 5831, 5496, 2010, 2567, 1997, 9969, 4487, 23809, 3436, 2010, 3350, 1012, 102]
Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .
Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .
50


In [13]:
print(len(data['train'][0]['input_ids']),len(data['train'][2]['input_ids']))
# Here different input_ids letgth for each example. we need save lengeh, so padding is require

50 47


In [14]:
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
sample = data['train'][:3]
sample = {k: v for k, v in sample.items() if k not in ["idx", "sentence1", "sentence2"]}
data_padding = data_collator(sample)
print(len(data_padding['input_ids'][0]),len(data_padding['input_ids'][2]))

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


59 59


### Fine-tuning a model with the Trainer API


In [15]:
from datasets import load_dataset
from transformers import AutoModelForSequenceClassification,AutoTokenizer,DataCollatorWithPadding
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
data_collator = DataCollatorWithPadding(tokenizer = tokenizer)

raw_datasets = load_dataset("glue", "mrpc")

def tokenize_function(ex):
    return tokenizer(ex['sentence1'],ex['sentence2'],truncation=True)
tokenized_datasets = raw_datasets.map(tokenize_function,batched=True)

Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

first step before we can define our Trainer is to define a TrainingArguments class that will contain all the hyperparameters the Trainer

In [16]:
from transformers import Trainer,TrainingArguments,AutoModelForSequenceClassification
checkpoint = "bert-base-uncased"
train_args = TrainingArguments('train-aeg')
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
from transformers import Trainer

trainer = Trainer(
    model,
    train_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)
trainer.train()

### A full training using **Pytorch**


In [1]:
import torch
from torch.utils.data import DataLoader

In [None]:
from datasets import load_dataset
raw_datasets = load_dataset("glue", "mrpc")


In [6]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
def tokenize_fun(x):
    return tokenizer(x['sentence1'],x['sentence2'],truncation=True)
data = raw_datasets.map(tokenize_fun)

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

In [7]:
data['train'][0].keys()

dict_keys(['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'])

In [8]:
data = data.remove_columns(column_names=['sentence1', 'sentence2','idx'])

In [9]:
data['train'].features

{'label': ClassLabel(names=['not_equivalent', 'equivalent'], id=None),
 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None),
 'token_type_ids': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None),
 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None)}

In [10]:
torch_data = data.with_format("torch")

In [11]:
torch_data['train'][0]

{'label': tensor(1),
 'input_ids': tensor([  101,  2572,  3217,  5831,  5496,  2010,  2567,  1010,  3183,  2002,
          2170,  1000,  1996,  7409,  1000,  1010,  1997,  9969,  4487, 23809,
          3436,  2010,  3350,  1012,   102,  7727,  2000,  2032,  2004,  2069,
          1000,  1996,  7409,  1000,  1010,  2572,  3217,  5831,  5496,  2010,
          2567,  1997,  9969,  4487, 23809,  3436,  2010,  3350,  1012,   102]),
 'token_type_ids': tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1]),
 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1])}

In [12]:
from transformers import DataCollatorWithPadding
data_collector = DataCollatorWithPadding(tokenizer)
train_dataloader = DataLoader(torch_data['train'],batch_size = 16,shuffle  = True,collate_fn = data_collector)
val_dataloader = DataLoader(torch_data['validation'],batch_size = 16,shuffle  = True,collate_fn = data_collector)


In [13]:
for i in train_dataloader:
    break

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


In [14]:
i.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'labels'])

In [15]:
i['token_type_ids'].size()

torch.Size([16, 79])

In [17]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
device

'cuda'

In [20]:
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
model.to(device)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

In [22]:
out = model(**i.to(device))
out.loss

tensor(0.6602, device='cuda:0', grad_fn=<NllLossBackward0>)

In [23]:
from transformers import AdamW,get_scheduler
optimizer = AdamW(model.parameters(),lr = 1e-05)
epoch = 3
num_training_steps = len(train_dataloader) * epoch
lr_schedular = get_scheduler('linear',
                             optimizer = optimizer,
                             num_warmup_steps = 0,
                             num_training_steps = num_training_steps)



In [24]:
from tqdm.auto import tqdm
progress_bar = tqdm(range(num_training_steps))

model.train()
for ep in range(epoch):
    total_loss = 0
    for batch in train_dataloader:
        batch.to(device)
        out = model(**batch)
        loss = out.loss
        total_loss += loss.item()
        # When you call loss.backward(), all it does is compute gradient of loss w.r.t all the parameters in loss that have requires_grad = True
        # and store them in parameter.grad attribute for every parameter.
        loss.backward()
        # optimizer.step() updates all the parameters based on parameter.grad
        optimizer.step()
        # adjusting the learning rate during the training process
        lr_schedular.step()
        optimizer.zero_grad()
        progress_bar.update(1)
    print(f"At {ep} epoch, training loss - {total_loss/len(train_dataloader)}")


  0%|          | 0/690 [00:00<?, ?it/s]

At 0 epoch, training loss - 0.5904001648011414
At 1 epoch, training loss - 0.4474819456105647
At 2 epoch, training loss - 0.3314490659405356


In [None]:
# training time - 02:29

### A full training with **Accelerate**


In [2]:
from accelerate import Accelerator
import torch
from torch.utils.data import DataLoader
from datasets import load_dataset
from transformers import AutoTokenizer
from transformers import DataCollatorWithPadding


In [None]:
raw_datasets = load_dataset("glue", "mrpc")
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
def tokenize_fun(x):
    return tokenizer(x['sentence1'],x['sentence2'],truncation=True)
data = raw_datasets.map(tokenize_fun)
data = data.remove_columns(column_names=['sentence1', 'sentence2','idx'])
torch_data = data.with_format("torch")

data_collector = DataCollatorWithPadding(tokenizer)
train_dataloader = DataLoader(torch_data['train'],batch_size = 16,shuffle  = True,collate_fn = data_collector)
val_dataloader = DataLoader(torch_data['validation'],batch_size = 16,shuffle  = True,collate_fn = data_collector)


In [4]:
from transformers import AutoModelForSequenceClassification
model_acc = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [5]:
from transformers import AdamW,get_scheduler
optimizer = AdamW(model_acc.parameters(),lr = 1e-05)
epoch = 3
num_training_steps = len(train_dataloader) * epoch




In [6]:
accelerator = Accelerator()
model_acc,optimizer,train_dataloader,val_dataloader = accelerator.prepare(model_acc,optimizer,train_dataloader,val_dataloader)

In [7]:
lr_schedular = get_scheduler('linear',
                             optimizer = optimizer,
                             num_warmup_steps = 0,
                             num_training_steps = num_training_steps)

In [8]:
from tqdm.auto import tqdm
progress_bar = tqdm(range(num_training_steps))

model_acc.train()
for ep in range(epoch):
    total_loss = 0
    for batch in train_dataloader:
        out = model_acc(**batch)
        loss = out.loss
        total_loss += loss.item()
        # When you call loss.backward(), all it does is compute gradient of loss w.r.t all the parameters in loss that have requires_grad = True
        # and store them in parameter.grad attribute for every parameter.
        accelerator.backward(loss)
        # optimizer.step() updates all the parameters based on parameter.grad
        optimizer.step()
        # adjusting the learning rate during the training process
        lr_schedular.step()
        optimizer.zero_grad()
        progress_bar.update(1)
    print(f"At {ep} epoch, training loss - {total_loss/len(train_dataloader)}")


  0%|          | 0/690 [00:00<?, ?it/s]

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


At 0 epoch, training loss - 0.5774198976547822
At 1 epoch, training loss - 0.4262043435288512
At 2 epoch, training loss - 0.313537280553061
