Fine-tuning a language model
In this notebook, we'll see how to fine-tune one of the 🤗 Transformers model on a language modeling tasks. We will cover two types of language modeling tasks which are:

Causal language modeling: the model has to predict the next token in the sentence (so the labels are the same as the inputs shifted to the right). To make sure the model does not cheat, it gets an attention mask that will prevent it to access the tokens after token i when trying to predict the token i+1 in the sentence.


In [2]:
!pip install transformers
import torch
!pip install datasets
from datasets import load_dataset
device = "cuda" if torch.cuda.is_available() else "cpu"
import pandas as pd
import numpy as np

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/fd/1a/41c644c963249fd7f3836d926afa1e3f1cc234a1c40d80c5f03ad8f6f1b2/transformers-4.8.2-py3-none-any.whl (2.5MB)
[K     |████████████████████████████████| 2.5MB 5.3MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/75/ee/67241dc87f266093c533a2d4d3d69438e57d7a90abb216fa076e7d475d4a/sacremoses-0.0.45-py3-none-any.whl (895kB)
[K     |████████████████████████████████| 901kB 37.8MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/d4/e2/df3543e8ffdab68f5acc73f613de9c2b155ac47f162e725dcac87c521c11/tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3MB)
[K     |████████████████████████████████| 3.3MB 40.3MB/s 
Collecting huggingface-hub==0.0.12
  Downloading https://files.pythonhosted.org/packages/2f/ee/97e253668fda9b17e968b3f97b2f8e53aa0127e8807d24a547687423

In [3]:
from datasets import load_dataset
datasets = load_dataset('wikitext', 'wikitext-2-raw-v1')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1934.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1124.0, style=ProgressStyle(description…


Downloading and preparing dataset wikitext/wikitext-2-raw-v1 (download: 4.50 MiB, generated: 12.91 MiB, post-processed: Unknown size, total: 17.41 MiB) to /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/aa5e094000ec7afeb74c3be92c88313cd6f132d564c7effd961c10fd47c76f20...


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=4721645.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Dataset wikitext downloaded and prepared to /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/aa5e094000ec7afeb74c3be92c88313cd6f132d564c7effd961c10fd47c76f20. Subsequent calls will reuse this data.


In [4]:
TrainData = pd.DataFrame(datasets['train'])
TestData = pd.DataFrame(datasets['test']['text'])
TestData.columns = ["text"]

In [8]:
print(TrainData.shape,TestData.shape,sep="\n\n")

(36718, 1)

(4358, 1)


In [6]:
model_name = "gpt2"

In [7]:
from transformers import GPT2LMHeadModel,GPT2TokenizerFast

In [9]:
Model = GPT2LMHeadModel.from_pretrained(model_name,force_download=True)
tokenizer = GPT2TokenizerFast.from_pretrained(model_name)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=665.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=548118077.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1042301.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1355256.0, style=ProgressStyle(descript…




In [None]:
def tokenize_function(TrainData):
  input_idss = []
  attention_maskss = []
  for data in TrainData:
    encodings = tokenizer.encode_plus(data)

    input_idss.append(encodings['input_ids'])
    attention_maskss.append(encodings['attention_mask'])
  
  return input_idss,attention_maskss

In [None]:
input_idsTrain,attention_maskTrain = tokenize_function(TrainData['text'])
input_idsTest,attention_maskTest = tokenize_function(TestData['text'])

In [None]:
# so now we have not used padding over here , we can concatenate all the input_ids from all sentences & than 
# convert to one single list. than we will make that the lis of list having size (128) features..
# make the labels same as input_ids (Transformers api wil do the shif right automatically).

In [None]:
def Concatenate_fun(input_ids,attention_mask):
  input_ids = list(np.concatenate(np.array(input_ids)))
  attention_mask = list(np.concatenate(np.array(attention_mask)))
  return input_ids,attention_mask

In [None]:
input_idsTrain,attention_maskTrain = Concatenate_fun(input_idsTrain,attention_maskTrain)
input_idsTest,attention_maskTest = Concatenate_fun(input_idsTest,attention_maskTest)

  
  This is separate from the ipykernel package so we can avoid doing imports until


In [None]:
print(len(input_idsTrain),len(input_idsTest))
print(len(attention_maskTrain),len(attention_maskTest))

2391884 283287
2391884 283287


In [None]:
def Grouping(lis,block_size):
  remainder = len(lis)%block_size
  idx = len(lis)-remainder
  lis_cut = lis[:idx] # chop-off excess data
  lis_new = [lis_cut[i:i+block_size] for i in range(0,len(lis_cut),block_size)]
  return lis_new

In [None]:
# block_size = tokenizer.model_max_length
block_size = 128

In [None]:
labels_train = Grouping(input_idsTrain,128)           # making labels from input_ids
input_idsTrain_grouped = Grouping(input_idsTrain,128) # grouping input_ids in fix features

labels_test = Grouping(input_idsTest,128)
input_idsTest_grouped = Grouping(input_idsTest,128)

attention_maskTrain_grouped = Grouping(attention_maskTrain,128)
attention_maskTest_grouped = Grouping(attention_maskTest,128)

In [None]:
trainDict = {
    "input_ids":torch.tensor(input_idsTrain_grouped,dtype=torch.long),
    "attention_mask":torch.tensor(attention_maskTrain_grouped,dtype=torch.long),
    "labels":torch.tensor(labels_train,dtype=torch.long)
    }

testDict = {
    "input_ids":torch.tensor(input_idsTest_grouped,dtype=torch.long),
    "attention_mask":torch.tensor(attention_maskTest_grouped,dtype=torch.long),
    "labels":torch.tensor(labels_test,dtype=torch.long)
    }

In [None]:
class CustomDataset(torch.utils.data.Dataset):
  def __init__(self,dictionary):
    self.dic = dictionary
    self.id = self.dic['input_ids']
    self.mask = self.dic['attention_mask']
    self.label = self.dic['labels']
    self.device = "cuda" if torch.cuda.is_available() else "cpu"

  def __len__(self):
    return len(self.label)

  def __getitem__(self,index):
    self.input_ids = self.id[index]
    self.attention_mask = self.mask[index]
    self.labels = self.label[index]
    
    dict_map = {
        "input_ids":self.input_ids,
        "attention_mask":self.attention_mask,
        "labels":self.labels
        }

    return dict_map

In [None]:
Train_iter = CustomDataset(trainDict)
Test_iter = CustomDataset(testDict)

In [None]:
from transformers import Trainer, TrainingArguments

In [None]:
training_args = TrainingArguments(
    "test-clm",
    do_train = True,
    do_eval=True,
    do_predict=True,
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    weight_decay=0.01,num_train_epochs=3.0,
    logging_dir = "Logs/",
    logging_strategy = "epoch",
    per_device_train_batch_size = 4,
    per_device_eval_batch_size = 2,
    save_strategy = "epoch"
    )

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [None]:
trainer = Trainer(
    model=Model,
    args=training_args,
    train_dataset=Train_iter,
    eval_dataset=Test_iter
)

In [None]:
trainer.train()

***** Running training *****
  Num examples = 18686
  Num Epochs = 3
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 4
  Gradient Accumulation steps = 1
  Total optimization steps = 14016


Epoch,Training Loss,Validation Loss
1,3.5357,3.387756
2,3.3251,3.380015
3,3.2452,3.379946


***** Running Evaluation *****
  Num examples = 2213
  Batch size = 2
Saving model checkpoint to test-clm/checkpoint-4672
Configuration saved in test-clm/checkpoint-4672/config.json
Model weights saved in test-clm/checkpoint-4672/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 2213
  Batch size = 2
Saving model checkpoint to test-clm/checkpoint-9344
Configuration saved in test-clm/checkpoint-9344/config.json
Model weights saved in test-clm/checkpoint-9344/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 2213
  Batch size = 2
Saving model checkpoint to test-clm/checkpoint-14016
Configuration saved in test-clm/checkpoint-14016/config.json
Model weights saved in test-clm/checkpoint-14016/pytorch_model.bin


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=14016, training_loss=3.3686721314033963, metrics={'train_runtime': 2830.8878, 'train_samples_per_second': 19.802, 'train_steps_per_second': 4.951, 'total_flos': 5357450309271552.0, 'train_loss': 3.3686721314033963, 'epoch': 3.0})

In [None]:
# tokenizer.save_pretrained("Tokenizer/")

tokenizer config file saved in Tokenizer/tokenizer_config.json
Special tokens file saved in Tokenizer/special_tokens_map.json


('Tokenizer/tokenizer_config.json',
 'Tokenizer/special_tokens_map.json',
 'Tokenizer/vocab.json',
 'Tokenizer/merges.txt',
 'Tokenizer/added_tokens.json',
 'Tokenizer/tokenizer.json')

In [None]:
# !mv /content/Tokenizer /content/drive/MyDrive/NLP/LanguageModelling/CasualModelling/

In [None]:
# %cd /content/drive/MyDrive/NLP/LanguageModelling/CasualModelling/

/content/drive/MyDrive/NLP/LanguageModelling/CasualModelling


Prediction

In [1]:
%cd /content/drive/MyDrive/NLP/LanguageModelling/CasualModelling/

/content/drive/MyDrive/NLP/LanguageModelling/CasualModelling


In [18]:
from transformers import GPT2LMHeadModel
Model = GPT2LMHeadModel.from_pretrained("test-clm/checkpoint-14016")

In [19]:
Model.eval()
Model.to("cpu")

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0): GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (1): GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )


In [2]:
from transformers import top_k_top_p_filtering,GPT2TokenizerFast

In [13]:
!ls

CLM.ipynb  CLMquantizedModel.pt  GPT2LM.py  main.py  templates	Tokenizer


In [14]:
token = GPT2TokenizerFast.from_pretrained("Tokenizer/")

In [15]:
sequence = f"I am Teacher and"

input_ids = token.encode(sequence, return_tensors="pt",add_special_tokens=True)

In [17]:
# For Genertion only input_id is required, attention_mask not needed

In [37]:
outputs = Model.generate(input_ids, max_length=80, do_sample=True, top_p=0.95, top_k=60)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [38]:
a = token.decode(outputs[0],skip_special_tokens=True,clean_up_tokenization_spaces=True)

In [39]:
print(a[:a.rindex(".")])

I am Teacher and I hope that you are okay and that you are okay and I 'll help you get back to work. I hope that you are fine and that you are fine. " 
 Simone described the trip as a " long one "


Model Quantization...

In [9]:
# quantized_model = torch.quantization.quantize_dynamic(
    # Model, {torch.nn.Linear}, dtype=torch.qint8
# )

In [10]:
# outputs = quantized_model.generate(input_ids, max_length=80, do_sample=True, top_p=0.95, top_k=60)
# a = token.decode(outputs[0],skip_special_tokens=True,clean_up_tokenization_spaces=True)
# a = a[:a.rindex(".")]
# a

In [40]:
print(a)

I am Teacher and I hope that you are okay and that you are okay and I 'll help you get back to work. I hope that you are fine and that you are fine. " 
 Simone described the trip as a " long one ". She noted " the whole thing, what they did, the whole thing, and what I had to do just to make a statement to each


In [11]:
# torch.save(quantized_model,"quantizedModel.pt")

In [41]:
model = torch.load("quantizedModel.pt")

In [44]:
sequence = f"I love my wife and"

input_ids = token.encode(sequence, return_tensors="pt",add_special_tokens=True)

In [45]:
outputs = model.generate(input_ids, max_length=80, do_sample=True, top_p=0.95, top_k=60)
a = token.decode(outputs[0],skip_special_tokens=True,clean_up_tokenization_spaces=True)
a = a[:a.rindex(".")]
print(a)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


I love my wife and baby. So when an Indian wife left the family, a third family member in that case, and an Indian son had been born, the income for that father was considered extreme — it needed to be about 3 million dollars a year. There was a great need for security because it was an illegal act. I have always used my son's father's name
