# NLP Final Project
We are going to fine-tune a Indic GPT-2 from the [Huggingface model hub](https://huggingface.co/models). As fine-tune, data we are using the [Thirukkural Dataset](https://github.com/tk120404/thirukkural/blob/master/thirukkural.json), which consists of 1330 quotes of wisdom in Tamil language. The dataset is collected after crawling _____.

The idea is we use the quotes in the kural to fine-tune our GPT-2 to let us create more quotes in this language.



## **What are we going to do:**

- load the dataset
- prepare the dataset and build a ``TextDataset``
- load the pre-trained GPT-2 model and tokenizer
- initialize ``Trainer`` with ``TrainingArguments``
- train and save the model
- test the model

In [1]:
!pip install rouge
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
!nvidia-smi

Fri May  5 04:18:45 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   66C    P8    10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [3]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt

# Load the dataset from Json

As already mentioned in the introduction we use the "[Thirukurral Dataset](https://github.com/tk120404/thirukkural/blob/master/thirukkural.json)" dataset from github. The dataset consists of 1330 quotes with English translations and crawled from ????. 


In [4]:
#upload files to your colab environment
from google.colab import files
uploaded = files.upload()

# Prepare the dataset and build a ``TextDataset``

The next step is to extract the kurals from all quotes and build a `TextDataset`. The `TextDataset` is a custom implementation of the [Pytroch `Dataset` class](https://pytorch.org/tutorials/beginner/data_loading_tutorial.html#dataset-class) implemented by the transformers library.

First, we are going to split the `thirukkural.json` into a `train` and `test` section and extract `Line 1 and Line 3` from the kurals and write them into a `tamil_train_dataset.txt` and `tamil_test_dataset.txt`

In [5]:
import re
import json
from sklearn.model_selection import train_test_split


with open('thirukkural.json') as f:
    data = json.load(f)

def build_text_files(data_json, dest_path):
    f = open(dest_path, 'w')
    data = ''
    for texts in data_json:
        summary = str(texts['Line1']+" "+texts['Line2']).strip()
        summary = re.sub(r"\s", " ", summary)
        data += summary + "  "
    f.write(data)

train, test = train_test_split(data["kural"],test_size=0.15) 

build_text_files(train,'tamil_train_dataset.txt')
build_text_files(test,'tamil_test_dataset.txt')

print("Train dataset length: "+str(len(train)))
print("Test dataset length: "+ str(len(test)))


Train dataset length: 1130
Test dataset length: 200


the next step is to download the tokenizer, which we use. We use the tokenizer from the `indic-gpt` model on [huggingface](https://huggingface.co/aashay96/indic-gpt).

In [6]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("aashay96/indic-gpt")

train_path = 'tamil_train_dataset.txt'
test_path = 'tamil_test_dataset.txt'

In [7]:
from transformers import TextDataset,DataCollatorForLanguageModeling

def load_dataset(train_path,test_path,tokenizer):
    train_dataset = TextDataset(
          tokenizer=tokenizer,
          file_path=train_path,
          block_size=128)
     
    test_dataset = TextDataset(
          tokenizer=tokenizer,
          file_path=test_path,
          block_size=128)   
    
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer, mlm=False,
    )
    return train_dataset,test_dataset,data_collator

train_dataset,test_dataset,data_collator = load_dataset(train_path,test_path,tokenizer)



# Initialize `Trainer` with `TrainingArguments` and GPT-2 model

The [Trainer](https://huggingface.co/transformers/main_classes/trainer.html#transformers.Trainer) class provides an API for feature-complete training. Before we can instantiate our `Trainer` we need to download our GPT-2 model and create a [TrainingArguments](https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments) to access all the points of customization during training. In the `TrainingArguments`, we can define the Hyperparameters we are going to use in the training process like our `learning_rate`, `num_train_epochs`, or  `per_device_train_batch_size`. A complete list can you find [here](https://huggingface.co/transformers/main_classes/trainer.html#trainingarguments).

In [8]:
from transformers import Trainer, TrainingArguments,AutoModelWithLMHead

model = AutoModelWithLMHead.from_pretrained("aashay96/indic-gpt")


training_args = TrainingArguments(
    output_dir="./gpt2-indic", #The output directory
    overwrite_output_dir=True, #overwrite the content of the output directory
    num_train_epochs=4, # number of training epochs
    per_device_train_batch_size=32, # batch size for training
    per_device_eval_batch_size=64,  # batch size for evaluation
    eval_steps = 400, # Number of update steps between two evaluations.
    save_steps=500, # after # steps model is saved 
    warmup_steps=200,# number of warmup steps for learning rate scheduler
    prediction_loss_only=True,
    )


trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)



# Train and save the model

To train the model we can simply run `Trainer.train()`.

In [9]:
trainer.train()



Step,Training Loss


TrainOutput(global_step=56, training_loss=2.5140271868024553, metrics={'train_runtime': 59.7157, 'train_samples_per_second': 28.066, 'train_steps_per_second': 0.938, 'total_flos': 109481361408000.0, 'train_loss': 2.5140271868024553, 'epoch': 4.0})

Saving the model by calling `save_model()`. The trained model is stored in the `output_dir` from our `TrainingArguments`.

In [10]:
trainer.save_model()

# Test the model

To test the model we are using `pipeline`. [Pipelines](https://huggingface.co/transformers/main_classes/pipelines.html?highlight=pipelines) are objects that offer a simple API dedicated to several tasks, among others also `text-generation`

In [37]:
from transformers import pipeline

tamil_quote = pipeline('text-generation',model='./gpt2-indic', tokenizer='aashay96/indic-gpt')

#Asserting Testing output

In [28]:
def post_process(output_sequences):
    predictions = []
    generated_sequences = []

    # decode prediction
    for _, generated_sequence in enumerate(output_sequences):
        generated_sequence = generated_sequence.tolist()
        text = tokenizer.decode(generated_sequence, clean_up_tokenization_spaces=True, skip_special_tokens=True)
        generated_sequences.append(text.strip())
                    
    for i, g in enumerate(generated_sequences):
        res = str(g).replace('\n\n\n', '\n').replace('\n\n', '\n').replace("."," ")
        re.sub("\s\s+" , " ", res)
        if res[-1] != ".":
            res = res + "."
        predictions.append(res)

    return predictions

In [29]:
num_sequences =  1
min_length =  40 #@param {type:"integer"}
max_length =   50#@param {type:"integer"}
temperature = 1 #@param {type:"slider", min:0, max:3, step:0.01}
top_p = 0.95 #@param {type:"slider", min:0, max:1, step:0.01}
top_k = 50 #@param {type:"integer"}
repetition_penalty =  1.0#@param {type:"number"}

def generate_text(start):
    encoded_prompt = tokenizer(start, add_special_tokens=False, return_tensors="pt").input_ids
    encoded_prompt = encoded_prompt.to(trainer.model.device)
    # prediction
    output_sequences = trainer.model.generate(
                            input_ids=encoded_prompt,
                            max_length=max_length,
                            min_length=min_length,
                            temperature=float(temperature),
                            top_p=float(top_p),
                            top_k=int(top_k),
                            do_sample=True,
                            repetition_penalty=repetition_penalty,
                            num_return_sequences=num_sequences,
                            pad_token_id=tokenizer.eos_token_id
                            )
    # Post-processing
    predictions = post_process(output_sequences)
    return predictions[0]

In [49]:
reference = 'அருள்வெஃகி ஆற்றின்கண் நின்றான் பொருள்வெஃகிப் பொல்லாத சூழக் கெடும்.'
pred_hypothesis = tamil_quote('அருள்வெஃகி ஆற்றின்கண்')
hypothesis = pred_hypothesis[0]['generated_text']
print(hypothesis)

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


அருள்வெஃகி ஆற்றின்கண் நோன்சில் சென்னை தாள்,  வீழ்த்தி யானை நோ


In [50]:
from rouge import Rouge 
rouge = Rouge()

In [51]:
rouge_scores = rouge.get_scores(hypothesis, reference)
print(rouge_scores)

[{'rouge-1': {'r': 0.2857142857142857, 'p': 0.25, 'f': 0.266666661688889}, 'rouge-2': {'r': 0.16666666666666666, 'p': 0.14285714285714285, 'f': 0.1538461488757398}, 'rouge-l': {'r': 0.2857142857142857, 'p': 0.25, 'f': 0.266666661688889}}]


In [52]:
import torch
from transformers import BertTokenizer, BertModel
from sklearn.metrics.pairwise import cosine_similarity

def bertscore(generated_text, reference_text, bert_model, tokenizer):
    # Tokenize generated and reference texts
    generated_tokens = tokenizer.encode(generated_text, add_special_tokens=False)
    reference_tokens = tokenizer.encode(reference_text, add_special_tokens=False)
    
    # Convert token IDs to tensor
    generated_ids = torch.tensor(generated_tokens).unsqueeze(0)
    reference_ids = torch.tensor(reference_tokens).unsqueeze(0)
    
    # Generate embeddings for each token using BERT model
    generated_embeddings = bert_model(generated_ids)[0][0].detach().numpy()
    reference_embeddings = bert_model(reference_ids)[0][0].detach().numpy()
    
    # Calculate cosine similarity between each pair of embeddings
    similarities = cosine_similarity(generated_embeddings, reference_embeddings)
    
    # Compute Bertscore as weighted average of similarities
    weights = [reference_tokens.count(token_id) for token_id in set(reference_tokens)]
    bertscore = sum(similarities[0, i] * weight for i, weight in enumerate(weights)) / sum(weights)
    
    return bertscore


In [53]:
bert_model = BertModel.from_pretrained('l3cube-pune/tamil-bert')
tokenizer = BertTokenizer.from_pretrained('l3cube-pune/tamil-bert')

Some weights of the model checkpoint at l3cube-pune/tamil-bert were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [54]:
score = bertscore(hypothesis, reference, bert_model, tokenizer)
print(score)


0.9223641364470773
