# NLP Final Project
We are going to fine-tune a Indic GPT-2 from the [Huggingface model hub](https://huggingface.co/models). As fine-tune, data we are using the [Thirukkural Dataset](https://github.com/tk120404/thirukkural/blob/master/thirukkural.json), which consists of 1330 quotes of wisdom in Tamil language. The dataset is collected after crawling _____.

The idea is we use the quotes in the kural to fine-tune our GPT-2 to let us create more quotes in this language.



## **What are we going to do:**

- load the dataset
- prepare the dataset and build a ``TextDataset``
- load the pre-trained GPT-2 model and tokenizer
- initialize ``Trainer`` with ``TrainingArguments``
- train and save the model
- test the model

In [1]:
!pip install rouge
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
!nvidia-smi

Thu May  4 07:55:42 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   64C    P8    11W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

# Load the dataset from Json

As already mentioned in the introduction we use the "[Thirukurral Dataset](https://github.com/tk120404/thirukkural/blob/master/thirukkural.json)" dataset from github. The dataset consists of 1330 quotes with English translations and crawled from ????. 


In [3]:
#upload files to your colab environment
from google.colab import files
uploaded = files.upload()

Saving thirukkural.json to thirukkural.json


# Prepare the dataset and build a ``TextDataset``

The next step is to extract the kurals from all quotes and build a `TextDataset`. The `TextDataset` is a custom implementation of the [Pytroch `Dataset` class](https://pytorch.org/tutorials/beginner/data_loading_tutorial.html#dataset-class) implemented by the transformers library.

First, we are going to split the `thirukkural.json` into a `train` and `test` section and extract `Line 1 and Line 3` from the kurals and write them into a `tamil_train_dataset.txt` and `tamil_test_dataset.txt`

In [53]:
import re
import json
from sklearn.model_selection import train_test_split


with open('thirukkural.json') as f:
    data = json.load(f)

def build_text_files(data_json, dest_path):
    f = open(dest_path, 'w')
    data = ''
    for texts in data_json:
        summary = str(texts['Line1']+" "+texts['Line2']).strip()
        summary = re.sub(r"\s", " ", summary)
        data += summary + "  "
    f.write(data)

train, test = train_test_split(data["kural"],test_size=0.15) 

build_text_files(train,'tamil_train_dataset.txt')
build_text_files(test,'tamil_test_dataset.txt')

print("Train dataset length: "+str(len(train)))
print("Test dataset length: "+ str(len(test)))


Train dataset length: 1130
Test dataset length: 200


In [7]:
# with open('tamil_train_dataset.txt', 'w') as f:
#     for text in kurals_train['kural'].tolist():
#         f.write(text)

In [8]:
# with open('tamil_test_dataset.txt', 'w') as f:
#     for text in kurals_test['kural'].tolist():
#         f.write(text)

the next step is to download the tokenizer, which we use. We use the tokenizer from the `indic-gpt` model on [huggingface](https://huggingface.co/aashay96/indic-gpt).

In [54]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("aashay96/indic-gpt")

train_path = 'tamil_train_dataset.txt'
test_path = 'tamil_test_dataset.txt'

In [55]:
from transformers import TextDataset,DataCollatorForLanguageModeling

def load_dataset(train_path,test_path,tokenizer):
    train_dataset = TextDataset(
          tokenizer=tokenizer,
          file_path=train_path,
          block_size=128)
     
    test_dataset = TextDataset(
          tokenizer=tokenizer,
          file_path=test_path,
          block_size=128)   
    
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer, mlm=False,
    )
    return train_dataset,test_dataset,data_collator

train_dataset,test_dataset,data_collator = load_dataset(train_path,test_path,tokenizer)



# Initialize `Trainer` with `TrainingArguments` and GPT-2 model

The [Trainer](https://huggingface.co/transformers/main_classes/trainer.html#transformers.Trainer) class provides an API for feature-complete training. Before we can instantiate our `Trainer` we need to download our GPT-2 model and create a [TrainingArguments](https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments) to access all the points of customization during training. In the `TrainingArguments`, we can define the Hyperparameters we are going to use in the training process like our `learning_rate`, `num_train_epochs`, or  `per_device_train_batch_size`. A complete list can you find [here](https://huggingface.co/transformers/main_classes/trainer.html#trainingarguments).

In [60]:
from transformers import Trainer, TrainingArguments,AutoModelWithLMHead

model = AutoModelWithLMHead.from_pretrained("aashay96/indic-gpt")


training_args = TrainingArguments(
    output_dir="./gpt2-indic", #The output directory
    overwrite_output_dir=True, #overwrite the content of the output directory
    num_train_epochs=4, # number of training epochs
    per_device_train_batch_size=32, # batch size for training
    per_device_eval_batch_size=64,  # batch size for evaluation
    eval_steps = 400, # Number of update steps between two evaluations.
    save_steps=500, # after # steps model is saved 
    warmup_steps=200,# number of warmup steps for learning rate scheduler
    prediction_loss_only=True,
    )


trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)



# Train and save the model

To train the model we can simply run `Trainer.train()`.

In [61]:
trainer.train()



Step,Training Loss


TrainOutput(global_step=52, training_loss=2.564396491417518, metrics={'train_runtime': 59.0191, 'train_samples_per_second': 26.229, 'train_steps_per_second': 0.881, 'total_flos': 101120016384000.0, 'train_loss': 2.564396491417518, 'epoch': 4.0})

Saving the model by calling `save_model()`. The trained model is stored in the `output_dir` from our `TrainingArguments`.

In [62]:
trainer.save_model()

# Test the model

To test the model we are using `pipeline`. [Pipelines](https://huggingface.co/transformers/main_classes/pipelines.html?highlight=pipelines) are objects that offer a simple API dedicated to several tasks, among others also `text-generation`

In [63]:
from transformers import pipeline

tamil_quote = pipeline('text-generation',model='./gpt2-indic', tokenizer='aashay96/indic-gpt')

In [64]:
tamil_quote('அருள்வெஃகி ஆற்றின்கண்')

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


[{'generated_text': 'அருள்வெஃகி ஆற்றின்கண் பார்வம் இன்னொரு கொண்டும் தீர்க்கேற்றவாதல'}]

In [65]:
from rouge import Rouge 
rouge = Rouge()

In [66]:
hypothesis = 'அருள்வெஃகி ஆற்றின்கண் பார்வம் இன்னொரு கொண்டும் தீர்க்கேற்றவாதல'
reference = 'அருள்வெஃகி ஆற்றின்கண் நின்றான் பொருள்வெஃகிப் பொல்லாத சூழக் கெடும்.'
scores = rouge.get_scores(hypothesis, reference)


In [67]:
print(scores)

[{'rouge-1': {'r': 0.2857142857142857, 'p': 0.3333333333333333, 'f': 0.3076923027218935}, 'rouge-2': {'r': 0.16666666666666666, 'p': 0.2, 'f': 0.18181817685950424}, 'rouge-l': {'r': 0.2857142857142857, 'p': 0.3333333333333333, 'f': 0.3076923027218935}}]


In [69]:
from nltk.translate.bleu_score import sentence_bleu
reference = [
    'அருள்வெஃகி ஆற்றின்கண் நின்றான் பொருள்வெஃகிப் பொல்லாத சூழக் கெடும்.'.split()
]
candidate = 'அருள்வெஃகி ஆற்றின்கண் பார்வம் இன்னொரு கொண்டும் தீர்க்கேற்றவாதல'.split()
print('BLEU score -> {}'.format(sentence_bleu(reference, candidate )))


BLEU score -> 6.416038883891965e-155


In [70]:
tamil_quote('தம்பொருள் என்பதம்')[0]['generated_text']

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


'தம்பொருள் என்பதம் நாள்.கிடந்த பான்கண் வேலும் இன்றைய பாளுளும் மனுன்'