<a href="https://colab.research.google.com/github/qum-ran/ProgrammingAssignment2/blob/master/fine_tune_ai_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-Tune Large Language Models for Hittite Glossing


In this notebook, we explore the fine-tuning of the T5 large language model (LLM) for the task of Hittite glossing, focusing on its adaptability to low-resource ancient languages. We employ the pre-trained T5 model to investigate its efficacy in addressing the unique challenges of Hittite morphology. The notebook outlines the process of fine-tuning T5 and evaluates its performance using metrics such as token-level accuracy.

# Table of Contents

- [ 1 - Set up Kernel, Load Required Dependencies, Dataset and LLM](#1)
  - [ 1.1 - Set up Kernel and Required Dependencies](#1.1)
  - [ 1.2 - Load Dataset and LLM](#1.2)
  - [ 1.3 - Test the Model with Zero Shot Inferencing](#1.3)
- [ 2 - Perform Full Fine-Tuning](#2)
  - [ 2.1 - Preprocess the Dialog-Summary Dataset](#2.1)
  - [ 2.2 - Fine-Tune the Model with the Preprocessed Dataset](#2.2)
  - [ 2.3 - Evaluate the Model Qualitatively (Human Evaluation)](#2.3)
  - [ 2.4 - Evaluate the Model Quantitatively (with ROUGE Metric)](#2.4)

<a name='1'></a>
## 1 - Set up Kernel, Load Required Dependencies, Dataset and LLM

<a name='1.1'></a>
### 1.1 - Set up Kernel and Required Dependencies

Now install the required packages for the LLM and datasets.


In [1]:
%pip install --upgrade pip
%pip install \
    torch==1.13.1 \
    datasets==2.11.0 \
    evaluate==0.4.0 \
    transformers==4.27.2 \
    rouge_score==0.1.2 \
    loralib==0.1.1 \
    peft==0.3.0 \
    sentencepiece \
    openai \
    pandas \
    numpy \
    matplotlib \
    tqdm \
    evaluate


Collecting pip
  Downloading pip-24.3.1-py3-none-any.whl.metadata (3.7 kB)
Downloading pip-24.3.1-py3-none-any.whl (1.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m61.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 24.1.2
    Uninstalling pip-24.1.2:
      Successfully uninstalled pip-24.1.2
Successfully installed pip-24.3.1
Collecting torch==1.13.1
  Downloading torch-1.13.1-cp310-cp310-manylinux1_x86_64.whl.metadata (24 kB)
Collecting datasets==2.11.0
  Downloading datasets-2.11.0-py3-none-any.whl.metadata (20 kB)
Collecting evaluate==0.4.0
  Downloading evaluate-0.4.0-py3-none-any.whl.metadata (9.4 kB)
Collecting transformers==4.27.2
  Downloading transformers-4.27.2-py3-none-any.whl.metadata (106 kB)
Collecting rouge_score==0.1.2
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting loralib==0.1.

Import the necessary components.




In [2]:
# Core Libraries
import torch  # PyTorch for model training
from transformers import T5Tokenizer, T5ForConditionalGeneration, AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig
from transformers import Trainer, TrainingArguments
from datasets import Dataset, load_metric

# Utility Libraries
import pandas as pd
import numpy as np
import os
import random
import time
import evaluate


<a name='1.2'></a>
### 1.2 - Load Dataset and LLM.

In [3]:
data = pd.read_csv("hittite_ds.csv", index_col = 0)
data.columns = ['txtid','lnr','cth','word', 'text', 'gloss','trans']
data.head()

Unnamed: 0,txtid,lnr,cth,word,text,gloss,trans
0,IBoT 1.30+,Vs. 1,821,LUGALuš,⸢LUGAL⸣-uš,FNL(u).NOM.SG.C,König
1,IBoT 1.30+,Vs. 1,821,kuapi,ku-wa-pí,CNJ,sobald als
2,IBoT 1.30+,Vs. 1,821,DINGIRaš,DINGIR{MEŠ}-aš,D/L.PL,Gottheit
3,IBoT 1.30+,Vs. 1,821,aruaizi,a-ru-wa-a-ez-zi,3SG.PRS,sich verneigen
4,IBoT 1.30+,Vs. 1,821,GUDU₁₂,{LÚ}GUDU₁₂,NOM.SG(UNM),Gesalbter


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 170496 entries, 0 to 170495
Data columns (total 7 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   txtid   170496 non-null  object
 1   lnr     170496 non-null  object
 2   cth     170496 non-null  int64 
 3   word    170496 non-null  object
 4   text    170496 non-null  object
 5   gloss   170469 non-null  object
 6   trans   170496 non-null  object
dtypes: int64(1), object(6)
memory usage: 10.4+ MB


In [5]:
hf_dataset = Dataset.from_pandas(data[['word','gloss']])
hf_dataset = hf_dataset.remove_columns("__index_level_0__") if "__index_level_0__" in hf_dataset.column_names else hf_dataset

hf_dataset

Dataset({
    features: ['word', 'gloss'],
    num_rows: 170496
})

In [6]:
# Split into train, validation, and test sets
splits = hf_dataset.train_test_split(test_size=0.2, seed=43)  # 80% train, 20% test
train_dataset = splits["train"]
test_dataset = splits["test"]

# Further split test set into validation and test
val_test_splits = test_dataset.train_test_split(test_size=0.5, seed=43)  # 50/50 split
val_dataset = val_test_splits["train"]
test_dataset = val_test_splits["test"]

Load the pre-trained [FLAN-T5 model](https://huggingface.co/docs/transformers/model_doc/flan-t5) and its tokenizer directly from HuggingFace. Notice that we can use the [small version](https://huggingface.co/google/flan-t5-base) of FLAN-T5. Setting `torch_dtype=torch.bfloat16` specifies the memory type to be used by this model.

In [31]:
model_name='google/flan-t5-base'

original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)



In [None]:
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# original_model = original_model.to(device)


In [None]:
#original_model.config

In [None]:
# test_input = "Provide the gloss for the word: LUGALuš"
# tokens = tokenizer(test_input, return_tensors="pt")
# print(tokens)

It is possible to pull out the number of model parameters and find out how many of them are trainable. The following function can be used to do that, at this stage, you do not need to go into details of it.

In [9]:
def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"trainable model parameters: {trainable_model_params}, all model parameters: {all_model_params}, percentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"

print_number_of_trainable_model_parameters(original_model)

'trainable model parameters: 247577856, all model parameters: 247577856, percentage of trainable model parameters: 100.00%'

<a name='1.3'></a>
### 1.3 - Test the Model with Zero Shot Inferencing

Test the model with the zero shot inferencing. You can see that the model struggles to summarize the dialogue compared to the baseline summary, but it does pull out some important information from the text which indicates the model can be fine-tuned to the task at hand.

In [10]:
random_indices = random.sample(range(len(test_dataset)), 5)
#device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
for idx in random_indices:
    hittite_word = test_dataset[idx]['word']
    expected_gloss = test_dataset[idx]['gloss']
    prompt = f"Provide the morphological gloss for the following Hittite word:\n\n{hittite_word}\n\nGloss:"
    inputs = tokenizer(prompt, return_tensors="pt")
    output = tokenizer.decode(
        original_model.generate(inputs["input_ids"], max_new_tokens=50)[0],
        skip_special_tokens=True,
    )
    print(f"Input: {hittite_word}")
    print(f"Expected Gloss: {expected_gloss}")
    print(f"Generated Gloss: {output}\n")


Input: peḫutezi
Expected Gloss: 3SG.PRS
Generated Gloss: eloquent

Input: QA-TAM-MA
Expected Gloss: ADV
Generated Gloss: QA

Input: ekuzi
Expected Gloss: 3SG.PRS
Generated Gloss: ekuzi

Input: 1-ENpat
Expected Gloss: QUANcar += pat
Generated Gloss: enpat

Input: walḫannianzi
Expected Gloss: 3PL.PRS.IMPF
Generated Gloss: walannianzi



<a name='2'></a>
## 2 - Perform Full Fine-Tuning

<a name='2.1'></a>
### 2.1 - Preprocess the Dataset

Training Prompt (Hittite Word or Phrase):
Prepend the instruction Provide the morphological gloss for the following Hittite word: before the input word or phrase.

Example:


```
Provide the morphological gloss for the following Hittite word:

LUGALuš

Gloss:
```



Training Response (Gloss):
The gloss should be the expected morphological annotation for the word.

Example:



```
FNL(u).NOM.SG.C
```




In [11]:
def preprocess_function(example):
    # Ensure input and target are strings
    return {
        "input_text": f"Provide the morphological gloss for the following Hittite word: {str(example['word'])}. Gloss:",
        "target_text": str(example["gloss"])
    }

In [12]:
train_dataset_preprocessed = train_dataset.map(preprocess_function)
val_dataset_preprocessed = val_dataset.map(preprocess_function)
test_dataset_preprocessed = test_dataset.map(preprocess_function)
train_dataset_preprocessed[378]

Map:   0%|          | 0/136396 [00:00<?, ? examples/s]

Map:   0%|          | 0/17050 [00:00<?, ? examples/s]

Map:   0%|          | 0/17050 [00:00<?, ? examples/s]

{'word': 'AZU',
 'gloss': 'NOM.SG(UNM)',
 'input_text': 'Provide the morphological gloss for the following Hittite word: AZU. Gloss:',
 'target_text': 'NOM.SG(UNM)'}

In [13]:
# Tokenize datasets
def tokenize_function(example):
    model_inputs = tokenizer(
        example["input_text"], max_length=512, padding="max_length", truncation=True
    )
    labels = tokenizer(
        example["target_text"], max_length=128, padding="max_length", truncation=True
    )
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [14]:
train_dataset_tokenized = train_dataset_preprocessed.map(tokenize_function, batched=True)
val_dataset_tokenized = val_dataset_preprocessed.map(tokenize_function, batched=True)
test_dataset_tokenized = test_dataset_preprocessed.map(tokenize_function, batched=True)

# Set dataset format for PyTorch
train_dataset_tokenized.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])
val_dataset_tokenized.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])
test_dataset_tokenized.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])

Map:   0%|          | 0/136396 [00:00<?, ? examples/s]

Map:   0%|          | 0/17050 [00:00<?, ? examples/s]

Map:   0%|          | 0/17050 [00:00<?, ? examples/s]

Check the shapes of all three parts of the dataset:

In [15]:
print(f"Shapes of the datasets:")
print(f"Training: {train_dataset_tokenized.shape}")
print(f"Validation: {val_dataset_tokenized.shape}")
print(f"Test: {test_dataset_tokenized.shape}")

test_dataset_tokenized

Shapes of the datasets:
Training: (136396, 7)
Validation: (17050, 7)
Test: (17050, 7)


Dataset({
    features: ['word', 'gloss', 'input_text', 'target_text', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 17050
})

The output dataset is ready for fine-tuning.

<a name='2.2'></a>
### 2.2 - Fine-Tune the Model with the Preprocessed Dataset

Now utilize the built-in Hugging Face `Trainer` class (see the documentation [here](https://huggingface.co/docs/transformers/main_classes/trainer)). Pass the preprocessed dataset with reference to the original model. Other training parameters are found experimentally and there is no need to go into details about those at the moment.

In [18]:
output_dir = f'./glossing-training-{str(int(time.time()))}'

training_args = TrainingArguments(
    output_dir=output_dir,          # Directory to save model/checkpoints
    learning_rate=1e-5,             # Small learning rate for fine-tuning
    num_train_epochs=3,             # 3 full passes through the dataset
    weight_decay=0.01,              # Regularization to avoid overfitting
    logging_steps=50,                # Log every step for detailed progress
    evaluation_strategy="epoch",    # Evaluate after each epoch
    report_to="none"
)

trainer = Trainer(
    model=original_model,
    args=training_args,
    train_dataset=train_dataset_tokenized,
    eval_dataset=val_dataset_tokenized
)

Start training process...



In [19]:
trainer.train()



Epoch,Training Loss,Validation Loss
1,27.9775,25.251848
2,27.9075,25.248636
3,27.835,25.252552


TrainOutput(global_step=51150, training_loss=27.9436119257087, metrics={'train_runtime': 9928.6683, 'train_samples_per_second': 41.213, 'train_steps_per_second': 5.152, 'total_flos': 2.8019449153349222e+17, 'train_loss': 27.9436119257087, 'epoch': 3.0})



Create an instance of the `AutoModelForSeq2SeqLM` class for the instruct model:

In [20]:
# Save the fine-tuned model
trainer.save_model(output_dir)

# Save the tokenizer
tokenizer.save_pretrained(output_dir)

print(f"Model and tokenizer saved to {output_dir}")


Model and tokenizer saved to ./glossing-training-1733262418


In [21]:
instructed_model = AutoModelForSeq2SeqLM.from_pretrained(output_dir)
instructed_tokenizer = AutoTokenizer.from_pretrained(output_dir)

<a name='2.3'></a>
### 2.3 - Evaluate the Model Qualitatively (Human Evaluation)

As with many GenAI applications, a qualitative approach where you ask yourself the question "Is my model behaving the way it is supposed to?" is usually a good starting point. In the example below (the same one we started this notebook with), you can see how the fine-tuned model is able to create a reasonable summary of the dialogue compared to the original inability to understand what is being asked of the model.

In [22]:
random_indices = random.sample(range(len(test_dataset)), 5)

In [23]:
for idx in random_indices:
    hittite_word = test_dataset[idx]['word']
    expected_gloss = test_dataset[idx]['gloss']
    prompt = f"Provide the morphological gloss for the following Hittite word:\n\n{hittite_word}\n\nGloss:"
    inputs = instructed_tokenizer(prompt, return_tensors="pt")
    output = instructed_tokenizer.decode(
        instructed_model.generate(inputs["input_ids"], max_new_tokens=50)[0],
        skip_special_tokens=True,
    )
    print(f"Input: {hittite_word}")
    print(f"Expected Gloss: {expected_gloss}")
    print(f"Generated Gloss: {output}\n")

Input: ekuzi
Expected Gloss: 3SG.PRS
Generated Gloss: a slick

Input: mezulla
Expected Gloss: DN.D/L.SG(UNM)
Generated Gloss: a slick

Input: QA-TAM-MApat
Expected Gloss: ADV
Generated Gloss: a syllable

Input: pai
Expected Gloss: 3SG.PRS
Generated Gloss: slick

Input: DINGIRnana
Expected Gloss: FNL(a).GEN.PL.C
Generated Gloss: a slick



<a name='2.4'></a>
### 2.4 - Evaluate the Model Quantitatively (with ROUGE Metric)

The [ROUGE metric](https://en.wikipedia.org/wiki/ROUGE_(metric)) helps quantify the validity of summarizations produced by models. It compares summarizations to a "baseline" summary which is usually created by a human. While not perfect, it does indicate the overall increase in summarization effectiveness that we have accomplished by fine-tuning.

In [24]:
rouge = evaluate.load('rouge')

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

Generate the outputs for the sample of the test dataset (only 10 dialogues and summaries to save time), and save the results.

In [35]:
# Move models to the appropriate device (GPU if available)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

hittite_words = test_dataset[1100:1200]['word']  # Hittite words/phrases
human_baseline_glosses = test_dataset[1100:1200]['gloss']  # Expected glosses (human-provided)

original_model = original_model.to(device)
instructed_model = instructed_model.to(device)

# Initialize lists to store model outputs
original_model_glosses = []
instructed_model_glosses = []

# Iterate through the selected examples
for hittite_word in hittite_words:
    # Create the input prompt
    prompt = f"""
    Provide the morphological gloss for the following Hittite word:

    {hittite_word}

    Gloss:
    """
    # Tokenize the input prompt
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)  # Move input_ids to the same device as the model
    input_ids_instructed = instructed_tokenizer(prompt, return_tensors="pt").input_ids.to(device)

    # Generate gloss using the pre-trained (original) model
    original_model_outputs = original_model.generate(
        input_ids=input_ids,
        generation_config=GenerationConfig(max_new_tokens=50)
    )
    original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)
    original_model_glosses.append(original_model_text_output)

    # Generate gloss using the fine-tuned (instructed) model
    instructed_model_outputs = instructed_model.generate(
        input_ids=input_ids_instructed,
        generation_config=GenerationConfig(max_new_tokens=50)
    )
    instructed_model_text_output = instructed_tokenizer.decode(instructed_model_outputs[0], skip_special_tokens=True)
    instructed_model_glosses.append(instructed_model_text_output)

# Combine results into a DataFrame for analysis
zipped_glosses = list(zip(human_baseline_glosses, original_model_glosses, instructed_model_glosses))
df = pd.DataFrame(zipped_glosses, columns=['human_baseline_glosses', 'original_model_glosses', 'instruct_model_glosses'])

# Display the DataFrame
df


Unnamed: 0,human_baseline_glosses,original_model_glosses,instruct_model_glosses
0,3PL.PRS,rhyming,slick
1,3SG.PRS,tiazi,a slick
2,{ a → NOM.SG(UNM)} { b → ACC.SG(UNM)} { c → NO...,GU4.MA,a slick
3,QUANcar,morphological,slick
4,3PL.PRS,tetraploid,a slick
...,...,...,...
95,ACC.SG.C,amorphous,a slick
96,GEN.SG(UNM),synapse,a smear
97,3PL.PRS,akuanzi,a slick
98,3PL.PRS.IMPF,alzianzi,a slender slender slender


Evaluate the models computing ROUGE metrics. Notice the improvement in the results!

In [39]:
original_model_results = rouge.compute(
    predictions=original_model_glosses,
    references=human_baseline_glosses,
    use_aggregator=True,
    use_stemmer=True,
)

# Compute ROUGE scores for the instructed (fine-tuned) model
instructed_model_results = rouge.compute(
    predictions=instructed_model_glosses,
    references=human_baseline_glosses,
    use_aggregator=True,
    use_stemmer=True,
)

# Display results
print("ORIGINAL MODEL:")
print(original_model_results)

print("\nINSTRUCTED MODEL:")
print(instructed_model_results)

ORIGINAL MODEL:
{'rouge1': 0.0016666666666666668, 'rouge2': 0.0, 'rougeL': 0.0016666666666666668, 'rougeLsum': 0.0016666666666666668}

INSTRUCTED MODEL:
{'rouge1': 0.0003773584905660377, 'rouge2': 0.0, 'rougeL': 0.0003773584905660377, 'rougeLsum': 0.0003773584905660377}


What Each Metric Represents


**rouge1 (Unigram Overlap):**

Measures the overlap of individual words (unigrams) between the predictions and the references.
A score of 0.001666 for the original model and 0.000377 for the instructed model means there is very little overlap between the predicted and ground truth glosses.

**rouge2 (Bigram Overlap):**

Measures the overlap of two-word sequences (bigrams) between predictions and references.
A score of 0.0 means there is no overlap of bigrams between the generated and expected glosses.

**rougeL (Longest Common Subsequence):**

Captures the longest sequence of tokens that appears in both the predictions and the references, preserving the order.
Scores here are the same as rouge1, indicating minimal commonality in token sequences.

**rougeLsum:**

Similar to rougeL, often used for summarization tasks. In our case, it matches rougeL.