# Fine-Tuning for Summarization

**Model:** [FLAN-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5) (encoder-decoder): [small version](https://huggingface.co/google/flan-t5-base) (0.2B params)

**Task:** Dialoge Summarization

**Dataset:** [DialogSum](https://huggingface.co/datasets/knkarthick/dialogsum)

# Table of Contents

- [ 1 - Set up Kernel, Load Required Dependencies, Dataset and LLM](#1)
  - [ 1.1 - Set up Kernel and Required Dependencies](#1.1)
  - [ 1.2 - Load Dataset and LLM](#1.2)
  - [ 1.3 - Test the Model with Zero Shot Inferencing](#1.3)
- [ 2 - Perform Parameter Efficient Fine-Tuning (PEFT)](#2)
  - [ 2.1 - Preprocess the Dialog-Summary Dataset](#2.1)
  - [ 2.2 - Setup the LoRA model for Fine-Tuning](#2.2)
  - [ 2.3 - Train PEFT Adapter](#2.3)
  - [ 2.4 - Evaluate the Model Qualitatively (Human Evaluation)](#2.4)
  - [ 2.5 - Evaluate the Model Quantitatively (ROUGE Metric)](#2.5)

<a name='1'></a>
## 1 - Set up & Load

<a name='1.1'></a>
### 1.1 - Set up Kernel & Required Dependencies

**Note:** restart the kernel to use updated packeges.

In [4]:
%pip install \
    transformers \
    datasets \
    evaluate \
    rouge_score\
    loralib \
    peft --quiet

[0mNote: you may need to restart the kernel to use updated packages.


In [5]:
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig, TrainingArguments, Trainer
import torch
import time
import evaluate
import pandas as pd
import numpy as np

<a name='1.2'></a>
### 1.2 - Load Dataset & LLM

Dataset contains 10,000+ dialogues with the corresponding manually labeled summaries and topics. 

In [6]:
huggingface_dataset_name = "knkarthick/dialogsum"

dataset = load_dataset(huggingface_dataset_name)

In [7]:
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 12460
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 500
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 1500
    })
})

Load the pre-trained **FLAN-T5 (small version)** and its tokenizer directly from HuggingFace.

Setting `dtype=torch.bfloat16` specifies the memory type to be used by the model.

In [8]:
model_name='google/flan-t5-base'

original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Pull out the **number** of **model parameters** & find out **how many** of them are **trainable**. 

In [9]:
def print_number_of_trainable_model_parameters(model):
    all_model_params = model.num_parameters()
    trainable_model_params = sum(param.numel() for param in model.parameters() if param.requires_grad)
    
    percentage_trainable = 100 * trainable_model_params / all_model_params if all_model_params > 0 else 0
    
    return (f"Trainable model parameters: {trainable_model_params}\n"
            f"All model parameters: {all_model_params}\n"
            f"Percentage of trainable model parameters: {percentage_trainable:.2f}%")

In [10]:
print(print_number_of_trainable_model_parameters(original_model))

Trainable model parameters: 247577856
All model parameters: 247577856
Percentage of trainable model parameters: 100.00%


<a name='1.3'></a>
### **1.3 - Baseline: Model with Zero Shot Inference**

**Observation:** the model struggles to summarize the dialogue compared to the baseline summary, but it does pull out some important information from the text which indicates the model can be fine-tuned to the task at hand.

In [11]:
index = 200

dialogue = dataset['test'][index]['dialogue']
summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary:
"""

inputs = tokenizer(prompt, return_tensors='pt')
output = tokenizer.decode(
    original_model.generate(
        inputs["input_ids"],
        max_new_tokens=200,
    )[0],
    skip_special_tokens=True
)

dash_line = '-'.join('' for x in range(100))

print(f'Dialogue:\n{dialogue}\n')
print(f'Generated Summary:{dash_line}\n{output}\n')
print(f'Human Summary:{dash_line}\n{summary}\n')

Dialogue:
#Person1#: Have you considered upgrading your system?
#Person2#: Yes, but I'm not sure what exactly I would need.
#Person1#: You could consider adding a painting program to your software. It would allow you to make up your own flyers and banners for advertising.
#Person2#: That would be a definite bonus.
#Person1#: You might also want to upgrade your hardware because it is pretty outdated now.
#Person2#: How can we do that?
#Person1#: You'd probably need a faster processor, to begin with. And you also need a more powerful hard disc, more memory and a faster modem. Do you have a CD-ROM drive?
#Person2#: No.
#Person1#: Then you might want to add a CD-ROM drive too, because most new software programs are coming out on Cds.
#Person2#: That sounds great. Thanks.

Generated Summary:---------------------------------------------------------------------------------------------------
#Person1#: I'm thinking of upgrading my computer.

Human Summary:--------------------------------------

<a name='2'></a>
## 2 - Perform Parameter Efficient Fine-Tuning (PEFT)

PEFT is a form of **instruction fine-tuning** that is **much more efficient** than full fine-tuning - with comparable evaluation results. 

PEFT is a generic term that includes **Low-Rank Adaptation (LoRA)** and prompt tuning (which is NOT THE SAME as prompt engineering!). In most cases, PEFT typically refers to LoRA.

LoRA allows the user to fine-tune their model using fewer compute resources (in some cases, a single GPU). After fine-tuning for a specific task, use case, or tenant with LoRA, the result is that the **original LLM** remains **unchanged** and a **newly-trained “LoRA adapter”** emerges. This LoRA adapter is much smaller than the original LLM: **less than 10%** of the original LLM size (**MBs vs GBs**).  

At inference time, the LoRA adapter needs to be reunited and **combined** with its original LLM to serve the inference request. Many LoRA adapters can **re-use** the original LLM which reduces overall memory requirements when serving multiple tasks and use cases.

<a name='2.1'></a>
### 2.1 - Preprocess the Dataset

**Convert** the dialog-summary (**prompt-response**) pairs into **explicit instructions** for the LLM.

**Prepend** an **instruction** to the start of the dialog with `Summarize the following conversation` and to the start of the summary with `Summary`.

Then **preprocess** the prompt-response **dataset** into **tokens** and **pull out** their `input_ids` (1 per token).

In [12]:
def tokenize_function(example):
    start_prompt = 'Summarize the following conversation.\n\n'
    end_prompt = '\n\nSummary: '
    prompt = [start_prompt + dialogue + end_prompt for dialogue in example["dialogue"]]
    # prompt
    example['input_ids'] = tokenizer(prompt, padding="max_length", truncation=True, return_tensors="pt").input_ids
    # response
    example['labels'] = tokenizer(example["summary"], padding="max_length", truncation=True, return_tensors="pt").input_ids

    # example: the 'prompt-response' pair (both in form of the 'input_ids')
    return example

# The dataset contains 3 different splits: train, validation, test.
# The tokenize_function code is handling all data across all splits in batches.
tokenized_datasets = dataset.map(tokenize_function, batched=True)
# remove the extra columns which are not needed for fine-tuning
tokenized_datasets = tokenized_datasets.remove_columns(['id', 'topic', 'dialogue', 'summary',])

Map:   0%|          | 0/12460 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

Check the shapes of all three parts of the dataset:

In [13]:
print(f"Shapes of the datasets:")
print(f"Training: {tokenized_datasets['train'].shape}")
print(f"Validation: {tokenized_datasets['validation'].shape}")
print(f"Test: {tokenized_datasets['test'].shape}")

print(tokenized_datasets)

Shapes of the datasets:
Training: (12460, 2)
Validation: (500, 2)
Test: (1500, 2)
DatasetDict({
    train: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 12460
    })
    validation: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 500
    })
    test: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 1500
    })
})


The output dataset is ready for fine-tuning.

<a name='2.2'></a>
### 2.2 - Setup the LoRA model for Fine-Tuning

In [47]:
from peft import LoraConfig, get_peft_model, TaskType

Set up the PEFT/LoRA model for fine-tuning with a new layer/parameter adapter. Using PEFT/LoRA, the underlying **LLM** is **freezed** and only the **adapter** is **trained**.

The rank (`r`) **hyper-parameter** defines the **rank/dimension** of the **adapter** to be trained. Some example values: r, r/2, r*2

`lora_alpha` is another **hyper-parameter** defines the **Scaling factor** for the LoRA **updates**: `lora_alpha/r` * LoRA_weights

`target_modules` defines **which linear layers** inside the model should **be replaced** with LoRA adapters. Adding LoRA to **only some matrices** significantly **reduces trainable parameters**. 

In a Transformer, each attention layer contains several projection matrices:
- q → query projection
- k → key projection
- v → value projection
- o → output projection

Empirically, modifying just **q** and **v** often gives almost the same performance as modifying all four (`q`, `k`, `v`, `o`), but with less memory and faster training for *seq2seq models*.

Increasing `lora_dropout` makes it harder for the LoRA adapter to learn. 

`bias` defines which bias parameters should be trained: `none`, `lora_only`, or `all`.

Set `task_type` according to the architecture of the original model.

In [70]:
lora_config = LoraConfig(
    r=32,                            # Rank of the LoRA matrices (controls adaptation capacity)
    lora_alpha=64,                   # Scaling factor for the LoRA updates 
    target_modules=["q", "v"],       # Which model modules to inject LoRA into
    lora_dropout=0.05,               # Dropout applied to LoRA layers during training
    bias="none",                     # Whether to train bias parameters
    task_type=TaskType.SEQ_2_SEQ_LM  # Type of model/task
)

Add LoRA adapter layers/parameters to the original LLM to be trained.

In [71]:
if isinstance(original_model, PeftModel):
    model = model.unload()  # Remove the PEFT adapter

In [72]:
peft_model = get_peft_model(original_model, 
                            lora_config)
print(print_number_of_trainable_model_parameters(peft_model))

Trainable model parameters: 3538944
All model parameters: 251116800
Percentage of trainable model parameters: 1.41%


<a name='2.3'></a>
### 2.3 - Train PEFT Adapter

Define training arguments and create `Trainer` instance.

`learning_rate` in LoRA should be **higher** than in full fine-tuning because:
- in **full fine-tuning**, the weighs of the **orignal model** are updated, so drastic changes should be avoided.
- in **LoRA**, the **adapter** weights start from random initialization, so they need to learn faster.

In [73]:
output_dir = f'./peft-dialogue-summary-training-{str(int(time.time()))}'

# A minimal setting to see the output.
peft_training_args = TrainingArguments(
    output_dir=output_dir,
    auto_find_batch_size=True,   # Let HuggingFace find the largest batch that fits in memory
    learning_rate=5e-4,          # Higher learning rate than full fine-tuning
    num_train_epochs=10,                  
    weight_decay=0.01,
    logging_steps=10,
    max_steps=100,
    label_names=['labels'] 
)


# peft_training_args = TrainingArguments(
#     output_dir=output_dir,
#     auto_find_batch_size=True,         # Let HuggingFace find the largest batch that fits in memory
#     learning_rate=5e-4,                # Higher learning rate than full fine-tuning
#     per_device_train_batch_size=8,     # Adjust based on GPU memory
#     per_device_eval_batch_size=8,      # Usually same as train batch size
#     num_train_epochs=3,                
#     logging_steps=50,                   # Log less frequently than every step
#     evaluation_strategy="steps",        # Evaluate periodically
#     eval_steps=200,                     # Adjust based on dataset size
#     save_strategy="steps",              # Save checkpoints periodically
#     save_steps=500,
#     save_total_limit=2,                 # Keep only last 2 checkpoints
#     max_steps=-1,                        # Use all steps determined by epochs
#     weight_decay=0.01,                  # Optional, small regularization
#     label_names=['labels'],
#     report_to="none"                     # Disable WandB/other logging if not used
# )
    
peft_trainer = Trainer(
    model=peft_model,
    args=peft_training_args,
    train_dataset=tokenized_datasets["train"],
)

Now everything is ready to train the PEFT adapter and save the model.



Train the model. Note: **tokenizer** should be also saved, as it **may change** during the fine-tuning.

In [74]:
peft_trainer.train()

peft_model_path="./peft-dialogue-summary-checkpoint-local"

peft_trainer.model.save_pretrained(peft_model_path)
tokenizer.save_pretrained(peft_model_path)

Step,Training Loss
10,37.05
20,15.2969
30,5.05
40,4.1031
50,3.3312
60,2.5297
70,1.8477
80,1.3711
90,1.1484
100,1.0598


('./peft-dialogue-summary-checkpoint-local/tokenizer_config.json',
 './peft-dialogue-summary-checkpoint-local/special_tokens_map.json',
 './peft-dialogue-summary-checkpoint-local/tokenizer.json')



Prepare the **PEFT model** by **adding** an **adapter** to the **original model**.

Set `is_trainable` as **False** when only performing inference with only the PEFT model. 

**Note:** set `is_trainable` as **True** when preparing the model for further training.

In [75]:
from peft import PeftModel, PeftConfig

In [76]:
peft_model_base = AutoModelForSeq2SeqLM.from_pretrained(model_name, dtype=torch.bfloat16) # model_name = 'google/flan-t5-base'
tokenizer = AutoTokenizer.from_pretrained(model_name) 

In [77]:
peft_model = PeftModel.from_pretrained(peft_model_base, 
                                       './peft-dialogue-summary-training-1764939631/checkpoint-100/',
                                       dtype=torch.bfloat16,
                                       is_trainable=False)

The number of trainable parameters will be `0` due to `is_trainable=False` setting:

In [78]:
print(print_number_of_trainable_model_parameters(peft_model))

Trainable model parameters: 0
All model parameters: 251116800
Percentage of trainable model parameters: 0.00%


<a name='2.4'></a>
### 2.4 - Evaluate the Model Qualitatively (Human Evaluation)

In [79]:
original_model = original_model.to('cuda')
peft_model = peft_model.to('cuda')

index = 200
dialogue = dataset['test'][index]['dialogue']
human_baseline_summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """

input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to('cuda')

original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

print(f'Original Model:{dash_line}\n{original_model_text_output}\n')
print(f'PEFT Model:{dash_line}\n{peft_model_text_output}\n')
print(f'Human:{dash_line}\n{human_baseline_summary}\n')

Original Model:---------------------------------------------------------------------------------------------------
You'd probably need a more power, but you might also want a more powerful hard disk, more memory and a faster modem.

PEFT Model:---------------------------------------------------------------------------------------------------
Upgrade your system.

Human:---------------------------------------------------------------------------------------------------
#Person1# teaches #Person2# how to upgrade software and hardware in #Person2#'s system.



<a name='2.5'></a>
### 2.5 - Evaluate the Model Quantitatively (ROUGE Metric)

In [80]:
rouge = evaluate.load('rouge')

Downloading builder script: 0.00B [00:00, ?B/s]

Generate the outputs for the sample of the test dataset (only 10 to save time) and save the results.

In [83]:
dialogues = dataset['test'][0:10]['dialogue']
human_baseline_summaries = dataset['test'][0:10]['summary']

device = "cuda" if torch.cuda.is_available() else "cpu"

original_model = original_model.to(device)
peft_model = peft_model.to(device)

original_model_summaries = []
peft_model_summaries = []

for idx, dialogue in enumerate(dialogues):
    prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """
    
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to('cuda')

    human_baseline_text_output = human_baseline_summaries[idx]
    
    original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

    peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

    original_model_summaries.append(original_model_text_output)
    peft_model_summaries.append(peft_model_text_output)

zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, peft_model_summaries))
 
df = pd.DataFrame(zipped_summaries, columns = ['human_baseline_summaries', 'original_model_summaries', 'peft_model_summaries'])

In [84]:
df

Unnamed: 0,human_baseline_summaries,original_model_summaries,peft_model_summaries
0,Ms. Dawson helps #Person1# to write a memo to ...,Poster1#: ms. dabbas. #Person1## needs to take...,This memo is to be distributed to all employee...
1,In order to prevent employees from wasting tim...,Your #Person1# needs to take a dict. #Person2#...,This memo is to be distributed to all employee...
2,Ms. Dawson takes a dictation for #Person1# abo...,Your memo is being distributed to all employee...,This memo is to be distributed to all employee...
3,#Person2# arrives late because of traffic jam....,@Person1#: I got stuck in traffic again. #Pers...,Take the subway to work.
4,#Person2# decides to follow #Person1#'s sugges...,You're finally here!,Take the subway to work.
5,#Person2# complains to #Person1# about the tra...,s2#: I feel bad about how much his car is addi...,Take the subway to work.
6,#Person1# tells Kate that Masha and Hero get d...,"@Person1#: #Person2#: Kate, you never believe ...",Masha and Hero are getting divorced.
7,#Person1# tells Kate that Masha and Hero are g...,"@Person1#: I don't really know, but I don't re...",Masha and Hero are getting divorced.
8,#Person1# and Kate talk about the divorce betw...,"You are kidding, #Person1#: #Person2#: Well, I...",Masha and Hero are getting divorced.
9,#Person1# and Brian are at the birthday party ...,"@Person1#: I'm so happy you remember, please c...","Happy birthday, Brian."


Evaluate the models by computing ROUGE metrics.

In [85]:
original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True, # retuns the average (not one by one for each instance)
    use_stemmer=True, # to increase rouge scores
)

peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries[0:len(peft_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

In [88]:
original_model_rouge = [x.item() for x in original_model_results.values()]
peft_model_rouge = [x.item() for x in peft_model_results.values()]

zipped_rouge = list(zip(original_model_rouge, peft_model_rouge))
df_rouge = pd.DataFrame(zipped_rouge, columns=['original_model', 'peft_model'])

PEFT model generates **better** summaries (**higher rouge** scores) in comparison with the original model.

In [90]:
df_rouge

Unnamed: 0,original_model,peft_model
0,0.145997,0.257233
1,0.003704,0.095022
2,0.126032,0.204499
3,0.127749,0.209052


The results show less of an improvement over full fine-tuning, but the benefits of PEFT generally outweigh the slightly lower performance metrics.

Calculate the improvement of PEFT over the original model:

In [91]:
print("Absolute percentage improvement of PEFT MODEL over HUMAN BASELINE")

improvement = (np.array(list(peft_model_results.values())) - np.array(list(original_model_results.values())))
for key, value in zip(peft_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

Absolute percentage improvement of PEFT MODEL over HUMAN BASELINE
rouge1: 11.12%
rouge2: 9.13%
rougeL: 7.85%
rougeLsum: 8.13%
