# Fine-Tuning for Summarization

**Model:** [FLAN-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5) (encoder-decoder): [small version](https://huggingface.co/google/flan-t5-base) (0.2B params)

**Task:** Dialoge Summarization

**Dataset:** [DialogSum](https://huggingface.co/datasets/knkarthick/dialogsum)

# Table of Contents

- [ 1 - Set Up & Load](#1)
  - [ 1.1 - Set Up Kernel & Required Dependencies](#1.1)
  - [ 1.2 - Load Dataset & LLM](#1.2)
  - [ 1.3 - Baseline: Model with Zero Shot Inference](#1.3)
- [ 2 - Perform Full Fine-Tuning](#2)
  - [ 2.1 - Preprocess the Dataset](#2.1)
  - [ 2.2 - Fine-Tune the Model with the Preprocessed Dataset](#2.2)
  - [ 2.3 - Evaluate the Model Qualitatively (Human Evaluation)](#2.3)
  - [ 2.4 - Evaluate the Model Quantitatively (ROUGE Metric)](#2.4)

<a name='1'></a>
## 1 - Set Up & Load

<a name='1.1'></a>
### 1.1 - Set Up Kernel & Required Dependencies

**Note:** restart the kernel to use updated packeges.

In [None]:
%pip install \
    transformers \
    datasets \
    evaluate \
    rouge_score\
    loralib \
    peft --quiet

In [3]:
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig, TrainingArguments, Trainer
import torch
import time
import evaluate
import pandas as pd
import numpy as np

<a name='1.2'></a>
### 1.2 - Load Dataset & LLM

Dataset contains 10,000+ dialogues with the corresponding manually labeled summaries and topics. 

In [4]:
huggingface_dataset_name = "knkarthick/dialogsum"

dataset = load_dataset(huggingface_dataset_name)

README.md: 0.00B [00:00, ?B/s]

train.csv:   0%|          | 0.00/11.3M [00:00<?, ?B/s]

validation.csv: 0.00B [00:00, ?B/s]

test.csv: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/12460 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/500 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1500 [00:00<?, ? examples/s]

In [5]:
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 12460
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 500
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 1500
    })
})

Load the pre-trained **FLAN-T5 (small version)** and its tokenizer directly from HuggingFace.

Setting `dtype=torch.bfloat16` specifies the memory type to be used by the model.

In [6]:
model_name='google/flan-t5-base'

original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

Pull out the **number** of **model parameters** & find out **how many** of them are **trainable**. 

In [8]:
def print_number_of_trainable_model_parameters(model):
    all_model_params = model.num_parameters()
    trainable_model_params = sum(param.numel() for param in model.parameters() if param.requires_grad)
    
    percentage_trainable = 100 * trainable_model_params / all_model_params if all_model_params > 0 else 0
    
    return (f"Trainable model parameters: {trainable_model_params}\n"
            f"All model parameters: {all_model_params}\n"
            f"Percentage of trainable model parameters: {percentage_trainable:.2f}%")

In [9]:
print(print_number_of_trainable_model_parameters(original_model))

Trainable model parameters: 247577856
All model parameters: 247577856
Percentage of trainable model parameters: 100.00%


<a name='1.3'></a>
### **1.3 - Baseline: Model with Zero Shot Inference**

**Observation:** the model struggles to summarize the dialogue compared to the baseline summary, but it does pull out some important information from the text which indicates the model can be fine-tuned to the task at hand.

In [10]:
index = 200

dialogue = dataset['test'][index]['dialogue']
summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary:
"""

inputs = tokenizer(prompt, return_tensors='pt')
output = tokenizer.decode(
    original_model.generate(
        inputs["input_ids"],
        max_new_tokens=200,
    )[0],
    skip_special_tokens=True
)

dash_line = '-'.join('' for x in range(100))

print(f'Dialogue:\n{dialogue}\n')
print(f'Generated Summary:{dash_line}\n{output}\n')
print(f'Human Summary:{dash_line}\n{summary}\n')

Dialogue:
#Person1#: Have you considered upgrading your system?
#Person2#: Yes, but I'm not sure what exactly I would need.
#Person1#: You could consider adding a painting program to your software. It would allow you to make up your own flyers and banners for advertising.
#Person2#: That would be a definite bonus.
#Person1#: You might also want to upgrade your hardware because it is pretty outdated now.
#Person2#: How can we do that?
#Person1#: You'd probably need a faster processor, to begin with. And you also need a more powerful hard disc, more memory and a faster modem. Do you have a CD-ROM drive?
#Person2#: No.
#Person1#: Then you might want to add a CD-ROM drive too, because most new software programs are coming out on Cds.
#Person2#: That sounds great. Thanks.

Generated Summary:---------------------------------------------------------------------------------------------------
#Person1#: I'm thinking of upgrading my computer.

Human Summary:--------------------------------------

<a name='2'></a>
## 2 - Perform Full Fine-Tuning

<a name='2.1'></a>
### 2.1 - Preprocess the Dataset

**Convert** the dialog-summary (**prompt-response**) pairs into **explicit instructions** for the LLM.

**Prepend** an **instruction** to the start of the dialog with `Summarize the following conversation` and to the start of the summary with `Summary`.

Then **preprocess** the prompt-response **dataset** into **tokens** and **pull out** their `input_ids` (1 per token).

In [12]:
def tokenize_function(example):
    start_prompt = 'Summarize the following conversation.\n\n'
    end_prompt = '\n\nSummary: '
    prompt = [start_prompt + dialogue + end_prompt for dialogue in example["dialogue"]]
    # prompt
    example['input_ids'] = tokenizer(prompt, padding="max_length", truncation=True, return_tensors="pt").input_ids
    # response
    example['labels'] = tokenizer(example["summary"], padding="max_length", truncation=True, return_tensors="pt").input_ids

    # example: the 'prompt-response' pair (both in form of the 'input_ids')
    return example

# The dataset contains 3 different splits: train, validation, test.
# The tokenize_function code is handling all data across all splits in batches.
tokenized_datasets = dataset.map(tokenize_function, batched=True)
# remove the extra columns which are not needed for fine-tuning
tokenized_datasets = tokenized_datasets.remove_columns(['id', 'topic', 'dialogue', 'summary',])

Map:   0%|          | 0/12460 [00:00<?, ? examples/s]

Check the shapes of all three parts of the dataset:

In [13]:
print(f"Shapes of the datasets:")
print(f"Training: {tokenized_datasets['train'].shape}")
print(f"Validation: {tokenized_datasets['validation'].shape}")
print(f"Test: {tokenized_datasets['test'].shape}")

print(tokenized_datasets)

Shapes of the datasets:
Training: (12460, 2)
Validation: (500, 2)
Test: (1500, 2)
DatasetDict({
    train: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 12460
    })
    validation: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 500
    })
    test: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 1500
    })
})


The output dataset is ready for fine-tuning.

<a name='2.2'></a>
### 2.2 - Fine-Tune the Model with the Preprocessed Dataset

Utilize the built-in Hugging Face [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer) class. Pass the **preprocessed dataset** with reference to the **original model**.

**Note:**

- Small dataset → fewer epochs (3–5) are enough. Overfitting is likely otherwise.

- Batch size 16 × gradient accumulation 2 → effective batch 32. Adjust if GPU memory is low.

- Learning rate 5e-5 → standard for T5 small fine-tuning.

- Evaluation every 200 steps → enough to track progress without slowing down training.

- Mixed precision (fp16) → reduces memory usage significantly.

- Save & load best model → ensures best validation model is retained.

- Since the dataset is very small, adding early stopping can help. This will stop training if validation loss doesn’t improve after 2 evaluation steps:

```
from transformers import EarlyStoppingCallback

callbacks = [EarlyStoppingCallback(early_stopping_patience=2)]
```



In [20]:
# reletive path & with (kind of) timestamp appendix
output_dir = f'./flan_t5_finetuned-{str(int(time.time()))}' # where checkpoints are saved

# TrainingArguments: to create the input arguments of 'Trainer'
"""
training_args = TrainingArguments(
    output_dir=output_dir,

    # Training parameters
    per_device_train_batch_size=16,    # Adjust based on GPU memory
    per_device_eval_batch_size=16,
    gradient_accumulation_steps=2,     # Accumulate gradients if batch size is small (effective batch size = batch_size * grad_accum_steps)
    learning_rate=5e-5,                # Starting LR for T5 fine-tuning
    weight_decay=0.01,                 # Standard weight decay
    num_train_epochs=5,                # Small dataset: 3-5 epochs may be sufficient
    warmup_steps=100,                  # Small warmup to help stabilize training at the very beginning: the number of steps during which lr increases from 0 to learning_rate
    load_best_model_at_end=True,       # To automatically load the checkpoint corresponding to the best validation metric
    metric_for_best_model="loss",      # The metric for determining the “best” checkpoint

    
    # Evaluation & logging
    evaluation_strategy="steps",       # Evaluate during training
    eval_steps=200,                    # Evaluate every 200 steps - NOTE: adjust based on dataset size
    save_strategy="steps",             # Save checkpoints
    save_steps=200,                    # Save every 200 steps
    logging_steps=50,                  # Log metrics every 50 steps
    
    # Mixed precision (faster and less memory usage)
    fp16=True,                         

    # Optional
    save_total_limit=3,                # keep only last 3 checkpoints
    report_to="none",                  # Disable WandB/other reporting if not needed (or set the value as "wandb" if using Weights & Biases)
)
"""

# A minimal setting to see the output.
training_args = TrainingArguments(
    output_dir=output_dir,
    learning_rate=5e-5,                
    warmup_steps=100,
    num_train_epochs=5,                  
    weight_decay=0.01,
    logging_steps=10,
    max_steps=100
)

trainer = Trainer(
    model=original_model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation']
)

Start training process...



In [21]:
trainer.train()

Step,Training Loss
10,18.775
20,18.85
30,18.9875
40,18.375
50,17.6
60,16.8562
70,14.975
80,12.2562
90,8.9719
100,6.3375


TrainOutput(global_step=100, training_loss=15.1984375, metrics={'train_runtime': 33.116, 'train_samples_per_second': 24.157, 'train_steps_per_second': 3.02, 'total_flos': 547805881958400.0, 'train_loss': 15.1984375, 'epoch': 0.06418485237483953})



Create an instance of the `AutoModelForSeq2SeqLM` class for the instruct model:

In [33]:
instruct_model = AutoModelForSeq2SeqLM.from_pretrained("./flan_t5_finetuned-1764860153/checkpoint-100", dtype=torch.bfloat16)

<a name='2.3'></a>
### 2.3 - Evaluate the Model Qualitatively (Human Evaluation)

Asking ourself "Is my model behaving the way it is supposed to?" is usually a good starting point to evaluate a fine-tuned model.

In [54]:
index = 200
dialogue = dataset['test'][index]['dialogue']
human_baseline_summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary:
"""

device = "cuda" if torch.cuda.is_available() else "cpu"

original_model = original_model.to(device)
instruct_model = instruct_model.to(device)

input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
# original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
# instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

print(f'Original Model:{dash_line}\n{original_model_text_output}\n')
print(f'Fine-Tuned Model:{dash_line}\n{instruct_model_text_output}\n')
print(f'Human:{dash_line}\n{human_baseline_summary}\n')


Original Model:---------------------------------------------------------------------------------------------------
You could consider adding a painting program to your software.

Fine-Tuned Model:---------------------------------------------------------------------------------------------------
Upgrade your computer.

Human:---------------------------------------------------------------------------------------------------
#Person1# teaches #Person2# how to upgrade software and hardware in #Person2#'s system.



<a name='2.4'></a>
### 2.4 - Evaluate the Model Quantitatively (ROUGE Metric)

The [ROUGE metric](https://en.wikipedia.org/wiki/ROUGE_(metric)) helps quantify the validity of summarizations produced by models. It compares summarizations to a "baseline" summary which is usually created by a human. While not perfect, it indicates the overall increase in summarization effectiveness accomplished by fine-tuning.

In [55]:
rouge = evaluate.load('rouge')

Downloading builder script: 0.00B [00:00, ?B/s]

Generate the outputs for the sample of the test dataset (only 10 to save time) and save the results.

In [56]:
device = "cuda" if torch.cuda.is_available() else "cpu"

original_model = original_model.to(device)
instruct_model = instruct_model.to(device)

dialogues = dataset['test'][0:10]['dialogue']
human_baseline_summaries = dataset['test'][0:10]['summary']

original_model_summaries = []
instruct_model_summaries = []

for _, dialogue in enumerate(dialogues):
    prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """
    
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

    original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)
    original_model_summaries.append(original_model_text_output)

    instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)
    instruct_model_summaries.append(instruct_model_text_output)

zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, instruct_model_summaries))

df = pd.DataFrame(zipped_summaries, columns=['human_baseline_summaries', 'original_model_summaries', 'instruct_model_summaries'])

In [57]:
df

Unnamed: 0,human_baseline_summaries,original_model_summaries,instruct_model_summaries
0,Ms. Dawson helps #Person1# to write a memo to ...,The following memos are for the employees of t...,Memo to be distributed to all employees by thi...
1,In order to prevent employees from wasting tim...,The memo is a memo to employees and is intende...,Memo to be distributed to all employees by thi...
2,Ms. Dawson takes a dictation for #Person1# abo...,Employees and others are asked to take a dicta...,Memo to be distributed to all employees by thi...
3,#Person2# arrives late because of traffic jam....,You're finally here. What took so long?,Take public transport to work.
4,#Person2# decides to follow #Person1#'s sugges...,I'm going to try to get to work by bike.,Take public transport to work.
5,#Person2# complains to #Person1# about the tra...,The conversation is about a new job.,Take public transport to work.
6,#Person1# tells Kate that Masha and Hero get d...,Masha and Hero are divorced.,Masha and Hero are getting divorced.
7,#Person1# tells Kate that Masha and Hero are g...,Masha and Hero are in a divorce.,Masha and Hero are getting divorced.
8,#Person1# and Kate talk about the divorce betw...,Masha and Hero are getting divorced.,Masha and Hero are getting divorced.
9,#Person1# and Brian are at the birthday party ...,Happy birthday Brian!,Brian's birthday is today.


Evaluate the models computing ROUGE metrics.

In [84]:
original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True, # retuns the average (not one by one for each instance)
    use_stemmer=True, # to increase rouge scores
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0:len(instruct_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

In [85]:
original_model_rouge = [x.item() for x in original_model_results.values()]
instruct_model_rouge = [x.item() for x in instruct_model_results.values()]

zipped_rouge = list(zip(original_model_rouge, instruct_model_rouge))
df_rouge = pd.DataFrame(zipped_rouge, columns=['original_model', 'instruct_model'])

Instruct model generates **better** summaries (**higher rouge** scores) in comparison with the original model.

In [86]:
df_rouge

Unnamed: 0,original_model,instruct_model
0,0.272481,0.302396
1,0.106857,0.154022
2,0.217989,0.249346
3,0.220144,0.252724


The results show substantial improvement in all ROUGE metrics:

In [89]:
print("Absolute percentage improvement of INSTRUCT MODEL over HUMAN BASELINE")

improvement = (np.array(list(instruct_model_results.values())) - np.array(list(original_model_results.values())))
for key, value in zip(instruct_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

Absolute percentage improvement of INSTRUCT MODEL over HUMAN BASELINE
rouge1: 2.99%
rouge2: 4.72%
rougeL: 3.14%
rougeLsum: 3.26%
