## Fine-Tune a Generative AI Model for Dialogue Summarization

In this notebook, you will fine-tune an existing LLM from Hugging Face for enhanced dialogue summarization. You will use the [FLAN-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5) model, which provides a high quality instruction tuned model and can summarize text out of the box. To improve the inferences, you will explore a full fine-tuning approach and evaluate the results with ROUGE metrics. Then you will perform Parameter Efficient Fine-Tuning (PEFT), evaluate the resulting model and see that the benefits of PEFT outweigh the slightly-lower performance metrics.

<a name='1'></a>
### 1 - Load Required Dependencies, Dataset and LLM

<a name='1.1'></a>
#### 1.1 - Load Dependencies

In [1]:
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig, TrainingArguments, Trainer
import torch
import time
import evaluate
import pandas as pd
import numpy as np

<a name='1.2'></a>
#### 1.2 - Load Dataset and LLM

We are going to use the [DialogSum](https://huggingface.co/datasets/knkarthick/dialogsum) Hugging Face dataset. It contains 10,000+ dialogues with the corresponding manually labeled summaries and topics. 

In [2]:
huggingface_dataset = "knkarthick/dialogsum"
dataset = load_dataset(huggingface_dataset)
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 12460
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 500
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 1500
    })
})

Load the pre-trained [FLAN-T5 model](https://huggingface.co/docs/transformers/model_doc/flan-t5) and its tokenizer directly from HuggingFace. Notice that you will be using the [small version](https://huggingface.co/google/flan-t5-base) of FLAN-T5. Setting `torch_dtype=torch.bfloat16` specifies the memory type to be used by this model.

In [3]:
model_name = "google/flan-t5-base"

tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load model with appropriate device mapping
# If CUDA is not available, force CPU usage
if torch.cuda.is_available():
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name, device_map="auto", dtype=torch.bfloat16)
else:
    # Use float32 on CPU since bfloat16 may not be supported
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name, device_map="cpu", dtype=torch.float32)


In [4]:
# determine the number of trainable parameters in the model
def print_trainable_parameters(model):
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )
print_trainable_parameters(model)

trainable params: 247577856 || all params: 247577856 || trainable%: 100.0


<a name='1.3'></a>
#### 1.3 - Test the Model with Zero Shot Inferencing

Test the model with the zero shot inferencing.

In [5]:
# test the model with zero-shot learning
index = 100

dialogue = dataset["test"][index]["dialogue"]
summary = dataset["test"][index]["summary"]

prompt = f"""summarize the following conversation: 
{dialogue}
summary: """

# Use CUDA if available, otherwise use CPU
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

inputs = tokenizer(prompt, return_tensors="pt").to(device)
outputs = tokenizer.decode(
    model.generate(**inputs, max_new_tokens=100)[0], 
    skip_special_tokens=True)

dashline = "-" * 100
print(dashline)
print(f"Dialogue: {dialogue}\n")
print(f"Summary: {summary}\n")
print(dashline)
print(f"Generated Summary: {outputs}\n")


Using device: cuda
----------------------------------------------------------------------------------------------------
Dialogue: #Person1#: OK, that's a cut! Let's start from the beginning, everyone.
#Person2#: What was the problem that time?
#Person1#: The feeling was all wrong, Mike. She is telling you that she doesn't want to see you any more, but I want to get more anger from you. You're acting hurt and sad, but that's not how your character would act in this situation.
#Person2#: But Jason and Laura have been together for three years. Don't you think his reaction would be one of both anger and sadness?
#Person1#: At this point, no. I think he would react the way most guys would, and then later on, we would see his real feelings.
#Person2#: I'm not so sure about that.
#Person1#: Let's try it my way, and you can see how you feel when you're saying your lines. After that, if it still doesn't feel right, we can try something else.

Summary: #Person1# and Mike have a disagreement on h

<a name='2'></a>
### 2 - Full Fine-Tuning

<a name='2.1'></a>
#### 2.1 - Preprocess the Dialog-Summary Dataset

In [6]:
# define a tokenization function to preprocess the dataset
def tokenize_function(examples):
    inputs = [
        f"summarize the following conversation: \n{dialogue}\nsummary: " for dialogue in examples["dialogue"]
    ]
    examples["input_ids"] = tokenizer(inputs, padding="max_length", truncation=True, return_tensors="pt").input_ids
    examples["labels"] = tokenizer(
        examples["summary"], padding="max_length", truncation=True, return_tensors="pt"
    ).input_ids

    return examples
tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(["dialogue", "summary", "topic", "id"])
tokenized_datasets.set_format("torch")
tokenized_datasets["train"].column_names
tokenized_datasets["train"].features


Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

{'input_ids': List(Value('int32')), 'labels': List(Value('int64'))}

In [7]:
# subsample the dataset for quicker training (increased from every 50th to every 100th to save memory)
tokenized_datasets = tokenized_datasets.filter(lambda example, idx: idx % 50 == 0, with_indices=True)
tokenized_datasets

Filter:   0%|          | 0/500 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1500 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 250
    })
    validation: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 10
    })
    test: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 30
    })
})

<a name='2.2'></a>
#### 2.2 - Fine-Tune the Model with the Preprocessed Dataset

We will utilize the built-in Hugging Face `Trainer` class (see the documentation [here](https://huggingface.co/docs/transformers/main_classes/trainer)).

In [8]:
# define the training arguments for the Trainer
training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=2e-5,
    num_train_epochs=1,
    weight_decay=0.01,
    logging_steps=1,
    max_steps=1,  # for quicker testing; remove or increase for actual training
    per_device_train_batch_size=1,  # Reduce batch size to save memory
    gradient_accumulation_steps=4,  # Accumulate gradients to simulate larger batch
    gradient_checkpointing=True,  # Enable gradient checkpointing to save memory
    fp16=torch.cuda.is_available() and not torch.cuda.is_bf16_supported(),  # Use mixed precision if available
    bf16=torch.cuda.is_available() and torch.cuda.is_bf16_supported(),  # Use bfloat16 if supported
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
)


The model is already on multiple devices. Skipping the move to device specified in `args`.


In [9]:
torch.cuda.empty_cache()

In [10]:
# train the model

trainer.train()

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


Step,Training Loss
1,47.2116


TrainOutput(global_step=1, training_loss=47.21155548095703, metrics={'train_runtime': 3.5975, 'train_samples_per_second': 1.112, 'train_steps_per_second': 0.278, 'total_flos': 2739029409792.0, 'train_loss': 47.21155548095703, 'epoch': 0.016})

In [11]:
# save the model
output_dir = "./finetuned-flan-t5-base"
trainer.save_model(output_dir)
tokenizer.save_pretrained(output_dir)
print(f"Model and tokenizer saved to {output_dir}")

Model and tokenizer saved to ./finetuned-flan-t5-base


In [12]:
# Load the fine-tuned model and tokenizer
finetuned_model = AutoModelForSeq2SeqLM.from_pretrained(output_dir, device_map="auto", dtype=torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float32)
finetuned_tokenizer = AutoTokenizer.from_pretrained(output_dir)

<a name='2.3'></a>
#### 2.3 - Evaluate the Model Qualitatively (Human Evaluation)


In [13]:
# Evaluate the fine-tuned model on a few test examples
for index in [10, 50, 100]:
    dialogue = dataset["test"][index]["dialogue"]
    baseline_summary = dataset["test"][index]["summary"]

    prompt = f"""summarize the following conversation:
    {dialogue}
    summary: """

    inputs = finetuned_tokenizer(prompt, return_tensors="pt").to(device)

    original_model_output = model.generate(**inputs, generation_config=GenerationConfig(max_new_tokens=200))
    original_summary = tokenizer.decode(original_model_output[0], skip_special_tokens=True)

    finetuned_model_output = finetuned_model.generate(**inputs)
    finetuned_summary = finetuned_tokenizer.decode(finetuned_model_output[0], skip_special_tokens=True)

    print(dashline)
    print(f"Dialogue:\n{dialogue}")
    print(dashline)
    print(f"Baseline Summary:\n{baseline_summary}")
    print(dashline)
    print(f"Original Summary:\n{original_summary}")
    print(dashline)
    print(f"Fine-tuned Summary:\n{finetuned_summary}")




Caching is incompatible with gradient checkpointing in T5Block. Setting `past_key_values=None`.


----------------------------------------------------------------------------------------------------
Dialogue:
#Person1#: Happy Birthday, this is for you, Brian.
#Person2#: I'm so happy you remember, please come in and enjoy the party. Everyone's here, I'm sure you have a good time.
#Person1#: Brian, may I have a pleasure to have a dance with you?
#Person2#: Ok.
#Person1#: This is really wonderful party.
#Person2#: Yes, you are always popular with everyone. and you look very pretty today.
#Person1#: Thanks, that's very kind of you to say. I hope my necklace goes with my dress, and they both make me look good I feel.
#Person2#: You look great, you are absolutely glowing.
#Person1#: Thanks, this is a fine party. We should have a drink together to celebrate your birthday
----------------------------------------------------------------------------------------------------
Baseline Summary:
#Person1# attends Brian's birthday party. Brian thinks #Person1# looks great and charming.
-----------

<a name='2.4'></a>
#### 2.4 - Evaluate the Model Quantitatively (with ROUGE Metric)

The [ROUGE metric](https://en.wikipedia.org/wiki/ROUGE_(metric)) helps quantify the validity of summarizations produced by models. It compares summarizations to a "baseline" summary which is usually created by a human. While not perfect, it does indicate the overall increase in summarization effectiveness that we have accomplished by fine-tuning.

In [15]:
# Select 10 random samples from the test set for evaluation
np.random.seed(42)
random_indices = np.random.choice(len(dataset["test"]), size=10, replace=False)
samples = [dataset["test"][i] for i in random_indices]
# Prepare references and predictions for ROUGE evaluation
references = [sample["summary"] for sample in samples]
original_predictions = []
finetuned_predictions = []
for sample in samples:
    dialogue = sample["dialogue"]
    prompt = f"""summarize the following conversation:
    {dialogue}
    summary: """
    inputs = finetuned_tokenizer(prompt, return_tensors="pt").to(device)
    finetuned_model_output = finetuned_model.generate(**inputs, generation_config=GenerationConfig(max_new_tokens=200))
    finetuned_summary = finetuned_tokenizer.decode(finetuned_model_output[0], skip_special_tokens=True)
    finetuned_predictions.append(finetuned_summary)

    original_model_output = model.generate(**inputs, generation_config=GenerationConfig(max_new_tokens=200))
    original_summary = tokenizer.decode(original_model_output[0], skip_special_tokens=True)
    original_predictions.append(original_summary)

# Zip the references and predictions together for evaluation and generate a dataframe
results = list(zip(references, original_predictions, finetuned_predictions))
df = pd.DataFrame(results, columns=["Reference", "Original Prediction", "Fine-tuned Prediction"])
df


`generation_config` default values have been modified to match model-specific defaults: {'pad_token_id': 0, 'eos_token_id': 1, 'decoder_start_token_id': 0}. If this is not desired, please set these values explicitly.


Unnamed: 0,Reference,Original Prediction,Fine-tuned Prediction
0,#Person1# wants to know the result of a interv...,# # # # # # # # # # # # # # # # # # # # # # # ...,Interviewers ask for the results of the interv...
1,#Person1# and #Person2# are appreciating lante...,People None,The Lantern Festival is taking place in Beijing.
2,#Person1# apologises to #Person2# after the qu...,Person.,"Person1: Honey, I'm sorry I didn't call you."
3,#Person1# asks for #Person2#'s help to print u...,Person mj,#Person1#: I need to edit my paper.
4,#Person2#'s attachment exceeds the e-mail capa...,The___,"#Person1#: I'm sorry, I can't send out this e-..."
5,#Person1# and #Person2# are talking about natu...,People,The earthquake in Wenchuan in China is a natur...
6,#Person1# feels bored at home and asks Jim go ...,# # # # # # # # # # # #,Jim and Mary are going to the gym. They are go...
7,#Person1# complains to Tony that Christmas has...,# # # # # # # # # # # # # # # # # # # # # # # ...,"#Person1: Hi, Tony."
8,#Person2# draws #Person1#'s blood to check whi...,The The............sssssssssssssssssssssssssss,The doctor will take a blood test.
9,#Person1# helps #Person2# pick a gift for #Per...,It.,#Person1#: I'm looking for a nice gift for my ...


In [20]:
# combine the above two cells to show both results together
rouge = evaluate.load("rouge")
rouge.add_batch(predictions=df["Fine-tuned Prediction"].tolist(), references=df["Reference"].tolist())
rouge_finetuned_scores = rouge.compute()
print("Fine-tuned Model ROUGE Scores:", rouge_finetuned_scores)

rouge_original = evaluate.load("rouge")
rouge_original.add_batch(predictions=df["Original Prediction"].tolist(), references=df["Reference"].tolist())
rouge_original_scores = rouge_original.compute()
print("Original Model ROUGE Scores:", rouge_original_scores)


Fine-tuned Model ROUGE Scores: {'rouge1': np.float64(0.22936643822127695), 'rouge2': np.float64(0.05641113351684296), 'rougeL': np.float64(0.1894295167198393), 'rougeLsum': np.float64(0.18803225806451612)}
Original Model ROUGE Scores: {'rouge1': np.float64(0.015384615384615385), 'rouge2': np.float64(0.0), 'rougeL': np.float64(0.015384615384615385), 'rougeLsum': np.float64(0.015384615384615385)}
Original Model ROUGE Scores: {'rouge1': np.float64(0.015384615384615385), 'rouge2': np.float64(0.0), 'rougeL': np.float64(0.015384615384615385), 'rougeLsum': np.float64(0.015384615384615385)}


In [24]:
# compute the difference in ROUGE scores in percentage
def compute_rouge_difference(rouge_finetuned, rouge_original):
    difference = {}
    for key in rouge_finetuned.keys():
        if rouge_original.get(key, 0) != 0:
            difference[key] = ((rouge_finetuned[key] - rouge_original[key]) / rouge_original[key])
        else:
            difference[key] = 0
    return difference
difference = compute_rouge_difference(rouge_finetuned_scores, rouge_original_scores)
print("Difference in ROUGE Scores (Fine-tuned - Original):", difference)


Difference in ROUGE Scores (Fine-tuned - Original): {'rouge1': np.float64(13.908818484383001), 'rouge2': 0, 'rougeL': np.float64(11.312918586789554), 'rougeLsum': np.float64(11.222096774193547)}


<a name='3'></a>
### 3 - Parameter Efficient Fine-Tuning (PEFT)

**Parameter Efficient Fine-Tuning (PEFT)** is a form of instruction fine-tuning that is much more efficient than full fine-tuning - with comparable evaluation results. PEFT includes **Low-Rank Adaptation (LoRA)**. LoRA, at a very high level, allows the user to fine-tune their model using fewer compute resources (in some cases, a single GPU). After fine-tuning for a specific task, use case, or tenant with LoRA, the result is that the original LLM remains unchanged and a newly-trained “LoRA adapter” emerges. This LoRA adapter is much, much smaller than the original LLM - on the order of a single-digit % of the original LLM size (MBs vs GBs). At inference time, the LoRA adapter needs to be reunited and combined with its original LLM to serve the inference request. 

<a name='3.1'></a>
#### 3.1 - Setup the PEFT/LoRA model for Fine-Tuning

In [25]:
# Set up a PEFT/LoRA model for fine-tuning
from peft import get_peft_model, LoraConfig, TaskType

# Define LoRA configuration
lora_config = LoraConfig(
    task_type=TaskType.SEQ_2_SEQ_LM,
    inference_mode=False,
    r=16,
    lora_alpha=32,
    lora_dropout=0.1,
    target_modules=["q", "v"],  # Targeting query and value projection layers
    bias="none" # No bias adaptation
)
# Create the PEFT/LoRA model (this will modify the original model in place and return it as well)
lora_model = get_peft_model(model, lora_config)
print_trainable_parameters(lora_model)

trainable params: 1769472 || all params: 249347328 || trainable%: 0.7096414524241463


<a name='3.2'></a>
#### 3.2 - Train PEFT Adapter

In [28]:
peft_training_args = TrainingArguments(
    output_dir="./peft-results",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    num_train_epochs=3,
    logging_dir="./peft-results/logs",
    logging_steps=1,
    max_steps=1,  # for quicker testing; remove or increase for actual training
    gradient_checkpointing=True,  # Enable gradient checkpointing to save memory
    fp16=torch.cuda.is_available() and not torch.cuda.is_bf16_supported(),  # Use mixed precision if available
    bf16=torch.cuda.is_available() and torch.cuda.is_bf16_supported(),
)

trainer = Trainer(
    model=lora_model,
    args=peft_training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
)

The model is already on multiple devices. Skipping the move to device specified in `args`.


In [29]:
# Clear GPU memory before training
torch.cuda.empty_cache()

In [30]:
# Train the PEFT/LoRA model
trainer.train()

Step,Training Loss
1,47.4037


TrainOutput(global_step=1, training_loss=47.403656005859375, metrics={'train_runtime': 1.4453, 'train_samples_per_second': 2.768, 'train_steps_per_second': 0.692, 'total_flos': 2760772681728.0, 'train_loss': 47.403656005859375, 'epoch': 0.016})

In [31]:
# Prepare the model for inference and save the PEFT/LoRA adapter locally
lora_model.save_pretrained("./peft-results")

In [36]:
from peft import PeftModel, PeftConfig
# Load the PEFT/LoRA model for inference
peft_base_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, dtype=torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float32)
tokenizer = AutoTokenizer.from_pretrained(model_name)
peft_model = PeftModel.from_pretrained(peft_base_model, "./peft-results", is_training=False, dtype=torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float32)

print_trainable_parameters(peft_model)

trainable params: 0 || all params: 249347328 || trainable%: 0.0


<a name='3.3'></a>
#### 3.3 - Evaluate the Model Qualitatively (Human Evaluation)

In [None]:
# Evalute the model

index = 100
dialogue = 

In [38]:
# Evaluate the model qualitatively (human evaluation)
index = 100
dialogue = dataset["test"][index]["dialogue"]
base_summary = dataset["test"][index]["summary"]
prompt = f"""summarize the following conversation:
{dialogue}
summary: """
inputs = tokenizer(prompt, return_tensors="pt").to(device)

# Move models to the correct device if needed
peft_model = peft_model.to(device)
finetuned_model = finetuned_model.to(device)
model = model.to(device)

peft_outputs = tokenizer.decode(
    peft_model.generate(**inputs, max_new_tokens=200)[0], 
    skip_special_tokens=True)

original_outputs = tokenizer.decode(
    model.generate(**inputs, max_new_tokens=200)[0],
    skip_special_tokens=True)

finetuned_outputs = tokenizer.decode(
    finetuned_model.generate(**inputs, max_new_tokens=200)[0],
    skip_special_tokens=True)

print(dashline)
print(f"Dialogue:\n{dialogue}\n")
print(dashline)
print(f"Base Summary:\n{base_summary}\n")
print(dashline)
print(f"PEFT Model Output:\n{peft_outputs}\n")
print(dashline)
print(f"Fine-tuned Model Output:\n{finetuned_outputs}\n")
print(dashline)
print(f"Original Model Output:\n{original_outputs}\n")



----------------------------------------------------------------------------------------------------
Dialogue:
#Person1#: OK, that's a cut! Let's start from the beginning, everyone.
#Person2#: What was the problem that time?
#Person1#: The feeling was all wrong, Mike. She is telling you that she doesn't want to see you any more, but I want to get more anger from you. You're acting hurt and sad, but that's not how your character would act in this situation.
#Person2#: But Jason and Laura have been together for three years. Don't you think his reaction would be one of both anger and sadness?
#Person1#: At this point, no. I think he would react the way most guys would, and then later on, we would see his real feelings.
#Person2#: I'm not so sure about that.
#Person1#: Let's try it my way, and you can see how you feel when you're saying your lines. After that, if it still doesn't feel right, we can try something else.

-----------------------------------------------------------------------

<a name='3.4'></a>
### 3.4 - Evaluate the Model Quantitatively (with ROUGE Metric)

In [39]:
# Modify below to include the peft model

# Select 10 random samples from the test set for evaluation
np.random.seed(42)
random_indices = np.random.choice(len(dataset["test"]), size=10, replace=False)
samples = [dataset["test"][i] for i in random_indices]
# Prepare references and predictions for ROUGE evaluation
references = [sample["summary"] for sample in samples]
original_predictions = []
finetuned_predictions = []
peft_predictions = []
for sample in samples:
    dialogue = sample["dialogue"]
    prompt = f"""summarize the following conversation:
    {dialogue}
    summary: """
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    
    # PEFT model predictions
    peft_model_output = peft_model.generate(**inputs, generation_config=GenerationConfig(max_new_tokens=200))
    peft_summary = tokenizer.decode(peft_model_output[0], skip_special_tokens=True)
    peft_predictions.append(peft_summary)
    
    # Fine-tuned model predictions
    finetuned_model_output = finetuned_model.generate(**inputs, generation_config=GenerationConfig(max_new_tokens=200))
    finetuned_summary = finetuned_tokenizer.decode(finetuned_model_output[0], skip_special_tokens=True)
    finetuned_predictions.append(finetuned_summary)

    # Original model predictions
    original_model_output = model.generate(**inputs, generation_config=GenerationConfig(max_new_tokens=200))
    original_summary = tokenizer.decode(original_model_output[0], skip_special_tokens=True)
    original_predictions.append(original_summary)

# Zip the references and predictions together for evaluation and generate a dataframe
results = list(zip(references, original_predictions, finetuned_predictions, peft_predictions))
df = pd.DataFrame(results, columns=["Reference", "Original Prediction", "Fine-tuned Prediction", "PEFT Prediction"])
df



Unnamed: 0,Reference,Original Prediction,Fine-tuned Prediction,PEFT Prediction
0,#Person1# wants to know the result of a interv...,You ...................................,Interviewers ask for the results of the interv...,Interviewers ask for the results of the interv...
1,#Person1# and #Person2# are appreciating lante...,People,The Lantern Festival is taking place in Beijing.,The Lantern Festival is taking place in Beijing.
2,#Person1# apologises to #Person2# after the qu...,Person ......,"Person1: Honey, I'm sorry I didn't call you.","Person1: Honey, I'm sorry I didn't call you."
3,#Person1# asks for #Person2#'s help to print u...,# #,#Person1#: I need to edit my paper.,#Person1#: I need to edit my paper.
4,#Person2#'s attachment exceeds the e-mail capa...,# # # # # # # # # # # # # # # # # # # # # # # ...,"#Person1#: I'm sorry, I can't send out this e-...","#Person1#: I'm sorry, I can't send out this e-..."
5,#Person1# and #Person2# are talking about natu...,# # # # # # # # # # # # # # # # # # # # # # # ...,The earthquake in Wenchuan in China is a natur...,The earthquake in Wenchuan in China is a natur...
6,#Person1# feels bored at home and asks Jim go ...,Person pzzi,Jim and Mary are going to the gym. They are go...,Jim and Mary are going to the gym. They are go...
7,#Person1# complains to Tony that Christmas has...,A A..............................................,"#Person1: Hi, Tony.",The toy department at the shopping center is b...
8,#Person2# draws #Person1#'s blood to check whi...,Theo,The doctor will take a blood test.,The doctor will take a blood test.
9,#Person1# helps #Person2# pick a gift for #Per...,# # # # # # # # # # # # # # # # # # # # # # # ...,#Person1#: I'm looking for a nice gift for my ...,#Person1#: I'm looking for a nice gift for my ...


In [40]:
# Compute ROUGE scores for across all three models using the dataframe and calculate percentage differences
rouge = evaluate.load("rouge")
original_rouge = rouge.compute(predictions=df["Original Prediction"].tolist(), references=df["Reference"].tolist())
finetuned_rouge = rouge.compute(predictions=df["Fine-tuned Prediction"].tolist(), references=df["Reference"].tolist())
peft_rouge = rouge.compute(predictions=df["PEFT Prediction"].tolist(), references=df["Reference"].tolist())

print("Original Model ROUGE Scores:", original_rouge)
print("Fine-tuned Model ROUGE Scores:", finetuned_rouge)
print("PEFT Model ROUGE Scores:", peft_rouge)

# Calculate percentage differences
def compute_percentage_difference(scores1, scores2, baseline_scores):
    """Compute percentage difference between scores1 and scores2 relative to baseline"""
    difference = {}
    for key in scores1.keys():
        if baseline_scores.get(key, 0) != 0:
            diff1 = ((scores1[key] - baseline_scores[key]) / baseline_scores[key]) * 100
            diff2 = ((scores2[key] - baseline_scores[key]) / baseline_scores[key]) * 100
            difference[f"{key}_finetuned_vs_original"] = diff1
            difference[f"{key}_peft_vs_original"] = diff2
        else:
            difference[f"{key}_finetuned_vs_original"] = 0
            difference[f"{key}_peft_vs_original"] = 0
    return difference

percentage_differences = compute_percentage_difference(finetuned_rouge, peft_rouge, original_rouge)
print("\nPercentage differences (relative to original model):")
for key, value in percentage_differences.items():
    print(f"{key}: {value:.2f}%")


Original Model ROUGE Scores: {'rouge1': np.float64(0.0), 'rouge2': np.float64(0.0), 'rougeL': np.float64(0.0), 'rougeLsum': np.float64(0.0)}
Fine-tuned Model ROUGE Scores: {'rouge1': np.float64(0.22936643822127695), 'rouge2': np.float64(0.05641113351684296), 'rougeL': np.float64(0.1894295167198393), 'rougeLsum': np.float64(0.18803225806451612)}
PEFT Model ROUGE Scores: {'rouge1': np.float64(0.19810933868998387), 'rouge2': np.float64(0.05641113351684296), 'rougeL': np.float64(0.1591181220213478), 'rougeLsum': np.float64(0.15888412304541336)}

Percentage differences (relative to original model):
rouge1_finetuned_vs_original: 0.00%
rouge1_peft_vs_original: 0.00%
rouge2_finetuned_vs_original: 0.00%
rouge2_peft_vs_original: 0.00%
rougeL_finetuned_vs_original: 0.00%
rougeL_peft_vs_original: 0.00%
rougeLsum_finetuned_vs_original: 0.00%
rougeLsum_peft_vs_original: 0.00%


### 4 - GPU-Optimized Training Improvements

This section contains optimized configurations to significantly improve model performance using your local GPU.

### 4.1 - GPU Memory Optimization and Analysis

In [41]:
# GPU Memory Analysis and Optimization Setup
import torch

def print_gpu_memory():
    """Print current GPU memory usage"""
    if torch.cuda.is_available():
        print(f"GPU Device: {torch.cuda.get_device_name(0)}")
        print(f"Total Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.2f} GB")
        print(f"Allocated: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")
        print(f"Reserved: {torch.cuda.memory_reserved() / 1024**3:.2f} GB")
        print(f"Free: {(torch.cuda.get_device_properties(0).total_memory - torch.cuda.memory_reserved()) / 1024**3:.2f} GB")
    else:
        print("No GPU available")

# Enable optimizations for faster training on modern GPUs
if torch.cuda.is_available():
    # Enable TF32 for faster training on Ampere GPUs (RTX 30xx, A100, etc.)
    torch.backends.cuda.matmul.allow_tf32 = True
    torch.backends.cudnn.allow_tf32 = True
    print("✓ TF32 enabled for faster training")

# Clear GPU cache
torch.cuda.empty_cache()
print_gpu_memory()

✓ TF32 enabled for faster training
GPU Device: NVIDIA GeForce RTX 4050 Laptop GPU
Total Memory: 6.00 GB
Allocated: 1.47 GB
Reserved: 1.49 GB
Free: 4.51 GB


In [43]:
# Find optimal batch size for your GPU
def find_max_batch_size(test_model, test_dataset, max_test=16):
    """
    Binary search to find the maximum batch size that fits in GPU memory.
    Returns the largest batch size that doesn't cause OOM errors.
    """
    print("Finding optimal batch size for your GPU...")
    max_working_batch = 1
    
    for batch_size in [1, 2, 4, 8, 16, 32]:
        if batch_size > max_test:
            break
        
        try:
            print(f"Testing batch_size={batch_size}...", end=" ")
            torch.cuda.empty_cache()
            
            test_args = TrainingArguments(
                output_dir="./temp",
                per_device_train_batch_size=batch_size,
                max_steps=2,
                logging_steps=1,
                gradient_checkpointing=True,
                fp16=torch.cuda.is_available() and not torch.cuda.is_bf16_supported(),
                bf16=torch.cuda.is_available() and torch.cuda.is_bf16_supported(),
            )
            
            test_trainer = Trainer(
                model=test_model,
                args=test_args,
                train_dataset=test_dataset.select(range(min(50, len(test_dataset)))),
            )
            
            test_trainer.train()
            print(f"✓ Works! (Memory used: {torch.cuda.max_memory_allocated() / 1024**3:.2f} GB)")
            max_working_batch = batch_size
            torch.cuda.empty_cache()
            
        except RuntimeError as e:
            if "out of memory" in str(e):
                print(f"✗ OOM - Maximum batch size is {max_working_batch}")
                torch.cuda.empty_cache()
                break
            else:
                raise e
    
    print(f"\n✓ Optimal batch size: {max_working_batch}")
    print(f"  Recommended: batch_size={max_working_batch}, gradient_accumulation=8")
    print(f"  Effective batch size: {max_working_batch * 8}")
    return max_working_batch


optimal_batch_size = find_max_batch_size(model, tokenized_datasets["train"])
print(f"\nOptimal batch size determined: {optimal_batch_size}")

The model is already on multiple devices. Skipping the move to device specified in `args`.


Finding optimal batch size for your GPU...
Testing batch_size=1... 

Step,Training Loss
1,45.8489
2,52.4094


The model is already on multiple devices. Skipping the move to device specified in `args`.


✓ Works! (Memory used: 2.47 GB)
Testing batch_size=2... 

Step,Training Loss
1,48.9578
2,54.8807


The model is already on multiple devices. Skipping the move to device specified in `args`.


✓ Works! (Memory used: 2.47 GB)
Testing batch_size=4... 

Step,Training Loss
1,47.4978
2,50.1537


The model is already on multiple devices. Skipping the move to device specified in `args`.


✓ Works! (Memory used: 2.47 GB)
Testing batch_size=8... 

Step,Training Loss
1,47.5208
2,50.8496


The model is already on multiple devices. Skipping the move to device specified in `args`.


✓ Works! (Memory used: 3.32 GB)
Testing batch_size=16... 

Step,Training Loss
1,48.6906
2,49.4411


✓ Works! (Memory used: 5.15 GB)

✓ Optimal batch size: 16
  Recommended: batch_size=16, gradient_accumulation=8
  Effective batch size: 128

Optimal batch size determined: 16


### 4.2 - Improved Data Preprocessing

**Current Issue**: Only using 2% of available data (every 50th sample)

**Solution**: Use more or all of the training data for better model learning

In [44]:
# Analyze current dataset usage
print("=" * 80)
print("DATASET ANALYSIS")
print("=" * 80)

# Original dataset sizes
print(f"\nOriginal dataset sizes:")
print(f"  Train: {len(dataset['train']):,} samples")
print(f"  Validation: {len(dataset['validation']):,} samples")
print(f"  Test: {len(dataset['test']):,} samples")

# Current tokenized dataset (with filtering)
print(f"\nCurrent tokenized dataset (filtered every 50th):")
print(f"  Train: {len(tokenized_datasets['train']):,} samples ({len(tokenized_datasets['train'])/len(dataset['train'])*100:.1f}%)")
print(f"  Validation: {len(tokenized_datasets['validation']):,} samples ({len(tokenized_datasets['validation'])/len(dataset['validation'])*100:.1f}%)")
print(f"  Test: {len(tokenized_datasets['test']):,} samples ({len(tokenized_datasets['test'])/len(dataset['test'])*100:.1f}%)")

# Analyze dialogue and summary lengths
dialogue_lengths = [len(ex["dialogue"].split()) for ex in dataset["train"].select(range(min(1000, len(dataset["train"]))))]
summary_lengths = [len(ex["summary"].split()) for ex in dataset["train"].select(range(min(1000, len(dataset["train"]))))]

print(f"\nData statistics (sample of 1000):")
print(f"  Dialogue length - Mean: {np.mean(dialogue_lengths):.1f}, Median: {np.median(dialogue_lengths):.1f}, Std: {np.std(dialogue_lengths):.1f}")
print(f"  Summary length - Mean: {np.mean(summary_lengths):.1f}, Median: {np.median(summary_lengths):.1f}, Std: {np.std(summary_lengths):.1f}")
print(f"  Compression ratio: {np.mean(dialogue_lengths) / np.mean(summary_lengths):.2f}x")

print("\n" + "=" * 80)

DATASET ANALYSIS

Original dataset sizes:
  Train: 12,460 samples
  Validation: 500 samples
  Test: 1,500 samples

Current tokenized dataset (filtered every 50th):
  Train: 250 samples (2.0%)
  Validation: 10 samples (2.0%)
  Test: 30 samples (2.0%)

Data statistics (sample of 1000):
  Dialogue length - Mean: 130.1, Median: 118.0, Std: 67.0
  Summary length - Mean: 22.4, Median: 21.0, Std: 10.3
  Compression ratio: 5.81x



In [45]:
# Create improved tokenized datasets with different data usage options

print("Creating improved tokenized datasets...")
print("\nChoose one of the following options:\n")

# Option 1: Use ALL data (recommended for best performance)
print("Option 1: USE ALL DATA (100%)")
print("  - Best performance, longer training time (~3-8 hours)")
print("  - Recommended if you have time for overnight training")
tokenized_datasets_full = dataset.map(tokenize_function, batched=True)
tokenized_datasets_full = tokenized_datasets_full.remove_columns(["dialogue", "summary", "topic", "id"])
tokenized_datasets_full.set_format("torch")
print(f"  Train samples: {len(tokenized_datasets_full['train']):,}")

# Option 2: Use 25% of data (good balance)
print("\nOption 2: USE 25% OF DATA")
print("  - Good balance of performance and speed (~1-2 hours)")
print("  - Recommended for initial improvements")
tokenized_datasets_25pct = tokenized_datasets_full.filter(lambda example, idx: idx % 4 == 0, with_indices=True)
print(f"  Train samples: {len(tokenized_datasets_25pct['train']):,}")

# Option 3: Use 10% of data (faster training)
print("\nOption 3: USE 10% OF DATA")
print("  - Faster training (~30-60 minutes)")
print("  - Good for testing configurations")
tokenized_datasets_10pct = tokenized_datasets_full.filter(lambda example, idx: idx % 10 == 0, with_indices=True)
print(f"  Train samples: {len(tokenized_datasets_10pct['train']):,}")

print("\n" + "=" * 80)
print("RECOMMENDATION: Start with Option 2 (25%) to see quick improvements,")
print("then train with Option 1 (100%) overnight for best results.")
print("=" * 80)

# Set the dataset you want to use (change this variable to switch)
# Options: tokenized_datasets_full, tokenized_datasets_25pct, tokenized_datasets_10pct
improved_tokenized_datasets = tokenized_datasets_25pct  # Change this as needed

print(f"\n✓ Using dataset with {len(improved_tokenized_datasets['train']):,} training samples")

Creating improved tokenized datasets...

Choose one of the following options:

Option 1: USE ALL DATA (100%)
  - Best performance, longer training time (~3-8 hours)
  - Recommended if you have time for overnight training
  Train samples: 12,460

Option 2: USE 25% OF DATA
  - Good balance of performance and speed (~1-2 hours)
  - Recommended for initial improvements


Filter:   0%|          | 0/12460 [00:00<?, ? examples/s]

Filter:   0%|          | 0/500 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1500 [00:00<?, ? examples/s]

  Train samples: 3,115

Option 3: USE 10% OF DATA
  - Faster training (~30-60 minutes)
  - Good for testing configurations


Filter:   0%|          | 0/12460 [00:00<?, ? examples/s]

Filter:   0%|          | 0/500 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1500 [00:00<?, ? examples/s]

  Train samples: 1,246

RECOMMENDATION: Start with Option 2 (25%) to see quick improvements,
then train with Option 1 (100%) overnight for best results.

✓ Using dataset with 3,115 training samples


### 4.3 - Optimized Training Configuration for Full Fine-tuning

**Current Issues**:
- `max_steps=1` (essentially no training!)
- Only 1 epoch
- Small batch size

**Solutions**: Proper training parameters with GPU optimizations

In [48]:
# IMPROVED Training Configuration for Full Fine-tuning
from transformers import EarlyStoppingCallback

# Clear GPU memory before training
torch.cuda.empty_cache()

improved_training_args = TrainingArguments(
    output_dir="./results-improved",
    
    # LEARNING RATE & OPTIMIZATION
    learning_rate=2e-5,              # Standard for T5 fine-tuning
    lr_scheduler_type="cosine",      # Cosine annealing for better convergence
    warmup_ratio=0.1,                # Warmup for 10% of training
    weight_decay=0.01,               # L2 regularization
    max_grad_norm=1.0,               # Gradient clipping
    
    # TRAINING DURATION - CRITICAL FIX!
    num_train_epochs=3,              # 3 epochs instead of 1
    max_steps=-1,                    # REMOVE THE LIMIT! (was max_steps=1)
    
    # BATCH SIZE & ACCUMULATION
    per_device_train_batch_size=2,   # Adjust based on your GPU (2-8)
    per_device_eval_batch_size=4,    # Can be higher for evaluation
    gradient_accumulation_steps=8,   # Effective batch size = 2 * 8 = 16
    
    # MEMORY OPTIMIZATION
    gradient_checkpointing=True,     # Save memory at cost of speed
    fp16=torch.cuda.is_available() and not torch.cuda.is_bf16_supported(),
    bf16=torch.cuda.is_available() and torch.cuda.is_bf16_supported(),
    
    # LOGGING & EVALUATION
    logging_dir="./results-improved/logs",
    logging_steps=200,                # Log every 200 steps
    eval_strategy="steps",           # Evaluate during training
    eval_steps=200,                  # Evaluate every 200 steps
    save_strategy="steps",           # Save checkpoints
    save_steps=200,                  # Save every 200 steps
    save_total_limit=3,              # Keep only 3 best checkpoints
    load_best_model_at_end=True,     # Load best checkpoint at end
    metric_for_best_model="eval_loss", # Use eval loss to determine best model
    
    # PERFORMANCE
    report_to="none",                # Disable external logging (tensorboard requires installation)
    dataloader_num_workers=2,        # Parallel data loading
    
    # REPRODUCIBILITY
    seed=42,
)

print("=" * 80)
print("IMPROVED TRAINING CONFIGURATION")
print("=" * 80)
print(f"Epochs: {improved_training_args.num_train_epochs}")
print(f"Max steps: {improved_training_args.max_steps} (-1 = use epochs)")
print(f"Learning rate: {improved_training_args.learning_rate}")
print(f"Per-device batch size: {improved_training_args.per_device_train_batch_size}")
print(f"Gradient accumulation: {improved_training_args.gradient_accumulation_steps}")
print(f"Effective batch size: {improved_training_args.per_device_train_batch_size * improved_training_args.gradient_accumulation_steps}")
print(f"Precision: {'BF16' if improved_training_args.bf16 else 'FP16' if improved_training_args.fp16 else 'FP32'}")
print(f"Total training steps: ~{len(improved_tokenized_datasets['train']) // (improved_training_args.per_device_train_batch_size * improved_training_args.gradient_accumulation_steps) * improved_training_args.num_train_epochs}")
print("=" * 80)

# Create trainer with early stopping
improved_trainer = Trainer(
    model=model,
    args=improved_training_args,
    train_dataset=improved_tokenized_datasets["train"],
    eval_dataset=improved_tokenized_datasets["validation"],
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],  # Stop if no improvement for 3 evals
)

print("\n✓ Improved trainer created!")
print("\nTo train: improved_trainer.train()")
print(f"Expected training time: ~1-3 hours (depending on GPU and data size)")

The model is already on multiple devices. Skipping the move to device specified in `args`.


IMPROVED TRAINING CONFIGURATION
Epochs: 3
Max steps: -1 (-1 = use epochs)
Learning rate: 2e-05
Per-device batch size: 2
Gradient accumulation: 8
Effective batch size: 16
Precision: BF16
Total training steps: ~582

✓ Improved trainer created!

To train: improved_trainer.train()
Expected training time: ~1-3 hours (depending on GPU and data size)

✓ Improved trainer created!

To train: improved_trainer.train()
Expected training time: ~1-3 hours (depending on GPU and data size)


In [49]:
# Optional: Train the improved model

print("Starting improved training...")
print_gpu_memory()
improved_trainer.train()
print("\n✓ Training complete!")

# print("Ready to train! Uncomment the lines above to start training.")

Starting improved training...
GPU Device: NVIDIA GeForce RTX 4050 Laptop GPU
Total Memory: 6.00 GB
Allocated: 1.47 GB
Reserved: 1.49 GB
Free: 4.51 GB


Step,Training Loss,Validation Loss
200,39.1414,27.624212
400,15.0022,6.134187


There were missing keys in the checkpoint model loaded: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight'].



✓ Training complete!
