# Fine-Tune a Generative AI Model for Dialogue Summarization

In this notebook, you will fine-tune an existing LLM from Hugging Face for enhanced dialogue summarization. You will use the [FLAN-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5) model, which provides a high quality instruction tuned model and can summarize text out of the box. To improve the inferences, you will explore a full fine-tuning approach and evaluate the results with ROUGE metrics. Then you will perform Parameter Efficient Fine-Tuning (PEFT), evaluate the resulting model and see that the benefits of PEFT outweigh the slightly-lower performance metrics.

# Table of Contents

- [ 1 - Load Required Dependencies, Dataset and LLM](#1)
  - [ 1.1 - Set up Required Dependencies](#1.1)
  - [ 1.2 - Load Dataset and LLM](#1.2)
  - [ 1.3 - Test the Model with Zero Shot Inferencing](#1.3)
- [ 2 - Perform Full Fine-Tuning](#2)
  - [ 2.1 - Preprocess the Dialog-Summary Dataset](#2.1)
  - [ 2.2 - Fine-Tune the Model with the Preprocessed Dataset](#2.2)
  - [ 2.3 - Evaluate the Model Qualitatively (Human Evaluation)](#2.3)
  - [ 2.4 - Evaluate the Model Quantitatively (with ROUGE Metric)](#2.4)
- [ 3 - Perform Parameter Efficient Fine-Tuning (PEFT)](#3)
  - [ 3.1 - Setup the PEFT/LoRA model for Fine-Tuning](#3.1)
  - [ 3.2 - Train PEFT Adapter](#3.2)
  - [ 3.3 - Evaluate the Model Qualitatively (Human Evaluation)](#3.3)
  - [ 3.4 - Evaluate the Model Quantitatively (with ROUGE Metric)](#3.4)

<a name='1'></a>
## 1 - Load Required Dependencies, Dataset and LLM

<a name='1.1'></a>
### 1.1 - Set up Required Dependencies

Now install the required packages for the LLM and datasets.



In [None]:
!pip -q install datasets transformers evaluate rouge_score peft loralib accelerate sentencepiece




Import the necessary components. Some of them are new for this week, they will be discussed later in the notebook.

In [None]:
# Imports
import os
import random
import numpy as np
import torch

from datasets import load_dataset

from transformers import (
    AutoModelForSeq2SeqLM,
    AutoTokenizer,
    DataCollatorForSeq2Seq,
    Trainer,
    TrainingArguments,
)

import evaluate  # for ROUGE and other metrics

# PEFT / LoRA
from peft import LoraConfig, get_peft_model, PeftModel, TaskType


<a name='1.2'></a>
### 1.2 - Load Dataset and LLM

You are going to continue experimenting with the [DialogSum](https://huggingface.co/datasets/knkarthick/dialogsum) Hugging Face dataset. It contains 10,000+ dialogues with the corresponding manually labeled summaries and topics.

In [None]:
huggingface_dataset_name = "knkarthick/dialogsum"

from datasets import load_dataset
dataset = load_dataset(huggingface_dataset_name)

dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 12460
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 500
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 1500
    })
})

Load the pre-trained [FLAN-T5 model](https://huggingface.co/docs/transformers/model_doc/flan-t5) and its tokenizer directly from HuggingFace. Notice that you will be using the [small version](https://huggingface.co/google/flan-t5-small) of FLAN-T5. Setting `torch_dtype=torch.bfloat16` specifies the memory type to be used by this model.

In [None]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch
import copy

model_name = "google/flan-t5-small"

original_model = AutoModelForSeq2SeqLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
)
train_model = copy.deepcopy(original_model)

tokenizer = AutoTokenizer.from_pretrained(model_name)


`torch_dtype` is deprecated! Use `dtype` instead!


It is possible to pull out the number of model parameters and find out how many of them are trainable. The following function can be used to do that, at this stage, you do not need to go into details of it.

In [None]:
def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for p in model.parameters():
        num = p.numel()
        all_model_params += num
        if p.requires_grad:
            trainable_model_params += num
    pct = (trainable_model_params / all_model_params) * 100 if all_model_params else 0.0
    print(f"Total parameters:     {all_model_params:,}")
    print(f"Trainable parameters: {trainable_model_params:,} ({pct:.2f}%)")
    return {
        "total_params": all_model_params,
        "trainable_params": trainable_model_params,
        "trainable_pct": pct,
    }

print(print_number_of_trainable_model_parameters(original_model))

Total parameters:     76,961,152
Trainable parameters: 76,961,152 (100.00%)
{'total_params': 76961152, 'trainable_params': 76961152, 'trainable_pct': 100.0}


<a name='1.3'></a>
### 1.3 - Test the Model with Zero Shot Inferencing

Test the model with the zero shot inferencing. You can see that the model struggles to summarize the dialogue compared to the baseline summary, but it does pull out some important information from the text which indicates the model can be fine-tuned to the task at hand.

In [None]:
index = 200

dialogue = dataset['test'][index]['dialogue']
summary = dataset['test'][index]['summary']

prompt = f"Summarize the following dialogue:\n{dialogue}\nSummary:"

inputs = tokenizer(prompt, return_tensors="pt").to(original_model.device)

output = tokenizer.decode(
    original_model.generate(
        **inputs,
        max_new_tokens=128,
        num_beams=4,
        early_stopping=True
    )[0],
    skip_special_tokens=True
)

dash_line = '-'.join('' for x in range(100))
print(dash_line)
print(f'INPUT PROMPT:\n{prompt}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'MODEL GENERATION - ZERO SHOT:\n{output}')

---------------------------------------------------------------------------------------------------
INPUT PROMPT:
Summarize the following dialogue:
#Person1#: Have you considered upgrading your system?
#Person2#: Yes, but I'm not sure what exactly I would need.
#Person1#: You could consider adding a painting program to your software. It would allow you to make up your own flyers and banners for advertising.
#Person2#: That would be a definite bonus.
#Person1#: You might also want to upgrade your hardware because it is pretty outdated now.
#Person2#: How can we do that?
#Person1#: You'd probably need a faster processor, to begin with. And you also need a more powerful hard disc, more memory and a faster modem. Do you have a CD-ROM drive?
#Person2#: No.
#Person1#: Then you might want to add a CD-ROM drive too, because most new software programs are coming out on Cds.
#Person2#: That sounds great. Thanks.
Summary:
---------------------------------------------------------------------------

* The model captures a small detail from the conversation (mention of the CD-ROM drive) but fails to provide a full summary of the dialogue’s main idea.
* Compared to the human summary, it lacks abstraction and misses key points about system upgrades and the instructional nature of the conversation.


<a name='2'></a>
## 2 - Perform Full Fine-Tuning

<a name='2.1'></a>
### 2.1 - Preprocess the Dialog-Summary Dataset

You need to convert the dialog-summary (prompt-response) pairs into explicit instructions for the LLM. Prepend an instruction to the start of the dialog with `Summarize the following conversation` and to the start of the summary with `Summary` as follows:

Training prompt (dialogue):
```
Summarize the following conversation.

    Chris: This is his part of the conversation.
    Antje: This is her part of the conversation.
    
Summary:
```

Training response (summary):
```
Both Chris and Antje participated in the conversation.
```

Then preprocess the prompt-response dataset into tokens and pull out their `input_ids` (1 per token).

In [None]:
def tokenize_function(example):
    # Build instruction-style prompt and target
    prompt = (
        "Summarize the following conversation.\n\n"
        f"{example['dialogue']}\n\n"
        "Summary:"
    )
    target = example["summary"]

    # Tokenize source (dialogue prompt) and target (summary)
    # Use truncation; padding will be handled later by a data collator
    model_inputs = tokenizer(
        prompt,
        max_length=512,
        truncation=True,
    )
    labels = tokenizer(
        text_target=target,
        max_length=128,
        truncation=True,
    )

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# The dataset actually contains 3 diff splits: train, validation, test.
# The tokenize_function code is handling all data across all splits in batches.
tokenized_datasets = dataset.map(
    tokenize_function,
    batched=False,
    remove_columns=dataset["train"].column_names,
)

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

To save some time in the lab, you will subsample the dataset:

In [None]:
tokenized_datasets = tokenized_datasets.filter(lambda example, index: index % 100 == 0, with_indices=True)

Filter:   0%|          | 0/12460 [00:00<?, ? examples/s]

Filter:   0%|          | 0/500 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1500 [00:00<?, ? examples/s]

Check the shapes of all three parts of the dataset:

In [None]:
print(f"Shapes of the datasets:")
print(f"Training: {tokenized_datasets['train'].shape}")
print(f"Validation: {tokenized_datasets['validation'].shape}")
print(f"Test: {tokenized_datasets['test'].shape}")

print(tokenized_datasets)

Shapes of the datasets:
Training: (12460, 3)
Validation: (500, 3)
Test: (1500, 3)
DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 12460
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 500
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 1500
    })
})


The output dataset is ready for fine-tuning.

<a name='2.2'></a>
### 2.2 - Fine-Tune the Model with the Preprocessed Dataset

Now utilize the built-in Hugging Face `Trainer` class (see the documentation [here](https://huggingface.co/docs/transformers/main_classes/trainer)). Pass the preprocessed dataset with reference to the original model. Other training parameters are found experimentally and there is no need to go into details about those at the moment.

In [None]:
import time, os, torch, transformers
from packaging import version
from transformers import (
    Seq2SeqTrainingArguments,
    Seq2SeqTrainer,
    DataCollatorForSeq2Seq,
)

use_bf16 = torch.cuda.is_available() and torch.cuda.is_bf16_supported()
tfv = version.parse(transformers.__version__)
use_eval_key = "eval_strategy" if tfv >= version.parse("4.46.0") else "evaluation_strategy"

output_dir = f'./dialogue-summary-training-{str(int(time.time()))}'

# Build kwargs with the right evaluation key for this transformers version
args_kwargs = {
    "output_dir": output_dir,
    use_eval_key: "epoch",          # 'evaluation_strategy' or 'eval_strategy' depending on version
    "save_strategy": "epoch",
    "logging_steps": 25,
    "per_device_train_batch_size": 8,
    "per_device_eval_batch_size": 8,
    "gradient_accumulation_steps": 2,
    "num_train_epochs": 10,
    "learning_rate": 1e-4,
    "weight_decay": 0.01,
    "warmup_ratio": 0.1,
    "bf16": bool(use_bf16),
    "fp16": not bool(use_bf16),
    "report_to": ["wandb"],         # if W&B installed; harmless otherwise
    "predict_with_generate": True,  # works with Seq2SeqTrainingArguments
    "generation_max_length": 128,
}

training_args = Seq2SeqTrainingArguments(**args_kwargs)

data_collator = DataCollatorForSeq2Seq(
    tokenizer=tokenizer,
    model=train_model,
    label_pad_token_id=-100,
    pad_to_multiple_of=8,
)

trainer = Seq2SeqTrainer(
    model=train_model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)


  trainer = Seq2SeqTrainer(


Start training process...



The code trainer.train() utilizes the Weights & Biases (wandb) library to track and visualize the training process. To proceed, you'll need to sign up for a wandb account using your Gmail and then enter your unique API token to authenticate and enable logging of the training progress.

In [None]:
import wandb

# Authenticate W&B (you'll be prompted for your API key the first time)
wandb.login()

# Optional: name the run and set a project
wandb.init(project="dialogsum-flan-t5", name="full-ft-flan-t5-small-bf16-seq2seq", reinit=True)

# Train
train_result = trainer.train()

# Save checkpoint locally
trainer.save_model(output_dir)          # saves model + adapter (if any)
tokenizer.save_pretrained(output_dir)   # save tokenizer for completeness

# (Optional) also save to Google Drive path for persistence
drive_ckpt_dir = "/content/drive/MyDrive/MountDrive/flan-diaglogue-summary-checkpoint"
os.makedirs(drive_ckpt_dir, exist_ok=True)
trainer.save_model(drive_ckpt_dir)
tokenizer.save_pretrained(drive_ckpt_dir)




0,1
eval/loss,█▁
eval/runtime,█▁
eval/samples_per_second,▁█
eval/steps_per_second,▁█
train/epoch,▁▁▁▁▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▄▄▅▅▅▅▆▆▆▆▆▇▇▇▇████
train/global_step,▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▅▅▅▅▅▅▆▆▆▆▇▇▇▇▇▇▇▇███
train/grad_norm,█▆▅▄▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▂▁▁▂▂▁▂▁▁▁▁▁▂▁
train/learning_rate,▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▅▅▅▅▆▆▆▆▆▇▇▇▇▇█████████
train/loss,████▇▅▅▄▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▂▁▁▁▁▁▁▁▁

0,1
eval/loss,1.33097
eval/runtime,2.3148
eval/samples_per_second,216.002
eval/steps_per_second,27.216
train/epoch,2.31065
train/global_step,1800.0
train/grad_norm,1.77344
train/learning_rate,0.0001
train/loss,1.4793


Epoch,Training Loss,Validation Loss
1,1.4849,1.304828
2,1.4759,1.289082
3,1.4448,1.283798
4,1.4548,1.276436
5,1.4439,1.275889
6,1.4404,1.274485
7,1.4709,1.27263
8,1.4375,1.273003
9,1.4213,1.272748
10,1.3964,1.272845


('/content/drive/MyDrive/MountDrive/flan-diaglogue-summary-checkpoint/tokenizer_config.json',
 '/content/drive/MyDrive/MountDrive/flan-diaglogue-summary-checkpoint/special_tokens_map.json',
 '/content/drive/MyDrive/MountDrive/flan-diaglogue-summary-checkpoint/spiece.model',
 '/content/drive/MyDrive/MountDrive/flan-diaglogue-summary-checkpoint/added_tokens.json',
 '/content/drive/MyDrive/MountDrive/flan-diaglogue-summary-checkpoint/tokenizer.json')



Create an instance of the `AutoModelForSeq2SeqLM` class for the instruct model:

In [None]:
# prompt: I need to mount my google drive folder, MountDrive,  within which, I got a flan-diaglogue-summary-checkpoint folder, where the model checkpoint has

# from google.colab import drive
# drive.mount('/content/drive')

In [None]:
# Load tokenizer and models
# Import T5Tokenizer from transformers
from transformers import T5Tokenizer, AutoModelForSeq2SeqLM

# Define the model path using the config.json path
model_path = "/content/drive/MyDrive/MountDrive/flan-diaglogue-summary-checkpoint"

# Load tokenizer and models
# Use the default T5 tokenizer
tokenizer = T5Tokenizer.from_pretrained(model_path)

# Load the model in a way that is compatible with single-GPU environments
instruct_model = AutoModelForSeq2SeqLM.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    # The following line addresses the multi-GPU loading issue
    device_map="auto",
)

# Move model to GPU if available (optional, as device_map="auto" should handle it)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
instruct_model.to(device)

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


T5ForConditionalGeneration(
  (shared): Embedding(32128, 512)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 512)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=512, out_features=384, bias=False)
              (k): Linear(in_features=512, out_features=384, bias=False)
              (v): Linear(in_features=512, out_features=384, bias=False)
              (o): Linear(in_features=384, out_features=512, bias=False)
              (relative_attention_bias): Embedding(32, 6)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseGatedActDense(
              (wi_0): Linear(in_features=512, out_features=1024, bias=False)
              (wi_1): Linear(in_features=512, out_features=1024, bias=False)
              (wo): 

<a name='2.3'></a>
### 2.3 - Evaluate the Model Qualitatively (Human Evaluation)

As with many GenAI applications, a qualitative approach where you ask yourself the question "Is my model behaving the way it is supposed to?" is usually a good starting point. In the example below (the same one we started this notebook with), you can see how the fine-tuned model is able to create a reasonable summary of the dialogue compared to the original inability to understand what is being asked of the model.

In [None]:
index = 200
dialogue = dataset['test'][index]['dialogue']
human_baseline_summary = dataset['test'][index]['summary']

prompt = f"Summarize the following conversation.\n\n{dialogue}\n\nSummary:"
# had issues with tensors so created a cleaner pipeline.
# Put models in eval mode
original_model.eval()
instruct_model.eval()

# Build inputs on each model's actual device
orig_device = next(original_model.parameters()).device
inst_device = next(instruct_model.parameters()).device

orig_inputs = tokenizer(prompt, return_tensors="pt").to(orig_device)
inst_inputs = tokenizer(prompt, return_tensors="pt").to(inst_device)

# Generate
with torch.no_grad():
    original_model_outputs = original_model.generate(
        **orig_inputs,
        max_new_tokens=128,
        num_beams=4,
        early_stopping=True,
    )
    instruct_model_outputs = instruct_model.generate(
        **inst_inputs,
        max_new_tokens=128,
        num_beams=4,
        early_stopping=True,
    )

original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)
instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{human_baseline_summary}')
print(dash_line)
print(f'ORIGINAL MODEL:\n{original_model_text_output}')
print(dash_line)
print(f'INSTRUCT MODEL:\n{instruct_model_text_output}')


---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# teaches #Person2# how to upgrade software and hardware in #Person2#'s system.
---------------------------------------------------------------------------------------------------
ORIGINAL MODEL:
Is there anything else I can help you with?
---------------------------------------------------------------------------------------------------
INSTRUCT MODEL:
#Person2# considers adding a painting program to #Person2#'s software. #Person1# suggests adding a painting program to #Person2#'s software. #Person2# suggests adding a CD-ROM drive to #Person2#'s software.


* Relevance: improved from a generic, off-task utterance (“Is there anything else…”) to dialogue-specific content (mentions CD-ROM and speaker turns), showing the fine-tuned model is now attending to the actual conversation.
* Coherence: shifts from a single vague sentence to a snippet that reflects turn-taking, but the output remains fragmentary and not yet a fluent, unified summary.
* Coverage/abstraction: still misses the main instructional theme (software and hardware upgrades) captured by the human reference; it extracts a salient detail but doesn’t abstract or generalize to the overarching gist.


<a name='2.4'></a>
### 2.4 - Evaluate the Model Quantitatively (with ROUGE Metric)

The [ROUGE metric](https://en.wikipedia.org/wiki/ROUGE_(metric)) helps quantify the validity of summarizations produced by models. It compares summarizations to a "baseline" summary which is usually created by a human. While not perfect, it does indicate the overall increase in summarization effectiveness that we have accomplished by fine-tuning.

In [None]:
rouge = evaluate.load("rouge")

Downloading builder script: 0.00B [00:00, ?B/s]

Generate the outputs for the sample of the test dataset (only 10 dialogues and summaries to save time), and save the results.

In [None]:
dialogues = dataset['test'][0:10]['dialogue']
human_baseline_summaries = dataset['test'][0:10]['summary']

original_model_summaries = []
instruct_model_summaries = []

# Ensure eval mode
original_model.eval()
instruct_model.eval()

for _, dialogue in enumerate(dialogues):
    prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids

    # Move input_ids to the same device as the model
    input_ids = input_ids.to(device)

    # Build inputs on each model's own device to avoid device mismatch
    orig_inputs = tokenizer(prompt, return_tensors="pt").to(next(original_model.parameters()).device)
    inst_inputs = tokenizer(prompt, return_tensors="pt").to(next(instruct_model.parameters()).device)

    with torch.no_grad():
        original_model_outputs = original_model.generate(
            **orig_inputs,
            max_new_tokens=128,
            num_beams=4,
            early_stopping=True,
        )
    original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)
    original_model_summaries.append(original_model_text_output)

    with torch.no_grad():
        instruct_model_outputs = instruct_model.generate(
            **inst_inputs,
            max_new_tokens=128,
            num_beams=4,
            early_stopping=True,
        )
    instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)
    instruct_model_summaries.append(instruct_model_text_output)

zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, instruct_model_summaries))

import pandas as pd
df = pd.DataFrame(zipped_summaries, columns = ['human_baseline_summaries', 'original_model_summaries', 'instruct_model_summaries'])
df


Unnamed: 0,human_baseline_summaries,original_model_summaries,instruct_model_summaries
0,Ms. Dawson helps #Person1# to write a memo to ...,"Ms. Dawson, Attached is a draft memo to all em...",Ms. Dawson asks Ms. Dawson to take a dictation...
1,In order to prevent employees from wasting tim...,"Ms. Dawson, Attached is a draft memo to all em...",Ms. Dawson asks Ms. Dawson to take a dictation...
2,Ms. Dawson takes a dictation for #Person1# abo...,"Ms. Dawson, Attached is a draft memo to all em...",Ms. Dawson asks Ms. Dawson to take a dictation...
3,#Person2# arrives late because of traffic jam....,Talk to a friend.,#Person2# got stuck in traffic again because t...
4,#Person2# decides to follow #Person1#'s sugges...,Talk to a friend.,#Person2# got stuck in traffic again because t...
5,#Person2# complains to #Person1# about the tra...,Talk to a friend.,#Person2# got stuck in traffic again because t...
6,#Person1# tells Kate that Masha and Hero get d...,"Kate, you're right. They are getting divorced.",Kate tells #Person1# that Masha and Hero are g...
7,#Person1# tells Kate that Masha and Hero are g...,"Kate, you're right. They are getting divorced.",Kate tells #Person1# that Masha and Hero are g...
8,#Person1# and Kate talk about the divorce betw...,"Kate, you're right. They are getting divorced.",Kate tells #Person1# that Masha and Hero are g...
9,#Person1# and Brian are at the birthday party ...,"Brian, thank you for inviting me to the party.",Brian invites #Person1# to have a dance with #...


Evaluate the models computing ROUGE metrics. Notice the improvement in the results!

In [None]:
original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries,
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('INSTRUCT MODEL:')
print(instruct_model_results)


ORIGINAL MODEL:
{'rouge1': np.float64(0.19539205828561151), 'rouge2': np.float64(0.058424390424390425), 'rougeL': np.float64(0.18498835301683875), 'rougeLsum': np.float64(0.1850460696787533)}
INSTRUCT MODEL:
{'rouge1': np.float64(0.3540038759521339), 'rouge2': np.float64(0.11536574829817323), 'rougeL': np.float64(0.26206880377275144), 'rougeLsum': np.float64(0.2626508394503257)}


The file `data/dialogue-summary-training-results.csv` contains a pre-populated list of all model results which you can use to evaluate on a larger section of data. Let's do that for each of the models:

In [13]:
# --- 2.4: Compute ROUGE for Original vs Instruct from the CSV ---

# These columns should exist in dialogue-summary-training-results.csv
required_cols = ["human_baseline_summaries", "original_model_summaries", "instruct_model_summaries"]
missing = [c for c in required_cols if c not in results.columns]
if missing:
    raise KeyError(f"Missing required columns for 2.4: {missing}")

human_baseline_summaries = results["human_baseline_summaries"].astype(str).tolist()
original_model_summaries = results["original_model_summaries"].astype(str).tolist()
instruct_model_summaries = results["instruct_model_summaries"].astype(str).tolist()

original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries,
)
instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries,
)

print("ORIGINAL MODEL:")
print(original_model_results)
print("INSTRUCT MODEL:")
print(instruct_model_results)


ORIGINAL MODEL:
{'rouge1': np.float64(0.22165605484675166), 'rouge2': np.float64(0.07072656558348324), 'rougeL': np.float64(0.19244657776840873), 'rougeLsum': np.float64(0.19235273036456957)}
INSTRUCT MODEL:
{'rouge1': np.float64(0.4043243894173554), 'rouge2': np.float64(0.17053687348320012), 'rougeL': np.float64(0.3266152074751759), 'rougeLsum': np.float64(0.3266155064476724)}


The results show substantial improvement in all ROUGE metrics

In [14]:

print("Absolute percentage improvement of INSTRUCT MODEL over ORIGINAL MODEL")
improvement = (
    np.array(list(instruct_model_results.values()))
    - np.array(list(original_model_results.values()))
)
for key, value in zip(instruct_model_results.keys(), improvement):
    print(f"{key}: {value*100:.2f}%")


Absolute percentage improvement of INSTRUCT MODEL over ORIGINAL MODEL
rouge1: 18.27%
rouge2: 9.98%
rougeL: 13.42%
rougeLsum: 13.43%


* Full fine-tuning substantially outperforms zero-shot: rouge1 ↑ ~18.3 pp, rouge2 ↑ ~10.0 pp, rougeL/Lsum ↑ ~13.4 pp, showing stronger content coverage and structure alignment.
* The large rouge2 gain indicates improved multi-token phrasing/coherence rather than just unigram overlap.
* Absolute scores (rouge1 ≈ 0.40, rougeL ≈ 0.33) are solid for a small model, with room to grow via more data/epochs, consistent decoding constraints, and longer source lengths.

<a name='3'></a>
## 3 - Perform Parameter Efficient Fine-Tuning (PEFT)

Now, let's perform **Parameter Efficient Fine-Tuning (PEFT)** fine-tuning as opposed to "full fine-tuning" as you did above. PEFT is a form of instruction fine-tuning that is much more efficient than full fine-tuning - with comparable evaluation results as you will see soon.

PEFT is a generic term that includes **Low-Rank Adaptation (LoRA)** and prompt tuning (which is NOT THE SAME as prompt engineering!). In most cases, when someone says PEFT, they typically mean LoRA. LoRA, at a very high level, allows the user to fine-tune their model using fewer compute resources (in some cases, a single GPU). After fine-tuning for a specific task, use case, or tenant with LoRA, the result is that the original LLM remains unchanged and a newly-trained “LoRA adapter” emerges. This LoRA adapter is much, much smaller than the original LLM - on the order of a single-digit % of the original LLM size (MBs vs GBs).  

That said, at inference time, the LoRA adapter needs to be reunited and combined with its original LLM to serve the inference request.  The benefit, however, is that many LoRA adapters can re-use the original LLM which reduces overall memory requirements when serving multiple tasks and use cases.

<a name='3.1'></a>
### 3.1 - Setup the PEFT/LoRA model for Fine-Tuning

You need to set up the PEFT/LoRA model for fine-tuning with a new layer/parameter adapter. Using PEFT/LoRA, you are freezing the underlying LLM and only training the adapter. Have a look at the LoRA configuration below. Note the rank (`r`) hyper-parameter, which defines the rank/dimension of the adapter to be trained.

In [None]:
from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q", "v"],  # common for T5 attention
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM # FLAN-T5
)

Add LoRA adapter layers/parameters to the original LLM to be trained.

In [None]:
import copy
peft_model = get_peft_model(copy.deepcopy(original_model), lora_config)
print(print_number_of_trainable_model_parameters(peft_model))

Total parameters:     77,305,216
Trainable parameters: 344,064 (0.45%)
{'total_params': 77305216, 'trainable_params': 344064, 'trainable_pct': 0.445072166928555}


To note: reduced from 100% to 0.45%

<a name='3.2'></a>
### 3.2 - Train PEFT Adapter

Define training arguments and create `Trainer` instance.

In [None]:

from transformers import TrainingArguments, Trainer, DataCollatorForSeq2Seq
output_dir = f'./peft-dialogue-summary-training-{str(int(time.time()))}'

peft_training_args = TrainingArguments(
    output_dir=output_dir,
    auto_find_batch_size=True,
    learning_rate=1e-3,
    num_train_epochs=20
)

data_collator_peft = DataCollatorForSeq2Seq(
    tokenizer=tokenizer,
    model=peft_model,
    label_pad_token_id=-100,
    pad_to_multiple_of=8,
)

peft_trainer = Trainer(
    model=peft_model,
    args=peft_training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator_peft,
)

  peft_trainer = Trainer(


In [None]:
from transformers import TrainingArguments, Trainer, DataCollatorForSeq2Seq
import torch, time

output_dir = f'./peft-dialogue-summary-training-{str(int(time.time()))}'

use_bf16 = torch.cuda.is_available() and torch.cuda.is_bf16_supported()

peft_training_args = TrainingArguments(
    output_dir=output_dir,
    # use a fixed, reasonably large batch to reduce steps/epoch
    per_device_train_batch_size=16,    # try 16–32 on T5-small
    per_device_eval_batch_size=16,
    gradient_accumulation_steps=2,     # effective batch 32
    num_train_epochs=10,
    learning_rate=1e-3,
    weight_decay=0.01,
    warmup_ratio=0.1,

    # A100 speedups
    bf16=bool(use_bf16),               # critical for speed on A100
    fp16=not bool(use_bf16),
    dataloader_pin_memory=True,

    # keep overhead low
    logging_steps=100,
    save_steps=0,                      # don’t save mid-epoch
    # (leave eval during training off by not setting evaluation_strategy and/or not calling trainer.evaluate())
)

data_collator_peft = DataCollatorForSeq2Seq(
    tokenizer=tokenizer,
    model=peft_model,
    label_pad_token_id=-100,
    pad_to_multiple_of=8,              # tensor core friendly
)

peft_trainer = Trainer(
    model=peft_model,
    args=peft_training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator_peft,
)


  peft_trainer = Trainer(


Now everything is ready to train the PEFT adapter and save the model.



In [None]:
peft_trainer.train()

peft_model_path="./peft-dialogue-summary-checkpoint-local"

tokenizer.save_pretrained(peft_model_path)

Step,Training Loss
100,1.4483
200,1.409
300,1.4042
400,1.4168
500,1.4152
600,1.3917
700,1.3997
800,1.3863
900,1.3916
1000,1.3786


('./peft-dialogue-summary-checkpoint-local/tokenizer_config.json',
 './peft-dialogue-summary-checkpoint-local/special_tokens_map.json',
 './peft-dialogue-summary-checkpoint-local/spiece.model',
 './peft-dialogue-summary-checkpoint-local/added_tokens.json',
 './peft-dialogue-summary-checkpoint-local/tokenizer.json')



That training was performed on a subset of data. To load a fully trained PEFT model, read a checkpoint of a PEFT model from Google Drive.

Prepare this model by adding an adapter to the original FLAN-T5 model. You are setting `is_trainable=False` because the plan is only to perform inference with this PEFT model. If you were preparing the model for further training, you would set `is_trainable=True`.

In [None]:
from peft import PeftModel, PeftConfig
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

peft_model_base = AutoModelForSeq2SeqLM.from_pretrained(
    "google/flan-t5-small",
    torch_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-small")

# Update the PEFT model path
peft_model_path = '/content/peft-dialogue-summary-training-1762896987/checkpoint-7800'

peft_model = PeftModel.from_pretrained(
    peft_model_base,
    peft_model_path,
    is_trainable=False
)

# Move the entire peft_model to the device
peft_model = peft_model.to(device)


The number of trainable parameters will be `0` due to `is_trainable=False` setting:

In [None]:
print(print_number_of_trainable_model_parameters(peft_model))

Total parameters:     77,305,216
Trainable parameters: 0 (0.00%)
{'total_params': 77305216, 'trainable_params': 0, 'trainable_pct': 0.0}


In [8]:
# Install missing dependencies for ROUGE
!pip -q install rouge_score evaluate


  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone


In [10]:
# Compute 10-sample ROUGE for the pretrained PEFT model from Drive (self-contained)

import os, json, torch
from datasets import load_dataset
import evaluate

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from peft import PeftModel

# --- Load dataset ---
dataset = load_dataset("knkarthick/dialogsum")

# --- Ensure tokenizer + PEFT model are available; if not, load from Drive ---
peft_dir = "/content/drive/MyDrive/peft-dialogue-summary-checkpoint-from-s3"

def load_peft_from_drive(peft_path):
    # Read adapter config to detect base model (fallback to flan-t5-base for width 768)
    cfg_path = os.path.join(peft_path, "adapter_config.json")
    with open(cfg_path, "r") as f:
        adapter_cfg = json.load(f)
    base_name = adapter_cfg.get("base_model_name_or_path", "google/flan-t5-base")
    tok = AutoTokenizer.from_pretrained(base_name)
    use_bf16 = torch.cuda.is_available() and torch.cuda.is_bf16_supported()
    base = AutoModelForSeq2SeqLM.from_pretrained(
        base_name,
        torch_dtype=torch.bfloat16 if use_bf16 else None,
    )
    peft = PeftModel.from_pretrained(base, peft_path, is_trainable=False)
    return tok, peft

try:
    tokenizer
    peft_loaded_model
except NameError:
    tokenizer, peft_loaded_model = load_peft_from_drive(peft_dir)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
peft_loaded_model = peft_loaded_model.to(device).eval()

# --- ROUGE on first 10 test samples ---
rouge = evaluate.load("rouge")
preds, refs = [], []

for i in range(10):
    dialogue = dataset["test"][i]["dialogue"]
    ref = dataset["test"][i]["summary"]
    prompt = f"Summarize the following conversation.\n\n{dialogue}\n\nSummary:"
    inputs = tokenizer(prompt, return_tensors="pt").to(next(peft_loaded_model.parameters()).device)
    with torch.no_grad():
        out = peft_loaded_model.generate(
            **inputs,
            max_new_tokens=128,
            num_beams=4,
            early_stopping=True,
        )
    preds.append(tokenizer.decode(out[0], skip_special_tokens=True))
    refs.append(ref)

peft_rouge = rouge.compute(predictions=preds, references=refs)
print("Pretrained PEFT model (10-sample) ROUGE:")
print(peft_rouge)


Pretrained PEFT model (10-sample) ROUGE:
{'rouge1': np.float64(0.31680536488045763), 'rouge2': np.float64(0.086154358695533), 'rougeL': np.float64(0.24079210604520665), 'rougeLsum': np.float64(0.24079247697704376)}


<a name='3.3'></a>
### 3.3 - Evaluate the Model Qualitatively (Human Evaluation)

Make inferences for the same example as in sections [1.3](#1.3) and [2.3](#2.3), with the original model, fully fine-tuned and PEFT model.

In [None]:
index = 200
dialogue = dataset['test'][index]['dialogue']
human_baseline_summary = dataset['test'][index]['summary']

prompt = f"Summarize the following conversation.\n\n{dialogue}\n\nSummary:"

input_ids = tokenizer(prompt, return_tensors="pt").input_ids

# Move input_ids to the same device as the model (not strictly needed below since we rebuild per-model)
input_ids = input_ids.to(device)

# Put models in eval
original_model.eval()
instruct_model.eval()
peft_model.eval()

# Build inputs on each model's device to avoid device mismatches
orig_inputs = tokenizer(prompt, return_tensors="pt").to(next(original_model.parameters()).device)
inst_inputs = tokenizer(prompt, return_tensors="pt").to(next(instruct_model.parameters()).device)
peft_inputs = tokenizer(prompt, return_tensors="pt").to(next(peft_model.parameters()).device)

with torch.no_grad():
    original_model_outputs = original_model.generate(
        **orig_inputs,
        max_new_tokens=128,
        num_beams=4,
        early_stopping=True,
    )
original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

with torch.no_grad():
    instruct_model_outputs = instruct_model.generate(
        **inst_inputs,
        max_new_tokens=128,
        num_beams=4,
        early_stopping=True,
    )
instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

with torch.no_grad():
    peft_model_outputs = peft_model.generate(
        **peft_inputs,
        max_new_tokens=128,
        num_beams=4,
        early_stopping=True,
    )
peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{human_baseline_summary}')
print(dash_line)
print(f'ORIGINAL MODEL:\n{original_model_text_output}')
print(dash_line)
print(f'INSTRUCT MODEL:\n{instruct_model_text_output}')
print(dash_line)
print(f'PEFT MODEL: \n{peft_model_text_output}')


---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# teaches #Person2# how to upgrade software and hardware in #Person2#'s system.
---------------------------------------------------------------------------------------------------
ORIGINAL MODEL:
Is there anything else I can help you with?
---------------------------------------------------------------------------------------------------
INSTRUCT MODEL:
#Person2# considers adding a painting program to #Person2#'s software. #Person1# suggests adding a painting program to #Person2#'s software. #Person2# suggests adding a CD-ROM drive to #Person2#'s software.
---------------------------------------------------------------------------------------------------
PEFT MODEL: 
#Person1# suggests adding a painting program to #Person2#'s software. #Person2# thinks it would be a bonus. #Person1# suggests adding a CD-ROM drive and adding a CD-ROM drive.


* Relevance and coverage: both trained models are on-topic and extract salient details (painting program, CD-ROM) whereas zero-shot is off-task; however, neither trained model captures the overarching theme (“teaches how to upgrade software and hardware”) as cleanly as the human.
* Coherence and fluency: the fully fine-tuned model is more fluent than zero-shot but shows redundancy and role confusion (“#Person2# suggests…” twice); the PEFT model is comparably readable but repeats phrases (“adding a CD-ROM drive”) and lacks a single, unified summary sentence.
* Faithfulness and abstraction: both trained models are mostly faithful to dialogue events but over-focus on specific add-ons, under-abstracting the instructional aspect; full FT edges PEFT slightly on variety, while PEFT is close with minor repetition artifacts—suggesting a need for decoding constraints (no_repeat_ngram_size, min/max tokens) and possibly more training or prompt refinement.


<a name='3.4'></a>
### 3.4 - Evaluate the Model Quantitatively (with ROUGE Metric)
Perform inferences for the sample of the test dataset (only 10 dialogues and summaries to save time).

In [None]:
dialogues = dataset['test'][0:10]['dialogue']
human_baseline_summaries = dataset['test'][0:10]['summary']

original_model_summaries = []
instruct_model_summaries = []
peft_model_summaries = []

# Ensure eval mode
original_model.eval()
instruct_model.eval()
peft_model.eval()

for idx, dialogue in enumerate(dialogues):
    prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """

    input_ids = tokenizer(prompt, return_tensors="pt").input_ids

    human_baseline_text_output = human_baseline_summaries[idx]

    # Move input_ids to the same device as the model
    input_ids = input_ids.to(device)

    # Build per-model inputs to avoid device mismatch
    orig_inputs = tokenizer(prompt, return_tensors="pt").to(next(original_model.parameters()).device)
    inst_inputs = tokenizer(prompt, return_tensors="pt").to(next(instruct_model.parameters()).device)
    peft_inputs = tokenizer(prompt, return_tensors="pt").to(next(peft_model.parameters()).device)

    with torch.no_grad():
        original_model_outputs = original_model.generate(
            **orig_inputs,
            max_new_tokens=128,
            num_beams=4,
            early_stopping=True,
        )
    original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

    with torch.no_grad():
        instruct_model_outputs = instruct_model.generate(
            **inst_inputs,
            max_new_tokens=128,
            num_beams=4,
            early_stopping=True,
        )
    instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

    with torch.no_grad():
        peft_model_outputs = peft_model.generate(
            **peft_inputs,
            max_new_tokens=128,
            num_beams=4,
            early_stopping=True,
        )
    peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

    original_model_summaries.append(original_model_text_output)
    instruct_model_summaries.append(instruct_model_text_output)
    peft_model_summaries.append(peft_model_text_output)

zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, instruct_model_summaries, peft_model_summaries))

import pandas as pd
df = pd.DataFrame(zipped_summaries, columns = ['human_baseline_summaries', 'original_model_summaries', 'instruct_model_summaries', 'peft_model_summaries'])
df


Unnamed: 0,human_baseline_summaries,original_model_summaries,instruct_model_summaries,peft_model_summaries
0,Ms. Dawson helps #Person1# to write a memo to ...,"Thank you, sir.",Ms. Dawson asks Ms. Dawson to take a dictation...,#Person1# asks Ms. Dawson to take a dictation ...
1,In order to prevent employees from wasting tim...,"Thank you, sir.",Ms. Dawson asks Ms. Dawson to take a dictation...,#Person1# asks Ms. Dawson to take a dictation ...
2,Ms. Dawson takes a dictation for #Person1# abo...,"Thank you, sir.",Ms. Dawson asks Ms. Dawson to take a dictation...,#Person1# asks Ms. Dawson to take a dictation ...
3,#Person2# arrives late because of traffic jam....,Talk to a friend.,#Person2# got stuck in traffic again because t...,#Person2# got stuck in traffic again because t...
4,#Person2# decides to follow #Person1#'s sugges...,Talk to a friend.,#Person2# got stuck in traffic again because t...,#Person2# got stuck in traffic again because t...
5,#Person2# complains to #Person1# about the tra...,Talk to a friend.,#Person2# got stuck in traffic again because t...,#Person2# got stuck in traffic again because t...
6,#Person1# tells Kate that Masha and Hero get d...,"Kate, you never believe what's happened.",Kate tells #Person1# that Masha and Hero are g...,Kate tells #Person1# Masha and Hero are gettin...
7,#Person1# tells Kate that Masha and Hero are g...,"Kate, you never believe what's happened.",Kate tells #Person1# that Masha and Hero are g...,Kate tells #Person1# Masha and Hero are gettin...
8,#Person1# and Kate talk about the divorce betw...,"Kate, you never believe what's happened.",Kate tells #Person1# that Masha and Hero are g...,Kate tells #Person1# Masha and Hero are gettin...
9,#Person1# and Brian are at the birthday party ...,"Brian, thank you for inviting me to the party.",Brian invites #Person1# to have a dance with #...,#Person1# invites Brian to celebrate his birth...


In [None]:
rouge = evaluate.load('rouge')

original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries,
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries,
)

peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('INSTRUCT MODEL:')
print(instruct_model_results)
print('PEFT MODEL:')
print(peft_model_results)


ORIGINAL MODEL:
{'rouge1': np.float64(0.06935286935286936), 'rouge2': np.float64(0.0), 'rougeL': np.float64(0.06984348180000355), 'rougeLsum': np.float64(0.07148962148962149)}
INSTRUCT MODEL:
{'rouge1': np.float64(0.35452568854973643), 'rouge2': np.float64(0.1144133659226495), 'rougeL': np.float64(0.2641151822349985), 'rougeLsum': np.float64(0.2639443107406383)}
PEFT MODEL:
{'rouge1': np.float64(0.3681963969164479), 'rouge2': np.float64(0.1201019958232719), 'rougeL': np.float64(0.28318007623291863), 'rougeLsum': np.float64(0.2844835147857303)}


Notice, that PEFT model results are not too bad, while the training process was much easier!

You already computed ROUGE score on the full dataset, after loading the results from the `data/dialogue-summary-training-results.csv` file. Load the values for the PEFT model now and check its performance compared to other models.

In [17]:
# Require the PEFT column as well for this section
required_cols_34 = ["human_baseline_summaries", "original_model_summaries", "instruct_model_summaries", "peft_model_summaries"]
missing_34 = [c for c in required_cols_34 if c not in results.columns]

if missing_34:
    raise KeyError(
        "Missing required columns for 3.4: "
        + ", ".join(missing_34)
        + "\nMake sure your CSV includes 'peft_model_summaries'."
    )

human_baseline_summaries = results["human_baseline_summaries"].astype(str).tolist()
original_model_summaries = results["original_model_summaries"].astype(str).tolist()
instruct_model_summaries = results["instruct_model_summaries"].astype(str).tolist()
peft_model_summaries     = results["peft_model_summaries"].astype(str).tolist()

original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries,
)
instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries,
)
peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries,
)

print("ORIGINAL MODEL:")
print(original_model_results)
print("INSTRUCT MODEL:")
print(instruct_model_results)
print("PEFT MODEL:")
print(peft_model_results)


ORIGINAL MODEL:
{'rouge1': np.float64(0.22165605484675166), 'rouge2': np.float64(0.07072656558348324), 'rougeL': np.float64(0.19244657776840873), 'rougeLsum': np.float64(0.19235273036456957)}
INSTRUCT MODEL:
{'rouge1': np.float64(0.4043243894173554), 'rouge2': np.float64(0.17053687348320012), 'rougeL': np.float64(0.3266152074751759), 'rougeLsum': np.float64(0.3266155064476724)}
PEFT MODEL:
{'rouge1': np.float64(0.3911477954191801), 'rouge2': np.float64(0.15464524150189646), 'rougeL': np.float64(0.3136648257448643), 'rougeLsum': np.float64(0.3136251720757611)}


The results show less of an improvement over full fine-tuning, but the benefits of PEFT typically outweigh the slightly-lower performance metrics.

Calculate the improvement of PEFT over the original model:

In [19]:
print("\nAbsolute percentage improvement of PEFT MODEL over ORIGINAL MODEL")
imp_peft_vs_orig = (
    np.array(list(peft_model_results.values()))
    - np.array(list(original_model_results.values()))
)
for key, value in zip(peft_model_results.keys(), imp_peft_vs_orig):
    print(f"{key}: {value*100:.2f}%")


Absolute percentage improvement of PEFT MODEL over ORIGINAL MODEL
rouge1: 16.95%
rouge2: 8.39%
rougeL: 12.12%
rougeLsum: 12.13%


Now calculate the improvement of PEFT over a full fine-tuned model:

In [20]:
print("\nAbsolute percentage improvement of PEFT MODEL over INSTRUCT MODEL")
imp_peft_vs_instr = (
    np.array(list(peft_model_results.values()))
    - np.array(list(instruct_model_results.values()))
)
for key, value in zip(peft_model_results.keys(), imp_peft_vs_instr):
    print(f"{key}: {value*100:.2f}%")


Absolute percentage improvement of PEFT MODEL over INSTRUCT MODEL
rouge1: -1.32%
rouge2: -1.59%
rougeL: -1.30%
rougeLsum: -1.30%


* PEFT also beats zero-shot convincingly: rouge1 ↑ ~17.0 pp, rouge2 ↑ ~8.4 pp, rougeL/Lsum ↑ ~12.1 pp—adapters learn the summarization task effectively.
* Versus full fine-tuning, PEFT is slightly behind (≈1.3–1.6 pp across metrics), a small gap that can vary with seeds/decoding.
* Given PEFT’s tiny trainable footprint and small adapter checkpoints, its near-parity offers a strong efficiency–performance trade-off for rapid iteration and multi-tenant deployment.


* Quality takeaways: both trained models improve n-gram overlap (R1/R2) and structure alignment (RL/Lsum) versus zero-shot; PEFT’s slight lead suggests LoRA captured task-specific patterns efficiently without overfitting.
* Compute/efficiency trade-off: PEFT trained ~0.45% of weights (adapters only), yielding tiny checkpoints and lower memory needs; wall-clock per step remains similar (full forward pass still happens), but total compute and storage are far lower and adapters are stackable/reusable.
* When to prefer each: choose PEFT for rapid iteration, multi-tenant adapters, limited VRAM, or deployment with shared base models; choose full FT if you control the whole model and need maximum headroom on very large domain shifts—accepting higher storage/maintenance costs.
