## Parameter-Efficient Fine-Tuning (PEFT) for FLAN-T5

### **Overview**
In this notebook, we explore **PEFT (Parameter-Efficient Fine-Tuning)** methods to fine-tune **FLAN-T5** with significantly fewer trainable parameters.

### **Why PEFT?**
- **Full fine-tuning is computationally expensive** (high memory, long training time).
- PEFT methods like **LoRA (Low-Rank Adaptation)** fine-tune only a subset of parameters.
- This results in **faster training, lower memory usage, and efficient deployment**.

### **What We Will Do:**
1. Implement **LoRA** for efficient fine-tuning.
2. Compare PEFT performance against:
   - **Full Fine-Tuning** ([`FLAN-T5_Full_FineTuning.ipynb`](./FLAN-T5_Full_FineTuning.ipynb))
   - **Zero/Multi-Shot Inference** ([`FLAN-T5_Zero_MultiShot_Inference.ipynb`](./FLAN-T5_Zero_MultiShot_Inference.ipynb))
3. Evaluate training time, model performance, and resource savings.

This approach enables fine-tuning on **low-resource hardware** while maintaining high performance.


In [None]:
# python- 3.8.10 
# !pip install --upgrade pip
# !pip install transformers
# !pip install datasets --quiet
# !pip install torchdata
# !pip install torch
# !pip install streamlit
# !pip install openai
# !pip install langchain
# !pip install unstructured
# !pip install sentence-transformers
# !pip install chromadb
# !pip install evaluate==0.4.0
# !pip install rouge_score==0.1.2
# !pip install loralib==0.1.1
# !pip install peft==0.3.0

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
import torch
import evaluate
import time
import pandas as pd
import numpy as np
from datasets import load_dataset
from transformers import (AutoModelForSeq2SeqLM, AutoModelForCausalLM, 
                          AutoTokenizer, GenerationConfig, TrainingArguments, Trainer)
from transformers import AutoTokenizer
from transformers import GenerationConfig


In [33]:
DEVICE="cuda" if torch.cuda.is_available() else "cpu"
torch_device = torch.device(DEVICE)

## Load Dataset and LLM

In [34]:
hugging_face_dataset_name = "knkarthick/dialogsum"

In [35]:
dataset = load_dataset(hugging_face_dataset_name)

In [36]:
model_name='google/flan-t5-base'

original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(torch_device)
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [37]:
def number_of_trainable_model_parameters(model):
        trainable_model_params = 0
        all_model_params = 0
        for _, param in model.named_parameters():
            all_model_params += param.numel()
            if param.requires_grad:
                trainable_model_params += param.numel()
        result = f"trainable model parameters: {trainable_model_params}\n"
        result += f"all model parameters: {all_model_params}\n"
        result += f"Percentage of model params: {(trainable_model_params/all_model_params)*100}"
        return result

## Test the Model with Zero Shot Inferencing

### Preprocess the Dialog-Summary dataset

Convert the dialog-summary (prompt-response) pairs into explicit instructions for the LLM. Prepend an instruction to the start of the dialog with 'Summarize the following conversation' and the start of the summary with 'Summary as follows'

In [40]:
def tokenize_function(example):
    start_prompt = 'Summarize the following conversation.\n\n'
    end_prompt = '\n\nSummary: '
    prompt = [start_prompt + dialogue + end_prompt for dialogue in example['dialogue']]
    example['input_ids'] = tokenizer(prompt, padding="max_length", truncation=True, return_tensors="pt").input_ids
    example['labels'] = tokenizer(example["summary"], padding="max_length", truncation=True, return_tensors="pt").input_ids

    return example

In [41]:
# The dataset actually contains 3 diff splits: train, validation, test
# The tokenize_function code is handling all data accross all splits in batches

tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(['id', 'topic', 'dialogue', 'summary',])

Map: 100%|██████████| 12460/12460 [00:07<00:00, 1675.51 examples/s]
Map: 100%|██████████| 500/500 [00:00<00:00, 1713.19 examples/s]
Map: 100%|██████████| 1500/1500 [00:00<00:00, 1795.14 examples/s]


To save some time, we will subsample the dataset:

In [42]:
tokenized_datasets = tokenized_datasets.filter(lambda example, index: index % 100 == 0, with_indices=True)

Filter: 100%|██████████| 12460/12460 [00:07<00:00, 1566.54 examples/s]
Filter: 100%|██████████| 500/500 [00:00<00:00, 1348.14 examples/s]
Filter: 100%|██████████| 1500/1500 [00:00<00:00, 1512.20 examples/s]


In [43]:
print(f"Shapes of the datasets:")
print(f"Training: {tokenized_datasets['train'].shape}")
print(f"Validation: {tokenized_datasets['validation'].shape}")
print(f"Test: {tokenized_datasets['test'].shape}")

print(tokenized_datasets)

Shapes of the datasets:
Training: (125, 2)
Validation: (5, 2)
Test: (15, 2)
DatasetDict({
    train: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 125
    })
    validation: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 5
    })
    test: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 15
    })
})


### Fine-Tune the model with the Preprocessed Dataset

Now utilize the built-in Hugging Face Trainer class.

In [44]:
output_dir = f"./dialogue-summary-training-{str(int(time.time()))}"

training_args = TrainingArguments(
    output_dir=output_dir,
    learning_rate=1e-5,
    num_train_epochs=1,
    weight_decay=0.01,
    logging_steps=1,
    max_steps=1
)

trainer = Trainer(
    model=original_model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation']
)

In [None]:
trainer.train()  

In [46]:
instruct_model = AutoModelForSeq2SeqLM.from_pretrained('full/').to(torch_device)
original_model = original_model.to(torch_device)

In [48]:
rouge = evaluate.load('rouge')

Downloading builder script: 100%|██████████| 6.27k/6.27k [00:00<00:00, 6.27MB/s]


In [49]:
dialogue = dataset['test'][0:10]['dialogue']
human_baseline_summaries = dataset['test'][0:10]['summary']

original_model_summaries = []
instruct_model_summaries = []
for _, dialogue in enumerate(dialogue):
    prompt = f"""
Summarize the following conversation

{dialogue}

Summary:
    """
    input_ids = tokenizer(prompt, return_tensors='pt').input_ids
    
    original_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    original_text_output = tokenizer.decode(original_outputs[0], skip_special_tokens=True)
    original_model_summaries.append(original_text_output)

    instruct_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    instruct_text_output = tokenizer.decode(instruct_outputs[0], skip_special_tokens=True)
    instruct_model_summaries.append(instruct_text_output)

zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, instruct_model_summaries))

df = pd.DataFrame(zipped_summaries, columns=['human', 'original', 'instruct'])

In [50]:
df

Unnamed: 0,human,original,instruct
0,Ms. Dawson helps #Person1# to write a memo to ...,Employees are required to use instant messagin...,#Person1# asks Ms. Dawson to take a dictation ...
1,In order to prevent employees from wasting tim...,This memo will be sent to all employees by thi...,#Person1# asks Ms. Dawson to take a dictation ...
2,Ms. Dawson takes a dictation for #Person1# abo...,Employees are required to use the Office of In...,#Person1# asks Ms. Dawson to take a dictation ...
3,#Person2# arrives late because of traffic jam....,People are talking about the traffic in this c...,#Person2# got stuck in traffic again. #Person1...
4,#Person2# decides to follow #Person1#'s sugges...,#Person1: I'm finally here!,#Person2# got stuck in traffic again. #Person1...
5,#Person2# complains to #Person1# about the tra...,#Person1: I'm sorry to hear that you're stuck ...,#Person2# got stuck in traffic again. #Person1...
6,#Person1# tells Kate that Masha and Hero get d...,Masha and Hero are divorced.,#Person1# tells Kate Masha and Hero are gettin...
7,#Person1# tells Kate that Masha and Hero are g...,Masha and Hero are divorced.,#Person1# tells Kate Masha and Hero are gettin...
8,#Person1# and Kate talk about the divorce betw...,#Person1: Masha and Hero are getting a divorce.,#Person1# tells Kate Masha and Hero are gettin...
9,#Person1# and Brian are at the birthday party ...,"#Person1#: Brian, thank you for coming to the ...",Brian's birthday is coming. Brian dances with ...


In [51]:
original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries,
    use_aggregator=True,
    use_stemmer=True
)

In [52]:
instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0:len(instruct_model_summaries)],
    use_aggregator=True,
    use_stemmer=True
)


# Parameter Efficient Fine Tunning with LoRA


## **Example: LoRA vs. Full Fine-Tuning**
Suppose we have a model layer with a weight matrix **W** of size **(1000 × 1000)**.  
- In **full fine-tuning**, we update all **1,000,000 parameters** in **W**.  
- With **LoRA (r=16)**, we replace updates to **W** with two much smaller matrices:  
  - **A (1000 × 16)**
  - **B (16 × 1000)**  
- Instead of updating **1,000,000 parameters**, LoRA only trains **32,000 parameters** → **96.8% fewer parameters!**  

---

In [53]:
from peft import LoraConfig, get_peft_model, TaskType, PeftModel, PeftConfig

## Setup the PEFT/LoRA model for Fine-Tunning
## LoRA Configuration Parameters Explained

### 1.`r=32` → Rank of the LoRA Matrices  
- Controls the size of the low-rank adaptation matrices (`A` and `B`).  
- **Higher `r`** → More trainable parameters, better learning but higher memory usage.  
- **Lower `r`** → Fewer parameters, faster training but may reduce model adaptability.  
- **Impact:** A trade-off between efficiency and model performance.  

### 2.`lora_alpha=32` → Scaling Factor  
- `lora_alpha` scales the LoRA update (`ΔW = α/r * A * B`).  
- Ensures the LoRA matrices don't overly distort the original frozen model.  
- **Higher `lora_alpha`** → Increases impact of LoRA updates.  
- **Lower `lora_alpha`** → Reduces adaptation strength.  
- **Impact:** Adjusts the balance between adaptation flexibility and stability.  

### 3.`target_modules=['q', 'v']` → Which Layers to Apply LoRA To  
- LoRA is applied only to the **attention mechanism** of the transformer.  
- `"q"` → Query matrix (controls how words attend to each other).  
- `"v"` → Value matrix (controls what information gets passed in attention).  
- **Impact:**  
  - Applying LoRA only to `"q"` and `"v"` reduces compute cost while maintaining performance.  
  - Adding `"k"` (Key) and `"o"` (Output) would fine-tune more components but use more memory.  

### 4. `lora_dropout=0.05` → Dropout for Regularization  
- Introduces **random dropping of LoRA updates** to prevent overfitting.  
- **Higher dropout (`>0.1`)** → Reduces overfitting but may slow learning.  
- **Lower dropout (`<0.05`)** → Retains more updates but risks overfitting.  
- **Impact:** Helps prevent the LoRA parameters from overfitting on small datasets.  

### 5. bias='none'` → Whether to Fine-Tune Bias Terms  
- `"none"` → Biases remain frozen (**default, saves memory**).  
- `"all"` → Fine-tunes all biases (**increases trainable parameters**).  
- `"lora_only"` → Fine-tunes only biases within LoRA layers.  
- **Impact:** Keeping biases frozen reduces computational cost while maintaining efficiency.  

### 6. `task_type=TaskType.SEQ_2_SEQ_LM` → Task Type for Fine-Tuning  
- Specifies that we are fine-tuning a **sequence-to-sequence language model** (e.g., FLAN-T5).  
- Other options include:  
  - `"CAUSAL_LM"` → For **autoregressive models** like GPT.  
  - `"SEQ_2_SEQ_LM"` → For **encoder-decoder models** like T5 that generate output from input.  
- **Impact:** Ensures LoRA is correctly applied based on the model architecture.  


In [54]:
lora_config = LoraConfig(
    r=32,
    lora_alpha=32,
    target_modules=['q','v'],
    lora_dropout=0.05,
    bias='none',
    task_type=TaskType.SEQ_2_SEQ_LM
)

In [55]:
peft_model = get_peft_model(original_model, lora_config)
print(number_of_trainable_model_parameters(peft_model))

trainable model parameters: 3538944
all model parameters: 251116800
Percentage of model params: 1.4092820552029972




With **LoRA fine-tuning**, we reduced the trainable parameters from **251M to just 3.5M**, making up only **1.41%** 

In [56]:
output_dir = f"./peft-dialogue-summary-training-{str(int(time.time()))}"

training_args = TrainingArguments(
    auto_find_batch_size=True,
    output_dir=output_dir,
    learning_rate=1e-3,
    num_train_epochs=100,
    weight_decay=0.01,
    logging_steps=1,
    max_steps=1
)

peft_trainer = Trainer(
    model=peft_model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation']
)

In [None]:
peft_trainer.train()

In [58]:
peft_model_base = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

peft_model = PeftModel.from_pretrained(
    peft_model_base,
    output_dir
).to(torch_device)
original_model = original_model.to(torch_device)


### PEFT model Training Time:
- **Device:** Mac M2 (`mps`)
- **Epochs:** 1
- **Total Time:** ~7 hours

## Qualitative Evaluation  

We assess the model's inference by comparing the model generated summary against the human-written summary, checking for fluency, accuracy, conciseness, and relevance.


In [59]:
index = 200
dialogue = dataset['test'][index]['dialogue']
human_baseline_summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation

{dialogue}

Summary:
"""

input_ids = tokenizer(prompt, return_tensors='pt').input_ids

original_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
original_text_output = tokenizer.decode(original_outputs[0], skip_special_tokens=True)

peft_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
peft_text_output = tokenizer.decode(peft_outputs[0], skip_special_tokens=True)

dash_line = "-".join("" for x in range(100))
print(dash_line)
print(f"Input Prompt:\n{prompt}")
print(dash_line)
print(f"Baseline Human Summary:\n{human_baseline_summary}\n")
print(dash_line)
print(f"Original Model Generation - Zero Shot: \n{original_text_output}")
print(dash_line)
print(f"Instruct Model Generation - Zero Shot: \n{peft_text_output}")

---------------------------------------------------------------------------------------------------
Input Prompt:

Summarize the following conversation

#Person1#: Have you considered upgrading your system?
#Person2#: Yes, but I'm not sure what exactly I would need.
#Person1#: You could consider adding a painting program to your software. It would allow you to make up your own flyers and banners for advertising.
#Person2#: That would be a definite bonus.
#Person1#: You might also want to upgrade your hardware because it is pretty outdated now.
#Person2#: How can we do that?
#Person1#: You'd probably need a faster processor, to begin with. And you also need a more powerful hard disc, more memory and a faster modem. Do you have a CD-ROM drive?
#Person2#: No.
#Person1#: Then you might want to add a CD-ROM drive too, because most new software programs are coming out on Cds.
#Person2#: That sounds great. Thanks.

Summary:

--------------------------------------------------------------------

In [60]:
dialogue = dataset['test'][0:10]['dialogue']
human_baseline_summaries = dataset['test'][0:10]['summary']

original_model_summaries = []
peft_model_summaries = []
for _, dialogue in enumerate(dialogue):
    prompt = f"""
Summarize the following conversation

{dialogue}

Summary:
    """
    input_ids = tokenizer(prompt, return_tensors='pt').input_ids
    
    peft_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    peft_text_output = tokenizer.decode(peft_outputs[0], skip_special_tokens=True)
    peft_model_summaries.append(peft_text_output)

zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, peft_model_summaries))

df = pd.DataFrame(zipped_summaries, columns=['human', 'original', 'peft'])

In [61]:
peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries[0:len(peft_model_summaries)],
    use_aggregator=True,
    use_stemmer=True
)

print(f"Original Model: \n{original_model_results}")
print(f"Instruct Model: \n{instruct_model_results}")
print(f"Peft Model: \n{peft_model_results}")

Original Model: 
{'rouge1': 0.261052062988671, 'rouge2': 0.08531489481944488, 'rougeL': 0.224821552384684, 'rougeLsum': 0.22788611265447228}
Instruct Model: 
{'rouge1': 0.38857220563277894, 'rouge2': 0.13135692283806472, 'rougeL': 0.28167162470172985, 'rougeLsum': 0.28344342480768214}
Peft Model: 
{'rouge1': 0.33176278482581173, 'rouge2': 0.08811333505050914, 'rougeL': 0.2509677309788697, 'rougeLsum': 0.25262149176905513}


## Comparing Models Using ROUGE Scores  

We compare the **zero-shot (original), fully fine-tuned (instruct), and PEFT fine-tuned (LoRA) models** using ROUGE scores.  

### **ROUGE Score Comparison:**
| Model            | ROUGE-1 ↑  | ROUGE-2 ↑  | ROUGE-L ↑  | ROUGE-Lsum ↑ |
|-----------------|------------|------------|------------|--------------|
| **Original Model**  | **0.261**  | **0.085**  | **0.225**  | **0.228**  |
| **Fully Fine-Tuned (Instruct)** | **0.389**  | **0.131**  | **0.282**  | **0.283**  |
| **PEFT (LoRA) Model** | **0.332**  | **0.088**  | **0.251**  | **0.253**  |

### **Observations:**
- **Fully fine-tuned (instruct) model performs best**, with **ROUGE-1 improving by +12.7%** over the original model.  
- **PEFT (LoRA) model shows a 7% improvement over the original model** while training far fewer parameters.  
- **LoRA fine-tuning achieves competitive performance** compared to full fine-tuning, but with significantly lower computational cost.
## Training Limitations and Future Improvements  
Due to computational constraints, we fine-tuned the model for **only one epoch**.  
The model's performance can be **further improved** by training for **more epochs**, allowing it to learn better representations and refine summaries.

