# Instruction Fine-Tuning for Summarization: Step-by-Step Guide

This notebook demonstrates how to fine-tune **FLAN-T5** for dialogue summarization using two approaches:
1. **Full Fine-Tuning**: Training all model parameters
2. **PEFT (LoRA)**: Parameter-Efficient Fine-Tuning with only ~1.4% trainable parameters

## 🎯 What You'll Learn

- How to fine-tune LLMs for summarization tasks
- How to preprocess dialogue-summary datasets
- How to perform full fine-tuning vs. PEFT
- How to evaluate with ROUGE metrics
- How to compare different fine-tuning approaches

## 📊 Key Comparisons

| Approach | Trainable Params | Training Time | GPU Memory | ROUGE Score |
|----------|------------------|---------------|------------|-------------|
| Zero-shot | 0% | N/A | Minimal | Low (~21%) |
| Full Fine-tune | 100% (247M) | Longer | High | Good (~41%) |
| PEFT/LoRA | 1.41% (3.5M) | Faster | Lower | Comparable (~37%) |

## 🔧 Requirements

- **GPU**: 16GB+ VRAM recommended
- **Time**: 1-2 hours (depends on training steps)
- **Dataset**: DialogSum (~10K dialogues)

## 📖 Table of Contents

1. [Setup and Environment](#1-setup-and-environment)
2. [Load Dataset and Model](#2-load-dataset-and-model)
3. [Zero-Shot Baseline](#3-zero-shot-baseline)
4. [Full Fine-Tuning](#4-full-fine-tuning)
5. [PEFT/LoRA Fine-Tuning](#5-peft-lora-fine-tuning)
6. [Results Comparison](#6-results-comparison)

---

**Credits**: Based on the excellent tutorial by [Youssef Hosni](https://youssef-hosni.medium.com/)

## 1. Setup and Environment

### 1.1. Install Required Dependencies

We'll install all necessary packages for:
- **transformers**: Hugging Face Transformers library
- **datasets**: Dataset loading and processing
- **evaluate**: Evaluation metrics
- **rouge_score**: ROUGE metric for summarization
- **peft**: Parameter-Efficient Fine-Tuning
- **torch**: PyTorch framework

In [None]:
# Install required packages
# This may take a few minutes

%pip install --upgrade pip
%pip install --disable-pip-version-check torch==1.13.1 torchdata==0.5.1 --quiet

%pip install transformers==4.27.2 datasets==2.11.0 evaluate==0.4.0 \
    rouge_score==0.1.2 peft==0.3.0 --quiet

print("✅ All packages installed successfully!")

In [None]:
# Import necessary libraries
from datasets import load_dataset
from transformers import (
    AutoModelForSeq2SeqLM, 
    AutoTokenizer, 
    GenerationConfig, 
    TrainingArguments, 
    Trainer
)
import torch
import time
import evaluate
import pandas as pd
import numpy as np

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

## 2. Load Dataset and Model

### 2.1. Load the DialogSum Dataset

**DialogSum** is a large-scale dialogue summarization dataset with:
- **10,000+** dialogues
- Manually labeled summaries
- Topics for each dialogue
- Train/validation/test splits

The dataset is perfect for training summarization models.

In [None]:
# Load the DialogSum dataset from Hugging Face
huggingface_dataset_name = "knkarthick/dialogsum"
dataset = load_dataset(huggingface_dataset_name)

print("✅ Dataset loaded successfully!")
print(f"\nDataset structure:")
print(dataset)

print(f"\nSample counts:")
print(f"  Training: {len(dataset['train']):,} dialogues")
print(f"  Validation: {len(dataset['validation']):,} dialogues")
print(f"  Test: {len(dataset['test']):,} dialogues")

In [None]:
# Explore a sample from the dataset
sample_idx = 0
sample = dataset['train'][sample_idx]

print("=== Sample Dialogue ===")
print(sample['dialogue'])
print("\n=== Human Summary ===")
print(sample['summary'])
print("\n=== Topic ===")
print(sample['topic'])
print("\n=== ID ===")
print(sample['id'])

### 2.2. Load FLAN-T5 Model and Tokenizer

**FLAN-T5** is an instruction-tuned version of T5 that excels at various NLP tasks.

We'll use `flan-t5-base` which has:
- **247M parameters**
- Good balance between performance and resource requirements
- Pre-trained on instruction-following tasks

**Note**: We use `torch.bfloat16` for memory efficiency.

In [None]:
# Load FLAN-T5 model and tokenizer
model_name = 'google/flan-t5-base'

print(f"Loading {model_name}...")
print("This may take a moment...")

original_model = AutoModelForSeq2SeqLM.from_pretrained(
    model_name, 
    torch_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

print("\n✅ Model and tokenizer loaded successfully!")

In [None]:
# Function to print trainable parameters
def print_number_of_trainable_model_parameters(model):
    """
    Calculate and display the number of trainable vs total parameters.
    
    Args:
        model: The model to analyze
    
    Returns:
        String with parameter statistics
    """
    trainable_model_params = 0
    all_model_params = 0
    
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    
    return f"""trainable model parameters: {trainable_model_params:,}
all model parameters: {all_model_params:,}
percentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"""

print("=== Original Model Parameters ===")
print(print_number_of_trainable_model_parameters(original_model))

## 3. Zero-Shot Baseline

Before fine-tuning, let's test the model's zero-shot performance on summarization.

This establishes a **baseline** to measure improvement after fine-tuning.

In [None]:
# Test zero-shot inference
index = 200

dialogue = dataset['test'][index]['dialogue']
summary = dataset['test'][index]['summary']

# Create prompt for summarization
prompt = f"""
Summarize the following conversation.

{dialogue}

Summary:
"""

# Tokenize and generate
inputs = tokenizer(prompt, return_tensors='pt')
output = tokenizer.decode(
    original_model.generate(
        inputs["input_ids"], 
        max_new_tokens=200,
    )[0], 
    skip_special_tokens=True
)

# Display results
dash_line = '-' * 100
print(dash_line)
print(f'INPUT PROMPT:\n{prompt}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'MODEL GENERATION - ZERO SHOT:\n{output}')

### Analysis of Zero-Shot Results

The model struggles with zero-shot summarization:
- ❌ Often misses key points
- ❌ May generate irrelevant text
- ❌ Doesn't follow the expected format

**This indicates the model needs fine-tuning for this specific task!**

## 4. Full Fine-Tuning

Now we'll perform **full fine-tuning** where we train all 247M parameters of the model.

### 4.1. Preprocess the Dataset

We need to format the data as instruction-response pairs:

**Format:**
```
Prompt: Summarize the following conversation.
[dialogue]

Summary:

Response: [summary]
```

In [None]:
# Define tokenization function
def tokenize_function(example):
    """
    Tokenize the dialogue-summary pairs into model inputs.
    
    Args:
        example: A batch of examples from the dataset
        
    Returns:
        Dictionary with input_ids and labels
    """
    start_prompt = 'Summarize the following conversation.\n\n'
    end_prompt = '\n\nSummary: '
    
    # Create prompts
    prompt = [start_prompt + dialogue + end_prompt for dialogue in example["dialogue"]]
    
    # Tokenize prompts (inputs)
    example['input_ids'] = tokenizer(
        prompt, 
        padding="max_length", 
        truncation=True, 
        return_tensors="pt"
    ).input_ids
    
    # Tokenize summaries (labels)
    example['labels'] = tokenizer(
        example["summary"], 
        padding="max_length", 
        truncation=True, 
        return_tensors="pt"
    ).input_ids
    
    return example

print("✅ Tokenization function defined!")

In [None]:
# Apply tokenization to dataset
print("Tokenizing dataset...")
print("This may take a few moments...")

tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(['id', 'topic', 'dialogue', 'summary'])

print("\n✅ Dataset tokenized!")

# Take a subset for faster training (optional)
# For full training, comment out the next line
tokenized_datasets = tokenized_datasets.filter(lambda example, index: index % 100 == 0, with_indices=True)

print(f"\nDataset shapes:")
print(f"  Training: {tokenized_datasets['train'].shape}")
print(f"  Validation: {tokenized_datasets['validation'].shape}")
print(f"  Test: {tokenized_datasets['test'].shape}")

print(f"\nDataset structure:")
print(tokenized_datasets)

### 4.2. Configure Training Arguments

We'll use the Hugging Face `Trainer` class with the following settings:

- **Learning rate**: 1e-5 (conservative for fine-tuning)
- **Epochs**: 1 (can increase for better results)
- **Weight decay**: 0.01 (regularization)
- **Logging**: Every step for monitoring

**Note**: For demonstration, we set `max_steps=1`. For real training, remove this or increase it significantly.

In [None]:
# Define training arguments
output_dir = f'./dialogue-summary-training-{str(int(time.time()))}'

training_args = TrainingArguments(
    output_dir=output_dir,
    learning_rate=1e-5,
    num_train_epochs=1,
    weight_decay=0.01,
    logging_steps=1,
    max_steps=1,  # Set to a higher number for real training (e.g., 500)
    save_strategy="steps",
    save_steps=100,
    evaluation_strategy="steps",
    eval_steps=100,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
)

print("✅ Training arguments configured!")
print(f"\nOutput directory: {output_dir}")
print(f"Learning rate: {training_args.learning_rate}")
print(f"Max steps: {training_args.max_steps}")
print(f"Batch size: {training_args.per_device_train_batch_size}")

In [None]:
# Initialize Trainer
trainer = Trainer(
    model=original_model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation']
)

print("✅ Trainer initialized!")
print("\nReady to start training...")

In [None]:
# Start training!
print("=" * 70)
print("STARTING FULL FINE-TUNING")
print("=" * 70)
print("\nThis will train all 247M parameters...")
print("For real training, increase max_steps in TrainingArguments")
print("=" * 70)
print()

trainer.train()

print("\n" + "=" * 70)
print("TRAINING COMPLETE!")
print("=" * 70)

In [None]:
# Save the fine-tuned model
trained_model_dir = "./dialogue-summary-trained-model"

print(f"Saving model to: {trained_model_dir}")
trainer.save_model(trained_model_dir)

print("\n✅ Model saved successfully!")

In [None]:
# Load the trained model
trained_model = AutoModelForSeq2SeqLM.from_pretrained(trained_model_dir)

print("✅ Trained model loaded!")
print("\nModel is ready for evaluation...")

### 4.3. Evaluate Full Fine-Tuned Model

Now let's evaluate the fine-tuned model using both **qualitative** (human judgment) and **quantitative** (ROUGE metrics) approaches.

In [None]:
# Qualitative evaluation - Compare original vs trained model
index = 200
dialogue = dataset['test'][index]['dialogue']
human_summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary:
"""

# Tokenize
input_ids = tokenizer(prompt, return_tensors="pt").input_ids

# Move to appropriate device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
input_ids = input_ids.to(device)
original_model.to(device)
trained_model.to(device)

# Generate with original model
generation_config = GenerationConfig(max_new_tokens=200, num_beams=1)
original_outputs = original_model.generate(input_ids=input_ids, generation_config=generation_config)
original_text = tokenizer.decode(original_outputs[0], skip_special_tokens=True)

# Generate with trained model
trained_outputs = trained_model.generate(input_ids=input_ids, generation_config=generation_config)
trained_text = tokenizer.decode(trained_outputs[0], skip_special_tokens=True)

# Display comparison
dash_line = '-' * 70
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{human_summary}')
print(dash_line)
print(f'ORIGINAL MODEL (Zero-shot):\n{original_text}')
print(dash_line)
print(f'TRAINED MODEL (Full Fine-tuned):\n{trained_text}')
print(dash_line)

### 4.4. Quantitative Evaluation with ROUGE

**ROUGE (Recall-Oriented Understudy for Gisting Evaluation)** measures overlap between generated and reference summaries.

**Metrics:**
- **ROUGE-1**: Unigram overlap
- **ROUGE-2**: Bigram overlap
- **ROUGE-L**: Longest common subsequence
- **ROUGE-Lsum**: Summary-level LCS

Higher scores = better summaries

In [None]:
# Load ROUGE metric
rouge = evaluate.load('rouge')

print("✅ ROUGE metric loaded!")

In [None]:
# Generate summaries for a sample of test set
dialogues = dataset['test'][0:10]['dialogue']
human_baseline_summaries = dataset['test'][0:10]['summary']
original_model_summaries = []
trained_model_summaries = []

print("Generating summaries for evaluation...")
print(f"Processing {len(dialogues)} dialogues...")

for idx, dialogue in enumerate(dialogues):
    prompt = f"""
Summarize the following conversation.
{dialogue}
Summary:
"""
    
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    input_ids = input_ids.to(device)
    
    # Original model
    original_outputs = original_model.generate(
        input_ids=input_ids, 
        generation_config=GenerationConfig(max_new_tokens=200, num_beams=1)
    )
    original_text = tokenizer.decode(original_outputs[0], skip_special_tokens=True)
    original_model_summaries.append(original_text)
    
    # Trained model
    trained_outputs = trained_model.generate(
        input_ids=input_ids, 
        generation_config=GenerationConfig(max_new_tokens=200, num_beams=1)
    )
    trained_text = tokenizer.decode(trained_outputs[0], skip_special_tokens=True)
    trained_model_summaries.append(trained_text)
    
    print(f"  Processed {idx + 1}/{len(dialogues)}", end='\r')

print("\n\n✅ All summaries generated!")

# Create DataFrame for comparison
zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, trained_model_summaries))
df = pd.DataFrame(zipped_summaries, columns=['human_baseline', 'original_model', 'trained_model'])
print("\n=== Sample Comparisons ===")
print(df.head())

In [None]:
# Calculate ROUGE scores
original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries,
    use_aggregator=True,
    use_stemmer=True,
)

trained_model_results = rouge.compute(
    predictions=trained_model_summaries,
    references=human_baseline_summaries,
    use_aggregator=True,
    use_stemmer=True,
)

print("=" * 70)
print("ROUGE SCORES COMPARISON")
print("=" * 70)
print("\nORIGINAL MODEL (Zero-shot):")
for key, value in original_model_results.items():
    print(f"  {key}: {value:.4f}")

print("\nTRAINED MODEL (Full Fine-tuned):")
for key, value in trained_model_results.items():
    print(f"  {key}: {value:.4f}")

print("\n" + "=" * 70)
print("IMPROVEMENT")
print("=" * 70)
improvement = {k: trained_model_results[k] - original_model_results[k] 
               for k in trained_model_results.keys()}
for key, value in improvement.items():
    print(f"  {key}: +{value*100:.2f}%")

## 5. PEFT/LoRA Fine-Tuning

Now we'll use **Parameter-Efficient Fine-Tuning (PEFT)** with **LoRA (Low-Rank Adaptation)**.

### Why PEFT/LoRA?

**Advantages:**
- ✅ Train only ~1.4% of parameters (3.5M vs 247M)
- ✅ Much faster training
- ✅ Less GPU memory required
- ✅ Comparable results to full fine-tuning
- ✅ Easy to swap adapters for different tasks

**How it works:**
- Freeze the original model weights
- Add small trainable matrices (adapters)
- Train only these adapters
- At inference, combine base model + adapter

In [None]:
# Configure LoRA
from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    r=32,  # Rank (dimension of adapter)
    lora_alpha=32,  # Scaling factor
    target_modules=["q", "v"],  # Apply to query and value projections
    lora_dropout=0.05,  # Dropout for regularization
    bias="none",  # Don't train bias terms
    task_type=TaskType.SEQ_2_SEQ_LM  # FLAN-T5 is seq2seq
)

print("✅ LoRA configuration created!")
print(f"\nLoRA settings:")
print(f"  Rank (r): {lora_config.r}")
print(f"  Alpha: {lora_config.lora_alpha}")
print(f"  Target modules: {lora_config.target_modules}")
print(f"  Dropout: {lora_config.lora_dropout}")
print(f"  Task type: {lora_config.task_type}")

In [None]:
# Add LoRA adapters to the original model
peft_model = get_peft_model(original_model, lora_config)

print("✅ PEFT model created!")
print("\n=== Trainable Parameters ===")
print(print_number_of_trainable_model_parameters(peft_model))

In [None]:
# Configure PEFT training arguments
output_dir_peft = f'./dialogue-summary-peft-training-{str(int(time.time()))}'

peft_training_args = TrainingArguments(
    output_dir=output_dir_peft,
    auto_find_batch_size=True,
    learning_rate=1e-3,  # Higher learning rate for PEFT
    num_train_epochs=1,
    logging_steps=1,
    max_steps=1,  # Increase for real training
    save_strategy="steps",
    save_steps=100,
)

peft_trainer = Trainer(
    model=peft_model,
    args=peft_training_args,
    train_dataset=tokenized_datasets['train'],
)

print("✅ PEFT trainer configured!")
print(f"\nOutput directory: {output_dir_peft}")
print(f"Learning rate: {peft_training_args.learning_rate}")
print(f"Max steps: {peft_training_args.max_steps}")

In [None]:
# Train the PEFT adapter
print("=" * 70)
print("STARTING PEFT/LoRA TRAINING")
print("=" * 70)
print("\nTraining only 1.41% of parameters...")
print("This is much faster than full fine-tuning!")
print("=" * 70)
print()

peft_trainer.train()

print("\n" + "=" * 70)
print("PEFT TRAINING COMPLETE!")
print("=" * 70)

In [None]:
# Save the PEFT adapter
peft_model_path = "./peft-dialogue-summary-checkpoint"

print(f"Saving PEFT adapter to: {peft_model_path}")
peft_trainer.model.save_pretrained(peft_model_path)
tokenizer.save_pretrained(peft_model_path)

print("\n✅ PEFT adapter saved successfully!")

In [None]:
# Load the PEFT model for inference
from peft import PeftModel, PeftConfig

# Load base model
peft_model_base = AutoModelForSeq2SeqLM.from_pretrained(
    "google/flan-t5-base", 
    torch_dtype=torch.bfloat16
)

# Load PEFT adapter
peft_model_inference = PeftModel.from_pretrained(
    peft_model_base,
    peft_model_path,
    torch_dtype=torch.bfloat16,
    is_trainable=False
)

print("✅ PEFT model loaded for inference!")

### 5.1. Evaluate PEFT Model Qualitatively

Let's compare all three models side by side:
1. Original (zero-shot)
2. Full fine-tuned
3. PEFT/LoRA

In [None]:
# Compare all three models
index = 200
dialogue = dataset['test'][index]['dialogue']
human_summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary:
"""

# Tokenize
input_ids = tokenizer(prompt, return_tensors="pt").input_ids

# Move models to device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
input_ids = input_ids.to(device)
original_model.to(device)
trained_model.to(device)
peft_model_inference.to(device)

# Generate with all models
generation_config = GenerationConfig(max_new_tokens=200, num_beams=1)

# Original model
original_outputs = original_model.generate(input_ids=input_ids, generation_config=generation_config)
original_text = tokenizer.decode(original_outputs[0], skip_special_tokens=True)

# Full fine-tuned model
trained_outputs = trained_model.generate(input_ids=input_ids, generation_config=generation_config)
trained_text = tokenizer.decode(trained_outputs[0], skip_special_tokens=True)

# PEFT model
peft_outputs = peft_model_inference.generate(input_ids=input_ids, generation_config=generation_config)
peft_text = tokenizer.decode(peft_outputs[0], skip_special_tokens=True)

# Display all results
dash_line = '-' * 70
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{human_summary}')
print(dash_line)
print(f'ORIGINAL MODEL (Zero-shot):\n{original_text}')
print(dash_line)
print(f'FULL FINE-TUNED MODEL:\n{trained_text}')
print(dash_line)
print(f'PEFT/LoRA MODEL:\n{peft_text}')
print(dash_line)

print("\n💡 Notice how both fine-tuned models produce much better summaries!")
print("   And PEFT achieves this with only 1.41% trainable parameters!")

### 5.2. Evaluate PEFT Model Quantitatively

Let's calculate ROUGE scores for the PEFT model and compare with the others.

In [None]:
# Generate summaries with PEFT model
dialogues = dataset['test'][0:10]['dialogue']
human_baseline_summaries = dataset['test'][0:10]['summary']
peft_model_summaries = []

print("Generating summaries with PEFT model...")
print(f"Processing {len(dialogues)} dialogues...")

for idx, dialogue in enumerate(dialogues):
    prompt = f"""
Summarize the following conversation.
{dialogue}
Summary:
"""
    
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    input_ids = input_ids.to(device)
    
    # PEFT model
    peft_outputs = peft_model_inference.generate(
        input_ids=input_ids, 
        generation_config=GenerationConfig(max_new_tokens=200, num_beams=1)
    )
    peft_text = tokenizer.decode(peft_outputs[0], skip_special_tokens=True)
    peft_model_summaries.append(peft_text)
    
    print(f"  Processed {idx + 1}/{len(dialogues)}", end='\r')

print("\n\n✅ PEFT summaries generated!")

# Create comprehensive DataFrame
zipped_all = list(zip(
    human_baseline_summaries, 
    original_model_summaries, 
    trained_model_summaries,
    peft_model_summaries
))
df_all = pd.DataFrame(
    zipped_all, 
    columns=['human_baseline', 'original_model', 'full_finetune', 'peft_model']
)
print("\n=== All Model Comparisons ===")
print(df_all.head())

In [None]:
# Calculate ROUGE scores for PEFT model
peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries,
    use_aggregator=True,
    use_stemmer=True,
)

print("=" * 70)
print("COMPLETE ROUGE SCORES COMPARISON")
print("=" * 70)

print("\n1️⃣ ORIGINAL MODEL (Zero-shot):")
for key, value in original_model_results.items():
    print(f"     {key}: {value:.4f}")

print("\n2️⃣ FULL FINE-TUNED MODEL:")
for key, value in trained_model_results.items():
    print(f"     {key}: {value:.4f}")

print("\n3️⃣ PEFT/LoRA MODEL:")
for key, value in peft_model_results.items():
    print(f"     {key}: {value:.4f}")

print("\n" + "=" * 70)

## 6. Results Comparison and Analysis

### 6.1. Improvement Analysis

Let's quantify the improvement of each fine-tuning approach over the baseline.

In [None]:
# Calculate improvements over baseline
print("=" * 70)
print("IMPROVEMENT OVER BASELINE (Zero-shot)")
print("=" * 70)

print("\n📊 FULL FINE-TUNED MODEL:")
full_improvement = {k: trained_model_results[k] - original_model_results[k] 
                    for k in trained_model_results.keys()}
for key, value in full_improvement.items():
    print(f"  {key}: +{value*100:.2f}%")

print("\n📊 PEFT/LoRA MODEL:")
peft_improvement = {k: peft_model_results[k] - original_model_results[k] 
                    for k in peft_model_results.keys()}
for key, value in peft_improvement.items():
    print(f"  {key}: +{value*100:.2f}%")

print("\n" + "=" * 70)
print("PEFT vs FULL FINE-TUNING")
print("=" * 70)
peft_vs_full = {k: peft_model_results[k] - trained_model_results[k] 
                for k in peft_model_results.keys()}
for key, value in peft_vs_full.items():
    print(f"  {key}: {value*100:+.2f}%")

print("\n💡 PEFT achieves comparable results with only 1.41% trainable parameters!")

### 6.2. Side-by-Side Comparison

| Metric | Zero-shot | Full Fine-tune | PEFT/LoRA | Winner |
|--------|-----------|----------------|-----------|--------|
| **Trainable Params** | 0 (0%) | 247M (100%) | 3.5M (1.41%) | 🏆 PEFT |
| **Training Speed** | N/A | Slow | Fast | 🏆 PEFT |
| **GPU Memory** | Minimal | High | Moderate | 🏆 PEFT |
| **ROUGE-1** | ~0.21 | ~0.41 | ~0.37 | 🏆 Full FT |
| **ROUGE-2** | ~0.08 | ~0.18 | ~0.12 | 🏆 Full FT |
| **ROUGE-L** | ~0.18 | ~0.30 | ~0.28 | 🏆 Full FT |
| **Practicality** | Poor | Good | Excellent | 🏆 PEFT |

### Key Insights

✅ **Full Fine-Tuning**:
- Best ROUGE scores
- Requires most resources
- Takes longest to train
- Best when maximum quality is critical

✅ **PEFT/LoRA**:
- ~90% of full fine-tuning quality
- Only 1.41% trainable parameters
- Much faster training
- Lower GPU memory requirements
- **Best for most practical use cases!**

❌ **Zero-Shot**:
- Poor performance on summarization
- No training required
- Only useful for quick prototyping

In [None]:
# Resource comparison summary
import pandas as pd

resource_data = {
    'Approach': ['Zero-shot', 'Full Fine-tune', 'PEFT/LoRA'],
    'Trainable Params': ['0 (0%)', '247M (100%)', '3.5M (1.41%)'],
    'Training Time': ['N/A', 'Long', 'Short'],
    'GPU Memory': ['Low', 'High', 'Medium'],
    'Quality (ROUGE-1)': [f"{original_model_results['rouge1']:.3f}", 
                          f"{trained_model_results['rouge1']:.3f}", 
                          f"{peft_model_results['rouge1']:.3f}"],
    'Best Use Case': ['Quick testing', 'Maximum quality', 'Production deployment']
}

df_resources = pd.DataFrame(resource_data)
print("=" * 70)
print("RESOURCE COMPARISON SUMMARY")
print("=" * 70)
print()
print(df_resources.to_string(index=False))
print()
print("=" * 70)

## 🎉 Congratulations!

You've successfully:
- ✅ Loaded and preprocessed the DialogSum dataset
- ✅ Tested zero-shot performance as baseline
- ✅ Performed full fine-tuning (all 247M parameters)
- ✅ Performed PEFT/LoRA fine-tuning (only 3.5M parameters)
- ✅ Evaluated both qualitatively and quantitatively (ROUGE)
- ✅ Compared different fine-tuning approaches

## 🎯 Key Takeaways

1. **Fine-tuning significantly improves performance** over zero-shot
   - ROUGE-1 improved from ~0.21 to ~0.41 (full FT) and ~0.37 (PEFT)

2. **PEFT/LoRA is highly efficient**
   - Trains only 1.41% of parameters
   - Achieves ~90% of full fine-tuning quality
   - Much faster and requires less memory

3. **Choose based on your constraints**:
   - **Full fine-tuning**: When you need maximum quality and have resources
   - **PEFT/LoRA**: For most practical applications (recommended!)
   - **Zero-shot**: Only for quick testing/prototyping

## 🚀 Next Steps

1. **Train longer**: Increase `max_steps` for better results
2. **Try other models**: Experiment with T5-large, FLAN-T5-xl
3. **Different datasets**: Fine-tune on your own data
4. **Hyperparameter tuning**: Adjust learning rate, LoRA rank, etc.
5. **Production deployment**: Package and serve your model

## 📚 Resources

- [FLAN-T5 Paper](https://arxiv.org/abs/2210.11416)
- [LoRA Paper](https://arxiv.org/abs/2106.09685)
- [PEFT Documentation](https://huggingface.co/docs/peft)
- [Transformers Documentation](https://huggingface.co/docs/transformers)
- [DialogSum Dataset](https://huggingface.co/datasets/knkarthick/dialogsum)

## 💡 Best Practices

1. **Always establish a baseline** (zero-shot or few-shot)
2. **Use ROUGE and human eval** together
3. **Start with PEFT** unless you need absolute best quality
4. **Monitor training** with logging and evaluation
5. **Test on held-out data** to measure generalization

---

**Happy Fine-Tuning! 🎓✨**

For more advanced techniques, check out:
- [Full Fine-Tuning (GPT-2)](../01-Full-Fine-Tuning/)
- [PEFT (Falcon-7B LoRA)](../02-PEFT/)
- [Reasoning Tuning](../04-Reasoning-Tuning/)