# Text Summarization with DialogSum

**Module 02 | Notebook 3 of 3**

In this notebook, we'll fine-tune a sequence-to-sequence model for dialogue summarization.

## Learning Objectives

By the end of this notebook, you will be able to:
1. Work with encoder-decoder models (T5, BART)
2. Prepare data for seq2seq tasks
3. Use ROUGE metrics for evaluation
4. Generate summaries from dialogue

---

In [1]:
%%capture
!pip install transformers datasets accelerate evaluate rouge-score nltk

In [2]:
import torch
from transformers import (
    AutoTokenizer, AutoModelForSeq2SeqLM,
    Seq2SeqTrainingArguments, Seq2SeqTrainer,
    DataCollatorForSeq2Seq
)
from datasets import load_dataset
import evaluate
import numpy as np
import nltk
import warnings
warnings.filterwarnings('ignore')

# Download NLTK data for sentence tokenization
nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

---

## Understanding Sequence-to-Sequence Models

### Why T5 (Encoder-Decoder)?
Unlike BERT (Encoder-only, good for classification) or GPT (Decoder-only, good for generation), **T5 is an Encoder-Decoder model**.

1.  **Encoder**: Reads the entire input text (dialogue) and understands the context.
2.  **Decoder**: Generates the output text (summary) word by word, attending to the encoder's understanding.

This makes it perfect for tasks where the input and output are different texts (Summarization, Translation).

### How Summarization Works

```
Input (Dialogue):
┌─────────────────────────────────────────────┐
│ Person A: Hi, how are you?                  │
│ Person B: I'm great! Just got promoted.     │
│ Person A: Congratulations! That's amazing!  │
│ Person B: Thanks! Let's celebrate tonight.  │
└─────────────────────────────────────────────┘
                    │
                    ▼
            ┌───────────────┐
            │    ENCODER    │  (Understands input)
            └───────────────┘
                    │
                    ▼
            ┌───────────────┐
            │    DECODER    │  (Generates output)
            └───────────────┘
                    │
                    ▼
Output (Summary):
┌─────────────────────────────────────────────┐
│ Person B got promoted and they plan to      │
│ celebrate tonight.                          │
└─────────────────────────────────────────────┘
```

### Popular Summarization Models

| Model | Parameters | Best For |
|-------|------------|----------|
| T5-small | 60M | Quick experiments |
| T5-base | 220M | Good balance |
| BART-base | 140M | News articles |
| FLAN-T5 | Various | Instruction-tuned |

---

## Load the DialogSum Dataset

In [3]:
# Load DialogSum dataset
dataset = load_dataset("knkarthick/dialogsum")

print("Dataset structure:")
print(dataset)
print(f"\nTrain examples: {len(dataset['train']):,}")
print(f"Validation examples: {len(dataset['validation']):,}")
print(f"Test examples: {len(dataset['test']):,}")

In [4]:
# Explore the data
print("Sample dialogue and summary:")
print("=" * 60)
example = dataset['train'][0]

print("DIALOGUE:")
print(example['dialogue'])
print("\nSUMMARY:")
print(example['summary'])
print("\nTOPIC:")
print(example.get('topic', 'N/A'))

In [5]:
# Analyze lengths
dialogue_lengths = [len(d.split()) for d in dataset['train']['dialogue']]
summary_lengths = [len(s.split()) for s in dataset['train']['summary']]

print("Length Statistics (words):")
print(f"\nDialogues:")
print(f"  Mean: {np.mean(dialogue_lengths):.0f}")
print(f"  Max: {max(dialogue_lengths)}")

print(f"\nSummaries:")
print(f"  Mean: {np.mean(summary_lengths):.0f}")
print(f"  Max: {max(summary_lengths)}")

# Compression ratio
ratios = [s/d for s, d in zip(summary_lengths, dialogue_lengths) if d > 0]
print(f"\nCompression ratio: {np.mean(ratios):.1%}")

---

## Data Preparation for Seq2Seq

In [6]:
# Load model and tokenizer
model_name = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# T5 uses a prefix for summarization
prefix = "summarize: "

# Tokenization parameters
max_input_length = 512
max_target_length = 128

In [None]:
def preprocess_function(examples):
    """Tokenize inputs and targets for seq2seq training."""
    # Add prefix to inputs
    inputs = [prefix + doc for doc in examples['dialogue']]
    
    # Tokenize inputs
    model_inputs = tokenizer(
        inputs,
        max_length=max_input_length,
        truncation=True,
        padding=True
    )
    
    # Tokenize targets (summaries)
    labels = tokenizer(
        examples['summary'],
        max_length=max_target_length,
        truncation=True,
        padding=True
    )
    
    model_inputs['labels'] = labels['input_ids']
    
    return model_inputs

In [8]:
# Use smaller subsets for faster training
train_size = 1000
val_size = 200

train_data = dataset['train'].shuffle(seed=42).select(range(train_size))
val_data = dataset['validation'].shuffle(seed=42).select(range(val_size))

# Tokenize
train_tokenized = train_data.map(preprocess_function, batched=True)
val_tokenized = val_data.map(preprocess_function, batched=True)

print(f"Training samples: {len(train_tokenized)}")
print(f"Validation samples: {len(val_tokenized)}")

---

## Understanding ROUGE Metrics

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures summary quality by checking overlaps.

### Visual Example
**Reference**: "The **cat** is **on** the **mat**."
**Generated**: "The **cat** is **on**."

1.  **ROUGE-1 (Unigrams)**: Matches single words.
    *   Matches: "The", "cat", "is", "on" (4 words).
    *   Score: 4/6 (Recall)

2.  **ROUGE-2 (Bigrams)**: Matches pairs.
    *   Matches: "The cat", "cat is", "is on" (3 pairs).
    *   Score: 3/5 (Recall)

3.  **ROUGE-L**: Longest Common Subsequence (Structure).
    *   Checks for the longest sequence of words that appear in both.

In [None]:
# Load ROUGE metric
rouge = evaluate.load("rouge")

def compute_metrics(eval_pred):
    """
    Compute ROUGE metrics for summarization evaluation.
    
    Note on Token Handling:
    -----------------------
    When using Seq2SeqTrainer with predict_with_generate=True, the predictions 
    and labels arrays come as numpy int64 values. The HuggingFace Fast Tokenizer 
    (Rust backend) can throw an OverflowError when decoding these values directly.
    
    This happens because:
    1. Padding tokens are marked as -100 (a convention in HuggingFace for ignore_index)
    2. Numpy int64 values may not convert cleanly to Rust's integer types
    3. Token IDs outside the valid vocabulary range cause decoding issues
    
    The fix below explicitly:
    - Converts each token to a Python int using int(tok)
    - Validates tokens are within [0, vocab_size) range
    - Replaces invalid tokens (including -100) with pad_token_id
    """
    predictions, labels = eval_pred
    
    # Get pad token id and vocab size for validation
    pad_id = tokenizer.pad_token_id if tokenizer.pad_token_id is not None else 0
    vocab_size = tokenizer.vocab_size
    
    # Clean predictions: convert to Python int and replace invalid values
    predictions_clean = []
    for row in predictions:
        clean_row = []
        for tok in row:
            tok_int = int(tok)  # Explicitly convert to Python int
            if tok_int < 0 or tok_int >= vocab_size:
                clean_row.append(pad_id)
            else:
                clean_row.append(tok_int)
        predictions_clean.append(clean_row)
    
    decoded_preds = tokenizer.batch_decode(predictions_clean, skip_special_tokens=True)
    
    # Clean labels: convert to Python int and replace -100 padding
    labels_clean = []
    for row in labels:
        clean_row = []
        for tok in row:
            tok_int = int(tok)  # Explicitly convert to Python int
            if tok_int < 0 or tok_int >= vocab_size:
                clean_row.append(pad_id)
            else:
                clean_row.append(tok_int)
        labels_clean.append(clean_row)
    
    decoded_labels = tokenizer.batch_decode(labels_clean, skip_special_tokens=True)
    
    # Strip whitespace
    decoded_preds = [pred.strip() for pred in decoded_preds]
    decoded_labels = [label.strip() for label in decoded_labels]
    
    # Compute ROUGE scores
    result = rouge.compute(
        predictions=decoded_preds,
        references=decoded_labels,
        use_stemmer=True
    )
    
    return {
        'rouge1': result['rouge1'],
        'rouge2': result['rouge2'],
        'rougeL': result['rougeL']
    }

---

## Model Training

> ⚠️ **Memory Warning**: T5 training uses ~6-8 GB GPU memory. If you get an **OOM (Out of Memory) Error**:
> - Reduce `per_device_train_batch_size` from 8 → 4 → 2
> - Reduce `train_size` to 500 in the data preparation step above
> - T5-small is already the lightest option; avoid T5-base on free Colab

In [14]:
# Load model
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

total_params = sum(p.numel() for p in model.parameters())
print(f"Model parameters: {total_params:,}")

In [15]:
# Seq2Seq training arguments
training_args = Seq2SeqTrainingArguments(
    output_dir="./summarization_model",
    num_train_epochs=3,              # 📚 Full passes through training data
    per_device_train_batch_size=8,   # 💾 Reduce to 4 or 2 if OOM (T5 is memory-hungry)
    per_device_eval_batch_size=8,
    learning_rate=3e-5,              # 🎯 T5 likes slightly higher LR (1e-4 to 3e-5)
    weight_decay=0.01,               # 🛡️ Helps prevent overfitting
    warmup_ratio=0.1,                # 🔥 Gradual LR ramp-up (10% of training steps)
    
    # Generation during evaluation
    predict_with_generate=True,      # 🔮 Actually generate text during eval (slower but needed for ROUGE)
    generation_max_length=max_target_length,
    
    # Evaluation
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="rougeL",
    
    # Logging
    logging_steps=50,
    
    # Performance
    fp16=torch.cuda.is_available(),
    report_to="none"
)

# Data collator for seq2seq
data_collator = DataCollatorForSeq2Seq(
    tokenizer=tokenizer,
    model=model
)

# Create trainer
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_tokenized,
    eval_dataset=val_tokenized,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

In [16]:
# Train
print("Starting training...")
print("=" * 50)
trainer.train()
print("\nTraining complete!")

---

## Evaluation

In [17]:
# Evaluate
eval_results = trainer.evaluate()

print("Evaluation Results:")
print("=" * 40)
print(f"ROUGE-1: {eval_results['eval_rouge1']:.2%}")
print(f"ROUGE-2: {eval_results['eval_rouge2']:.2%}")
print(f"ROUGE-L: {eval_results['eval_rougeL']:.2%}")

> 📊 **Are these ROUGE scores good?**
> - ROUGE-1 ~30-40% is typical for abstractive summarization
> - ROUGE-L ~25-35% means decent structural similarity
> - Our DialogSum scores: ROUGE-1 ~40%, ROUGE-L ~35% ✓
> 
> **Important**: ROUGE isn't perfect! A summary can be great but score low if it uses different words than the reference. Human evaluation is often needed for final quality assessment.

> 🔧 **Troubleshooting:**
> - **Summaries are too short/repetitive?** → Increase `min_length`, try `num_beams=4`
> - **Summaries don't make sense?** → Check training data quality or add more examples
> - **ROUGE dropping after epoch 2?** → Overfitting! Reduce epochs to 2 or add more training data

---

## Generate Summaries

In [18]:
from transformers import pipeline

# Create summarization pipeline
summarizer = pipeline(
    "summarization",
    model=model,
    tokenizer=tokenizer,
    device=0 if torch.cuda.is_available() else -1
)

In [19]:
# Test on examples from the test set
test_examples = dataset['test'].select(range(3))

print("Generated Summaries:")
print("=" * 60)

for i, example in enumerate(test_examples):
    dialogue = example['dialogue']
    reference = example['summary']
    
    # Generate summary
    generated = summarizer(
        prefix + dialogue,
        max_length=128,
        min_length=20,
        do_sample=False
    )[0]['summary_text']
    
    print(f"\n--- Example {i+1} ---")
    print(f"\nDIALOGUE (truncated):")
    print(dialogue[:300] + "...")
    print(f"\nREFERENCE SUMMARY:")
    print(reference)
    print(f"\nGENERATED SUMMARY:")
    print(generated)
    print("-" * 60)

In [20]:
# Test on custom dialogue
custom_dialogue = """
#Person1#: Hey Sarah, have you finished the project report?
#Person2#: Almost done! Just need to add the conclusion section.
#Person1#: Great! The client meeting is tomorrow at 2 PM.
#Person2#: I'll have it ready by noon so you can review it.
#Person1#: Perfect. Also, can you include the budget projections?
#Person2#: Already added those. I also updated the timeline chart.
#Person1#: You're a lifesaver! Thanks so much.
#Person2#: No problem! See you tomorrow.
"""

summary = summarizer(
    prefix + custom_dialogue,
    max_length=100,
    min_length=20,
    do_sample=False
)[0]['summary_text']

print("Custom Dialogue Summarization:")
print("=" * 50)
print("\nINPUT:")
print(custom_dialogue)
print("\nSUMMARY:")
print(summary)

---

## Generation Parameters

You can control how summaries are generated:

| Parameter | Effect |
|-----------|--------|
| `num_beams` | Example: 4. The model keeps the 4 best paths at each step. Better quality, but slower. |
| `temperature` | Randomness. Low (0.1) = repetitive/safe. High (0.9) = creative/risky. |
| `do_sample` | True = allows multiple different outputs. False = always same output. |

In [21]:
test_dialogue = dataset['test'][0]['dialogue']

# Different generation strategies
print("Generation Strategies Comparison:")
print("=" * 50)

# Greedy (default)
greedy = summarizer(prefix + test_dialogue, max_length=100, do_sample=False)[0]['summary_text']
print(f"\nGreedy: {greedy}")

# Beam search
beam = summarizer(prefix + test_dialogue, max_length=100, num_beams=4, do_sample=False)[0]['summary_text']
print(f"\nBeam Search (4): {beam}")

# Sampling
sampled = summarizer(prefix + test_dialogue, max_length=100, do_sample=True, top_k=50, temperature=0.7)[0]['summary_text']
print(f"\nSampling (temp=0.7): {sampled}")

---

## 🎯 Student Challenge

### Challenge: Fine-tune on News Articles (CNN/DailyMail)

We trained on dialogue. Now, try adapting the code for news summarization using the famous CNN/DailyMail dataset.

In [None]:
# TODO: Student Solution

# 1. Load the dataset (It's large, so we just take a tiny slice for the challenge)
# cnn_dataset = load_dataset("cnn_dailymail", "3.0.0", split="train[:100]")

# 2. Inspect the columns - Note: They are different from DialogSum!
# print(cnn_dataset.column_names) 
# Expected: ['article', 'highlights', 'id']

# 3. Create a NEW preprocess function
# def preprocess_news(examples):
#     inputs = [prefix + doc for doc in examples['article']] # Use 'article' column
#     ...
#     labels = tokenizer(examples['highlights'], ...) # Use 'highlights' column
#     return model_inputs

# 4. Apply mapping and Train


---

## Key Takeaways

1. **Seq2Seq models** have separate encoder and decoder components
2. **T5 uses prefixes** ("summarize:", "translate:") to specify tasks
3. **ROUGE metrics** measure n-gram overlap between generated and reference summaries
4. **Generation parameters** control output quality and diversity
5. **Beam search** usually produces better results than greedy decoding

---

## Next Steps

Continue to Module 03: **Model Optimization**
- `03_Model_Optimization/01_intro_to_optimization.ipynb`