# Project 3: Fine-Tuning FLAN-T5 for Summarization & Measuring Forgetting

**Authors:** Shaunak Kapur & Pranav Krishnan

This notebook implements the Project 3 proposal: fine-tuning a small language model (`google/flan-t5-small`) on the Amazon Fine Food Reviews dataset to generate product review summaries. It also evaluates "forgetting" by checking the model's performance on a set of general knowledge questions before and after fine-tuning.


## 1. Setup and Installation

Installing required libraries: `transformers`, `datasets`, `evaluate`, `rouge_score`, `accelerate`, `sentencepiece`.


In [None]:
!pip install -q transformers datasets evaluate rouge_score accelerate sentencepiece


In [None]:
import torch
import pandas as pd
import numpy as np
from datasets import Dataset, load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSeq2SeqLM,
    DataCollatorForSeq2Seq,
    Seq2SeqTrainingArguments,
    Seq2SeqTrainer
)
import evaluate

# The code below was generated by AI; see [2].
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")


## 2. Load and Preprocess Data

We use the Amazon Fine Food Reviews dataset from Hugging Face. The dataset will be automatically downloaded using `load_dataset`.

We will:
1. Download the dataset from Hugging Face.
2. Convert to pandas DataFrame.
3. Drop rows with missing values.
4. Sample the data (e.g., 20,000 rows) to keep training time reasonable.
5. Split into Train (80%), Validation (10%), and Test (10%).


In [None]:
# Load dataset from Hugging Face
# The code below was generated by AI; see [2].
print("Downloading dataset from Hugging Face...")
ds = load_dataset("jhan21/amazon-food-reviews-dataset")

# Convert to pandas DataFrame (the dataset has a 'train' split)
df = ds["train"].to_pandas()

# Keep relevant columns and drop NaNs
df = df[["Summary", "Text"]].dropna()

# Filter out very long reviews to save memory/time
df = df[df["Text"].str.len() <= 512]

# Sample data for faster training (adjust as needed)
SAMPLE_SIZE = 20000
if len(df) > SAMPLE_SIZE:
    df = df.sample(SAMPLE_SIZE, random_state=42)

print(f"Dataset size: {len(df)}")
df.head()


In [None]:
from sklearn.model_selection import train_test_split

# The code below was generated by AI; see [2].
train_df, temp_df = train_test_split(df, test_size=0.2, random_state=42)
val_df, test_df = train_test_split(temp_df, test_size=0.5, random_state=42)

train_ds = Dataset.from_pandas(train_df.reset_index(drop=True))
val_ds = Dataset.from_pandas(val_df.reset_index(drop=True))
test_ds = Dataset.from_pandas(test_df.reset_index(drop=True))

print(f"Train: {len(train_ds)}, Val: {len(val_ds)}, Test: {len(test_ds)}")


## 3. Model and Tokenizer Setup

We use `google/flan-t5-small`. We load two copies:
1. `base_model`: Keeps original weights to measure baseline performance and forgetting.
2. `model`: Will be fine-tuned.


In [None]:
MODEL_NAME = "google/flan-t5-small"

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

# Model to be fine-tuned
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)

# Base model for comparison (frozen)
base_model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)
base_model.to(device)
print("Models loaded.")


## 4. Tokenization

We preprocess the text inputs with a prefix "Summarize this review: ".


In [None]:
MAX_INPUT_LENGTH = 256
MAX_TARGET_LENGTH = 32
PREFIX = "Summarize this review: "

def preprocess_function(examples):
    inputs = [PREFIX + doc for doc in examples["Text"]]
    model_inputs = tokenizer(inputs, max_length=MAX_INPUT_LENGTH, truncation=True)

    labels = tokenizer(text_target=examples["Summary"], max_length=MAX_TARGET_LENGTH, truncation=True)
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_train = train_ds.map(preprocess_function, batched=True)
tokenized_val = val_ds.map(preprocess_function, batched=True)
tokenized_test = test_ds.map(preprocess_function, batched=True)


## 5. Forgetting Analysis (Before Training)

We define a small set of general knowledge questions to test the "forgetting" hypothesis. We check how well the base model answers them.


In [None]:
qa_pairs = [
    ("What is the capital of France?", "Paris"),
    ("How many days are in a week?", "7"),
    ("What gas do plants absorb?", "carbon dioxide"),
    ("What is the largest planet in our solar system?", "Jupiter"),
    ("What is H2O?", "water"),
    ("Who wrote Romeo and Juliet?", "Shakespeare"),
    ("What color is the sky on a clear day?", "blue"),
    ("What is 2 + 2?", "4")
]

def evaluate_forgetting(model_obj, tokenizer_obj, questions, device):
    model_obj.eval()
    correct = 0
    results = []
    
    print("--- Forgetting Analysis ---")
    for q, ans in questions:
        input_ids = tokenizer_obj("Answer the question: " + q, return_tensors="pt").input_ids.to(device)
        
        with torch.no_grad():
            outputs = model_obj.generate(input_ids, max_length=20)
        
        pred = tokenizer_obj.decode(outputs[0], skip_special_tokens=True)
        is_correct = ans.lower() in pred.lower()
        if is_correct:
            correct += 1
            
        results.append({"Question": q, "Expected": ans, "Predicted": pred, "Correct": is_correct})
        print(f"Q: {q} | Pred: {pred} | Expected: {ans}")
    
    accuracy = correct / len(questions)
    print(f"Accuracy: {accuracy:.2%}")
    return accuracy, results

print("Evaluating Base Model on QA set...")
base_qa_acc, base_qa_results = evaluate_forgetting(base_model, tokenizer, qa_pairs, device)


## 6. Fine-Tuning

We use `Seq2SeqTrainer` to fine-tune the model.


In [None]:
rouge = evaluate.load("rouge")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)
    
    return {k: round(v * 100, 4) for k, v in result.items()}

args = Seq2SeqTrainingArguments(
    output_dir="./flan-t5-summarizer",
    evaluation_strategy="epoch",
    learning_rate=2e-4,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    weight_decay=0.01,
    save_total_limit=1,
    num_train_epochs=2,
    predict_with_generate=True,
    fp16=(device == "cuda"),
    logging_steps=100,
)

trainer = Seq2SeqTrainer(
    model=model,
    args=args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    data_collator=DataCollatorForSeq2Seq(tokenizer, model=model),
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

# The code below was generated by AI; see [2].
trainer.train()


## 7. Evaluation: Summarization Quality

Compare ROUGE scores and look at qualitative examples.


In [None]:
print("Evaluating on Test Set...")
test_results = trainer.evaluate(tokenized_test)
print(test_results)


In [None]:
# Qualitative Comparison
def generate_summary(model_obj, text, device):
    inputs = tokenizer(PREFIX + text, return_tensors="pt", max_length=MAX_INPUT_LENGTH, truncation=True).to(device)
    outputs = model_obj.generate(inputs.input_ids, max_length=MAX_TARGET_LENGTH, num_beams=4)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

sample_indices = [0, 5, 10, 15, 20]
print("--- Qualitative Results ---\n")

for idx in sample_indices:
    example = test_ds[idx]
    text = example["Text"]
    ref_summary = example["Summary"]
    
    base_summary = generate_summary(base_model, text, device)
    ft_summary = generate_summary(model, text, device)
    
    print(f"Review: {text[:200]}...")
    print(f"Reference: {ref_summary}")
    print(f"Base Model: {base_summary}")
    print(f"Fine-Tuned: {ft_summary}")
    print("-" * 80)


## 8. Forgetting Analysis (After Training)

Check if the fine-tuned model has forgotten general knowledge.


In [None]:
print("Evaluating Fine-Tuned Model on QA set...")
ft_qa_acc, ft_qa_results = evaluate_forgetting(model, tokenizer, qa_pairs, device)

print(f"\nBase Model QA Accuracy: {base_qa_acc:.2%}")
print(f"Fine-Tuned Model QA Accuracy: {ft_qa_acc:.2%}")

diff = ft_qa_acc - base_qa_acc
print(f"Change in Accuracy: {diff:.2%}")


## 9. Save Model

Save the fine-tuned model to be downloaded.


In [None]:
trainer.save_model("./finetuned_summarizer_final")
tokenizer.save_pretrained("./finetuned_summarizer_final")

print("Model saved to ./finetuned_summarizer_final")
# To download from Colab:
# from google.colab import files
# !zip -r model.zip ./finetuned_summarizer_final
# files.download('model.zip')


# Project 3: Fine-Tuning FLAN-T5 for Summarization & Measuring Forgetting

**Authors:** Shaunak Kapur & Pranav Krishnan

This notebook implements the Project 3 proposal: fine-tuning a small language model (`google/flan-t5-small`) on the Amazon Fine Food Reviews dataset to generate product review summaries. It also evaluates "forgetting" by checking the model's performance on a set of general knowledge questions before and after fine-tuning.


## 1. Setup and Installation

Installing required libraries: `transformers`, `datasets`, `evaluate`, `rouge_score`, `accelerate`, `sentencepiece`.


In [None]:
!pip install -q transformers datasets evaluate rouge_score accelerate sentencepiece


In [None]:
import torch
import pandas as pd
import numpy as np
from datasets import Dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSeq2SeqLM,
    DataCollatorForSeq2Seq,
    Seq2SeqTrainingArguments,
    Seq2SeqTrainer
)
import evaluate

# The code below was generated by AI; see [2].
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")


## 2. Load and Preprocess Data

We use the Amazon Fine Food Reviews dataset. 
**Important:** You must upload `Reviews.csv` to the Colab runtime files (left sidebar) before running this cell.

We will:
1. Load the CSV.
2. Drop rows with missing values.
3. Sample the data (e.g., 20,000 rows) to keep training time reasonable.
4. Split into Train (80%), Validation (10%), and Test (10%).


In [None]:
# Load dataset
# The code below was generated by AI; see [2].
try:
    df = pd.read_csv("Reviews.csv")
except FileNotFoundError:
    print("Error: Reviews.csv not found. Please upload it to the Colab runtime.")
    # Create dummy data for demonstration purposes if file is missing so notebook can still 'run' structurally
    data = {
        "Summary": ["Great product", "Not good", "Okay item"] * 100,
        "Text": ["This is a really great product I loved it.", "This was terrible do not buy.", "It was just okay nothing special."] * 100
    }
    df = pd.DataFrame(data)

# Keep relevant columns and drop NaNs
df = df[["Summary", "Text"]].dropna()

# Filter out very long reviews to save memory/time
df = df[df["Text"].str.len() <= 512]

# Sample data for faster training (adjust as needed)
SAMPLE_SIZE = 20000
if len(df) > SAMPLE_SIZE:
    df = df.sample(SAMPLE_SIZE, random_state=42)

print(f"Dataset size: {len(df)}")
df.head()


In [None]:
from sklearn.model_selection import train_test_split

# The code below was generated by AI; see [2].
train_df, temp_df = train_test_split(df, test_size=0.2, random_state=42)
val_df, test_df = train_test_split(temp_df, test_size=0.5, random_state=42)

train_ds = Dataset.from_pandas(train_df.reset_index(drop=True))
val_ds = Dataset.from_pandas(val_df.reset_index(drop=True))
test_ds = Dataset.from_pandas(test_df.reset_index(drop=True))

print(f"Train: {len(train_ds)}, Val: {len(val_ds)}, Test: {len(test_ds)}")


## 3. Model and Tokenizer Setup

We use `google/flan-t5-small`. We load two copies:
1. `base_model`: Keeps original weights to measure baseline performance and forgetting.
2. `model`: Will be fine-tuned.


In [None]:
MODEL_NAME = "google/flan-t5-small"

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

# Model to be fine-tuned
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)

# Base model for comparison (frozen)
base_model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)
base_model.to(device)
print("Models loaded.")


## 4. Tokenization

We preprocess the text inputs with a prefix "Summarize this review: ".


In [None]:
MAX_INPUT_LENGTH = 256
MAX_TARGET_LENGTH = 32
PREFIX = "Summarize this review: "

def preprocess_function(examples):
    inputs = [PREFIX + doc for doc in examples["Text"]]
    model_inputs = tokenizer(inputs, max_length=MAX_INPUT_LENGTH, truncation=True)

    labels = tokenizer(text_target=examples["Summary"], max_length=MAX_TARGET_LENGTH, truncation=True)
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_train = train_ds.map(preprocess_function, batched=True)
tokenized_val = val_ds.map(preprocess_function, batched=True)
tokenized_test = test_ds.map(preprocess_function, batched=True)


## 5. Forgetting Analysis (Before Training)

We define a small set of general knowledge questions to test the "forgetting" hypothesis. We check how well the base model answers them.


In [None]:
qa_pairs = [
    ("What is the capital of France?", "Paris"),
    ("How many days are in a week?", "7"),
    ("What gas do plants absorb?", "carbon dioxide"),
    ("What is the largest planet in our solar system?", "Jupiter"),
    ("What is H2O?", "water"),
    ("Who wrote Romeo and Juliet?", "Shakespeare"),
    ("What color is the sky on a clear day?", "blue"),
    ("What is 2 + 2?", "4")
]

def evaluate_forgetting(model_obj, tokenizer_obj, questions, device):
    model_obj.eval()
    correct = 0
    results = []
    
    print("--- Forgetting Analysis ---")
    for q, ans in questions:
        input_ids = tokenizer_obj("Answer the question: " + q, return_tensors="pt").input_ids.to(device)
        
        with torch.no_grad():
            outputs = model_obj.generate(input_ids, max_length=20)
        
        pred = tokenizer_obj.decode(outputs[0], skip_special_tokens=True)
        is_correct = ans.lower() in pred.lower()
        if is_correct:
            correct += 1
            
        results.append({"Question": q, "Expected": ans, "Predicted": pred, "Correct": is_correct})
        print(f"Q: {q} | Pred: {pred} | Expected: {ans}")
    
    accuracy = correct / len(questions)
    print(f"Accuracy: {accuracy:.2%}")
    return accuracy, results

print("Evaluating Base Model on QA set...")
base_qa_acc, base_qa_results = evaluate_forgetting(base_model, tokenizer, qa_pairs, device)


## 6. Fine-Tuning

We use `Seq2SeqTrainer` to fine-tune the model.


In [None]:
rouge = evaluate.load("rouge")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)
    
    return {k: round(v * 100, 4) for k, v in result.items()}

args = Seq2SeqTrainingArguments(
    output_dir="./flan-t5-summarizer",
    evaluation_strategy="epoch",
    learning_rate=2e-4,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    weight_decay=0.01,
    save_total_limit=1,
    num_train_epochs=2,
    predict_with_generate=True,
    fp16=(device == "cuda"),
    logging_steps=100,
)

trainer = Seq2SeqTrainer(
    model=model,
    args=args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    data_collator=DataCollatorForSeq2Seq(tokenizer, model=model),
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

# The code below was generated by AI; see [2].
trainer.train()


## 7. Evaluation: Summarization Quality

Compare ROUGE scores and look at qualitative examples.


In [None]:
print("Evaluating on Test Set...")
test_results = trainer.evaluate(tokenized_test)
print(test_results)


In [None]:
# Qualitative Comparison
def generate_summary(model_obj, text, device):
    inputs = tokenizer(PREFIX + text, return_tensors="pt", max_length=MAX_INPUT_LENGTH, truncation=True).to(device)
    outputs = model_obj.generate(inputs.input_ids, max_length=MAX_TARGET_LENGTH, num_beams=4)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

sample_indices = [0, 5, 10, 15, 20]
print("--- Qualitative Results ---\n")

for idx in sample_indices:
    example = test_ds[idx]
    text = example["Text"]
    ref_summary = example["Summary"]
    
    base_summary = generate_summary(base_model, text, device)
    ft_summary = generate_summary(model, text, device)
    
    print(f"Review: {text[:200]}...")
    print(f"Reference: {ref_summary}")
    print(f"Base Model: {base_summary}")
    print(f"Fine-Tuned: {ft_summary}")
    print("-" * 80)


## 8. Forgetting Analysis (After Training)

Check if the fine-tuned model has forgotten general knowledge.


In [None]:
print("Evaluating Fine-Tuned Model on QA set...")
ft_qa_acc, ft_qa_results = evaluate_forgetting(model, tokenizer, qa_pairs, device)

print(f"\nBase Model QA Accuracy: {base_qa_acc:.2%}")
print(f"Fine-Tuned Model QA Accuracy: {ft_qa_acc:.2%}")

diff = ft_qa_acc - base_qa_acc
print(f"Change in Accuracy: {diff:.2%}")


## 9. Save Model

Save the fine-tuned model to be downloaded.


In [None]:
trainer.save_model("./finetuned_summarizer_final")
tokenizer.save_pretrained("./finetuned_summarizer_final")

print("Model saved to ./finetuned_summarizer_final")
# To download from Colab:
# from google.colab import files
# !zip -r model.zip ./finetuned_summarizer_final
# files.download('model.zip')
