# Exercise: Fine-Tune and Evaluate a Q&A Model

---

## Instructions

Complete all **TODO** sections marked in the code. There are multiple TODOs total:
- TODOs: In `exercise_utils_starter.py` (tokenization function)
- TODOs: In `exercise_utils_starter.py` (metrics functions)
- TODOs: In this notebook (training configuration)

Follow the hints provided in comments!

---

In [None]:
# Import helper utilities (contains TODOs)
import exercise_utils_starter as utils

---
# Fine-Tune DistilBERT

**Note:** This uses `prepare_train_features()` which contains **TODOs**. Make sure you've completed those first!

In [None]:
print("üì• Loading SQuAD 2.0...")
dataset = load_dataset("squad_v2")

print("\nüîÑ Creating subsets...")
train_dataset = create_squad_subset(dataset['train'], n_samples=1000, seed=SEED)
test_dataset = create_squad_subset(dataset['validation'], n_samples=200, seed=SEED)

print(f"\n‚úÖ Ready: {len(train_dataset)} train, {len(test_dataset)} test")

example = train_dataset[0]
print(f"\nExample Q: {example['question']}")
print(f"Answer: {example['answers']['text'][0]}")

## Load Model

In [None]:
try:
    train_dataset = utils.get_tokenized_dataset(small_train_dataset, tokenizer, max_length=384)
    print(f"‚úì Training dataset prepared successfully! Size: {len(train_dataset)}")
    print(f"  Sample keys: {list(train_dataset.features.keys())}")
except Exception as e:
    print(f"‚úó Error in tokenization: {e}")
    print("   Check TODOs in exercise_utils_starter.py")

## Tokenize Dataset

**Note:** This uses `prepare_train_features()` which contains **TODOs**. Make sure you've completed those first!

In [None]:
print("üîß Tokenizing...")

tokenized_train = train_dataset.map(
    lambda x: prepare_train_features(x, tokenizer),
    batched=True,
    remove_columns=train_dataset.column_names,
)

tokenized_test = test_dataset.map(
    lambda x: prepare_validation_features(x, tokenizer),
    batched=True,
    remove_columns=test_dataset.column_names,
)

print(f"‚úÖ Done: {len(tokenized_train)} train, {len(tokenized_test)} test samples")

# Verify tokenization
print("\nüîç Verifying...")
valid_answers = sum(1 for s in tokenized_train if s['start_positions'] > 0)
print(f"   Valid answers: {valid_answers}/{len(tokenized_train)}")

if valid_answers == 0:
    print("\n‚ö†Ô∏è  WARNING: No valid answers found!")
    print("   Check TODOs in exercise_utils_starter.py")
elif valid_answers < len(tokenized_train) * 0.8:
    print(f"\n‚ö†Ô∏è  Only {valid_answers/len(tokenized_train)*100:.0f}% valid")
    print("   Expected: >80%. Review tokenization logic.")
else:
    print(f"   ‚úÖ Good! {valid_answers/len(tokenized_train)*100:.0f}% have valid positions")

## Configure Training

### TODO: Set the learning rate
**Hint:** Use a small learning rate appropriate for fine-tuning

In [None]:
total_steps = len(tokenized_train) // 16 * 3
logging_steps = max(10, total_steps // 20)

# TODO: Set learning_rate to 2e-5
learning_rate = None  # TODO

training_args = TrainingArguments(
    output_dir="./finetuned-distilbert-qa",
    learning_rate=learning_rate,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    eval_strategy="no",
    save_strategy="epoch",
    logging_strategy="steps",
    logging_steps=logging_steps,
    logging_first_step=True,
    seed=SEED,
    save_total_limit=2,
    report_to="none",
)

print(f"‚öôÔ∏è  LR={training_args.learning_rate}, Batch=16, Epochs=3")
print(f"   Logging every {logging_steps} steps")

### TODO: Set the learning rate

In [None]:
# TODO: Set learning_rate to 2e-5
learning_rate = None  # Replace None with 2e-5

print(f"Learning rate: {learning_rate}")

## Fine-Tune

In [None]:
print("üöÄ Fine-tuning...\n")
train_result = trainer.train()
print(f"\n‚úÖ Done in {train_result.metrics['train_runtime']:.1f}s")
print(f"   Final loss: {train_result.metrics.get('train_loss', 'N/A')}")

## Save Model

In [None]:
output_dir = "./finetuned-distilbert-qa"
trainer.save_model(output_dir)
tokenizer.save_pretrained(output_dir)
print(f"‚úÖ Saved to {output_dir}/")

---
# Evaluation

**Note:** This section uses `compute_exact_match()` and `compute_f1_score()` which contain **TODOs**.

## Baseline

In [None]:
print("üîç Baseline evaluation...")

baseline_model = AutoModelForQuestionAnswering.from_pretrained(MODEL_NAME)
baseline_trainer = Trainer(model=baseline_model, tokenizer=tokenizer)

baseline_preds = baseline_trainer.predict(tokenized_test)
baseline_answers = postprocess_qa_predictions(
    test_dataset, tokenized_test,
    (baseline_preds.predictions[0], baseline_preds.predictions[1])
)

references = {ex['id']: ex['answers']['text'] for ex in test_dataset}

# Uses TODOs
baseline_em = compute_exact_match(baseline_answers, references)
baseline_f1 = compute_f1_score(baseline_answers, references)

print(f"\nBaseline: EM={baseline_em:.1f}%, F1={baseline_f1:.1f}%")

## Fine-Tuned

In [None]:
print("üîç Fine-tuned evaluation...")

finetuned_preds = trainer.predict(tokenized_test)
finetuned_answers = postprocess_qa_predictions(
    test_dataset, tokenized_test,
    (finetuned_preds.predictions[0], finetuned_preds.predictions[1])
)

finetuned_em = compute_exact_match(finetuned_answers, references)
finetuned_f1 = compute_f1_score(finetuned_answers, references)

print(f"\nFine-tuned: EM={finetuned_em:.1f}%, F1={finetuned_f1:.1f}%")

## Comparison

In [None]:
baseline_metrics = {'EM': baseline_em, 'F1': baseline_f1}
finetuned_metrics = {'EM': finetuned_em, 'F1': finetuned_f1}

print_metrics_summary(baseline_metrics, finetuned_metrics)

df = pd.DataFrame({
    'Baseline': baseline_metrics,
    'Fine-Tuned': finetuned_metrics
}).T
df['EM Œî'] = df['EM'] - baseline_em
df['F1 Œî'] = df['F1'] - baseline_f1

display(df)

## Visualization

In [None]:
plot_metrics_comparison({
    'Baseline': baseline_metrics,
    'Fine-Tuned': finetuned_metrics
})
plt.show()

print(f"\nüí° F1 improved by {finetuned_f1 - baseline_f1:.1f} points!")

---
# Learning Rate Experiment

### TODO: Set the higher learning rate
**Hint:** Use a higher learning rate for comparison

## Train with Higher LR

In [None]:
print("üî¨ Training with higher LR...\n")

model_high_lr = AutoModelForQuestionAnswering.from_pretrained(MODEL_NAME)

total_steps_high = len(tokenized_train) // 16 * 2
logging_steps_high = max(10, total_steps_high // 15)

# TODO: Set higher_learning_rate to 5e-5
higher_learning_rate = None  # TODO

training_args_high = TrainingArguments(
    output_dir="./finetuned-high-lr",
    learning_rate=higher_learning_rate,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    eval_strategy="no",
    save_strategy="epoch",
    logging_strategy="steps",
    logging_steps=logging_steps_high,
    logging_first_step=True,
    seed=SEED,
    save_total_limit=2,
    report_to="none",
)

trainer_high = Trainer(
    model=model_high_lr,
    args=training_args_high,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    tokenizer=tokenizer,
    data_collator=default_data_collator,
)

train_result_high = trainer_high.train()
print(f"\n‚úÖ Done in {train_result_high.metrics['train_runtime']:.1f}s")

## Evaluate

In [None]:
preds_high = trainer_high.predict(tokenized_test)
answers_high = postprocess_qa_predictions(
    test_dataset, tokenized_test,
    (preds_high.predictions[0], preds_high.predictions[1])
)

em_high = compute_exact_match(answers_high, references)
f1_high = compute_f1_score(answers_high, references)

print(f"High LR: EM={em_high:.1f}%, F1={f1_high:.1f}%")

## Comparison

In [None]:
comparison = pd.DataFrame({
    'LR': ['2e-5', '5e-5'],
    'Epochs': [3, 2],
    'F1 (%)': [finetuned_f1, f1_high],
    'EM (%)': [finetuned_em, em_high],
    'Time (s)': [
        train_result.metrics['train_runtime'],
        train_result_high.metrics['train_runtime']
    ]
})

display(comparison)

winner = '2e-5' if finetuned_f1 > f1_high else '5e-5'
print(f"\nüèÜ Winner: LR={winner}")

## Visualization

In [None]:
plot_learning_rate_comparison([
    {'lr': '2e-5', 'epochs': 3, 'em': finetuned_em, 'f1': finetuned_f1},
    {'lr': '5e-5', 'epochs': 2, 'em': em_high, 'f1': f1_high}
])
plt.show()

---
# Reflection Questions

### TODO: Answer these questions based on your results

**Q1: Which learning rate performed better?**

*Your answer here:*

---

**Q2: Why might a lower learning rate be preferred for fine-tuning? (Hint: catastrophic forgetting)**

*Your answer here:*

---

**Q3: What trade-off do you notice between learning rate and training time?**

*Your answer here:*

---