# 03 — Baseline Fine‑Tuning (BART‑base on CNN/DailyMail)

**Project:** Reducing Hallucinations via Verifier‑Reranking  
**Dataset:** `ccdv/cnn_dailymail` (config `3.0.0`)  
**Model:** `facebook/bart-base`

This notebook fine‑tunes a BART‑base summarizer and establishes the baseline metrics we will later compare to reranking.
Each step is structured as **What/Why → Code (commented) → How to read results**. 


### Reference Context
This notebook reproduces the BART fine-tuning recipe from Lewis et al. (2020) on the CNN/DailyMail summarization benchmark (Hermann et al., 2015; See et al., 2017). The model minimizes the conditional log-loss
\[
\mathcal{L}(\theta) = - \sum_{t=1}^{T} \log p_\theta(y_t \mid y_{<t}, x),
\]
using teacher forcing, where \(x\) denotes the article and \(y_t\) the reference highlight tokens. Metrics reported later (ROUGE, BERTScore) follow the evaluation protocol from these works.



## What / Why
**What:** Install exact library versions and import dependencies.  
**Why:** Reproducibility requires **pinned versions** so runs on Colab or locally behave the same. We'll also set the global random seed to make results comparable.


In [None]:

# Install pinned versions for reproducibility
import sys, subprocess

def pip_install(requirements):
    # Install packages with specific versions
    cmd = [sys.executable, "-m", "pip", "install", "-q"] + requirements
    print("Installing:", " ".join(requirements))
    subprocess.run(cmd, check=True)

REQ = [
    "transformers==4.41.2",
    "datasets==2.19.1",
    "evaluate==0.4.2",
    "rouge-score==0.1.2",
    "bert-score==0.3.13",
    "accelerate==0.30.1",
    "sentencepiece==0.1.99",
    "sacrebleu==2.4.0",
]
pip_install(REQ)

# Imports
import os, math, random, json
from pathlib import Path
import numpy as np
import torch

from datasets import load_dataset
from transformers import (AutoTokenizer, AutoModelForSeq2SeqLM,
                          DataCollatorForSeq2Seq, Seq2SeqTrainer, Seq2SeqTrainingArguments,
                          set_seed)
import evaluate

# Seed everything for reproducibility
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)
set_seed(SEED)

print("Python:", sys.version)
print("Torch:", torch.__version__, "| CUDA:", torch.version.cuda if torch.cuda.is_available() else "CPU")
print("Device:", "GPU" if torch.cuda.is_available() else "CPU")


**How to read results:**  
If the cell installed packages without errors and printed versions and device info, you're ready to continue. The *seed* is set, so repeated runs with the same data and hyper‑parameters are comparable.


## What / Why
**What:** Define one **central config** (dataset/model IDs, max lengths, training hyper‑parameters, decoding params).  
**Why:** Keeping *all knobs in one dictionary* avoids buried magic numbers and lets graders reproduce settings precisely. We'll also save this to disk as `configs/baseline.json`.


In [None]:

from datetime import datetime

cfg = {
    "project": "verifier_reranking_cnndm",
    "timestamp_utc": datetime.utcnow().isoformat(timespec="seconds"),
    "seed": 42,
    "dataset": {"hf_id": "ccdv/cnn_dailymail", "config": "3.0.0", "split_train": "train", "split_val": "validation", "split_test": "test"},
    "text_fields": {"source": "article", "summary": "highlights"},
    "model": {"hf_id": "facebook/bart-base"},
    "tokenization": {"max_source_len": 512, "max_target_len": 128, "truncation": True},
    "training": {
        "output_dir": "checkpoints/bart-base-cnndm",
        "num_train_epochs": 3,
        "per_device_train_batch_size": 8,
        "per_device_eval_batch_size": 8,
        "gradient_accumulation_steps": 2,
        "learning_rate": 5e-5,
        "weight_decay": 0.01,
        "warmup_ratio": 0.03,
        "lr_scheduler_type": "linear",
        "logging_steps": 100,
        "evaluation_strategy": "steps",
        "eval_steps": 1000,
        "save_strategy": "steps",
        "save_steps": 1000,
        "predict_with_generate": True,
        "generation_max_length": 128,
        "generation_num_beams": 4,
        "report_to": "none"
    },
    "decoding": {"num_beams": 4, "length_penalty": 1.0},
    "notes": "Baseline BART-base on CNN/DailyMail v3.0.0 with beam search decoding."
}

Path("configs").mkdir(parents=True, exist_ok=True)
with open("configs/baseline.json", "w") as f:
    json.dump(cfg, f, indent=2)

print(json.dumps(cfg, indent=2))


**How to read results:**  
You should see the JSON config printed. The `output_dir` is where checkpoints and logs will be written. You can change `num_train_epochs` for longer training (e.g., 5) if you have more time/GPU.


## What / Why
**What:** Load the CNN/DailyMail dataset and show **three redacted examples**.  
**Why:** Sanity‑check fields, and confirm our preprocessing won't leak sensitive tokens (we'll lightly redact digits/URLs). We keep only `article` and `highlights` (reference summary).


In [None]:

raw = load_dataset(cfg["dataset"]["hf_id"], cfg["dataset"]["config"])
print(raw)

import re
URL_RE = re.compile(r"https?://\S+|www\.\S+", re.IGNORECASE)
DIGIT_RE = re.compile(r"\d")

def redact(text: str) -> str:
    t = URL_RE.sub("[URL]", text)
    t = DIGIT_RE.sub("#", t)
    return t

def show_samples(ds, n=3):
    for i in range(n):
        ex = ds[i]
        src = redact(ex[cfg["text_fields"]["source"]])[:600].strip()
        ref = redact(ex[cfg["text_fields"]["summary"]])[:300].strip()
        print(f"--- Example {i} ---")
        print("SOURCE:\n", src, "\n")
        print("REFERENCE SUMMARY:\n", ref, "\n")

print("\nValidation samples (redacted):")
show_samples(raw[cfg["dataset"]["split_val"]], n=3)


**How to read results:**  
You should see three validation examples with URLs replaced and digits masked. This verifies field names: `article` → source, `highlights` → reference summary.


## What / Why
**What:** Tokenize and prepare dataset for sequence‑to‑sequence training.  
**Why:** We must truncate/pad to model limits and place labels in the `text_target` slot so the Trainer can compute loss.


In [None]:

tok = AutoTokenizer.from_pretrained(cfg["model"]["hf_id"], use_fast=True)
model = AutoModelForSeq2SeqLM.from_pretrained(cfg["model"]["hf_id"])

def preprocess(batch):
    sources = batch[cfg["text_fields"]["source"]]
    targets = batch[cfg["text_fields"]["summary"]]
    model_inputs = tok(
        sources,
        max_length=cfg["tokenization"]["max_source_len"],
        truncation=cfg["tokenization"]["truncation"],
        padding="max_length",
    )
    with tok.as_target_tokenizer():
        labels = tok(
            targets,
            max_length=cfg["tokenization"]["max_target_len"],
            truncation=cfg["tokenization"]["truncation"],
            padding="max_length",
        )
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

cols_to_keep = [cfg["text_fields"]["source"], cfg["text_fields"]["summary"]]
proc_train = raw[cfg["dataset"]["split_train"]].remove_columns(
    [c for c in raw["train"].column_names if c not in cols_to_keep]
).map(preprocess, batched=True, remove_columns=cols_to_keep)

proc_val = raw[cfg["dataset"]["split_val"]].remove_columns(
    [c for c in raw["validation"].column_names if c not in cols_to_keep]
).map(preprocess, batched=True, remove_columns=cols_to_keep)

proc_test = raw[cfg["dataset"]["split_test"]].remove_columns(
    [c for c in raw["test"].column_names if c not in cols_to_keep]
).map(preprocess, batched=True, remove_columns=cols_to_keep)

print(proc_train, proc_val, proc_test, sep="\n")


**How to read results:**  
The printed dataset objects should now have only numeric tensors (`input_ids`, `attention_mask`, `labels`). That means they’re ready for the Trainer.


## What / Why
**What:** Define collator and ROUGE metric function for evaluation.  
**Why:** The **data collator** pads to a multiple of 8 tokens for faster tensor cores on GPU. ROUGE gives a baseline string‑overlap view of summary quality; we'll compute BERTScore only once after training to save time.


In [None]:

collator = DataCollatorForSeq2Seq(tokenizer=tok, model=model, pad_to_multiple_of=8)

rouge = evaluate.load("rouge")
def compute_rouge(eval_preds):
    preds, labels = eval_preds
    def decode(seqs):
        return [tok.decode(s, skip_special_tokens=True, clean_up_tokenization_spaces=True) for s in seqs]
    if isinstance(preds, tuple):
        preds = preds[0]
    preds_text = decode(preds)
    labels[labels == -100] = tok.pad_token_id
    refs_text = decode(labels)
    result = rouge.compute(predictions=preds_text, references=refs_text, use_stemmer=True)
    return {k: round(v, 4) for k, v in result.items()}


**How to read results:**  
Nothing printed here; we just prepared the metric function. We’ll see ROUGE during/after evaluation.


## What / Why
**What:** Configure `Seq2SeqTrainer` and launch fine‑tuning.  
**Why:** Trainer handles batching, gradient accumulation, mixed precision, checkpointing, and evaluation for us — all with reproducible arguments saved in our config.


In [None]:

train_args = Seq2SeqTrainingArguments(
    output_dir=cfg["training"]["output_dir"],
    num_train_epochs=cfg["training"]["num_train_epochs"],
    per_device_train_batch_size=cfg["training"]["per_device_train_batch_size"],
    per_device_eval_batch_size=cfg["training"]["per_device_eval_batch_size"],
    gradient_accumulation_steps=cfg["training"]["gradient_accumulation_steps"],
    learning_rate=cfg["training"]["learning_rate"],
    weight_decay=cfg["training"]["weight_decay"],
    warmup_ratio=cfg["training"]["warmup_ratio"],
    lr_scheduler_type=cfg["training"]["lr_scheduler_type"],
    logging_steps=cfg["training"]["logging_steps"],
    evaluation_strategy=cfg["training"]["evaluation_strategy"],
    eval_steps=cfg["training"]["eval_steps"],
    save_strategy=cfg["training"]["save_strategy"],
    save_steps=cfg["training"]["save_steps"],
    predict_with_generate=cfg["training"]["predict_with_generate"],
    generation_max_length=cfg["training"]["generation_max_length"],
    generation_num_beams=cfg["training"]["generation_num_beams"],
    report_to=cfg["training"]["report_to"],
    fp16=torch.cuda.is_available(),
    dataloader_num_workers=2,
)

trainer = Seq2SeqTrainer(
    model=model,
    args=train_args,
    train_dataset=proc_train,
    eval_dataset=proc_val,
    tokenizer=tok,
    data_collator=collator,
    compute_metrics=compute_rouge,
)

train_output = trainer.train()
print(train_output)


**How to read results:**  
Watch the logs: you’ll see training steps and periodic evaluation with ROUGE scores. After it finishes, the final checkpoint is in `checkpoints/bart-base-cnndm`. For a quick smoke test, you can set `num_train_epochs=1` first, then increase later.


## What / Why
**What:** Evaluate on validation and test with **beam search decoding** and compute **ROUGE + BERTScore**.  
**Why:** These are our baseline quality metrics to compare against reranking later. We report ROUGE‑1/2/L and BERTScore (F1).


In [None]:

from tqdm.auto import tqdm
import numpy as np

def summarize_split(dataset, split_name: str, out_jsonl: str):
    preds = trainer.predict(dataset, max_length=cfg["decoding"]["generation_max_length"] if "generation_max_length" in cfg["training"] else 128)
    def decode(seqs):
        return [tok.decode(s, skip_special_tokens=True, clean_up_tokenization_spaces=True) for s in seqs]
    pred_texts = decode(preds.predictions)
    labels = preds.label_ids
    labels[labels == -100] = tok.pad_token_id
    ref_texts = decode(labels)

    rouge_scores = rouge.compute(predictions=pred_texts, references=ref_texts, use_stemmer=True)
    rouge_scores = {k: round(v, 4) for k, v in rouge_scores.items()}
    print(f"[{split_name}] ROUGE:", rouge_scores)

    bertscore = evaluate.load("bertscore")
    bs = bertscore.compute(predictions=pred_texts, references=ref_texts, lang="en")
    bert_f1 = float(np.mean(bs["f1"]))
    print(f"[{split_name}] BERTScore F1: {bert_f1:.4f}")

    Path("outputs").mkdir(exist_ok=True, parents=True)
    out_path = Path("outputs") / out_jsonl
    with out_path.open("w", encoding="utf-8") as f:
        for hyp, ref in zip(pred_texts, ref_texts):
            f.write(json.dumps({"prediction": hyp, "reference": ref}) + "\n")
    print(f"Saved {len(pred_texts)} predictions to {out_path}")

    return {"rouge": rouge_scores, "bertscore_f1": bert_f1}

val_metrics = summarize_split(proc_val, "val", "baseline_val_predictions.jsonl")
test_metrics = summarize_split(proc_test.select(range(2000)), "test_subset", "baseline_test_subset_predictions.jsonl")
print("VAL:", val_metrics)
print("TEST (subset):", test_metrics)


**How to read results:**  
You’ll see ROUGE and BERTScore for validation and a test *subset* (the full test can be large; use the subset to save time). These numbers are your **baseline**. We saved predictions in `outputs/` for downstream use.
