## Background & Rationale

### Why BART?

 **BART-base** (Lewis et al., 2020)  for several reasons:

1. **Architecture Fit**: BART as a encoder-decoder transformer pre-trained with a denoising objective, its well-suited for sequence-to-sequence tasks like summarization.

2. **Established Baseline**: BART achieves strong performance on CNN/DailyMail (Lewis et al., 2020 report ROUGE-L ~40 for BART-large). Using BART-base provides a fair baseline within the GPU compute constraints.

3. **Reproducibility**: The `facebook/bart-base` checkpoint is publicly available via Hugging Face, ensuring reproducibility.

4. **Compatibility**: BART's architecture allows for multiple candidates via beam search, which is essential for the reranking approach.

**Alternative Considered**: PEGASUS (Zhang et al., 2020) achieves higher ROUGE on CNN/DailyMail but requires more compute. T5 (Raffel et al., 2019) is another option but BART's denoising pre-training is more aligned with summarization.

### Training Objective

By fine-tuning BART using the standard **autoregressive cross-entropy loss**:

$$\mathcal{L}(\theta) = -\sum_{t=1}^{T} \log P_\theta(y_t \mid y_{<t}, x)$$

Where:
- $x$ = source article (encoder input)
- $y$ = target summary (decoder output)
- $\theta$ = model parameters
- $T$ = summary length

This objective teaches the model to predict each summary token given the article and all previous summary tokens.


In [None]:
import os
import json
import random
import pathlib
import numpy as np
import torch
from datasets import load_dataset, load_from_disk
from transformers import (
    AutoTokenizer,
    AutoModelForSeq2SeqLM,
    DataCollatorForSeq2Seq,
    Seq2SeqTrainer,
    Seq2SeqTrainingArguments,
    set_seed
)

!pip install evaluate
import evaluate
import nltk
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive')

# Load Central Config
BASE_DIR = "/content/drive/MyDrive/w266_project_final"
CONFIGS_DIR = os.path.join(BASE_DIR, "configs")
CONFIG_PATH = os.path.join(CONFIGS_DIR, "baseline.json")

with open(CONFIG_PATH, 'r') as f:
    cfg = json.load(f)
print(f"Loaded config from: {CONFIG_PATH}")

# Set Seed (Guardrail #5)
SEED = cfg['seed']
set_seed(SEED)
print(f"Global random seed set to {SEED}.")

# Define Artifact Paths
DATA_DIR = os.path.join(BASE_DIR, "data")
MODELS_DIR = os.path.join(BASE_DIR, "models")
RESULTS_DIR = os.path.join(BASE_DIR, "results")

HF_CACHE_DIR = os.path.join(DATA_DIR, "hf_cache")
SNAPSHOT_DIR = os.path.join(DATA_DIR, "snapshots/cnndm_tok_bart_base")
CHECKPOINT_DIR = os.path.join(BASE_DIR, cfg['train']['output_dir'])

for d in [HF_CACHE_DIR, SNAPSHOT_DIR, CHECKPOINT_DIR, RESULTS_DIR]:
    pathlib.Path(d).mkdir(parents=True, exist_ok=True)
os.environ["HF_HOME"] = HF_CACHE_DIR

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")
if device == "cuda":
    print(f"GPU Name: {torch.cuda.get_device_name(0)}")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Loaded config from: /content/drive/MyDrive/w266_project_final/configs/baseline.json
Global random seed set to 42.
Using device: cuda
GPU Name: NVIDIA A100-SXM4-40GB


# 2. Data Preprocessing What / Why: - Baseline Training (Guardrail #4)

##What: Fine-tuning BART using a standard cross-entropy loss.
##Why: This creates "Baseline B" (Guardrail #2).

This follows the BART paper (Lewis et al., 2020).

Tokenize the inputs using the BART tokenizer with truncation settings derived from EDA:
* **Source:** 1024 tokens (captures full article context).
* **Target:** 128 tokens (sufficient for abstractive summaries).
We also create a random subset (Train=20k, Val=2k) to ensure the project remains computationally feasible on Colab GPUs.


## Data Subset Justification

### Why 20,000 Training Examples?

The full CNN/DailyMail training set contains 287,113 examples. Using a **random subset of 20,000** for the following reasons:

1. **Compute Constraints**: Fine-tuning on the full dataset would require ~15+ hours on a Colab GPU. Our 20K subset trains in far less time (3 epochs).

2. **Sufficient Signal**: Prior work shows that transformer models can achieve competitive performance with smaller training sets when starting from pre-trained checkpoints (Howard & Ruder, 2018).

3. **Reproducibility**: Using a fixed random seed (42) ensures that the subset is reproducible across runs.

4. **Focus on Reranking**: The project proposal is attempting a reranking method, not achieving state-of-the-art generation. A competitive baseline is sufficient.

**Potential Limitation**: Training on 20K examples may result in slightly lower ROUGE than training on the full dataset. However, test-set ROUGE-L (28.09) is within expected range for BART-base, confirming the baseline is valid.

### Tokenization Settings

| Parameter | Value | Justification |
|-----------|-------|---------------|
| `max_source_len` | 1024 | Covers 95%+ of articles without truncation (from EDA) |
| `max_target_len` | 128 | >2x the average summary length (~56 tokens); prevents truncation |

These settings follow standard practice for CNN/DailyMail summarization (See et al., 2017; Lewis et al., 2020).

In [None]:
# Load and Tokenize Dataset


tokenizer = AutoTokenizer.from_pretrained(cfg['model_name'], use_fast=True)

MAX_INPUT = cfg['tokenization']['max_source_len']
MAX_TARGET = cfg['tokenization']['max_target_len']
SOURCE_COL = cfg['text_fields']['source']
SUMMARY_COL = cfg['text_fields']['summary']

def preprocess(batch):
    model_inputs = tokenizer(batch[SOURCE_COL], max_length=MAX_INPUT, truncation=True, padding=False)
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(batch[SUMMARY_COL], max_length=MAX_TARGET, truncation=True, padding=False)
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

#  Load  Dataset
SNAPSHOT_PATH = os.path.join(DATA_DIR, "snapshots/cnndm_tok_bart_base")
if os.path.exists(SNAPSHOT_PATH) and os.path.exists(os.path.join(SNAPSHOT_PATH, "dataset_dict.json")):
    print(f"Loading tokenized snapshot from disk: {SNAPSHOT_PATH}")
    ds_tok = load_from_disk(SNAPSHOT_PATH)
else:
    print(f"No snapshot found. Loading from Hugging Face and tokenizing...")
    ds = load_dataset(cfg['dataset_name'], cfg['dataset_config'], cache_dir=HF_CACHE_DIR)

    print("Tokenizing...")
    ds_tok = ds.map(
        preprocess,
        batched=True,
        num_proc=os.cpu_count(),
        remove_columns=ds['train'].column_names
    )

    print(f"Saving tokenized snapshot to: {SNAPSHOT_PATH}")
    ds_tok.save_to_disk(SNAPSHOT_PATH)

ds_tok.set_format(type="torch", columns=['input_ids', 'attention_mask', 'labels'])

# Create Subsets
TRAIN_SUBSET_SIZE = cfg['train_subset_size']
EVAL_SUBSET_SIZE = cfg['val_subset_size']

full_train = ds_tok['train'].shuffle(seed=SEED)
train_ds = full_train.select(range(TRAIN_SUBSET_SIZE))

full_eval = ds_tok['validation'].shuffle(seed=SEED)
eval_ds = full_eval.select(range(EVAL_SUBSET_SIZE))

print("\n--- Tokenized Dataset (Using Subsets) ---")
print(f"Training on {len(train_ds)} examples (randomly sampled)")
print(f"Evaluating on {len(eval_ds)} examples (randomly sampled)")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading tokenized snapshot from disk: /content/drive/MyDrive/w266_project_final/data/snapshots/cnndm_tok_bart_base

--- Tokenized Dataset (Using Subsets) ---
Training on 20000 examples (randomly sampled)
Evaluating on 2000 examples (randomly sampled)


## 3.0: Model Fine-Tuning
Using the Hugging Face `Seq2SeqTrainer` for optimized training.
* **Metric:** ROUGE-L is calculated at each epoch to monitor convergence.
* **Precision:** Mixed Precision (FP16) is enabled to speed up training on A100/T4 GPUs.
* **Saving:** The best model (lowest validation loss) is automatically saved.

In [None]:
# Configure and Launch Trainer (Guardrail #5)

#  Load Model and Data
model = AutoModelForSeq2SeqLM.from_pretrained(cfg['model_name']).to(device)
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model, pad_to_multiple_of=8)

# Define ROUGE Metric for Evaluation (Guardrail #6)
!pip install rouge_score -q
import nltk
nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True)
import evaluate
rouge = evaluate.load("rouge")


def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple): preds = preds[0]

    preds_copy = np.where(preds == -100, tokenizer.pad_token_id, preds)
    decoded_preds = tokenizer.batch_decode(preds_copy, skip_special_tokens=True)


    labels_copy = np.where(labels == -100, tokenizer.pad_token_id, labels)
    decoded_labels = tokenizer.batch_decode(labels_copy, skip_special_tokens=True)


    # Adding a newline after each sentence for ROUGE-Lsum
    decoded_preds = ["\n".join(nltk.sent_tokenize(p.strip())) for p in decoded_preds]
    decoded_labels = ["\n".join(nltk.sent_tokenize(l.strip())) for l in decoded_labels]

    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    result = {key: round(value * 100, 4) for key, value in result.items()}
    return result

#  Training Arguments
t_args = cfg['train']
cap_major = 0
if torch.cuda.is_available():
    cap_major = torch.cuda.get_device_capability()[0]
use_bf16 = (cap_major >= 8)
use_fp16 = (cap_major < 8) and torch.cuda.is_available()

training_args = Seq2SeqTrainingArguments(
    output_dir=CHECKPOINT_DIR,
    eval_strategy="epoch",
    save_strategy="epoch",
    logging_strategy="steps",
    logging_steps=100,
    save_total_limit=2,
    load_best_model_at_end=True,
    metric_for_best_model="eval_rougeL",
    greater_is_better=True,
    num_train_epochs=t_args['epochs'],
    per_device_train_batch_size=t_args['batch_size'],
    per_device_eval_batch_size=t_args['batch_size'],
    gradient_accumulation_steps=t_args['grad_accum'],
    learning_rate=t_args['lr'],
    warmup_ratio=t_args['warmup_ratio'],
    weight_decay=0.01,
    predict_with_generate=True,
    generation_max_length=cfg['tokenization']['max_target_len'],
    seed=SEED,
    data_seed=SEED,
    fp16=use_fp16,
    bf16=use_bf16,
    dataloader_num_workers=2,
    report_to="none"
)

# Initialize Trainer
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=eval_ds,
    processing_class=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

print("Trainer initialized. Ready for training.")

Trainer initialized. Ready for training.


## Training Configuration

### Hyperparameter Choices

| Hyperparameter | Value | Rationale |
|----------------|-------|-----------|
| Epochs | 3 | Standard for fine-tuning; validation loss plateaus by epoch 3 |
| Batch Size | 8 | Maximum that fits in Colab GPU memory |
| Gradient Accumulation | 2 | Effective batch size = 16, balancing stability and speed |
| Learning Rate | 5e-5 | Standard for fine-tuning transformers (Devlin et al., 2019) |
| Warmup Ratio | 0.1 | Gradual warmup prevents early training instability |
| Weight Decay | 0.01 | Mild regularization to prevent overfitting |
| Precision | FP16/BF16 | Mixed precision for faster training on GPU |

### Evaluation Strategy

- **Metric**: ROUGE-L (measures longest common subsequence)
- **Checkpoint Selection**: Save best model based on validation ROUGE-L
- **Generation**: `predict_with_generate=True` enables autoregressive decoding during evaluation

### What I am looking for

1. **Training Loss**: Should decrease steadily across epochs
2. **Validation Loss**: Should decrease then plateau (not increase = no overfitting)
3. **ROUGE-L**: Primary quality metric; higher is better

In [None]:
#  Run Training

print(f"Starting training on {len(train_ds)} examples...")
print(f"Checkpoints will be saved to {CHECKPOINT_DIR}")


train_result = trainer.train(resume_from_checkpoint=False)

# Save Artifacts (Guardrail #2)
trainer.save_model()
print(f"Best model saved to: {CHECKPOINT_DIR}")

hist = trainer.state.log_history
metrics_path = os.path.join(RESULTS_DIR, "baseline_bart_20k_metrics.json")

final_metrics = {
    "model": "BART-base (Baseline B)",
    "note": f"Trained on {len(train_ds)} examples",
    "config_file": CONFIG_PATH,
    "final_eval_metrics": hist[-1],
    "log_history": hist
}
with open(metrics_path, "w") as f:
    json.dump(final_metrics, f, indent=2)

print("\n--- Training Complete ---")
print(f"Final metrics saved to: {metrics_path}")
print("\nFinal Evaluation Metrics:")
print(json.dumps(hist[-1], indent=2))


Starting training on 20000 examples...
Checkpoints will be saved to /content/drive/MyDrive/w266_project_final/models/bart_base_cnn_dm_20k


  batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)
  batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum
1,2.0849,1.778065,42.6609,19.959,29.0147,39.654
2,1.8851,1.72917,42.4433,19.6057,28.9677,39.4869
3,1.6985,1.72409,42.6181,19.8872,29.2091,39.6181


  batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)
  batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)
  batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)
  batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)
  batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)
  batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)
  batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)
  batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)
  batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)
  batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)
There were missing keys in the checkpoint model loaded: ['model.encoder.embed_tokens.weight', 'model.decoder.embed_tokens.weight', 'lm_head.weight'].


Best model saved to: /content/drive/MyDrive/w266_project_final/models/bart_base_cnn_dm_20k

--- Training Complete ---
Final metrics saved to: /content/drive/MyDrive/w266_project_final/results/baseline_bart_20k_metrics.json

Final Evaluation Metrics:
{
  "train_runtime": 2277.3053,
  "train_samples_per_second": 26.347,
  "train_steps_per_second": 1.647,
  "total_flos": 3.639474999853056e+16,
  "train_loss": 1.945367948404948,
  "epoch": 3.0,
  "step": 3750
}


## Training Results Analysis

### Convergence Assessment

| Epoch | Train Loss | Val Loss | ROUGE-L |
|-------|------------|----------|---------|
| 1 | 2.085 | 1.778 | 29.01 |
| 2 | 1.885 | 1.729 | 28.97 |
| 3 | 1.699 | 1.724 | 29.21 |

**Observations:**

1. **Training Loss**: Decreases consistently (2.08 → 1.70), indicating the model is learning.

2. **Validation Loss**: Decreases then plateaus (1.78 → 1.73 → 1.72). The slight increase in epoch 3 (1.729 → 1.724) is negligible and does not indicate overfitting.

3. **ROUGE-L**: Stable around 29, with best performance at epoch 3 (29.21). This is the checkpoint saved.

### Comparison to Published Results

| Model | ROUGE-L | Notes |
|-------|---------|-------|
| BART-large (Lewis et al., 2020) | ~40 | Full training set, large model |
| BART-base (Results) | 29.21 | 20K subset, base model |
| LEAD-3 (extractive baseline) | 24.91 | No training |

Our BART-base achieves **+4.3 ROUGE-L over LEAD-3**, confirming the model has learned abstractive summarization. The gap to BART-large is expected given the smaller model and training subset.

### Key Takeaway

The baseline model is **successfully trained** and provides a valid foundation for reranking experiments. The model achieves competitive ROUGE scores and shows little to no signs of overfitting.

---

## Next Steps

This trained model will be used in:
- **Notebook 03**: Generate K=5 candidate summaries per article
- **Notebook 04**: Score candidates with FactCC and NLI verifiers
- **Notebook 08**: Final evaluation on full test set