## Candidate Generation Strategy

### Why Generate Multiple Candidates?

The core hypothesis of the project is that beam search produces multiple candidate summaries, and **not all candidates are equally factual**. By generating K candidates and selecting the most factual one, it may reduce hallucinations without retraining the model.

### Why Beam Search?

Using **beam search** rather than sampling for several reasons:

1. **Deterministic**: The same input always produces the same candidates, ensuring reproducibility.
2. **High quality**: Beam search explores high-probability paths, producing fluent outputs.
3. **Diverse but related**: Candidates share common structure but differ in specific details.

### Generation Parameters

| Parameter | Value | Rationale |
|-----------|-------|-----------|
| `num_beams` | 5 | Number of parallel hypotheses to track |
| `num_return_sequences` | 5 | Return all 5 beam hypotheses as candidates |
| `max_new_tokens` | 128 | Sufficient for CNN/DailyMail summaries (~56 tokens avg) |
| `min_new_tokens` | 30 | Ensures summaries are not trivially short |
| `do_sample` | False | Deterministic beam search, not stochastic sampling |

### Why K=5?

- **K=1**: No reranking possible (just the baseline)
- **K=2-3**: Limited diversity; may not include a factual candidate
- **K=5**: Good trade-off between diversity and compute cost
- **K>5**: Diminishing returns (see K-ablation in Notebook 05)

In [None]:


import os
import json
import torch
import pandas as pd
import orjson
from google.colab import drive
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from tqdm.auto import tqdm # For a nice progress bar

print("--- 1.0: Setup ---")
# 1.1: Mount Google Drive
drive.mount('/content/drive')

# Central Config
PROJECT_ROOT = "/content/drive/MyDrive/w266_project_final"
CONFIGS_DIR = os.path.join(PROJECT_ROOT, "configs")
CONFIG_PATH = os.path.join(CONFIGS_DIR, "baseline.json")

with open(CONFIG_PATH, 'r') as f:
    cfg = json.load(f)

print(f"Loaded config from: {CONFIG_PATH}")

#  Artifact Paths
DATA_DIR = os.path.join(PROJECT_ROOT, "data")
OUTPUTS_DIR = os.path.join(PROJECT_ROOT, "outputs") # Use "outputs" dir
os.makedirs(OUTPUTS_DIR, exist_ok=True)

# This is the fine-tuned model built in Notebook 02
CHECKPOINT_DIR = os.path.join(PROJECT_ROOT, cfg['train']['output_dir'])

#  NEW artifact we will create
CANDIDATES_FILE = os.path.join(OUTPUTS_DIR, "validation_candidates_k5.jsonl")

# Setup
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")
if device == "cuda":
    print(f"GPU Name: {torch.cuda.get_device_name(0)}")

# Define Generation Parameters from Config
gen_params = cfg['generate']
K_CANDIDATES = gen_params['num_return_sequences']
print(f"Will generate K={K_CANDIDATES} candidates per article.")

--- 1.0: Setup ---
Mounted at /content/drive
Loaded config from: /content/drive/MyDrive/w266_project_final/configs/baseline.json
Using device: cuda
GPU Name: NVIDIA A100-SXM4-80GB
Will generate K=5 candidates per article.


In [None]:

# Loading Model and Tokenizer

print("\n--- 2.0: Loading Fine-Tuned Model & Tokenizer ---")
print(f"Loading model from: {CHECKPOINT_DIR}")

# Load the fine-tuned model and tokenizer from checkpoint
tokenizer = AutoTokenizer.from_pretrained(CHECKPOINT_DIR)
model = AutoModelForSeq2SeqLM.from_pretrained(CHECKPOINT_DIR).to(device)
model.eval() # Setting model to evaluation mode

print("Model and tokenizer loaded successfully.")


--- 2.0: Loading Fine-Tuned Model & Tokenizer ---
Loading model from: /content/drive/MyDrive/w266_project_final/models/bart_base_cnn_dm_20k
Model and tokenizer loaded successfully.


In [None]:

# Load and Prepare Dataset

print("\n--- 3.0: Loading Validation Dataset ---")
# Loading the raw validation set (not the tokenized one)
# Use the 'val_subset_size' from the config to match notebook 02
val_subset_size = cfg['val_subset_size']

raw_dataset = load_dataset(
    cfg["dataset_name"],
    cfg["dataset_config"],
    split="validation",
    cache_dir=os.path.join(DATA_DIR, "hf_cache")
)

# Applying the same shuffle and subset logic from training
val_ds = raw_dataset.shuffle(seed=cfg['seed']).select(range(val_subset_size))

print(f"Loaded and selected {len(val_ds)} examples from the validation set.")

# Get column names from config
SOURCE_COL = cfg['text_fields']['source']
SUMMARY_COL = cfg['text_fields']['summary']


--- 3.0: Loading Validation Dataset ---


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

Loaded and selected 2000 examples from the validation set.


In [None]:

# Generate All Candidates (The Main Event)
print(f"\n--- 4.0: Generating {K_CANDIDATES} Candidates for {len(val_ds)} Articles ---")
print(f"This is the main GPU task. Saving results to: {CANDIDATES_FILE}")

# Writing the file line-by-line to save memory and prevent data loss
with open(CANDIDATES_FILE, 'wb') as f_out:
    for example in tqdm(val_ds):
        article_text = example[SOURCE_COL]
        reference_summary = example[SUMMARY_COL]

        # Tokenize the article
        inputs = tokenizer(
            article_text,
            max_length=cfg['tokenization']['max_source_len'],
            truncation=True,
            return_tensors="pt"
        ).to(device)

        # Run model.generate()
        # This will use the parameters from config file
        with torch.no_grad():
            output_ids_batch = model.generate(
                inputs['input_ids'],
                num_beams=gen_params['num_beams'],
                num_return_sequences=gen_params['num_return_sequences'],
                max_new_tokens=gen_params['max_new_tokens'],
                min_new_tokens=gen_params['min_new_tokens'],
                do_sample=False # Use beam search, not sampling
            )

        # Decode the results
        generated_summaries = tokenizer.batch_decode(
            output_ids_batch,
            skip_special_tokens=True
        )

        # Create the data record
        record = {
            "id": example.get('id', None),
            "article": article_text,
            "reference_summary": reference_summary,
            f"generated_candidates_k{K_CANDIDATES}": generated_summaries
        }

        f_out.write(orjson.dumps(record) + b'\n')

print("\n--- 5.0: Candidate Generation Complete! ---")
print(f"All candidates have been saved to {CANDIDATES_FILE}")


--- 4.0: Generating 5 Candidates for 2000 Articles ---
This is the main GPU task. Saving results to: /content/drive/MyDrive/w266_project_final/outputs/validation_candidates_k5.jsonl


  0%|          | 0/2000 [00:00<?, ?it/s]


--- 5.0: Candidate Generation Complete! ---
All candidates have been saved to /content/drive/MyDrive/w266_project_final/outputs/validation_candidates_k5.jsonl
You can now proceed to notebook 04 for reranking and analysis.
You will not need a high-power GPU for the next steps.


In [None]:
import os
import orjson
from google.colab import drive

# 1. Mounting drive
drive.mount('/content/drive')

# 2. Define the path to new artifact
PROJECT_ROOT = "/content/drive/MyDrive/w266_project_final"
OUTPUTS_DIR = os.path.join(PROJECT_ROOT, "outputs")
CANDIDATES_FILE = os.path.join(OUTPUTS_DIR, "validation_candidates_k5.jsonl")

# 3. Open the file
print(f"Peeking inside: {CANDIDATES_FILE}\n")

try:
    with open(CANDIDATES_FILE, 'rb') as f:
        first_line = f.readline()

        if first_line:

            data = orjson.loads(first_line)

            # Print a summary
            print("--- Success! Found 1 record: ---")
            print(f"Article (truncated): {data['article'][:200]}...")
            print(f"\nReference Summary: {data['reference_summary']}")

            print(f"\nGenerated Candidates (K={len(data['generated_candidates_k5'])}):")
            for i, cand in enumerate(data['generated_candidates_k5']):
                print(f"  {i+1}: {cand}")
        else:
            print("File is empty!")

except FileNotFoundError:
    print(f"ERROR: File not found at {CANDIDATES_FILE}")
except Exception as e:
    print(f"An error occurred: {e}")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Peeking inside: /content/drive/MyDrive/w266_project_final/outputs/validation_candidates_k5.jsonl

--- Success! Found 1 record: ---
Article (truncated): Jarryd Hayne's move to the NFL is a boost for rugby league in the United States, it has been claimed. The Australia international full-back or centre quit the National Rugby League in October to try h...

Reference Summary: Jarryd Hayne quit the NRL in October to try and get into American Football .
This week, he signed a three-year contract with the San Francisco 49ers .
The chairman of the US Association of Rugby League welcomed his arrival .

Generated Candidates (K=5):
  1: Jarryd Hayne has signed a three-year contract with the San Francisco 49ers .
The Australia international quit the NRL in October to try his luck in American football .
Hayne could play at full back or centre in rugby league and is expec

##Notebook 03 Summary & How to Interpret  Results

What Did This Notebook Do?

This notebook loaded the fine-tuned BART model (from Notebook 02) and looped through all 2,000 articles in the validation set. For each article, it ran model.generate() to create K=5 candidate summaries and saved them all to the validation_candidates_k5.jsonl file.

##Breakingdown that output:

**Article (truncated)**: This is the original input text from the CNN/DailyMail dataset that your model was asked to summarize.

**Reference Summary**: This is the "ground truth" or "gold standard" summary written by a human. This is what we are comparing against for ROUGE scores.

**Generated Candidates (K=5)**: These are the 5 different summaries your BART model produced.

##How to Interpret. The "Why"

This is the most important part. You might notice the 5 candidates look very similar to each other. This is normal and expected because of beam search.

Beam search explores several "paths" to build a sentence. These 5 candidates are the 5 highest-scoring paths it found. They will often share the same beginning and differ by only a few words or a final sentence.

This is the entire point of the project. The hypothesis is that even though these 5 summaries look similar, one of them is likely more factually accurate than the others. Right now, we just have 5 guesses. We have no "judge."