# 07_Final_Test


## Purpose
This notebook runs the **final blind test set evaluation** for the project.

**What it does:**
1. Loads fine-tuned BART model (from Notebook 02)
2. Generates K=5 candidates for all 11,490 test examples (beam search)
3. Scores candidates with FactCC and RoBERTa-MNLI verifiers
4. Computes final metrics: ROUGE-L and factuality scores
5. Produces results for the paper (Table 1, Section 4)

**Why this matters:**
- Test set is held out until now
- Results prove method generalizes beyond validation set


**Output:** `test_set_final_results.jsonl` (11,490 scored examples)

---

In [None]:
import os
from google.colab import drive

# 1. Mount Drive
drive.mount('/content/drive')

# 2. Check File
OUTPUTS_DIR = "/content/drive/MyDrive/w266_project_final/outputs"
TEST_RESULTS_FILE = os.path.join(OUTPUTS_DIR, "test_set_final_results.jsonl")

if os.path.exists(TEST_RESULTS_FILE):
    # Count lines (each line is one processed example)
    with open(TEST_RESULTS_FILE, 'rb') as f:
        count = sum(1 for _ in f)

    print(f"‚úÖ Success! You have {count} completed examples saved safely.")
    print(f"Progress: {count} / 11,490 ({count/11490:.1%} Complete)")

    if count >= 11490:
        print("üéâ YOU ARE DONE! You don't need to resume. Go to Notebook 08.")
    else:
        print(f"‚ö†Ô∏è Incomplete. You need to process {11490 - count} more examples.")
else:
    print("‚ùå File not found. Something went wrong at the start.")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
‚úÖ Success! You have 11490 completed examples saved safely.
Progress: 11490 / 11,490 (100.0% Complete)
üéâ YOU ARE DONE! You don't need to resume. Go to Notebook 08.


In [None]:
import os
import json
import orjson
import pandas as pd
from google.colab import drive

# Setup
drive.mount('/content/drive')
OUTPUTS_DIR = "/content/drive/MyDrive/w266_project_final/outputs"
TEST_RESULTS_FILE = os.path.join(OUTPUTS_DIR, "test_set_final_results.jsonl")

print(f"Inspecting: {TEST_RESULTS_FILE}")

# Deep Inspection
valid_count = 0
corrupt_count = 0
records = []

try:
    with open(TEST_RESULTS_FILE, 'rb') as f:
        # Read all lines
        lines = f.readlines()
        total_lines = len(lines)
        print(f"Total Lines Found: {total_lines}")

        # Check First, Middle, Last
        indices_to_check = [0, total_lines // 2, total_lines - 1]

        print("\n--- Spot Check (First, Middle, Last) ---")
        for i in indices_to_check:
            try:
                data = orjson.loads(lines[i])

                # Validation Logic
                checks = {
                    "Has Candidates": "candidates" in data,
                    "K=5 Candidates": len(data.get("candidates", [])) == 5,
                    "Has FactCC": "factcc_scores" in data,
                    "K=5 FactCC": len(data.get("factcc_scores", [])) == 5,
                    "Has NLI": "nli_scores" in data,
                    "K=5 NLI": len(data.get("nli_scores", [])) == 5,
                }

                if all(checks.values()):
                    print(f"‚úÖ Line {i}: Valid")
                    if i == total_lines - 1:
                        print(f"   Last Article Snippet: {data['article'][:50]}...")
                else:
                    print(f"‚ùå Line {i}: INVALID STRUCTURE")
                    print(checks)
                    corrupt_count += 1

            except json.JSONDecodeError:
                print(f"‚ùå Line {i}: JSON CORRUPTION (Likely cut off)")
                corrupt_count += 1

except FileNotFoundError:
    print("File not found.")

print("\n--- Summary ---")
if corrupt_count == 0:
    print("üéâ INTEGRITY CHECK PASSED: Data is clean and ready for Notebook 08.")
else:
    print(f"‚ö†Ô∏è WARNING: Found {corrupt_count} corrupt lines. You may need to trim the file.")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Inspecting: /content/drive/MyDrive/w266_project_final/outputs/test_set_final_results.jsonl
Total Lines Found: 11490

--- Spot Check (First, Middle, Last) ---
‚úÖ Line 0: Valid
‚úÖ Line 5745: Valid
‚úÖ Line 11489: Valid
   Last Article Snippet: Angus Hawley's brother has spoken of his shock aft...

--- Summary ---
üéâ INTEGRITY CHECK PASSED: Data is clean and ready for Notebook 08.


In [None]:
import os
import json
import orjson
import pandas as pd
from google.colab import drive

# Setup
drive.mount('/content/drive')
OUTPUTS_DIR = "/content/drive/MyDrive/w266_project_final/outputs"
TEST_RESULTS_FILE = os.path.join(OUTPUTS_DIR, "test_set_final_results.jsonl")

#  Load Data
print(f"Loading: {TEST_RESULTS_FILE}")
records = []
with open(TEST_RESULTS_FILE, 'rb') as f:
    for line in f:
        records.append(orjson.loads(line))

df = pd.DataFrame(records)
print(f"Loaded {len(df)} rows.")


# View: Baseline (First Cand) vs. Reranked (Best FactCC Cand)
viewer_data = []

for idx, row in df.head(10).iterrows(): # Show top 10 examples

    # Baseline = Index 0
    base_text = row['candidates'][0]
    base_score = row['factcc_scores'][0]

    # Reranked = Highest Score
    import numpy as np
    best_idx = np.argmax(row['factcc_scores'])
    rerank_text = row['candidates'][best_idx]
    rerank_score = row['factcc_scores'][best_idx]

    viewer_data.append({
        "ID": idx,
        "Baseline Score": f"{base_score:.4f}",
        "Rerank Score": f"{rerank_score:.4f}",
        "Did it Change?": "YES" if best_idx != 0 else "No",
        "Baseline Summary": base_text,
        "Reranked Summary": rerank_text
    })


display_df = pd.DataFrame(viewer_data)

# Display
from IPython.display import display
print("\n--- Reranker Data Explorer (First 10) ---")
display(display_df)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Loading: /content/drive/MyDrive/w266_project_final/outputs/test_set_final_results.jsonl
Loaded 11490 rows.

--- Reranker Data Explorer (First 10) ---


Unnamed: 0,ID,Baseline Score,Rerank Score,Did it Change?,Baseline Summary,Reranked Summary
0,0,0.0002,0.9703,YES,Palestinians signed the ICC's founding Rome St...,Palestinians signed the ICC's founding Rome St...
1,1,0.0001,0.4225,YES,"Theia, a friendly white-and-black bully breed ...","Theia, a friendly white-and-black bully breed ..."
2,2,0.9993,1.0,YES,Iranian foreign minister Mohammad Javad Zarif ...,Iranian foreign minister Mohammad Javad Zarif ...
3,3,0.9974,0.9992,YES,Five Americans who were monitored for three we...,Five Americans were monitored for three weeks ...
4,4,0.0003,0.0032,YES,A Duke student has admitted to hanging a noose...,A Duke student has admitted to hanging a noose...
5,5,0.9913,0.9914,YES,"Trey Moses, a star on Eastern High School's ba...","Trey Moses, a star on Eastern High School's ba..."
6,6,0.9999,0.9999,No,"""Dark trend of governments using the death pen...","""Dark trend of governments using the death pen..."
7,7,0.9999,0.9999,YES,The coroner's preliminary assessment is there ...,The coroner's preliminary assessment is there ...
8,8,0.9997,0.9999,YES,Maysak has lost a lot of steam as it has spun ...,Maysak has lost a lot of steam as it has spun ...
9,9,0.0001,0.0036,YES,Bob Barker hosted the TV game show for 35 year...,Bob Barker hosted the TV game show for 35 year...


In [None]:
# 07_Final_Test_Pipeline.ipynb

# Steps:
# 1. Load Baseline BART.
# 2. Generate K=5 Candidates for the TEST set.
# 3. Score them with FactCC and NLI (Simple AlignScore).
# 4. Calculate Final ROUGE scores.

import os
import json
import torch
import orjson
import pandas as pd
import numpy as np
import evaluate
from google.colab import drive
from tqdm.auto import tqdm
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, AutoModelForSequenceClassification
from datasets import load_dataset

# Setup
print("--- 1.0: Setup & Config ---")
drive.mount('/content/drive')

PROJECT_ROOT = "/content/drive/MyDrive/w266_project_final"
CONFIG_PATH = os.path.join(PROJECT_ROOT, "configs", "baseline.json")
OUTPUTS_DIR = os.path.join(PROJECT_ROOT, "outputs")
TEST_RESULTS_FILE = os.path.join(OUTPUTS_DIR, "test_set_final_results.jsonl")

with open(CONFIG_PATH, 'r') as f:
    cfg = json.load(f)

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

--- 1.0: Setup & Config ---
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Using device: cuda


In [None]:
OUTPUTS_DIR = os.path.join(PROJECT_ROOT, "outputs")
TEST_RESULTS_FILE = os.path.join(OUTPUTS_DIR, "test_set_final_results.jsonl")

#  Load Generator Model
CHECKPOINT_DIR = os.path.join(PROJECT_ROOT, cfg['train']['output_dir'])
print(f"Loading Fine-Tuned Model from: {CHECKPOINT_DIR}")
gen_tokenizer = AutoTokenizer.from_pretrained(CHECKPOINT_DIR)
gen_model = AutoModelForSeq2SeqLM.from_pretrained(CHECKPOINT_DIR).to(device)
gen_model.eval()

#  Load FULL Test Data
print("Loading CNN/DailyMail (Test Split)...")
test_dataset = load_dataset(cfg['dataset_name'], cfg['dataset_config'], split="test")
print(f"‚úÖ Loaded FULL test set: {len(test_dataset)} examples.")

Loading Fine-Tuned Model from: /content/drive/MyDrive/w266_project_final/models/bart_base_cnn_dm_20k
Loading CNN/DailyMail (Test Split)...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

3.0.0/train-00000-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

3.0.0/train-00001-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

3.0.0/train-00002-of-00003.parquet:   0%|          | 0.00/259M [00:00<?, ?B/s]

3.0.0/validation-00000-of-00001.parquet:   0%|          | 0.00/34.7M [00:00<?, ?B/s]

3.0.0/test-00000-of-00001.parquet:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/287113 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/13368 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11490 [00:00<?, ? examples/s]

‚úÖ Loaded FULL test set: 11490 examples.


In [None]:

print("\n--- 2.0: Generating Candidates for TEST Set ---")

#  Load Model
CHECKPOINT_DIR = os.path.join(PROJECT_ROOT, cfg['train']['output_dir'])
print(f"Loading Fine-Tuned Model from: {CHECKPOINT_DIR}")
gen_tokenizer = AutoTokenizer.from_pretrained(CHECKPOINT_DIR)
gen_model = AutoModelForSeq2SeqLM.from_pretrained(CHECKPOINT_DIR).to(device)
gen_model.eval()

#  Load Test Data
print("Loading CNN/DailyMail (Test Split)...")
test_dataset = load_dataset(cfg['dataset_name'], cfg['dataset_config'], split="test")
print(f"Loaded {len(test_dataset)} test examples.")

# Generation Loop
# Store results in memory temporarily to pass to the scorer

generated_data = []

print("Starting Generation (Beam Search K=5)...")
batch_size = 8
# Iterate in batches for speed
for i in tqdm(range(0, len(test_dataset), batch_size), desc="Generating"):
    batch = test_dataset[i : i + batch_size]
    articles = batch[cfg['text_fields']['source']]
    refs = batch[cfg['text_fields']['summary']]

    inputs = gen_tokenizer(
        articles,
        max_length=1024,
        truncation=True,
        padding=True,
        return_tensors="pt"
    ).to(device)

    with torch.no_grad():
        outputs = gen_model.generate(
            inputs["input_ids"],
            num_beams=5,
            num_return_sequences=5,
            max_new_tokens=128,
            min_new_tokens=10,
            early_stopping=True
        )

    # Decode
    decoded = gen_tokenizer.batch_decode(outputs, skip_special_tokens=True)

    # Reshape: [Batch_Size * K] -> [Batch_Size, K]
    for j, article in enumerate(articles):
        candidates = decoded[j*5 : (j+1)*5]
        generated_data.append({
            "article": article,
            "reference": refs[j],
            "candidates": candidates
        })

# Free up GPU memory
del gen_model
torch.cuda.empty_cache()
print("Generation Complete. Model unloaded.")


--- 2.0: Generating Candidates for TEST Set ---
Loading Fine-Tuned Model from: /content/drive/MyDrive/w266_project_final/models/bart_base_cnn_dm_20k
Loading CNN/DailyMail (Test Split)...
Loaded 11490 test examples.
Starting Generation (Beam Search K=5)...


Generating:   0%|          | 0/1437 [00:00<?, ?it/s]

Generation Complete. Model unloaded.


In [None]:

#SAFETY CHECKPOINT


import orjson
import os

OUTPUTS_DIR = "/content/drive/MyDrive/w266_project_final/outputs"
CHECKPOINT_FILE = os.path.join(OUTPUTS_DIR, "intermediate_candidates_backup.jsonl")

print(f"Saving {len(generated_data)} candidates to {CHECKPOINT_FILE}...")

with open(CHECKPOINT_FILE, 'wb') as f:
    for record in generated_data:
        f.write(orjson.dumps(record) + b'\n')

print("‚úÖ BACKUP COMPLETE.")
print("If the runtime crashes during scoring, you can now skip Generation")
print("and just load this file instead!")

Saving 11490 candidates to /content/drive/MyDrive/w266_project_final/outputs/intermediate_candidates_backup.jsonl...
‚úÖ BACKUP COMPLETE.
If the runtime crashes during scoring, you can now skip Generation
and just load this file instead!



## 3.0: Score with Factuality Verifiers
**Model A:** FactCC (manueldeprada/FactCC) - Specialized for summarization  
**Model B:** RoBERTa-MNLI (roberta-large-mnli) - General NLI for entailment  
**Method:** Each of K=5 candidates scored independently against source article


In [None]:

#  Score Candidates (FactCC + NLI)

print("\n--- 3.0: Scoring with Verifiers ---")

from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load Verifiers
print("Loading FactCC...")
factcc_tokenizer = AutoTokenizer.from_pretrained("manueldeprada/FactCC")
factcc_model = AutoModelForSequenceClassification.from_pretrained("manueldeprada/FactCC").to(device)
factcc_model.eval()

print("Loading NLI (RoBERTa)...")
nli_tokenizer = AutoTokenizer.from_pretrained("roberta-large-mnli")
nli_model = AutoModelForSequenceClassification.from_pretrained("roberta-large-mnli").to(device)
nli_model.eval()

#  Helper Functions
def score_batch(model, tokenizer, pairs, target_label_idx):
    # Tokenize pairs [ (art, cand), (art, cand) ... ]
    # Process one candidate at a time

    scores = []
    for art, cand in pairs:
        inputs = tokenizer(
            art, cand,
            return_tensors="pt",
            truncation="only_first",
            max_length=512
        ).to(device)
        with torch.no_grad():
            logits = model(**inputs).logits
            probs = torch.softmax(logits, dim=1)
            scores.append(probs[0][target_label_idx].item())
    return scores

# Scoring Loop
print("Scoring all candidates...")
with open(TEST_RESULTS_FILE, 'wb') as f_out:
    for record in tqdm(generated_data, desc="Scoring"):
        article = record['article']
        cands = record['candidates']

        # Prepare pairs
        pairs = [(article, c) for c in cands]

        # Score FactCC (Label 1 = Correct)
        f_scores = score_batch(factcc_model, factcc_tokenizer, pairs, 1)

        # Score NLI (Label 2 = Entailment)
        n_scores = score_batch(nli_model, nli_tokenizer, pairs, 2)

        record['factcc_scores'] = f_scores
        record['nli_scores'] = n_scores

        f_out.write(orjson.dumps(record) + b'\n')

print(f"Scoring Complete. Results saved to {TEST_RESULTS_FILE}")


--- 3.0: Scoring with Verifiers ---
Loading FactCC...
Loading NLI (RoBERTa)...


Some weights of the model checkpoint at roberta-large-mnli were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Scoring all candidates...


Scoring:   0%|          | 0/11490 [00:00<?, ?it/s]

Scoring Complete. Results saved to /content/drive/MyDrive/w266_project_final/outputs/test_set_final_results.jsonl



## 4.0: Compute Final Metrics
**Primary Metric:** FactCC score (factuality)  
**Secondary Metric:** ROUGE-L (lexical overlap with reference)  
**Comparison:** Baseline (first candidate) vs. Reranked (best FactCC candidate)


In [None]:

print("\n--- 4.0: Calculating Final Test Metrics ---")

# Reload to ensure clean state
df = pd.DataFrame(generated_data) # We still have it in memory
# Update with scores
df['factcc_scores'] = [r['factcc_scores'] for r in generated_data]
df['nli_scores'] = [r['nli_scores'] for r in generated_data]

# Reranking Logic
df['summary_baseline'] = df['candidates'].apply(lambda x: x[0])
df['summary_factcc'] = df.apply(lambda r: r['candidates'][np.argmax(r['factcc_scores'])], axis=1)
df['summary_nli'] = df.apply(lambda r: r['candidates'][np.argmax(r['nli_scores'])], axis=1)

!pip install rouge_score
rouge = evaluate.load('rouge')

def get_rouge(preds, refs):
    res = rouge.compute(predictions=preds, references=refs, use_stemmer=True)
    return round(res['rougeL'] * 100, 2)

print("\n=== FINAL TEST SET SCORES ===")
print(f"Baseline ROUGE-L: {get_rouge(df['summary_baseline'], df['reference'])}")
print(f"FactCC Rerank ROUGE-L: {get_rouge(df['summary_factcc'], df['reference'])}")
print(f"NLI Rerank ROUGE-L: {get_rouge(df['summary_nli'], df['reference'])}")

print("\n--- Factuality Gains ---")
print(f"Baseline Avg FactCC: {df['factcc_scores'].apply(lambda x: x[0]).mean():.4f}")
print(f"Reranked Avg FactCC: {df['factcc_scores'].apply(max).mean():.4f}")


--- 4.0: Calculating Final Test Metrics ---
Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24934 sha256=a25e61ab14e27ce51a12c91c5aa3b056a7a05ab44a8fd86a0544fdf63f62275c
  Stored in directory: /root/.cache/pip/wheels/85/9d/af/01feefbe7d55ef5468796f0c68225b6788e85d9d0a281e7a70
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2

=== FINAL TEST SET SCORES ===
Baseline ROUGE-L: 28.08
FactCC Rerank ROUGE-L: 27.92
NLI Rerank ROUGE-L: 28.07

--- Factuality Gains ---
Baseline Avg FactCC: 0.4268
Reranked Avg FactCC: 0.6609


---
## FINAL RESULTS (Test Set)

### Key Findings:
| Metric | Baseline (First Candidate) | Reranked (Best FactCC) | Œî Change |
|--------|---------------------------|------------------------|----------|
| **FactCC** | 0.4268 | **0.6609** | **+0.2341 (+54.8%)**  |
| **ROUGE-L** | 28.08 | 27.92 | -0.16 (-0.6%)  |

### Success Criteria (from Proposal):
-  **FactCC gain ‚â• +2.0 points:** ACHIEVED (+23.41 points!)
-  **ROUGE-L drop ‚â§ 1.0 point:** ACHIEVED (-0.16 points)

### Interpretation:
- **Factuality improved dramatically** (54.8% relative gain)
- **Minimal fluency cost** (ROUGE-L nearly unchanged)
- **Method achieves stated goal:** Reranking selects more factual summaries without sacrificing readability

