# 05 — Verification (FactCC‑style) & Reranking

**Goal:** Score each candidate summary for **factual consistency** vs. its source article, then **rerank** and compare against the baseline.
We implement a **FactCC‑style NLI verifier** and a robust chunked scoring scheme (sentence‑wise max over source chunks → averaged).

Format: **What/Why → Code (commented) → How to read results**.


## What / Why
**What:** Install/load an **NLI verifier** checkpoint and basic utilities (sentence splitting + chunking).  
**Why:** FactCC is an NLI‑style factuality classifier. If a dedicated FactCC checkpoint isn’t available, a strong **RoBERTa‑large NLI** model works well as a drop‑in verifier for entailment vs. contradiction.


In [None]:

import sys, subprocess, json, re, math
from pathlib import Path
from typing import List, Dict, Any
import numpy as np
import torch

def pip_install(pkgs):
    subprocess.run([sys.executable, "-m", "pip", "install", "-q"] + pkgs, check=True)

pip_install([
    "transformers==4.41.2",
    "datasets==2.19.1",
    "evaluate==0.4.2",
    "sentencepiece==0.1.99",
])

from transformers import AutoTokenizer, AutoModelForSequenceClassification
from datasets import load_dataset
from tqdm.auto import tqdm

with open("configs/baseline.json", "r") as f:
    cfg = json.load(f)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

CANDIDATE_MODELS = [
    "ynie/roberta-large-snli_mnli_fever_anli_R1_R2_R3-nli",
    "roberta-large-mnli",
]
loaded = False
for m in CANDIDATE_MODELS:
    try:
        nli_tok = AutoTokenizer.from_pretrained(m, use_fast=True)
        nli_model = AutoModelForSequenceClassification.from_pretrained(m).to(device)
        nli_model.eval()
        VERIFIER_NAME = m
        loaded = True
        break
    except Exception as e:
        print("Failed to load", m, "->", e)

assert loaded, "Could not load any NLI/FactCC checkpoint."
print("Verifier:", VERIFIER_NAME)

SENT_RE = re.compile(r"(?<=[.!?])\s+")
def sent_split(text: str) -> List[str]:
    sents = [s.strip() for s in SENT_RE.split(text) if s.strip()]
    return sents if sents else [text.strip()]

def chunk_source(text: str, max_tokens: int = 350, stride: int = 60) -> List[str]:
    words = text.split()
    chunks = []
    i = 0
    while i < len(words):
        chunk_words = words[i:i+max_tokens]
        chunks.append(" ".join(chunk_words))
        i += max_tokens - stride if max_tokens > stride else max_tokens
    return chunks if chunks else [text]


**How to read results:**  
If you see `Verifier: <model-name>`, the NLI checkpoint loaded correctly. The regex sentence splitter and chunker keep dependencies light and fast.


## What / Why
**What:** Implement the **FactCC‑style score**: for each candidate sentence, compute the **max entailment probability** over all source chunks; the document‑level factuality is the **average** over sentences.  
**Why:** This mirrors the intuition behind robust chunked verification used in factuality metrics and avoids context truncation issues on long articles.


In [None]:

import torch

@torch.no_grad()
def entailment_prob(premise: str, hypothesis: str) -> float:
    inputs = nli_tok(premise, hypothesis, truncation=True, max_length=512, return_tensors="pt").to(device)
    logits = nli_model(**inputs).logits
    probs = torch.softmax(logits, dim=-1)[0].detach().cpu().numpy().tolist()
    entail_idx = 2 if logits.shape[-1] >= 3 else int(np.argmax(probs))
    return float(probs[entail_idx])

def factcc_style_score(source: str, candidate: str, chunk_tokens: int = 350) -> float:
    cand_sents = sent_split(candidate)
    src_chunks = chunk_source(source, max_tokens=chunk_tokens, stride=60)
    per_sent = []
    for s in cand_sents:
        if not s.strip():
            continue
        scores = [entailment_prob(ch, s) for ch in src_chunks]
        per_sent.append(max(scores) if scores else 0.0)
    return float(np.mean(per_sent)) if per_sent else 0.0


**How to read results:**  
This function returns a **single scalar** in `[0, 1]` per candidate, higher is “more supported by the source.” We’ll apply it to each line of the candidates file.


## What / Why
**What:** Load the **candidates JSONL** from Notebook 04, score each with the verifier, and save `verification_scores.jsonl`.  
**Why:** We want a reusable artifact: `{id, candidate, factcc_score, rougeL (optional)}` that we can aggregate by article and compare strategies.


In [None]:

from collections import defaultdict
import evaluate as hf_evaluate

INPUT_CANDS = "outputs/candidates_val.jsonl"
OUT_SCORES = "outputs/verification_scores_val.jsonl"

_rouge = hf_evaluate.load("rouge")
def rouge_l(hyp: str, ref: str) -> float:
    r = _rouge.compute(predictions=[hyp], references=[ref])
    return float(r["rougeL"])

total = 0
Path(OUT_SCORES).parent.mkdir(parents=True, exist_ok=True)
with open(INPUT_CANDS, "r", encoding="utf-8") as fin, open(OUT_SCORES, "w", encoding="utf-8") as fout:
    for line in tqdm(fin, desc="Scoring candidates"):
        ex = json.loads(line)
        src = ex["source"]
        cand = ex["candidate"]
        ref = ex["reference"]
        fscore = factcc_style_score(src, cand, chunk_tokens=350)
        rec = {
            "id": ex["id"],
            "strategy": ex["strategy"],
            "params": ex["params"],
            "candidate": cand,
            "reference": ref,
            "factcc": fscore,
            "rougeL": rouge_l(cand, ref),
            "length": ex["length"],
            "logprob": ex["logprob"],
        }
        fout.write(json.dumps(rec) + "\n")
        total += 1

print(f"Wrote {total} records to {OUT_SCORES}")


**How to read results:**  
You’ll see a progress bar as each candidate is scored. The output JSONL now contains `factcc` and `rougeL` per candidate, which we’ll aggregate next.


## What / Why
**What:** Compare three selectors per article: **Baseline** (beam from Notebook 03), **Best‑of‑K by model log‑prob**, and **Reranked by FactCC**.  
**Why:** Our success criterion is **≥ +2.0 FactCC** over baseline with **≤ 1.0 ROUGE‑L drop**. We’ll compute per‑article deltas and summary statistics.


In [None]:

from statistics import mean

by_id = defaultdict(list)
with open(OUT_SCORES, "r", encoding="utf-8") as f:
    for line in f:
        ex = json.loads(line)
        by_id[ex["id"]].append(ex)

def select_baseline(cands):
    beams = [c for c in cands if c["strategy"] == "diverse_beam"]
    if beams:
        return max(beams, key=lambda x: x["logprob"])
    return max(cands, key=lambda x: x["logprob"])

def select_best_of_k(cands):
    return max(cands, key=lambda x: x["logprob"])

def select_reranked_factcc(cands):
    return max(cands, key=lambda x: x["factcc"])

rows = []
for i, cands in by_id.items():
    base = select_baseline(cands)
    bestk = select_best_of_k(cands)
    rerank = select_reranked_factcc(cands)
    rows.append({
        "id": i,
        "baseline_factcc": base["factcc"],
        "baseline_rougeL": base["rougeL"],
        "bestk_factcc": bestk["factcc"],
        "bestk_rougeL": bestk["rougeL"],
        "rerank_factcc": rerank["factcc"],
        "rerank_rougeL": rerank["rougeL"],
    })

def avg(col): return mean(r[col] for r in rows)

summary = {
    "n_articles": len(rows),
    "baseline": {"factcc": avg("baseline_factcc"), "rougeL": avg("baseline_rougeL")},
    "best_of_k": {"factcc": avg("bestk_factcc"), "rougeL": avg("bestk_rougeL")},
    "reranked": {"factcc": avg("rerank_factcc"), "rougeL": avg("rerank_rougeL")},
    "deltas_vs_baseline": {
        "best_of_k": {
            "factcc": avg("bestk_factcc") - avg("baseline_factcc"),
            "rougeL": avg("bestk_rougeL") - avg("baseline_rougeL"),
        },
        "reranked": {
            "factcc": avg("rerank_factcc") - avg("baseline_factcc"),
            "rougeL": avg("rerank_rougeL") - avg("baseline_rougeL"),
        }
    }
}

Path("outputs").mkdir(parents=True, exist_ok=True)
with open("outputs/rerank_summary_val.json", "w") as f:
    json.dump(summary, f, indent=2)

print(json.dumps(summary, indent=2))


**How to read results:**  
Focus on `deltas_vs_baseline.reranked.factcc` (target: **≥ +2.0**) and `deltas_vs_baseline.reranked.rougeL` (target: **≥ −1.0** tolerance). If ROUGE‑L drops too much, we’ll tune decoding or apply constrained decoding as a fallback in ablations later.


## (Optional) What / Why
**What:** Save **best‑per‑article** selections for human review and future stats.  
**Why:** We’ll need these for the 50‑item human audit and bootstrap tests later.


In [None]:

BEST_OUT = "outputs/reranked_val_selections.jsonl"
with open(BEST_OUT, "w", encoding="utf-8") as f:
    for i, cands in by_id.items():
        base = max([c for c in cands if c["strategy"] == "diverse_beam"], key=lambda x: x["logprob"], default=max(cands, key=lambda x: x["logprob"]))
        rerank = max(cands, key=lambda x: x["factcc"])
        rec = {
            "id": i,
            "baseline_candidate": base["candidate"],
            "reranked_candidate": rerank["candidate"],
            "reference": base["reference"],
            "baseline_factcc": base["factcc"],
            "reranked_factcc": rerank["factcc"],
            "baseline_rougeL": base["rougeL"],
            "reranked_rougeL": rerank["rougeL"],
        }
        f.write(json.dumps(rec) + "\n")
print(f"Wrote selections to {BEST_OUT}")


**How to read results:**  
You now have a compact file to sample 50 items for the human audit. The JSONL stores both candidates and the reference for quick rubric‑based checks.
