## Factuality Verification Approach

### The Reranking Hypothesis

Given K candidate summaries, I hypothesize that a **factuality verifier** can identify which candidate is most consistent with the source article. The best candidate is then selected as the final output.

### Why Two Verifiers?

Attempting two complementary verifiers to test robustness:

1. **FactCC** (Kryscinski et al., 2020): Trained specifically for summarization factuality
2. **RoBERTa-MNLI**: General-purpose NLI model that checks logical entailment

If both verifiers agree, we can infer higher confidence in the result.

In [None]:
# 04_Rerank_&_Score.ipynb
# Purpose:
# 1. Load the K=5 candidates file generated by notebook 03.
# 2. Load TWO standard verifiers directly from Hugging Face:
# - Model A: FactCC (Specialized for summarization)
# - Model B: RoBERTa-MNLI (General NLI / Logic)
# 3. Score all candidates.
# 4. Save the final results for analysis.

import os
import json
import torch
import orjson
from google.colab import drive
from tqdm.auto import tqdm
from transformers import AutoTokenizer, AutoModelForSequenceClassification

print("--- 1.0: Setup ---")
drive.mount('/content/drive')

PROJECT_ROOT = "/content/drive/MyDrive/w266_project_final"
OUTPUTS_DIR = os.path.join(PROJECT_ROOT, "outputs")
CANDIDATES_FILE = os.path.join(OUTPUTS_DIR, "validation_candidates_k5.jsonl")
SCORED_FILE = os.path.join(OUTPUTS_DIR, "validation_results_with_scores.jsonl")

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

--- 1.0: Setup ---
Mounted at /content/drive
Using device: cuda


In [None]:

# 2.0: Load Models (Direct from Hugging Face)

print("\n--- 2.0: Loading Verifiers ---")

# Model A: FactCC
# Specialized model for checking summary consistency
print("Loading FactCC...")
factcc_id = "manueldeprada/FactCC"
factcc_tokenizer = AutoTokenizer.from_pretrained(factcc_id)
factcc_model = AutoModelForSequenceClassification.from_pretrained(factcc_id).to(device)
factcc_model.eval()

#  Model B: RoBERTa-MNLI (The "General Logic" Verifier)
# This replaces AlignScore with a standard NLI approach.
# It checks if the Article "Entails" the Summary.
print("Loading RoBERTa-MNLI...")
nli_id = "roberta-large-mnli"
nli_tokenizer = AutoTokenizer.from_pretrained(nli_id)
nli_model = AutoModelForSequenceClassification.from_pretrained(nli_id).to(device)
nli_model.eval()

print("Both models loaded successfully!")


--- 2.0: Loading Verifiers ---
Loading FactCC...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/174 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/854 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Loading RoBERTa-MNLI...


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/688 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.43G [00:00<?, ?B/s]

Some weights of the model checkpoint at roberta-large-mnli were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Both models loaded successfully!


## Verifier Details

### Model A: FactCC

**Source:** `manueldeprada/FactCC` on Hugging Face (BERT-base architecture)

**How it works:**
1. Concatenate article and summary: `[CLS] Article [SEP] Summary [SEP]`
2. Pass through BERT encoder
3. Classify as Consistent (1) or Inconsistent (0)
4. Extract `P(Consistent)` as the factuality score

**Known limitation:** Synthetic training data may not capture natural hallucination patterns.

### Model B: RoBERTa-MNLI

**Source:** `roberta-large-mnli` on Hugging Face

**How it works:**
1. Format as NLI pair: `<s> Article </s></s> Summary </s>`
2. Classify relationship as Contradiction (0), Neutral (1), or Entailment (2)
3. Extract `P(Entailment)` as the supportedness score

**Rationale:** If an article "entails" a summary, all claims in the summary should be supported by the article.

In [None]:
#  Helper Functions

def get_factcc_score(article, candidate):
    """Calculates P(Correct) using FactCC"""
    # FactCC format: [CLS] Article [SEP] Candidate [SEP]
    inputs = factcc_tokenizer(
        article,
        candidate,
        return_tensors="pt",
        max_length=512,
        truncation="only_first",
        padding=True
    ).to(device)

    with torch.no_grad():
        logits = factcc_model(**inputs).logits
        probs = torch.softmax(logits, dim=1)
        # Class 1 = Correct
        return probs[0][1].item()

def get_nli_score(article, candidate):
    """Calculates P(Entailment) using RoBERTa-MNLI"""
    # NLI format: <s> Article </s></s> Candidate </s>
    inputs = nli_tokenizer(
        article,
        candidate,
        return_tensors="pt",
        max_length=512,
        truncation="only_first",
        padding=True
    ).to(device)

    with torch.no_grad():
        logits = nli_model(**inputs).logits
        # Check model config for label mapping
        # RoBERTa-MNLI typically: 0=Contradiction, 1=Neutral, 2=Entailment
        probs = torch.softmax(logits, dim=1)
        return probs[0][2].item()

## Scoring Implementation

### Sentence-Level Aggregation

For each candidate summary $c$ and source article $a$:

$$\text{score}(a, c) = P(\text{Consistent} \mid a, c)$$

The verifier processes the full article-summary pair and returns a single score.

### Truncation Strategy

Both verifiers have a 512-token limit. We use `truncation="only_first"` to:
- Truncate the **article** if too long
- Keep the **summary** intact

This means long articles lose content from the end. Since news articles typically front-load important information, this is acceptable for CNN/DailyMail.

### Computational Cost

Scoring all 5 candidates for 2,000 articles takes ~7-8 minutes on a T4 GPU. This is fast enough for research purposes but would need optimization for production.

In [None]:

# Main Scoring Loop

print(f"\n--- 3.0: Scoring Candidates ---")

with open(CANDIDATES_FILE, 'rb') as f_in, open(SCORED_FILE, 'wb') as f_out:
    # Reads file line by line
    lines = f_in.readlines()

    for line in tqdm(lines, desc="Scoring"):
        if not line.strip(): continue

        record = orjson.loads(line)
        article = record['article']
        # Dynamically find the candidates key
        cand_key = [k for k in record.keys() if 'generated_candidates' in k][0]
        candidates = record[cand_key]

        factcc_scores = []
        nli_scores = []

        for cand in candidates:
            # Score with FactCC
            f_score = get_factcc_score(article, cand)
            factcc_scores.append(f_score)

            # Score with NLI (RoBERTa)
            n_score = get_nli_score(article, cand)
            nli_scores.append(n_score)

        # Add scores to record
        record['factcc_scores'] = factcc_scores
        record['nli_scores'] = nli_scores # Replaces "alignscores"

        f_out.write(orjson.dumps(record) + b'\n')

print("\nDone! Scores saved.")
print(f"Results at: {SCORED_FILE}")


--- 3.0: Scoring Candidates ---


Scoring:   0%|          | 0/2000 [00:00<?, ?it/s]


Done! Scores saved.
Results at: /content/drive/MyDrive/w266_project_final/outputs/validation_results_with_scores.jsonl
