## Human Evaluation Design

### Motivation

Automatic metrics like FactCC may not correlate with human judgment. To validate the methodology, I conduct a human evaluation experiment on a stratified sample of 50 test examples.

### Research Questions

1. When FactCC prefers a reranked summary, do humans agree?
2. What types of errors remain after reranking?
3. Does improving FactCC actually improve perceived factuality?

In [None]:
# 06_Human_Audit_Prep.ipynb
# Purpose: Generate the 50-item audit file from the TEST SET.

import os
import json
import orjson
import pandas as pd
import numpy as np
from google.colab import drive

drive.mount('/content/drive')
PROJECT_ROOT = "/content/drive/MyDrive/w266_project_final"
OUTPUTS_DIR = os.path.join(PROJECT_ROOT, "outputs")

# Pointing to the TEST set
RESULTS_FILE = os.path.join(OUTPUTS_DIR, "test_set_final_results.jsonl")

print(f"Reading from: {RESULTS_FILE}")

records = []
with open(RESULTS_FILE, 'rb') as f:
    for line in f:
        records.append(orjson.loads(line))
df = pd.DataFrame(records)
print(f"Loaded {len(df)} rows.")

# Helper to find candidate column
cand_col = [c for c in df.columns if 'candidates' in c or 'generated' in c][0]

# Logic: Find "Disagreements"
def is_change(row):
    return np.argmax(row['factcc_scores']) != 0

df['changed'] = df.apply(is_change, axis=1)

# Sample: 30 Disagreements (Interesting) + 20 Agreements (Control)
audit_set = pd.concat([
    df[df['changed']==True].sample(30, random_state=42),
    df[df['changed']==False].sample(20, random_state=42)
])

# CSV with Taxonomy Columns
audit_export = []
for i, row in audit_set.iterrows():
    base_idx = 0
    fact_idx = np.argmax(row['factcc_scores'])

    audit_export.append({
        "ID": i,
        "Full_Article": row['article'],
        "Baseline_Summary": row[cand_col][base_idx],
        "Reranked_Summary": row[cand_col][fact_idx],

        "Preferred_Summary": "", # "Baseline", "Reranked", or "Tie"
        "Error_Type": "", # "None", "Entity", "Contradiction", "Hallucination", "Omission"
        "Comments": ""
    })

csv_path = os.path.join(OUTPUTS_DIR, "human_audit_TEST_SET_FULL_TEXT.csv")
pd.DataFrame(audit_export).to_csv(csv_path, index=False)

print(f"\n✅ Audit File Created: {csv_path}")
print("Action: Download this CSV. Fill the 'Preferred_Summary' and 'Error_Type' columns.")
print("Then upload it back to Drive as 'human_audit_TEST_SET_FILLED.csv'")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Reading from: /content/drive/MyDrive/w266_project_final/outputs/test_set_final_results.jsonl
Loaded 11490 rows.

✅ Audit File Created: /content/drive/MyDrive/w266_project_final/outputs/human_audit_TEST_SET_FULL_TEXT.csv
Action: Download this CSV. Fill the 'Preferred_Summary' and 'Error_Type' columns.
Then upload it back to Drive as 'human_audit_TEST_SET_FILLED.csv'


## Stratified Sampling Strategy

We sample examples to maximize information gain:

| Group | N | Selection Criterion | Purpose |
|-------|---|---------------------|---------|
| **Disagreements** | 30 | `argmax(FactCC) ≠ 0` | Cases where reranking changed the output |
| **Agreements** | 20 | `argmax(FactCC) = 0` | Control cases (baseline was already best) |
| **Total** | 50 | — | — |

**Rationale:**
- Disagreements show the effect of reranking (interesting cases)
- Agreements verify baseline quality (control group)
- 50 examples balances thoroughness with time constraints

**Reproducibility:** Fixed `random_state=42` ensures the same sample if regenerated.

## Annotation Protocol

### Instructions for Annotator

For each of the 50 examples:

1. **Read the full article** carefully
2. **Read both summaries** (Baseline and Reranked)
3. **Recorded preference:**
   - `Baseline` — Baseline summary is more accurate
   - `Reranked` — Reranked summary is more accurate  
   - `Tie` — Both are equally accurate (or equally inaccurate)

4. **Identify error types**:
   - `None` — No factual errors detected
   - `OutE` — "Out of thin air" hallucination (not in article)
   - `Entity` — Wrong name, number, date, or location
   - `Contradiction` — Summary contradicts the article
   - `Temporal` — Incorrect time sequence or tense

5. **Add comments** explaining the decision (especially for non-Tie judgments)

### Limitations

- Single annotator (no inter-annotator agreement measured)
- Annotator is not blind to condition labels
- Time pressure affecting judgment quality

These limitations are acknowledged in the paper's Discussion section.