### LLM-based peer review simulation setup

This notebook configures an **LLM-based reviewer + editor pipeline** for ICLR-style peer review.

We use the parsed manuscripts in `ICLR2025_papers/` and the historical **human reviews and decisions** in the CSV file to:

- Construct reviewer prompts for an LLM to read each manuscript and produce a JSON review (summary, strengths, weaknesses, questions, decision).
- Construct editor/area chair prompts for an LLM to read both **human reviews + LLM reviews** and output a final decision and rationale.

The goal is to simulate how replacing some proportion of human reviewers with LLM agents changes **paper-level decisions and written reviews**.


In [34]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
%matplotlib inline

# Path to the CSV created in 01_get_human_review.ipynb
# Ensure this file exists in your directory or adjust the path
DATA_PATH = Path("ICLR2025_human_reviews_with_decision.csv")
if not DATA_PATH.exists():
    # Fallback if the specific filename differs
    DATA_PATH = Path("ICLR2025_human_reviews.csv")

print(f"Loading data from: {DATA_PATH.resolve()}")


if not DATA_PATH.exists():
    raise FileNotFoundError(f"Could not find {DATA_PATH}. Please run 01_get_human_review.ipynb first.")

reviews_df = pd.read_csv(DATA_PATH)
print("Raw reviews shape:", reviews_df.shape)
reviews_df.head(2)


Loading data from: /Users/cathychen/Library/CloudStorage/OneDrive-Stanford/Research/MIMIR/ICLR_simulation/GABM_peer_review/ICLR2025_human_reviews_with_decision.csv
Raw reviews shape: (35351, 10)


Unnamed: 0,paper_forum,paper_id,title,decision,review_id,reviewer_id,summary,strengths,weaknesses,questions
0,zkNCWtw2fd,zkNCWtw2fd,Synergistic Approach for Simultaneous Optimiza...,Reject,UXJcl5skaD,ICLR.cc/2025/Conference/Submission14290/Review...,This paper introduces a hybrid batch training ...,1. Addresses a relevant challenge in multiling...,1. The primary contribution merely combines tw...,None.
1,zkNCWtw2fd,zkNCWtw2fd,Synergistic Approach for Simultaneous Optimiza...,Reject,bXNaEp660n,ICLR.cc/2025/Conference/Submission14290/Review...,The paper studies information retrieval tasks ...,The paper shows that two standard batching str...,1. Limited evaluations are only QA datasets (e...,"1. Related to weakness1, could this proposed m..."


In [35]:
from dotenv import load_dotenv
import os

# Load environment variables from .env in the current working directory
load_dotenv()

# Optional sanity check (does NOT print the whole key)
key = os.environ.get("OPENAI_API_KEY")
print("Has OPENAI_API_KEY:", bool(key))
if key:
    print("Key prefix:", key[:7], "...")  # Just to confirm it's being read

Has OPENAI_API_KEY: True
Key prefix: sk-proj ...


In [36]:
# Parse numeric rating from strings like "8: Strong accept" if needed (optional for this flow but good for reference)
def parse_rating(val):
    if pd.isna(val):
        return np.nan
    if isinstance(val, (int, float)):
        return float(val)
    s = str(val).strip()
    # Try splitting on ':' first, then space
    for sep in (":", " "):
        try:
            num = float(s.split(sep)[0])
            return num
        except (ValueError, IndexError):
            continue
    return np.nan

if "rating" in reviews_df.columns:
    reviews_df["rating_num"] = reviews_df["rating"].apply(parse_rating)
    print("Fraction of reviews with parsed numeric rating:", reviews_df["rating_num"].notna().mean())


# Aggregate to paper-level statistics (useful for selecting papers to simulate)

paper_stats = (
    reviews_df.groupby("paper_id")
    .agg(
        n_reviews=("review_id", "count"),
        decision=("decision", lambda x: x.dropna().iloc[0] if len(x.dropna()) else np.nan),
        title=("title", "first"),
    )
    .reset_index()
)

print("Paper-level stats shape:", paper_stats.shape)
paper_stats.head()


Paper-level stats shape: (8727, 4)


Unnamed: 0,paper_id,n_reviews,decision,title
0,00SnKBGTsz,4,Accept (Spotlight),DataEnvGym: Data Generation Agents in Teacher ...
1,00ezkB2iZf,4,Reject,MedFuzz: Exploring the Robustness of Large Lan...
2,01wMplF8TL,4,Reject,INSTRUCTION-FOLLOWING LLMS FOR TIME SERIES PRE...
3,029hDSVoXK,5,Accept (Poster),Dynamic Neural Fortresses: An Adaptive Shield ...
4,02Od16GFRW,3,Reject,Ensembles provably learn equivariance through ...


### LLM-based reviewer/editor simulation (decision + written review)

In this section we setup the **LLM-based reviewer/editor simulation**.

For each manuscript we:

1. **Load the parsed manuscript text** from `ICLR2025_papers/<paper_id>/<paper_id>.tei.xml`.
2. **Prompt an LLM reviewer** to read the manuscript and output JSON with `summary`, `strengths`, and `weaknesses`. 
3. **Prompt an LLM editor/area chair** to read the *human reviews from CSV* plus the LLM-generated review(s) and make a final decision.

In [39]:
from pathlib import Path
import xml.etree.ElementTree as ET

PAPERS_ROOT = Path("ICLR2025_papers")


def load_tei_for_paper(paper_id: str) -> Path:
    """Return the TEI XML path for a given ICLR paper_id.

    Assumes the structure ICLR2025_papers/<paper_id>/<paper_id>.tei.xml.
    """
    # Sometimes folder name is paper_id, sometimes it might be forum ID. 
    # Adjust if your folder structure differs. 
    tei_path = PAPERS_ROOT / paper_id / f"{paper_id}.tei.xml"
    
    # Fallback check: sometimes folder exists but file name might differ slightly or be in a subfolder
    if not tei_path.exists():
        # Try just the paper_id folder if the tei file is named differently
        potential_files = list((PAPERS_ROOT / paper_id).glob("*.tei.xml"))
        if potential_files:
            return potential_files[0]
            
    if not tei_path.exists():
        raise FileNotFoundError(f"TEI file not found for paper_id={paper_id}: {tei_path}")
    return tei_path

def extract_text_from_tei(tei_path: Path, max_chars: int = 30000) -> str:
    """Lightweight TEI â†’ plain text extractor (title + abstract + body paragraphs).

    This is intentionally simple; you can swap it out for a more faithful converter later.
    """

    ns = {"tei": "http://www.tei-c.org/ns/1.0"}
    try:
        tree = ET.parse(tei_path)
        root = tree.getroot()
    except Exception as e:
        return f"[Error parsing TEI XML: {e}]"

    parts: list[str] = []

    # Title
    title_el = root.find(".//tei:titleStmt/tei:title", ns)
    if title_el is not None and (title_el.text or "").strip():
        parts.append(title_el.text.strip())

    # Abstract
    for p in root.findall(".//tei:abstract//tei:p", ns):
        if p.text and p.text.strip():
            parts.append(p.text.strip())

    # Body
    for p in root.findall(".//tei:body//tei:p", ns):
        if p.text and p.text.strip():
            parts.append(p.text.strip())

    text = "\n\n".join(parts)
    
    # Truncate if too long for context window
    if max_chars is not None and len(text) > max_chars:
        text = text[:max_chars] + "\n\n[TRUNCATED FOR PROMPT LENGTH]"
    return text


def get_paper_text(paper_id: str, max_chars: int = 30000) -> str:
    tei_path = load_tei_for_paper(paper_id)
    return extract_text_from_tei(tei_path, max_chars=max_chars)

In [40]:
# Bundle human reviews for a given paper into a simple JSON-friendly structure

HUMAN_COLUMNS = ["review_id", "reviewer_id", "summary", "strengths", "weaknesses", "questions", "decision"]
# Ensure columns exist
valid_cols = [c for c in HUMAN_COLUMNS if c in reviews_df.columns]

def get_human_reviews_for_paper(paper_id: str) -> list[dict]:
    rows = reviews_df.loc[reviews_df["paper_id"] == paper_id, valid_cols]
    reviews: list[dict] = []
    for _, r in rows.iterrows():
        review_dict = {
            "source": "human",
            "review_id": r.get("review_id"),
            "reviewer_id": r.get("reviewer_id"),
            "summary": r.get("summary"),
            "strengths": r.get("strengths"),
            "weaknesses": r.get("weaknesses"),
            "questions": r.get("questions"),
            "rating": r.get("rating"),
            "decision": r.get("decision"),
        }
        reviews.append(review_dict)
    return reviews

### AutoGen Simulation Helpers
Copied and adapted from LLM_review_simulation.ipynb for use with our parsed papers.

In [41]:

import os
import json
import random
import time
from dataclasses import dataclass
from typing import Any, Dict, List, Optional

from openai import OpenAI

try:
    from autogen import AssistantAgent, UserProxyAgent
except ImportError:
    print("AutoGen not installed. Please run: pip install autogen-agentchat")

# Configuration
MODEL_NAME = "gpt-4o-mini"
DEFAULT_LLM_CONFIG = {
    "model": MODEL_NAME,
    "temperature": 0.7,
    "max_tokens": 1200,
    "timeout": 120,
}

DECISION_LABELS = ["oral", "spotlight", "poster", "reject"]
# Adjust decisions if your dataset uses different labels (e.g. "Accept (Oral)", "Reject")
# For ICLR, these are standard.

SENIORITY_LABELS = {
    "grad": "Graduate student reviewer",
    "junior": "Junior faculty reviewer",
    "senior": "Senior faculty reviewer",
}

In [54]:
REVIEWER_PROMPT_TEMPLATE = """You are serving as an ICLR reviewer with expertise in {expertise}.
Seniority: {seniority_label}.

Guidelines:
- Carefully read the entire manuscript text provided to you.
- Wait until you receive the message END_OF_PAPER before replying.
- Provide a fair, evidence-based review using first person plural.
- Produce exactly one JSON object as specified below; do not add commentary or prose outside JSON.

JSON schema (keys are required):
{{
  "persona": {{"expertise": string, "seniority": string}},
  "review": {{
    "summary": string,
    "strengths": [string],
    "weaknesses": [string],
    "questions": [string],
    "confidence": integer 1-5,
  }}
}}
"""

In [56]:
# # Updated Editor Prompt to handle both Human and LLM reviews
# EDITOR_PROMPT = """You are the area chair for an ICLR submission.
# You will receive the manuscript text plus a list of reviews (some human, some LLM-generated).
# Tasks:
# 1. Identify consensus and disagreements across ALL reviews.
# 2. Justify a single decision in {decisions}.
# 3. Return a strict JSON object without code fences: 
# {{
#   "decision": one of {decisions},
#   "confidence": float between 0 and 1,
#   "rationale": short paragraph,
#   "highlights": [string]
# }}
# """.format(decisions=DECISION_LABELS)

EDITOR_PROMPT = """You are serving as the Area Chair for an ICLR submission.
You will receive the manuscript text plus a list of reviews (some human-written, some LLM-generated).

Your responsibilities:

1. Analyze all reviews:
   - Identify clear points of consensus across reviewers.
   - Identify and explain disagreements, including which reviewers disagree and why.
   - Flag any reviewer misunderstandings, unsupported claims, or inconsistencies.

2. Conduct an independent Area Chair assessment:
   - Evaluate novelty, technical correctness, significance, empirical rigor, and clarity.
   - Weigh reviewer feedback appropriately, but do not rely on it blindly.
   - Distinguish between major and minor concerns.
   - Note any methodological flaws, missing experiments, ethical issues, or correctness problems that materially affect acceptance.

3. Make and justify the final decision (one of {decisions}):
   - Base the decision on reviewer consensus, resolved disagreements, and your own AC-level judgment.
   - Ensure the justification reflects ICLR standards: correctness > novelty > significance > clarity.
   - The justification must be factual, grounded in the manuscript + reviews, and internally consistent.

4. Return a strict JSON object with no code fences:
{{
  "decision": one of {decisions},
  "confidence": float between 0 and 1,
  "rationale": "A concise paragraph explaining the decision, referencing consensus, disagreements, and your independent assessment.",
  "highlights": [
    "Key factor influencing the decision",
    "Another key factor",
    "Major consensus or disagreement point",
    "Any important caveat or uncertainty"
  ]
}}
""".format(decisions=DECISION_LABELS)


In [43]:
@dataclass
class ReviewerPersona:
    name: str
    expertise: str
    seniority: str

    @property
    def seniority_label(self) -> str:
        return SENIORITY_LABELS.get(self.seniority, self.seniority)

def require_api_key() -> str:
    # Ensure you have set this environment variable!
    api_key = os.environ.get("OPENAI_API_KEY")
    if not api_key:
        print("WARNING: OPENAI_API_KEY not set. Simulation calls will fail.")
        return ""
    return api_key

def build_llm_config(overrides: Optional[Dict[str, Any]] = None) -> Dict[str, Any]:
    config: Dict[str, Any] = {**DEFAULT_LLM_CONFIG}
    if overrides:
        config.update(overrides)
    
    # Filter config list to only include valid keys if using OpenAI
    config_list = [
        {
            "model": config.get("model", MODEL_NAME),
            "api_key": require_api_key(),
        }
    ]
    config["config_list"] = config_list
    return config

def make_reviewer_agent(persona: ReviewerPersona, overrides: Optional[Dict[str, Any]] = None) -> AssistantAgent:
    llm_config = build_llm_config(overrides)
    system_message = REVIEWER_PROMPT_TEMPLATE.format(
        expertise=persona.expertise,
        seniority_label=persona.seniority_label,
    )
    return AssistantAgent(
        name=persona.name,
        system_message=system_message,
        llm_config=llm_config,
    )
def make_editor_agent(overrides: Optional[Dict[str, Any]] = None) -> AssistantAgent:
    llm_config = build_llm_config(overrides)
    return AssistantAgent(
        name="area_chair",
        system_message=EDITOR_PROMPT,
        llm_config=llm_config,
    )

def strip_code_fences(text: str) -> str:
    stripped = text.strip()
    if stripped.startswith("```") and stripped.endswith("```"):
        stripped = stripped.strip("`")
        if "\n" in stripped:
            _, _, remainder = stripped.partition("\n")
            stripped = remainder
    return stripped.strip()

def parse_json_response(raw_text: str) -> Dict[str, Any]:
    cleaned = strip_code_fences(raw_text)
    cleaned = cleaned.replace("\u200b", "").strip()
    if not cleaned:
        raise ValueError("Empty response")
    return json.loads(cleaned)

def message_content(message: Any) -> str:
    if message is None: return ""
    if isinstance(message, str): return message
    if isinstance(message, dict): return str(message.get("content", ""))
    if hasattr(message, "content"): return str(message.content)
    return str(message)

def last_reply(agent: AssistantAgent, conversation_partner: UserProxyAgent) -> str:
    history = getattr(agent, "_oai_messages", {})
    if conversation_partner in history:
        msg = agent.last_message(conversation_partner)
    else:
        msg = agent.last_message()
    return message_content(msg)

def send_with_retry(sender, recipient, message, request_reply, silent=True, max_attempts=3):
    last_error = None
    for attempt in range(1, max_attempts + 1):
        try:
            return sender.send(recipient=recipient, message=message, request_reply=request_reply, silent=silent)
        except Exception as exc:
            last_error = exc
            time.sleep(min(4, attempt) * 2)
    if last_error: raise last_error

def collect_reviewer_output(persona: ReviewerPersona, paper_text: str, llm_overrides: Optional[Dict[str, Any]] = None) -> Dict[str, Any]:
    reviewer = make_reviewer_agent(persona, overrides=llm_overrides)
    orchestrator = UserProxyAgent(name=f"proxy_for_{persona.name}", human_input_mode="NEVER", code_execution_config=False)

    intro = "You will receive the full manuscript. Read it completely and respond only after prompted."
    send_with_retry(orchestrator, reviewer, intro, request_reply=False)
    send_with_retry(orchestrator, reviewer, f"PAPER_TEXT\n\n{paper_text}", request_reply=False)

    final_prompt = "END_OF_PAPER. Produce the JSON object now. Output valid JSON only without fences."
    
    for _ in range(3):
        send_with_retry(orchestrator, reviewer, final_prompt, request_reply=True, silent=False)
        try:
            return parse_json_response(last_reply(reviewer, orchestrator))
        except Exception:
            send_with_retry(orchestrator, reviewer, "Invalid JSON. Respond again with JSON only.", request_reply=False)
    raise RuntimeError("Unable to collect reviewer output")

def collect_editor_decision(paper_text: str, reviews: List[Dict[str, Any]], llm_overrides: Optional[Dict[str, Any]] = None) -> Dict[str, Any]:
    editor = make_editor_agent(overrides=llm_overrides)
    orchestrator = UserProxyAgent(name="proxy_for_editor", human_input_mode="NEVER", code_execution_config=False)

    send_with_retry(orchestrator, editor, "You will receive the paper text followed by reviews.", request_reply=False)
    send_with_retry(orchestrator, editor, f"PAPER_TEXT_START\n{paper_text}\nPAPER_TEXT_END", request_reply=False)
    
    # Convert reviews to string payload
    reviews_payload = json.dumps(reviews, indent=2)
    send_with_retry(orchestrator, editor, f"REVIEWS_START\n{reviews_payload}\nREVIEWS_END", request_reply=False)

    final_prompt = "Using the rubric, output the JSON decision now."
    
    for _ in range(3):
        send_with_retry(orchestrator, editor, final_prompt, request_reply=True, silent=False)
        try:
            return parse_json_response(last_reply(editor, orchestrator))
        except Exception:
            send_with_retry(orchestrator, editor, "Invalid JSON. Respond again with JSON only.", request_reply=False)
    raise RuntimeError("Unable to collect editor decision")


In [44]:
# === Wiring it all together ===

def call_llm_reviewer(paper_text: str, *, metadata: dict | None = None) -> dict:
    """Concrete implementation using AutoGen."""
    # Create a random persona for variety or use metadata to set specific expertise
    persona = ReviewerPersona(
        name="sim_reviewer",
        expertise="machine learning",  # could be inferred from paper title/content in future
        seniority=random.choice(["grad", "junior", "senior"])
    )
    
    try:
        result = collect_reviewer_output(persona, paper_text)
        # Normalize keys if needed
        result["source"] = "llm"
        return result
    except Exception as e:
        print(f"LLM Reviewer failed: {e}")
        return {"source": "llm", "error": str(e)}

def call_llm_editor(paper_text: str, *, human_reviews: list[dict], llm_reviews: list[dict], metadata: dict | None = None) -> dict:
    """Concrete implementation using AutoGen."""
    all_reviews = human_reviews + llm_reviews
    try:
        return collect_editor_decision(paper_text, all_reviews)
    except Exception as e:
        print(f"LLM Editor failed: {e}")
        return {"decision": "reject", "error": str(e)}


In [45]:
# High-level LLM simulation for a single paper

def run_llm_review_simulation_for_paper(
    paper_id: str,
    *,
    n_llm_reviewers: int = 1,
    replace_human: bool = False,
    max_chars: int = 30000,
) -> dict:
    """Run the 3-step LLM pipeline for one paper_id.

    Args:
        paper_id: The paper ID to simulate.
        n_llm_reviewers: How many LLM reviewers to include.
        replace_human: If True, effectively 'switches' n_llm_reviewers human reviews with LLM reviews.
                       (If there are fewer human reviews than n_llm_reviewers, it replaces all of them).
                       If False, adds n_llm_reviewers to the existing human panel.
    """

    try:
        paper_text = get_paper_text(paper_id, max_chars=max_chars)
    except FileNotFoundError as e:
        print(f"Skipping {paper_id}: {e}")
        return {"error": str(e)}

    human_reviews = get_human_reviews_for_paper(paper_id)

    # Implement 'switch' logic: if replace_human is True, remove k random human reviews
    if replace_human and len(human_reviews) > 0:
        # Remove up to n_llm_reviewers human reviews to make space for the agents
        n_to_remove = min(n_llm_reviewers, len(human_reviews))
        # Simple slice (could be random.sample if order mattered more)
        human_reviews_to_keep = human_reviews[n_to_remove:]
        final_human_reviews = human_reviews_to_keep
    else:
        final_human_reviews = human_reviews

    llm_reviews: list[dict] = []
    for i in range(n_llm_reviewers):
        print(f"Generating LLM review {i+1}/{n_llm_reviewers}...")
        llm_review = call_llm_reviewer(paper_text, metadata={"paper_id": paper_id})
        llm_reviews.append(llm_review)

    print("Generating Editor decision...")
    editor_decision = call_llm_editor(
        paper_text,
        human_reviews=final_human_reviews,
        llm_reviews=llm_reviews,
        metadata={"paper_id": paper_id},
    )

    return {
        "paper_id": paper_id,
        "human_reviews_used": final_human_reviews,
        "llm_reviews": llm_reviews,
        "editor_decision": editor_decision,
    }


In [46]:
def simulate_llm_composition_for_paper(
    paper_id: str,
    llm_counts: list[int],
    n_rounds: int = 3,
    replace_human: bool = False,
    max_chars: int = 30000,
) -> pd.DataFrame:
    """For one paper, run the LLM reviewer+editor pipeline with different numbers of LLM reviews.
    
    Args:
        llm_counts: list of integers, e.g. [0, 1, 2].
        replace_human: If True, we SWAP human reviewers for LLM reviewers.
                       If False, we ADD LLM reviewers to the panel.
    """

    rows: list[dict] = []

    for n_llm in llm_counts:
        for round_idx in range(1, n_rounds + 1):
            print(f"--- Simulating: {n_llm} LLM reviewers (Round {round_idx}/{n_rounds}) ---")
            sim = run_llm_review_simulation_for_paper(
                paper_id=paper_id,
                n_llm_reviewers=n_llm,
                replace_human=replace_human,
                max_chars=max_chars,
            )
            
            if "error" in sim:
                print(f"Skipping due to error: {sim['error']}")
                continue

            ed = sim.get("editor_decision", {}) or {}

            rows.append(
                {
                    "paper_id": paper_id,
                    "round": round_idx,
                    "n_llm_reviewers": n_llm,
                    "replace_human": replace_human,
                    "editor_decision": ed.get("decision"),
                    "editor_rationale": ed.get("rationale"),
                    "editor_highlights": ed.get("highlights"),
                }
            )

    return pd.DataFrame(rows)


# Validate Editor Prompt


In [57]:
# Direct single-call LLM editor (one API call per paper)

_client = None


def get_openai_client() -> OpenAI:
    """Lazily construct a global OpenAI client using the same API key helper."""
    global _client
    if _client is None:
        api_key = require_api_key()
        if not api_key:
            raise RuntimeError("OPENAI_API_KEY not set; cannot call LLM editor.")
        _client = OpenAI(api_key=api_key)
    return _client


def call_llm_editor_single_call(
    paper_text: str,
    *,
    reviews: list[dict],
    overrides: Optional[Dict[str, Any]] = None,
) -> Dict[str, Any]:
    """Call the editor model exactly once, with paper text and ALL reviews grouped together.

    This bypasses AutoGen's multi-message protocol and retry loop, making
    exactly one chat.completions.create call per paper.
    """
    cfg: Dict[str, Any] = {**DEFAULT_LLM_CONFIG}
    if overrides:
        cfg.update(overrides)

    model = cfg.get("model", MODEL_NAME)
    temperature = cfg.get("temperature", 0.7)
    max_tokens = cfg.get("max_tokens", 1200)

    reviews_payload = json.dumps(reviews, indent=2)

    user_content = (
        "You will receive the full paper text and then the list of reviews.\n\n"
        "PAPER_TEXT_START\n" + paper_text + "\nPAPER_TEXT_END\n\n"
        "REVIEWS_START\n" + reviews_payload + "\nREVIEWS_END\n\n"
        "Using the rubric, output the JSON decision now."
    )

    client = get_openai_client()
    resp = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": EDITOR_PROMPT},
            {"role": "user", "content": user_content},
        ],
        temperature=temperature,
        max_tokens=max_tokens,
    )

    raw_text = resp.choices[0].message.content or ""
    return parse_json_response(raw_text)



In [58]:
# === Validation: Check Editor alignment with Human Decisions ===

from collections import Counter


def human_majority_editor_decision(human_reviews: list[dict]) -> dict:
    """Offline "editor" that uses only human review fields (no LLM).

    Strategy:
    - If per-review `decision` is available, take the majority label.
    - Otherwise, fall back to mean numeric rating (if present) and map to a coarse label.
    """
    if not human_reviews:
        return {
            "decision": "reject",
            "confidence": 0.0,
            "rationale": "No human reviews available; defaulting to reject.",
            "highlights": [],
        }

    # Collect decision strings (if present)
    decisions = [
        str(r.get("decision")).strip()
        for r in human_reviews
        if r.get("decision") not in (None, "", float("nan"))
    ]

    if decisions:
        counts = Counter(decisions)
        top_decision, top_count = counts.most_common(1)[0]
        confidence = top_count / len(decisions)
        return {
            "decision": top_decision,
            "confidence": float(confidence),
            "rationale": f"Chosen by majority of human reviews ({top_count}/{len(decisions)}).",
            "highlights": [],
        }

    # Fallback: use mean numeric rating, if available
    ratings = []
    for r in human_reviews:
        if "rating_num" in r and r["rating_num"] is not None:
            ratings.append(float(r["rating_num"]))
        elif "rating" in r and r["rating"] is not None:
            ratings.append(parse_rating(r["rating"]))

    ratings = [x for x in ratings if x == x]  # drop NaNs
    if ratings:
        avg = float(sum(ratings) / len(ratings))
        if avg >= 8:
            label = "oral"
        elif avg >= 7:
            label = "spotlight"
        elif avg >= 6:
            label = "poster"
        else:
            label = "reject"
        return {
            "decision": label,
            "confidence": 0.5,
            "rationale": f"Heuristic decision from mean rating {avg:.2f}.",
            "highlights": [],
        }

    return {
        "decision": "reject",
        "confidence": 0.0,
        "rationale": "Insufficient structured signal in human reviews; defaulting to reject.",
        "highlights": [],
    }


def validate_editor_on_human_reviews(
    n_papers: int = 5,
    seed: int = 42,
    max_chars: int = 30000,
    use_llm: bool = True,
) -> pd.DataFrame:
    """Validate either the LLM editor or an offline human-only rule.

    Args:
        n_papers: Number of papers to sample for validation.
        seed: Random seed for sampling.
        max_chars: Text limit for manuscript.
        use_llm: If True, call the LLM editor; if False, use `human_majority_editor_decision`.

    Returns:
        DataFrame with columns: [paper_id, actual_decision, llm_decision, match, rationale]
    """

    # Filter for papers that actually have a decision and reviews
    valid_papers = paper_stats[paper_stats["decision"].notna() & (paper_stats["n_reviews"] > 0)]

    if len(valid_papers) == 0:
        print("No papers found with both reviews and decisions to validate against.")
        return pd.DataFrame()

    sample = valid_papers.sample(n=min(n_papers, len(valid_papers)), random_state=seed)

    results = []
    mode = "LLM" if use_llm else "human-majority heuristic"
    print(f"Validating Editor ({mode}) on {len(sample)} papers...")

    for _, row in sample.iterrows():
        pid = row["paper_id"]
        actual_decision = row["decision"]

        try:
            paper_text = get_paper_text(pid, max_chars=max_chars)
        except FileNotFoundError:
            print(f"Skipping {pid} (PDF text not found)")
            continue

        human_reviews = get_human_reviews_for_paper(pid)

        # Either call LLM editor (single API call) or offline heuristic
        print(f"  Processing {pid} (Actual: {actual_decision})...")
        if use_llm:
            all_reviews = human_reviews  # we only use human reviews in this validation
            llm_out = call_llm_editor_single_call(
                paper_text=paper_text,
                reviews=all_reviews,
            )
        else:
            llm_out = human_majority_editor_decision(human_reviews)

        llm_decision = llm_out.get("decision", "error")
        rationale = llm_out.get("rationale", "")

        # Simple fuzzy match for decision labels (e.g. "accept (oral)" vs "oral")
        # Normalize both to lower case
        act_norm = str(actual_decision).lower()
        llm_norm = str(llm_decision).lower()

        # Define a match if the main category agrees (Accept vs Reject)
        # or if the specific label matches.
        is_match = False
        if "reject" in act_norm and "reject" in llm_norm:
            is_match = True
        elif "accept" in act_norm and "accept" in llm_norm:
            is_match = True
        elif "oral" in act_norm and "oral" in llm_norm:
            is_match = True
        elif "spotlight" in act_norm and "spotlight" in llm_norm:
            is_match = True
        elif "poster" in act_norm and "poster" in llm_norm:
            is_match = True

        results.append(
            {
                "paper_id": pid,
                "actual_decision": actual_decision,
                "llm_decision": llm_decision,
                "match": is_match,
                "llm_rationale": rationale,
            }
        )

    df = pd.DataFrame(results)

    if not df.empty:
        try:
            from sklearn.metrics import accuracy_score, f1_score

            # Helper to map decision string to binary (Accept=1, Reject=0)
            def to_binary(d_str):
                s = str(d_str).lower()
                if "reject" in s:
                    return 0
                # Assume "accept", "oral", "spotlight", "poster" are all accept
                if any(k in s for k in ["accept", "oral", "spotlight", "poster"]):
                    return 1
                return 0

            y_true = df["actual_decision"].apply(to_binary)
            y_pred = df["llm_decision"].apply(to_binary)

            acc = accuracy_score(y_true, y_pred)
            f1 = f1_score(y_true, y_pred, zero_division=0)

            print("\nValidation Complete.")
            print(f"Accuracy: {acc:.2%}")
            print(f"F1 Score: {f1:.4f}")

        except ImportError:
            print("\nValidation Complete. (sklearn not found, skipping F1/Accuracy)")
            print(f"Raw Agreement Rate: {df['match'].mean():.2%}")

    if not df.empty:
        accuracy = df["match"].mean()
        print(f"\nValidation Complete. Agreement Rate: {accuracy:.2%}")

    return df

In [59]:
val_results = validate_editor_on_human_reviews(n_papers=10)
val_results

Validating Editor (LLM) on 10 papers...
  Processing 7WUdjDhF38 (Actual: Reject)...
  Processing xkgfLXZ4e0 (Actual: Accept (Poster))...
  Processing SBCMNc3Mq3 (Actual: Accept (Oral))...
  Processing OBjF5I4PWg (Actual: Accept (Poster))...
  Processing aKcd7ImG5e (Actual: Accept (Poster))...
  Processing lBlHIQ1psv (Actual: Reject)...
  Processing CPhqrV5Ehg (Actual: Reject)...
  Processing 0Zot73kfLB (Actual: Reject)...
  Processing Bl3e8HV9xW (Actual: Accept (Poster))...
  Processing K2Tqn8R9pu (Actual: Accept (Poster))...

Validation Complete. (sklearn not found, skipping F1/Accuracy)
Raw Agreement Rate: 100.00%

Validation Complete. Agreement Rate: 100.00%


Unnamed: 0,paper_id,actual_decision,llm_decision,match,llm_rationale
0,7WUdjDhF38,Reject,reject,True,The paper presents an interesting approach to ...
1,xkgfLXZ4e0,Accept (Poster),poster,True,The paper presents a novel investigation into ...
2,SBCMNc3Mq3,Accept (Oral),oral,True,The paper presents a significant contribution ...
3,OBjF5I4PWg,Accept (Poster),poster,True,The paper addresses a novel and relevant probl...
4,aKcd7ImG5e,Accept (Poster),poster,True,The paper presents a novel approach to time se...
5,lBlHIQ1psv,Reject,reject,True,"The paper presents the ADOPD-Instruct dataset,..."
6,CPhqrV5Ehg,Reject,reject,True,The paper proposes a low-rank autoregressive r...
7,0Zot73kfLB,Reject,reject,True,"The manuscript presents a novel approach, GVFi..."
8,Bl3e8HV9xW,Accept (Poster),poster,True,"The paper introduces a novel solution concept,..."
9,K2Tqn8R9pu,Accept (Poster),poster,True,The manuscript presents a novel approach to Ga...


# Implementation

In [None]:
# Example (skeleton): how decisions change as we increase the number of LLM reviews

# Pick a paper_id that has at least one human review in the CSV
if not reviews_df.empty:
    example_paper_id = reviews_df["paper_id"].iloc[0]
    print("Example paper_id:", example_paper_id)

    # Uncomment to run (requires OPENAI_API_KEY env var)
    comp_df = simulate_llm_composition_for_paper(
        paper_id=example_paper_id,
        llm_counts=[0, 1],        # Try 0 LLMs (all human) vs 1 LLM (switched)
        replace_human=True,       # SWITCH mode
        n_rounds=1,
    )
    print("\nSimulation Results:")
    print(comp_df)
else:
    print("No reviews found to sample from.")

Example paper_id: zkNCWtw2fd
--- Simulating: 0 LLM reviewers (Round 1/1) ---
Generating Editor decision...
LLM Editor failed: The api_key client option must be set either by passing api_key to the client or by setting the OPENAI_API_KEY environment variable
--- Simulating: 1 LLM reviewers (Round 1/1) ---
Generating LLM review 1/1...
LLM Reviewer failed: The api_key client option must be set either by passing api_key to the client or by setting the OPENAI_API_KEY environment variable
Generating Editor decision...
LLM Editor failed: The api_key client option must be set either by passing api_key to the client or by setting the OPENAI_API_KEY environment variable

Simulation Results:
     paper_id  round  n_llm_reviewers  replace_human editor_decision  \
0  zkNCWtw2fd      1                0           True          reject   
1  zkNCWtw2fd      1                1           True          reject   

  editor_rationale editor_highlights  
0             None              None  
1             N