# Reproducing "From Answers to Hypotheses"

This notebook reproduces all figures and statistical analyses reported in:

> Victor Lavrenko. *From Answers to Hypotheses: Internal Consensus and Its Limits in Large Language Models*. 2026.

The results correspond to the tagged release:
`paper/from-answers-to-hypotheses-v1`

Repository:
https://github.com/victorlavrenko/rofa

## Setup

Install the ROFA package and load the precomputed runs.


In [None]:
# install ROFA package if not already installed
import importlib.metadata

try: 
  importlib.metadata.distribution("rofa")
except importlib.metadata.PackageNotFoundError: 
  from pathlib import Path
  if (Path.cwd().parent.parent / "pyproject.toml").is_file():
      %pip install -e "../.."
  else:
      if not Path("rofa").is_dir():
          !git clone https://github.com/victorlavrenko/rofa
      %pip install -e "rofa"

In [None]:
# Setup
import importlib.util

if importlib.util.find_spec("rofa.papers") is None:
    print(
        "\n⚠️  Runtime restart required\n\n"
        "ROFA has just been installed, but the Python runtime has not been restarted yet.\n\n"
        "Please restart the runtime via:\n"
        "  Runtime (or ▼ after Run all) → Restart runtime (or Restart runtime and run all)\n"
        "This is expected behaviour in Google Colab."
    )
    raise SystemExit

import pandas as pd

from rofa.papers.from_answers_to_hypotheses import analysis, notebook_helpers

# Get run artifacts
run_dir_greedy, greedy_asset_url = (
    r"",
    "https://github.com/victorlavrenko/rofa/releases/download/paper%2Ffrom-answers-to-hypotheses-v1/rofa-from-answers-to-hypotheses-runs-v1-greedy.zip",
)
run_dir_k_sample, k_sample_asset_url = (
    r"",
    "https://github.com/victorlavrenko/rofa/releases/download/paper%2Ffrom-answers-to-hypotheses-v1/rofa-from-answers-to-hypotheses-runs-v1-branches10.zip",
)
run_inputs = notebook_helpers.resolve_run_inputs(
    run_dir_greedy, greedy_asset_url, run_dir_k_sample, k_sample_asset_url
)


In [None]:
# Load + validate
df_greedy, df_branches, metadata = analysis.load_paper_runs(run_inputs)
notebook_helpers.validate_required_columns(df_greedy, df_branches)
notebook_helpers.print_run_summary(df_greedy, df_branches, metadata)

## H1: Aggregation Improves Accuracy

> Greedy accuracy: 65.75%
> Majority accuracy: 66.75%
> A two-sided binomial test with null hypothesis $H_0: \pi = 0.6575$ yields a p-value of approximately 0.63, indicating no statistically significant difference between greedy and majority-vote accuracy.


In [None]:
# R1: greedy accuracy
df_greedy_accuracy = pd.DataFrame(
    {"metric": ["greedy_accuracy"], "value": [analysis.accuracy_greedy(df_greedy)]}
)
df_greedy_accuracy

In [None]:
# R2: leader accuracy
df_leader_accuracy = pd.DataFrame(
    {"metric": ["leader_accuracy"], "value": [analysis.accuracy_leader(df_branches)]}
)
df_leader_accuracy

In [None]:
# R9: majority vote does not help (greedy vs leader)
df_majority_vote = notebook_helpers.majority_vote_table(df_greedy, df_branches)
df_majority_vote

## H2: Correct Answers Appear Among Alternatives

> Observed Top-2 coverage is 80.5%, compared to a greedy accuracy of 65.75%, corresponding to an absolute improvement of 14.75 percentage points. Using a binomial model with null hypothesis $H_0: \pi = 0.6575$, this difference is highly statistically significant ($p$-value $\ll 10^{-6}$).


In [None]:
# R6: top-2 coverage
df_top2 = analysis.compute_table_top2(df_branches)
df_top2

## H3: Internal Consensus Implies Correctness

> Unanimous cases: 151
> Unanimous accuracy: 86.8%
> A one-sided binomial test of $H_0: \pi \ge 0.95$ yields a p-value below 0.01, so even strong internal consensus does not guarantee near-perfect reliability.
> Near-unanimous cases ($\text{max\_frac} \ge 0.9$) still exhibit error rates above 15%.


In [None]:
# R4: unanimous stats
unanimous_stats = analysis.unanimous_stats(df_branches)
df_unanimous = pd.DataFrame([unanimous_stats])
df_unanimous

In [None]:
# R5: near-unanimous stats
near_unanimous_stats = analysis.near_unanimous_stats(df_branches, threshold=0.9)
df_near_unanimous = pd.DataFrame([near_unanimous_stats])
df_near_unanimous

In [None]:
# R5b: operational failure-mode breakdown (top-2 + unanimity)
from collections import Counter

def _pick_col(df, candidates):
    for name in candidates:
        if name in df.columns:
            return name
    raise KeyError(f"Missing required columns. Tried: {candidates}")

gold_col = _pick_col(df_branches, ["gold", "answer", "label"])
leader_col = _pick_col(df_branches, ["leader", "majority", "leader_answer"])
branch_col = _pick_col(df_branches, ["branch_preds", "branches", "preds"])
max_frac_col = next((c for c in ["max_frac", "max_frac_exact"] if c in df_branches.columns), None)
leader_correct_col = next((c for c in ["leader_correct", "correct"] if c in df_branches.columns), None)

leader_correct = (
    df_branches[leader_correct_col].fillna(False).astype(bool)
    if leader_correct_col
    else (df_branches[leader_col] == df_branches[gold_col]).fillna(False)
)
leader_wrong = ~leader_correct

def _top_k(preds, k=2):
    # Tie-break: rank by count, then by earliest first appearance in branch_preds.
    preds_clean = [p for p in preds if p is not None]
    if not preds_clean:
        return []
    first_idx = {}
    for i, p in enumerate(preds):
        if p is None:
            continue
        if p not in first_idx:
            first_idx[p] = i
    counts = Counter(preds_clean)
    ranked = sorted(counts.items(), key=lambda kv: (-kv[1], first_idx[kv[0]]))
    return [item[0] for item in ranked[:k]]

gold_series = df_branches[gold_col]
branch_series = df_branches[branch_col]
gold_in_top2 = [
    gold in _top_k(preds, k=2)
    for preds, gold in zip(branch_series, gold_series, strict=False)
]

if max_frac_col:
    unanimous_mask = df_branches[max_frac_col].fillna(0.0) == 1.0
else:
    unanimous_mask = [
        len({p for p in preds if p is not None}) == 1
        for preds in branch_series
    ]

n_total = len(df_branches)
n_errors = int(leader_wrong.sum())
selection_mask = leader_wrong & pd.Series(gold_in_top2)
unsurfaced_mask = leader_wrong & ~pd.Series(gold_in_top2)

sel_n = int(selection_mask.sum())
uns_n = int(unsurfaced_mask.sum())

def _pct(num, denom):
    return 0.0 if denom == 0 else 100.0 * num / denom

sel_total_pct = _pct(sel_n, n_total)
sel_error_pct = _pct(sel_n, n_errors)
uns_total_pct = _pct(uns_n, n_total)
uns_error_pct = _pct(uns_n, n_errors)

unanimous_mask = pd.Series(unanimous_mask)
unanim_n = int(unanimous_mask.sum())
unanim_wrong_n = int((unanimous_mask & leader_wrong).sum())
unanim_wrong_pct = _pct(unanim_wrong_n, unanim_n)
unanim_acc_pct = 100.0 - unanim_wrong_pct if unanim_n else 0.0
unanim_share_errors_pct = _pct(unanim_wrong_n, n_errors)

print("Failure mode breakdown (N=10 sampling)")
print(f"N_total: {n_total}")
print(f"N_errors: {n_errors}")
print(
    f"selection_errors: {sel_n} ("
    f"{sel_total_pct:.1f}% total, {sel_error_pct:.1f}% of errors)"
)
print(
    f"unsurfaced_errors: {uns_n} ("
    f"{uns_total_pct:.1f}% total, {uns_error_pct:.1f}% of errors)"
)
print(
    f"unanimous: {unanim_n} (acc {unanim_acc_pct:.1f}%, "
    f"wrong {unanim_wrong_n}, wrong rate {unanim_wrong_pct:.1f}%, "
    f"share of errors {unanim_share_errors_pct:.1f}%)"
)

if max_frac_col:
    near_mask = df_branches[max_frac_col].fillna(0.0) >= 0.9
    near_n = int(near_mask.sum())
    near_wrong_n = int((near_mask & leader_wrong).sum())
    near_wrong_pct = _pct(near_wrong_n, near_n)
    print(
        f"near_unanimous>=0.9: {near_n} (wrong {near_wrong_n}, "
        f"wrong rate {near_wrong_pct:.1f}%)"
    )


> Figure 1: Accuracy as a function of internal consensus (max_frac). Higher branch agreement correlates with higher accuracy, but even near-unanimous predictions exhibit a non-zero error rate.


In [None]:
# Figure 1: accuracy vs internal consensus (max_frac_exact)
df_max_frac_exact = notebook_helpers.plot_accuracy_vs_consensus(
    df_branches, "figure1_max_frac_exact.png"
)
df_max_frac_exact

## H4: Selective Leader Override Is Feasible

> We analyzed selective top-2 leader override across vote-consensus regimes. Broad regimes that include high-consensus predictions require extremely high leader-override precision, while uncertainty-focused regimes are more favorable. Excluding high-consensus predictions yields a more than twofold reduction in required false-override suppression (from $\approx 2.4$ to $\approx 1.0$), a difference that is statistically significant ($p < 10^{-6}$).


In [None]:
# R6b: top-2 flip subset discovery (strict mode)
matrix, rectangles, threshold_rectangles, tie_stats = analysis.top2_flip_analysis(
    df_branches, strict=True, min_support=10
)
threshold_rectangles


In [None]:
# R6c: top-2 flip playground (absolute + relative ranges)
analysis.top2_flip_playground(
    df_branches, top1_votes_min=6, top1_votes_max=10, top2_votes_min=1, top2_votes_max=7
)
analysis.top2_flip_playground_relative(
    df_branches, top1_votes_min=6, top1_votes_max=10, gap_min=0, gap_max=5
)


> Figure 2: Feasibility of selective top-2 leader override across vote-consensus regimes. The x-axis shows the maximum overall accuracy achievable by an ideal oracle that corrects all top-2=gold cases within a regime, and the y-axis indicates the required false-override suppression (top-1 correct vs. top-2 correct).


In [None]:
# R6d: top-2 flip subset discovery (relative gap search)
matrix_gap, rectangles_gap, threshold_rectangles_gap, tie_stats_gap = \
analysis.top2_flip_analysis_relative(
    df_branches, strict=True, min_support=10
)
threshold_rectangles_gap

# Figure 2: selective top-2 flip feasibility
from rofa.analysis.plots import plot_top2_flip_feasibility

baseline_acc = float(df_leader_accuracy["value"].iloc[0])
fig, ax, plot_df = plot_top2_flip_feasibility(
    rectangles_gap,
    baseline_acc,
    total_n=len(df_branches),
    use_frontier_df=threshold_rectangles_gap,
    save_path="figure2_top2_flip_feasibility.png",
)
plot_df


In [None]:
# R6e: sensitivity analysis around relative gap selections
gap_neighbors = analysis.make_gap_neighbor_rows(
    df_branches, threshold_rectangles_gap, include_gap_min_neighbors=False
)
gap_neighbors[[
    "source_row","variant",
    "top1_votes_min","top1_votes_max","gap_min","gap_max",
    "total_examples_count",
    "total_top1_correct_count","total_top2_correct_count",
    "harm_to_benefit_ratio",
    "always_flip_delta_accuracy","delta_always_flip_delta_accuracy"
]]


## Supporting Tables and Diagnostics

The following tables are supporting artifacts referenced in paper exports and exploratory diagnostics.


In [None]:
# R3: distribution of max_frac
df_max_frac = analysis.max_frac_distribution(df_branches).reset_index()
df_max_frac.columns = ["max_frac_bin", "count"]
df_max_frac

In [None]:
# R7: R/W/Other breakdown by max_frac bins
df_rw_other = analysis.rw_other_breakdown(df_branches)
df_rw_other

In [None]:
# R8: error modes (unanimous wrong)
df_unanimous_wrong = analysis.unanimous_wrong(df_branches)
df_unanimous_wrong.head()

In [None]:
# R10: subject-wise breakdown (optional)
df_subject_breakdown = notebook_helpers.subject_breakdown(df_greedy, df_branches)
df_subject_breakdown.head(20)

In [None]:
# R11: export paper tables
report_dir = notebook_helpers.export_paper_reports(
    metadata,
    df_greedy_accuracy,
    df_leader_accuracy,
    unanimous_stats,
    near_unanimous_stats,
    df_top2,
    df_max_frac,
    df_rw_other,
    df_subject_breakdown,
)
print("Saved reports to", report_dir)

## Add your own analysis below
