# ROFA — Data Generation Notebook

This notebook runs the **data generation stage** for the paper  
*From Answers to Hypotheses: Internal Consensus and Its Limits in Large Language Models*.

It executes model inference under fixed decoding settings and produces
versioned run artifacts (JSON/JSONL) that capture:
- per-question model outputs,
- alternative sampled hypotheses,
- metadata required for downstream analysis.

## Purpose

The outputs of this notebook are **inputs** to the analysis and reproduction
pipeline implemented in `20_paper_reproduce.ipynb`.
They are not required for reproducing the paper figures if you use the
pre-generated release artifacts.

## Usage modes

- **Reproduce paper results (recommended):**  
  Skip this notebook and download the released run artifacts from GitHub.

- **Regenerate data (optional):**  
  Run this notebook to regenerate model outputs, e.g. to:
  - test alternative decoding parameters,
  - evaluate new models,
  - extend experiments beyond the paper.

## Notes

- Generation can be time- and compute-intensive.
- Results depend on model checkpoints, decoding parameters, and random seeds.
- This notebook is typically executed in Colab or a GPU-enabled environment.

In [None]:
DATASET_NAME = "openlifescienceai/medmcqa"
DATASET_SPLIT = "validation"
SEED = 42
N = 300
SUBJECTS = 20

In [None]:
# Cell 0 — Bootstrap

# install ROFA package
import subprocess
import sys
from pathlib import Path

if (Path.cwd().parent.parent / "pyproject.toml").is_file():
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-e", "../.."])
    OUT_BASE = "./rofa_runs"
else:
    from google.colab import drive
    drive.mount("/content/drive")
    OUT_BASE = "/content/drive/MyDrive/rofa_runs"
    subprocess.check_call(["git", "clone", "https://github.com/victorlavrenko/rofa"])
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-e", "rofa"])

In [None]:
# Cell 1 — Validate environment (Drive + output + model)
from pathlib import Path

from rofa.core.model import MODEL_ID, load_model_with_fallback, load_tokenizer

out_base = Path(OUT_BASE)
out_base.mkdir(parents=True, exist_ok=True)
assert out_base.exists() and out_base.is_dir(), "Output base not available"
(out_base / "tmp_write_check.txt").write_text("ok")
(out_base / "tmp_write_check.txt").unlink()

tokenizer = load_tokenizer()
model = load_model_with_fallback()
print(f"Model ready: {MODEL_ID}")

In [None]:
# Cell 2 — Create a fixed question set (IDs)
from pathlib import Path

from rofa.core.question_set import create_question_set, save_question_set

question_set = create_question_set(
    {"dataset_name": DATASET_NAME, "dataset_split": DATASET_SPLIT},
    {
        "seed": SEED,
        "n": N,
        "subjects": SUBJECTS,
        "max_per_subject": N / SUBJECTS * 1.1 + 1,
    },
)

qs_dir = Path(OUT_BASE) / "question_sets"
qs_dir.mkdir(parents=True, exist_ok=True)
question_set_id = question_set.qs_id
qs_path = qs_dir / f"{question_set_id}.json"
save_question_set(question_set, str(qs_path))

print(f"Saved question set: {question_set_id} -> {qs_path}")

In [None]:
# Cell 3 — Run greedy generation (native Python call)
from datetime import datetime, timezone
from pathlib import Path

from rofa.core.runner import run_generation
from rofa.core.schemas import GenerationConfig
from rofa.papers.from_answers_to_hypotheses.methods import GreedyDecode

run_id = f"greedy_{question_set_id}_{datetime.now(timezone.utc).strftime('%Y%m%d_%H%M%S')}"
run_dir = Path(OUT_BASE) / "runs"
run_dir.mkdir(parents=True, exist_ok=True)

config = GenerationConfig(
    method="greedy",
    model_id=MODEL_ID,
    out_dir=str(run_dir),
    run_id=run_id,
    seed=SEED,
    max_new_tokens=1024,
    n=N,
    subjects=SUBJECTS,
    dataset_name=DATASET_NAME,
    dataset_split=DATASET_SPLIT,
    question_set_path=str(qs_path),
    progress=True,
    heartbeat_every=10,
    write_full_records=False,
    tokenizer=tokenizer,
    model=model,
    method_impl=GreedyDecode(),
)

run_generation(config)

In [None]:
# Cell 4 — Run k-sample ensemble generation (branches alias)
from datetime import datetime, timezone
from pathlib import Path

from rofa.core.runner import run_generation
from rofa.core.schemas import GenerationConfig
from rofa.papers.from_answers_to_hypotheses.methods import BranchSamplingEnsemble

run_id = (
    f"k_sample_ensemble_{question_set_id}_"
    f"{datetime.now(timezone.utc).strftime('%Y%m%d_%H%M%S')}"
)
run_dir = Path(OUT_BASE) / "runs"
run_dir.mkdir(parents=True, exist_ok=True)

config = GenerationConfig(
    method="k_sample_ensemble",
    model_id=MODEL_ID,
    out_dir=str(run_dir),
    run_id=run_id,
    seed=SEED,
    max_new_tokens=1024,
    n=N,
    subjects=SUBJECTS,
    dataset_name=DATASET_NAME,
    dataset_split=DATASET_SPLIT,
    question_set_path=str(qs_path),
    n_branches=10,
    temperature=0.8,
    top_p=0.8,
    top_k=50,
    progress=True,
    heartbeat_every=10,
    write_full_records=True,
    tokenizer=tokenizer,
    model=model,
    method_impl=BranchSamplingEnsemble(n_branches=10, temperature=0.8, top_p=0.8, top_k=50),
)

run_generation(config)

## Publish your run artifacts to GitHub Releases (manual)

1. Open Google Drive and locate your run folder under `OUT_BASE/runs/<run_id>/`.
2. Download the run folder as a `.zip`.
3. Create a new GitHub Release in your repository.
4. Upload the `.zip` as a release asset.
5. Paste the asset URL into the analysis notebook so it can download the artifacts.
