🔬 Scholar Loop

Autonomous, multi-agent AI research — a PhD's workflow on a single-GPU budget.

read papers → find a gap → run real experiments → reflect → write & self-review

ScholarLoop runs the loop a PhD actually runs: it reads the literature, forms a grounded hypothesis, runs real ML experiments, scores them against a frozen ground-truth metric, learns from its failures, and drafts a peer-reviewed write-up — autonomously, with a deterministic harness that keeps the agents honest and impossible to reward-hack.

Stage	What it does
🎯 Director	reads ledger + literature trends → sets the next direction, topic & budget
🔭 Lit Scout	pulls real papers from arXiv + OpenAlex, citation-ranked → structured, cited findings
💡 Reasoner	constraints + literature + past lessons → the next experiment
🗳️ Debate Panel	three personas vote — is this worth a GPU?
🪜 Funnel	smoke → verify → full · a cheap screen kills most ideas
⚙️ Runner	runs a real torch experiment, scored by a frozen metric
🪞 Reflector	turns the outcome into a lesson in a decaying skill library
🚦 Advisor	PROCEED · REFINE · PIVOT — steers the loop
✍️ Writer + Reviewer	confirmed findings → number-grounded draft → peer review
🗄️ Ledger + Skills	the durable memory every step reads from

✨ Highlights


🧪 Real, pluggable experiments	Drives real PyTorch runs (CPU-fast, no download). Two domains ship today — digit classification (error %) and diabetes regression (RMSE) — and a new one is just a YAML profile + an engine pair, zero orchestrator changes.
🤖 8 agents, one harness	Director · Lit Scout · Reasoner · Debate · Reflector · Advisor · Writer · Reviewer — typed JSON-schema I/O, validate→retry, one shared audit trace.
🔭 Literature-grounded	The Lit Scout pulls real papers from arXiv + OpenAlex, ranks them by citation impact, and distills cited techniques — so ideas aren't blind hill-climbing.
💸 Budget-aware funnel	One idea climbs smoke → verify → full, each tier gated. Bad ideas die after one cheap run; marginal ones never burn a full run.
⚙️ Engineered loop	A parallel population funnel — propose N ideas, smoke-screen them all at once, climb only the survivors — under a self-stopping governor that halts on budget, round cap, or convergence (loop-until-dry).
🧠 Self-improving	Predicts each idea's effect, scores the prediction against reality, and distills failures into a relevance-ranked, time-decaying skill library re-injected next round.
🎯 Calibrated agents	Universal predict-then-verify — every agent's checkable claims (Reasoner deltas, Debate go/no-go) are scored against ground truth, so the loop learns which of its own agents to trust.
🛡️ Can't be reward-hacked	Two-phase frozen scoring (`train.py` can't fake the metric or see the val set) + edit allowlist + VerifiedRegistry number-grounding — proven by a bundled `cheater` engine.
✅ Honest & testable	108 tests, no API key or GPU needed — the whole loop runs against a deterministic `MockLLM`.

The LLM does only the open-ended reasoning. Everything checkable — search-space pruning, dedup, calibration, number-grounding, promotion gates — is deterministic, unit-tested code, and the metric is the only optimization target (no LLM-as-judge). Multiple adversarial review passes found and fixed real bugs across the loop's correctness and reward-hacking boundaries.

🔁 Loop engineering — the part that isn't the prompt

The leverage isn't in any single agent call — it's in the outer loop around them: how it fans out, what it spends compute on, what it remembers, and when it stops. ScholarLoop treats that loop as the product. One governed round looks like this:

  Director topic ─▶  Reasoner ✦ proposes N distinct ideas
                       │   fed: literature priors · relevance-ranked skills · each agent's track record
                       ▼
                  ╭──────────── parallel smoke screen · max_workers ────────────╮
                  │   idea₁ 4.2     idea₂ 7.3 ✗     idea₃ 4.4      …     ideaₙ   │   cheap · concurrent
                  ╰──────────────────────────────┬──────────────────────────────╯
                                 survivors only   │   (the clearly-worse die here)
                                                  ▼
                       verify · 3 seeds + significance  ─▶  full · 5 seeds        compute spent on the few
                                                  │
                                                  ▼
                       calibrate every agent's claim  ·  distill one lesson → skills
                                                  │
                                                  ▼
                       governor ▸  budget?   rounds?   converged?  ──▶  stop ◇ or next round ↺

Four pillars, each a few lines of deterministic code — and each pinned by tests so it can't silently rot:

	pillar	what it buys you	in code
🌐	Parallel population funnel	explore wide, pay narrow — propose N, smoke-screen them all at once, climb only survivors	`Orchestrator.population_step(k, max_workers)`
🛑	Self-stopping governor	run unattended — halt on a $ budget, a round cap, or loop-until-dry convergence	`Governor(max_cost, max_rounds, dry_patience)`
🎯	Universal predict-then-verify	the loop learns which of its own agents to trust — Reasoner deltas & Debate go/no-go scored vs ground truth	`CalibrationLog` → next prompt
🧭	Relevance-ranked context	a growing skill library stays useful — surface the lessons that bear on this idea, not just the heaviest	`SkillLibrary.render(query=…)`

Net effect: a loop you can actually let run — it fans out to explore, screens cheaply in parallel, concentrates compute on what survives, calibrates itself, and knows when to stop. See it live in examples/governed_campaign.py (free, deterministic) or in the two captured Opus runs above.

🚀 Quickstart

pip install -e ".[dev]"          # pyyaml + jsonschema + pytest   (".[llm]" adds the Anthropic client)

python examples/quickstart.py    # the whole loop in <1s — no GPU, no API key
python examples/campaign_demo.py # a full campaign on real torch, MockLLM-scripted
pytest -q

quickstart.py runs one idea through the funnel:

baseline to beat: 4.9% val_top1_err
  smoke  3.7644%  [kept]
  verify 3.8004%  [kept]
  full   3.7644%  [kept]    →  climbed 3 tiers, 3 kept

🎬 See it run — a real campaign

campaign_demo.py drives the whole agent chain on real torch, scripted by MockLLM so it's deterministic and free (abridged):

=== CAMPAIGN · digits-mlp (real torch) · baseline 5.0% ===

🎯 Director    scale width/depth and tune the optimizer
🔭 Lit Scout   wider/deeper layers (arXiv:1512.03385) · SGD momentum + cosine (arXiv:1608.03983)

idea 1   🗳️ run      🪜 smoke 4.67% → verify 4.96%      🚦 proceed
idea 2   🗳️ REJECT   → skipped, no GPU spent
idea 3   🪜 smoke 52.0% discarded     🚦 pivot     predicted −1.0, measured +47 → calib_err 48.3

A single run shows the system

ground its idea in literature before proposing it,
save a GPU run when the debate panel vetoes a redundant idea,
kill a bad idea cheaply at the smoke tier — no full run,
catch its own wrong prediction via predict-then-verify, then pivot.

🧪 Two live runs on Claude Opus 4.8 — real torch, real API, end to end

domain	metric	baseline	best (confirmed)	the climb	cost	self-review
digits-mlp	val error	5.0%	3.82%	population → verify → full, governed	≈ $0.45	reject 2/10
diabetes-mlp	val RMSE	56.5 (linear model)	55.24	population → verify → full, governed	≈ $0.77	reject 3/10

Both are governed population funnels — each round fans out several ideas, smoke-screens them in parallel, and climbs only the survivors, while the loop halts itself (on convergence or a round cap) and scores each agent's predictions against ground truth. An idea beats the baseline in each, every number traces to a frozen-metric measurement, and the system's own reviewer still rejects the papers as too marginal. (It's not wrong.) Each link opens the captured paper, run log, and raw ledger — every number reproducible from the jsonl.

📊 Status

Research preview — the full PhD-workflow skeleton runs end-to-end on real experiments, with the anti-reward-hacking guards in place and adversarially reviewed. It has been run live against the real Anthropic API across two domains — captured verbatim for both classification and regression, each a real Opus campaign that beats its baseline and writes itself up. Next: container sandboxing for the residual boundaries and scale.

License: MIT.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.github/workflows		.github/workflows
engines		engines
examples		examples
profiles		profiles
scholarloop		scholarloop
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔬 Scholar Loop

Autonomous, multi-agent AI research — a PhD's workflow on a single-GPU budget.

✨ Highlights

🔁 Loop engineering — the part that isn't the prompt

🚀 Quickstart

🎬 See it run — a real campaign

🧪 Two live runs on Claude Opus 4.8 — real torch, real API, end to end

📊 Status

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🔬 Scholar Loop

Autonomous, multi-agent AI research — a PhD's workflow on a single-GPU budget.

✨ Highlights

🔁 Loop engineering — the part that isn't the prompt

🚀 Quickstart

🎬 See it run — a real campaign

🧪 Two live runs on Claude Opus 4.8 — real torch, real API, end to end

📊 Status

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages