An autonomous ML research framework that uses Claude to iteratively generate, evaluate, and improve training code. Point it at a task, set a time budget, and let it run - it proposes experiments, validates them, executes in a sandbox, and keeps only the improvements.
Inspired by Karpathy's autoresearch, rebuilt from scratch as a task-agnostic framework with multi-seed evaluation, search strategy, cost tracking, a live dashboard, and HTML reporting.
+------------------+
| Search Strategy |
| (top-k, explore |
| diversify) |
+--------+---------+
|
select parent + mode
|
+--------v---------+
| Claude API |
| generate code |
+--------+---------+
|
+--------v---------+
| Validator |
| syntax, imports |
| safety checks |
+--------+---------+
|
+--------------+--------------+
| |
+--------v---------+ +--------v---------+
| Sandbox seed=42 | | Sandbox seed=137|
| (subprocess) | | (short-circuit |
+--------+---------+ | if seed 1 bad) |
| +--------+---------+
+----------+----------+
|
mean/std metric
|
+--------v---------+
| Keep / Discard |
| Update top-k |
| Store to SQLite |
+------------------+
Each iteration:
- The search strategy picks a parent experiment and a mode (incremental, exploration, or diversification after stagnation)
- Claude receives the parent code, experiment history, and task-specific prompts, then proposes a new experiment
- The validator catches syntax errors, missing API calls, and dangerous patterns before execution
- The code runs in a subprocess sandbox with 2 seeds (short-circuits seed 2 if seed 1 is worse than best)
- Mean metric across seeds determines keep/discard; results go to SQLite with per-seed data, cost, and category
| Task | Metric | Budget | Baseline |
|---|---|---|---|
shakespeare |
val_bpb (bits per byte) | 30s | ~2.6 bpb (3-layer transformer, ~200K params) |
cifar10 |
error_rate (1 - accuracy) | 60s | ~0.19 (simple CNN, ~200K params) |
Each task provides a fixed API surface that generated code imports from. The evaluation function is immutable - Claude can change the model and training loop, but not how the metric is computed.
# Install
uv pip install -e ".[all]"Verify your GPU pipeline works before spending API credits.
# Shakespeare - expect ~2.6 bpb on GPU, ~3.1 bpb on CPU
uv run -m autoresearch baseline --task shakespeare
# CIFAR-10 - expect ~0.19 error rate (81% accuracy) on GPU
uv run -m autoresearch baseline --task cifar10export ANTHROPIC_API_KEY=sk-...
# CIFAR-10: 10 experiments is enough to see keeps, discards, and strategy shifts
uv run -m autoresearch run --task cifar10 --max-experiments 10uv run -m autoresearch dashboard --task cifar10
# Open http://localhost:8501The dashboard shows experiment results as they come in - stats bar, experiment table, and metric progression chart all update every 5 seconds.
# Summary table in terminal
uv run -m autoresearch status --task cifar10
# Self-contained HTML report with charts and diffs
uv run -m autoresearch report --task cifar10 -o cifar10_report.html
open cifar10_report.html
# Raw data export
uv run -m autoresearch export --task cifar10 -o results.tsvuv run -m autoresearch run --task shakespeare --max-experiments 10
uv run -m autoresearch report --task shakespeare -o shakespeare_report.htmlBoth tasks share the same database, dashboard, and reporting infrastructure. The --task flag switches between them everywhere.
uv run -m autoresearch <command> [options]
Commands:
run Start the autonomous research loop
baseline Run just the baseline model
status Show experiment history table
export Export results as TSV
dashboard Launch live web dashboard (localhost:8501)
report Generate self-contained HTML report
Common options:
--task shakespeare | cifar10 (default: shakespeare)
--train-seconds Override time budget per experiment
--max-experiments Stop after N experiments
--model Claude model ID (default: claude-sonnet-4-6)
src/autoresearch/
tasks/ # Task abstraction layer
base.py # TaskConfig dataclass
shakespeare/ # Shakespeare byte-level LM
api_surface.py # Fixed API: get_train_loader, evaluate_bpb, report_results
baseline.py # Baseline transformer code (string)
data.py # Download + cache Shakespeare corpus
prompt.py # Task-specific system prompt section
cifar10/ # CIFAR-10 classification
api_surface.py # Fixed API: get_train_loader, evaluate, report_results
baseline.py # Baseline CNN code (string)
data.py # torchvision CIFAR-10 download
prompt.py # Task-specific system prompt section
research/
controller.py # Main loop: Claude -> validate -> sandbox -> keep/discard
strategy.py # Top-k candidates, stagnation detection, exploration scheduling
validator.py # Pre-execution checks (syntax, imports, safety)
sandbox.py # Subprocess execution with multi-seed support
prompt.py # System/user prompt construction
parse.py # JSON response extraction from Claude output
db/
store.py # SQLite with WAL mode, schema migration, seed_runs table
dashboard/
app.py # FastAPI with htmx polling + Chart.js
report/
generator.py # Self-contained HTML with Plotly charts
config.py # Constants, device detection, API cost computation
Task abstraction: Each task is a TaskConfig dataclass containing the metric name, baseline code, API surface module path, and prompt section. Adding a new task means creating a new directory under tasks/ with four files. The registry auto-discovers tasks at import time.
Sandbox isolation: Generated code runs in a subprocess with only PYTHONPATH injected. The API surface modules set torch.manual_seed() from the AUTORESEARCH_SEED environment variable at import time, so seed control is transparent to generated code.
Multi-seed with short-circuit: Running 2 seeds doubles execution time. The short-circuit skips seed 2 when seed 1 is strictly worse than the current best, saving time on clearly bad experiments while still collecting variance data for promising ones.
Search strategy: Maintains top-3 candidates instead of just the best. Every 5th experiment forces an exploration mode (radical departure). After 5 consecutive discards, it switches to improving a non-best candidate. Normal iterations sample from top-3 with weighted probability (70/20/10).
Database schema: The val_bpb column name is a legacy artifact from the Shakespeare-only version. The metric_value property on ExperimentRecord provides task-agnostic access. Schema migrations use idempotent ALTER TABLE ADD COLUMN wrapped in try/except.
-
Create
src/autoresearch/tasks/yourtask/with four files:api_surface.py- fixed imports for generated code (get_device, data loaders,evaluate,report_results)baseline.py-BASELINE_CODEstring with a working training scriptdata.py- download/cache datasetprompt.py-get_system_prompt_section()describing the API and tips for Claude
-
Create
__init__.pywith a@register("yourtask")factory returningTaskConfig -
Add the import to
tasks/__init__.py
The evaluate function must return a scalar where lower is better. The report_results function must emit JSON with a "val_metric" key.
The dashboard is a FastAPI app with htmx for live updates (polls every 5s) and Chart.js for the metric progression chart. Dark theme, no build step.
- Main view: stats bar, experiment table, metric scatter plot with best-so-far line
- Detail view: per-experiment metadata, seed results, loss curve, full generated code
The report generator produces a self-contained HTML file with embedded Plotly charts:
- Metric progression (scatter + best-so-far line)
- Top-5 experiments with unified diffs between them
- Failure analysis pie chart (OOM / syntax / timeout / divergence / validation)
- Category breakdown (architectural / optimizer / hyperparameter / regularization)
- Cumulative cost chart
- Key insights grouped by category
API cost is tracked per experiment. With claude-sonnet-4-6 (default), a typical experiment costs $0.01-0.03. A 50-experiment run costs roughly $0.50-1.50. Cost per experiment varies with prompt size (experiment history grows over time).
The --model flag accepts any Claude model ID. Pricing is looked up from a table in config.py.
- Python 3.10+
- PyTorch 2.1+ with CUDA, MPS, or CPU
ANTHROPIC_API_KEYenvironment variable (for the research loop)- Optional:
fastapi,uvicorn,jinja2for the dashboard - Optional:
plotly,jinja2for report generation