autoresearch-search

An autonomous ML research framework that uses Claude to iteratively generate, evaluate, and improve training code. Point it at a task, set a time budget, and let it run - it proposes experiments, validates them, executes in a sandbox, and keeps only the improvements.

Inspired by Karpathy's autoresearch, rebuilt from scratch as a task-agnostic framework with multi-seed evaluation, search strategy, cost tracking, a live dashboard, and HTML reporting.

How it works

                    +------------------+
                    |  Search Strategy |
                    |  (top-k, explore |
                    |   diversify)     |
                    +--------+---------+
                             |
                    select parent + mode
                             |
                    +--------v---------+
                    |   Claude API     |
                    |  generate code   |
                    +--------+---------+
                             |
                    +--------v---------+
                    |   Validator      |
                    |  syntax, imports |
                    |  safety checks   |
                    +--------+---------+
                             |
              +--------------+--------------+
              |                             |
     +--------v---------+         +--------v---------+
     |  Sandbox seed=42 |         |  Sandbox seed=137|
     |  (subprocess)    |         |  (short-circuit  |
     +--------+---------+         |   if seed 1 bad) |
              |                   +--------+---------+
              +----------+----------+
                         |
                  mean/std metric
                         |
                +--------v---------+
                |  Keep / Discard  |
                |  Update top-k    |
                |  Store to SQLite |
                +------------------+

Each iteration:

The search strategy picks a parent experiment and a mode (incremental, exploration, or diversification after stagnation)
Claude receives the parent code, experiment history, and task-specific prompts, then proposes a new experiment
The validator catches syntax errors, missing API calls, and dangerous patterns before execution
The code runs in a subprocess sandbox with 2 seeds (short-circuits seed 2 if seed 1 is worse than best)
Mean metric across seeds determines keep/discard; results go to SQLite with per-seed data, cost, and category

Supported tasks

Task	Metric	Budget	Baseline
`shakespeare`	val_bpb (bits per byte)	30s	~2.6 bpb (3-layer transformer, ~200K params)
`cifar10`	error_rate (1 - accuracy)	60s	~0.19 (simple CNN, ~200K params)

Each task provides a fixed API surface that generated code imports from. The evaluation function is immutable - Claude can change the model and training loop, but not how the metric is computed.

Quick start

# Install
uv pip install -e ".[all]"

Step 1: Run baselines (no API key needed)

Verify your GPU pipeline works before spending API credits.

# Shakespeare - expect ~2.6 bpb on GPU, ~3.1 bpb on CPU
uv run -m autoresearch baseline --task shakespeare

# CIFAR-10 - expect ~0.19 error rate (81% accuracy) on GPU
uv run -m autoresearch baseline --task cifar10

Step 2: Start the research loop

export ANTHROPIC_API_KEY=sk-...

# CIFAR-10: 10 experiments is enough to see keeps, discards, and strategy shifts
uv run -m autoresearch run --task cifar10 --max-experiments 10

Step 3: Watch live progress (separate terminal)

uv run -m autoresearch dashboard --task cifar10
# Open http://localhost:8501

The dashboard shows experiment results as they come in - stats bar, experiment table, and metric progression chart all update every 5 seconds.

Step 4: Review results

# Summary table in terminal
uv run -m autoresearch status --task cifar10

# Self-contained HTML report with charts and diffs
uv run -m autoresearch report --task cifar10 -o cifar10_report.html
open cifar10_report.html

# Raw data export
uv run -m autoresearch export --task cifar10 -o results.tsv

Step 5: Try the other task

uv run -m autoresearch run --task shakespeare --max-experiments 10
uv run -m autoresearch report --task shakespeare -o shakespeare_report.html

Both tasks share the same database, dashboard, and reporting infrastructure. The --task flag switches between them everywhere.

CLI reference

uv run -m autoresearch <command> [options]

Commands:
  run        Start the autonomous research loop
  baseline   Run just the baseline model
  status     Show experiment history table
  export     Export results as TSV
  dashboard  Launch live web dashboard (localhost:8501)
  report     Generate self-contained HTML report

Common options:
  --task           shakespeare | cifar10  (default: shakespeare)
  --train-seconds  Override time budget per experiment
  --max-experiments  Stop after N experiments
  --model          Claude model ID (default: claude-sonnet-4-6)

Architecture

src/autoresearch/
    tasks/                  # Task abstraction layer
        base.py             # TaskConfig dataclass
        shakespeare/        # Shakespeare byte-level LM
            api_surface.py  # Fixed API: get_train_loader, evaluate_bpb, report_results
            baseline.py     # Baseline transformer code (string)
            data.py         # Download + cache Shakespeare corpus
            prompt.py       # Task-specific system prompt section
        cifar10/            # CIFAR-10 classification
            api_surface.py  # Fixed API: get_train_loader, evaluate, report_results
            baseline.py     # Baseline CNN code (string)
            data.py         # torchvision CIFAR-10 download
            prompt.py       # Task-specific system prompt section
    research/
        controller.py       # Main loop: Claude -> validate -> sandbox -> keep/discard
        strategy.py         # Top-k candidates, stagnation detection, exploration scheduling
        validator.py        # Pre-execution checks (syntax, imports, safety)
        sandbox.py          # Subprocess execution with multi-seed support
        prompt.py           # System/user prompt construction
        parse.py            # JSON response extraction from Claude output
    db/
        store.py            # SQLite with WAL mode, schema migration, seed_runs table
    dashboard/
        app.py              # FastAPI with htmx polling + Chart.js
    report/
        generator.py        # Self-contained HTML with Plotly charts
    config.py               # Constants, device detection, API cost computation

Key design decisions

Task abstraction: Each task is a TaskConfig dataclass containing the metric name, baseline code, API surface module path, and prompt section. Adding a new task means creating a new directory under tasks/ with four files. The registry auto-discovers tasks at import time.

Sandbox isolation: Generated code runs in a subprocess with only PYTHONPATH injected. The API surface modules set torch.manual_seed() from the AUTORESEARCH_SEED environment variable at import time, so seed control is transparent to generated code.

Multi-seed with short-circuit: Running 2 seeds doubles execution time. The short-circuit skips seed 2 when seed 1 is strictly worse than the current best, saving time on clearly bad experiments while still collecting variance data for promising ones.

Search strategy: Maintains top-3 candidates instead of just the best. Every 5th experiment forces an exploration mode (radical departure). After 5 consecutive discards, it switches to improving a non-best candidate. Normal iterations sample from top-3 with weighted probability (70/20/10).

Database schema: The val_bpb column name is a legacy artifact from the Shakespeare-only version. The metric_value property on ExperimentRecord provides task-agnostic access. Schema migrations use idempotent ALTER TABLE ADD COLUMN wrapped in try/except.

Adding a new task

Create src/autoresearch/tasks/yourtask/ with four files:
- api_surface.py - fixed imports for generated code (get_device, data loaders, evaluate, report_results)
- baseline.py - BASELINE_CODE string with a working training script
- data.py - download/cache dataset
- prompt.py - get_system_prompt_section() describing the API and tips for Claude
Create __init__.py with a @register("yourtask") factory returning TaskConfig
Add the import to tasks/__init__.py

The evaluate function must return a scalar where lower is better. The report_results function must emit JSON with a "val_metric" key.

Dashboard

The dashboard is a FastAPI app with htmx for live updates (polls every 5s) and Chart.js for the metric progression chart. Dark theme, no build step.

Main view: stats bar, experiment table, metric scatter plot with best-so-far line
Detail view: per-experiment metadata, seed results, loss curve, full generated code

Report

The report generator produces a self-contained HTML file with embedded Plotly charts:

Metric progression (scatter + best-so-far line)
Top-5 experiments with unified diffs between them
Failure analysis pie chart (OOM / syntax / timeout / divergence / validation)
Category breakdown (architectural / optimizer / hyperparameter / regularization)
Cumulative cost chart
Key insights grouped by category

Cost

API cost is tracked per experiment. With claude-sonnet-4-6 (default), a typical experiment costs $0.01-0.03. A 50-experiment run costs roughly $0.50-1.50. Cost per experiment varies with prompt size (experiment history grows over time).

The --model flag accepts any Claude model ID. Pricing is looked up from a table in config.py.

Requirements

Python 3.10+
PyTorch 2.1+ with CUDA, MPS, or CPU
ANTHROPIC_API_KEY environment variable (for the research loop)
Optional: fastapi, uvicorn, jinja2 for the dashboard
Optional: plotly, jinja2 for report generation

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.claude		.claude
src		src
README.md		README.md
pyproject.toml		pyproject.toml
resume.tex		resume.tex
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

autoresearch-search

How it works

Supported tasks

Quick start

Step 1: Run baselines (no API key needed)

Step 2: Start the research loop

Step 3: Watch live progress (separate terminal)

Step 4: Review results

Step 5: Try the other task

CLI reference

Architecture

Key design decisions

Adding a new task

Dashboard

Report

Cost

Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

autoresearch-search

How it works

Supported tasks

Quick start

Step 1: Run baselines (no API key needed)

Step 2: Start the research loop

Step 3: Watch live progress (separate terminal)

Step 4: Review results

Step 5: Try the other task

CLI reference

Architecture

Key design decisions

Adding a new task

Dashboard

Report

Cost

Requirements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages