A literature-grounded idea engine for AI/CS research. Give it a one-line research direction; get back a ranked portfolio of 50β100 concrete paper ideas β each one recombined from genes distilled out of real, freshly-retrieved papers, scored for novelty/feasibility/impact, then adversarially reviewed.
"diffusion models for combinatorial optimization"
β
brainstorm ββΆ retrieve ββΆ digest ββΆ hybridize ββΆ prioritize ββΆ critique
(opus) (sonnet) (sonnet) (opus ΓN) (opus) (opus ΓN)
β β β β β β
10 variants ~100 papers "genes" 50β100 ideas scored & adversarial review
+ queries dedup+rank (parallel + key ranked + per-step credit
analysis) insights β ideas.json β credit_summary.json
Most research novelty is recombination β new work tends to pair an unusual combination of otherwise-conventional prior ideas (cf. literature-based discovery, Swanson's undiscovered-public-knowledge "ABC" model; atypical combinations, Uzzi et al., Science 2013). The bottleneck isn't generating a combination β it's systematically enumerating and triaging the combinatorial space of plausible ones across a literature you can't fully hold in your head.
Auto Research Idea operationalizes that:
- Retrieve the relevant slice of literature (multi-source, deduped, ranked).
- Distill each paper into a reusable gene β its core mechanism, the assumption it relies on, where it breaks.
- Recombine genes across papers into candidate ideas, each with an explicit key insight and parent genes (full provenance β you can trace every idea back to the papers it came from).
- Score & rank on novelty / feasibility / impact, dedup near-twins.
- Critique each survivor adversarially β strengths, defects, overlap with the retrieved literature β and attribute every flaw to the pipeline step that caused it, so the system itself gets a feedback signal (which step is the weak link).
The output isn't one "perfect" idea β it's a ranked, red-teamed menu of grounded directions you'd never have enumerated by hand, with the receipts to vet each.
A ranked ideas.json (and a live web view of it). Each idea carries its lineage
and its critique:
Every run is reproducible and inspectable: all intermediate artifacts
(variants, retrieved papers, genes, raw candidates, ranking, reviews, credit
summary) are written as JSON under runs/<id>/ and rendered live by the dashboard
β where you can also add your own notes/scores without touching the generated files.
1. Install (Python 3.8 + a handful of deps):
python3 -m venv .venv
.venv/bin/pip install -r requirements.txt
cp .env.example .env # optional β see "Keys" below; works without any key2. Start the dashboard (separate terminal):
.venv/bin/python -m auto_research_idea.dashboard # β http://localhost:80003. Run the pipeline β just talk to Claude Code in this directory:
use the research-ideas skill on: diffusion models for combinatorial optimization
Claude Code brainstorms variants, spawns the retrieve/digest subagents, fans out
parallel hybridizers, prioritizes, runs parallel critics, and updates the dashboard
live. Final ranked ideas land in runs/<id>/ideas.json and on the page.
(Alternatively: type your idea into the dashboard's box to queue it, then tell Claude Code "run the queued research-ideas request".)
The default pipeline runs entirely on your Claude Code subscription β the
subagents digest in-context, so no ANTHROPIC_API_KEY is required. Keys only
unlock extras / raise rate limits, and all go in .env (never .env.example):
| Variable | Needed for | Without it |
|---|---|---|
ANTHROPIC_API_KEY |
the standalone digest.py CLI (API credits) |
pipeline still works; CLI digest won't |
SEMANTIC_SCHOLAR_API_KEY |
higher Semantic Scholar rate limits | S2 source fails soft (other sources still work) |
GITHUB_TOKEN |
the github awesome-list source |
60 req/hr instead of 5000 |
CONTACT_EMAIL |
politeness with arXiv / OpenAlex | slightly stricter rate limits |
| Stage | What happens | Artifact |
|---|---|---|
| Brainstorm | seed idea β 10 distinct framings + targeted search queries | brainstorm.json |
| Retrieve | queries fan out across 8 sources β merge, dedup by title, rank (~100 papers) | papers.json |
| Digest | each paper β a structured gene: mechanism, assumption, failure mode | genes_<k>.json |
| Hybridize | N Opus agents in parallel cross-breed genes β candidates with key insight + provenance | ideas_raw_<k>.json |
| Prioritize | score (novelty/feasibility/impact), dedup near-duplicates, rank | ideas.json |
| Critique | N Opus critics adversarially review each idea + assign per-step credit/blame | reviews_<k>.json |
| Credit | aggregate the reviews β which pipeline step is the weak link | credit_summary.json |
Because each stage is a file, you can stop, inspect, edit, and resume β e.g.
hand-curate genes_<k>.json before hybridizing, or re-run prioritization with
different weights. The dashboard also writes a non-destructive annotations.json
for your own notes/score/rank overrides β ideas.json is never overwritten.
The interesting bit isn't just the output β it's the orchestration pattern:
Skill/subagent = brain; Python = hands. Open-ended creative + judgment work is delegated to LLM subagents; deterministic, parallel, mechanical work (retrieval, dedup, ranking, credit aggregation) is plain Python. Each side does what it's good at.
-
Orchestrator β Claude Code running the
research-ideasskill (.claude/skills/research-ideas/). Delegates each step to a subagent and tracks progress inruns/<id>/status.json. -
One subagent per step (
.claude/agents/*.md) β independently promptable and swappable, with task-based model routing: cheap mechanical steps run on Sonnet, open-ended creative/judgment steps on Opus.Step Agent (model) Writes Role 1 idea-brainstormer(opus)brainstorm.json10 idea variants + search queries 2 paper-retriever(sonnet)papers.jsondrives searchtool β ~100 deduped, ranked papers3 paper-digester(sonnet)genes_<k>.jsondistills each paper into a reusable gene 4 idea-hybridizer(opus, ΓN)ideas_raw_<k>.jsonrecombines genes β 50β100 candidates 5 idea-prioritizer(opus)ideas.jsonscores, dedups, ranks 6 idea-critic(opus, ΓN)reviews_<k>.jsonadversarial review + per-step credit assignment Then
credit.py(pure Python, no key) pools the reviews intocredit_summary.jsonβ a feedback signal for which step to improve next. -
Multi-source retrieval (
auto_research_idea/sources/) β pluggablePaperSourcebackends fanned out in parallel with backoff; a registry merges results by normalized title, enriching each paper from every source that found it. Eight sources ship today:Source Covers arxivarXiv preprints openalexbroad cross-venue index + citation counts semantic_scholarabstracts + citations (key optional) githubawesome-<topic>reading listsvenue_pagesofficial accepted-paper listings (CVF, NeurIPS, β¦) openreviewICLR / NeurIPS / etc. on OpenReview acl_anthologyACL / EMNLP / NAACL (+ Findings) ecvaECCV open-access proceedings Sources fail soft β a dead backend returns
[], never crashes the run. -
Dashboard (
dashboard.py, stdlib only) β renders variants, papers, genes, ranked ideas, reviews, and the credit summary live. It never runs the pipeline, so it can't corrupt a run; the only thing it writes is yourannotations.json.
The whole thing coordinates through one contract: JSON files in runs/<id>/.
Orchestrator, subagents, and dashboard share nothing else β which is what makes
runs reproducible and every stage independently hackable.
config.yaml tunes the tools: which of the 8 paper sources are enabled (+ the
per-venue venue_pages / openreview / acl_anthology / ecva registries),
retrieval limits (retrieval.max_papers β 100, enrich_abstracts, parse_pdf),
and the digest model/effort. Orchestration knobs (10 variants, number of parallel
hybridizers β 50β100 ideas, number of critics) live in the skill and agent files.
The retrieval and credit paths need no key β exercise them standalone:
.venv/bin/python -m auto_research_idea.search --queries "graph neural network CO" --out papers.json
.venv/bin/python -m auto_research_idea.digest --papers papers.json --out genes.json # needs ANTHROPIC_API_KEY
.venv/bin/python -m auto_research_idea.credit --run-dir runs/<id> # aggregate reviews, no key.claude/
skills/research-ideas/SKILL.md # orchestrator
agents/ # one subagent per step
idea-brainstormer.md paper-retriever.md paper-digester.md
idea-hybridizer.md idea-prioritizer.md idea-critic.md
auto_research_idea/
search.py # tool: queries -> papers.json (search + dedup + rank)
digest.py # tool: papers.json -> genes.json (parallel analysis)
credit.py # tool: reviews_*.json -> credit_summary.json (pure aggregation)
dashboard.py # live web dashboard + non-destructive annotations (stdlib only)
runstate.py # run-dir status contract (orchestrator <-> dashboard)
llm.py models.py config.py
sources/ # arxiv, openalex, semantic_scholar, github_awesome, venue_pages,
# openreview, acl_anthology, ecva, pdf_extract, _http, registry
config.yaml requirements.txt
runs/<id>/ # per-run artifacts (created at runtime; git-ignored)
Extend it: add a paper source by subclassing PaperSource in sources/
(must never raise β return [] on failure) and listing it under sources: in
config.yaml. Re-prompt a step by editing its .claude/agents/*.md. Change the
run-dir contract in runstate.py (the dashboard reads the same shapes).
- Idea quality is bounded by retrieval. A thin or off-target paper set yields
thin ideas β tune queries /
max_papers/ sources for unfamiliar areas. - Scores and reviews are LLM judgments, not ground truth. Treat novelty/feasibility/impact and the critic's verdicts as a triage signal to skim 100 ideas down to 10, not as a verdict.
- Novelty β correctness. A high-scoring idea can still be subtly known or flawed; the critique + provenance are there so you can verify, fast. This is a tool for augmenting researcher ideation, not replacing the literature review.
- Python 3.8 +
anthropic0.72.0: the code deliberately avoids newer SDK features (messages.parse,int | Noneannotations). SeeCLAUDE.mdfor the full constraints before you refactor. - No test framework: verify by importing the modules and running the real
(keyless) tools β
CLAUDE.mdβ Verifying changes.
Novelty is recombination at scale. This just does the bookkeeping β and red-teams the result.
{ "title": "Annealed Diffusion Samplers for Large-Neighborhood Search in MILP", "key_insight": "Treat the diffusion denoiser as a learned neighborhood proposal distribution, annealing temperature to trade exploration vs. repair...", "parent_genes": ["gene from paper A", "gene from paper B"], // provenance "scores": { "novelty": 8, "feasibility": 6, "impact": 7 }, "why_it_might_work": "...", "risks": "...", "review": { "verdict": "promising-with-caveats", // from the critic step "defects": ["closest prior work is ...", "..."], "overlap": "partial β differs from X in ..." } }