Skip to content

xd-liu/AutoResearch-Idea

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

9 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🧬 Auto Research Idea

A literature-grounded idea engine for AI/CS research. Give it a one-line research direction; get back a ranked portfolio of 50–100 concrete paper ideas β€” each one recombined from genes distilled out of real, freshly-retrieved papers, scored for novelty/feasibility/impact, then adversarially reviewed.

Python 3.8 Runtime: Claude Code Multi-agent Status: research preview

   "diffusion models for combinatorial optimization"
                      β”‚
   brainstorm ─▢ retrieve ─▢ digest ─▢ hybridize ─▢ prioritize ─▢ critique
    (opus)       (sonnet)   (sonnet)   (opus Γ—N)     (opus)        (opus Γ—N)
      β”‚            β”‚           β”‚          β”‚             β”‚             β”‚
  10 variants   ~100 papers  "genes"   50–100 ideas  scored &     adversarial review
  + queries     dedup+rank   (parallel  + key         ranked       + per-step credit
                              analysis)   insights    β†’ ideas.json   β†’ credit_summary.json

πŸ’‘ Why this exists

Most research novelty is recombination β€” new work tends to pair an unusual combination of otherwise-conventional prior ideas (cf. literature-based discovery, Swanson's undiscovered-public-knowledge "ABC" model; atypical combinations, Uzzi et al., Science 2013). The bottleneck isn't generating a combination β€” it's systematically enumerating and triaging the combinatorial space of plausible ones across a literature you can't fully hold in your head.

Auto Research Idea operationalizes that:

  1. Retrieve the relevant slice of literature (multi-source, deduped, ranked).
  2. Distill each paper into a reusable gene β€” its core mechanism, the assumption it relies on, where it breaks.
  3. Recombine genes across papers into candidate ideas, each with an explicit key insight and parent genes (full provenance β€” you can trace every idea back to the papers it came from).
  4. Score & rank on novelty / feasibility / impact, dedup near-twins.
  5. Critique each survivor adversarially β€” strengths, defects, overlap with the retrieved literature β€” and attribute every flaw to the pipeline step that caused it, so the system itself gets a feedback signal (which step is the weak link).

The output isn't one "perfect" idea β€” it's a ranked, red-teamed menu of grounded directions you'd never have enumerated by hand, with the receipts to vet each.


✨ What you get

A ranked ideas.json (and a live web view of it). Each idea carries its lineage and its critique:

{
  "title": "Annealed Diffusion Samplers for Large-Neighborhood Search in MILP",
  "key_insight": "Treat the diffusion denoiser as a learned neighborhood proposal
                  distribution, annealing temperature to trade exploration vs. repair...",
  "parent_genes": ["gene from paper A", "gene from paper B"],   // provenance
  "scores": { "novelty": 8, "feasibility": 6, "impact": 7 },
  "why_it_might_work": "...",
  "risks": "...",
  "review": { "verdict": "promising-with-caveats",             // from the critic step
              "defects": ["closest prior work is ...", "..."],
              "overlap": "partial β€” differs from X in ..." }
}

Every run is reproducible and inspectable: all intermediate artifacts (variants, retrieved papers, genes, raw candidates, ranking, reviews, credit summary) are written as JSON under runs/<id>/ and rendered live by the dashboard β€” where you can also add your own notes/scores without touching the generated files.


πŸš€ Quickstart

1. Install (Python 3.8 + a handful of deps):

python3 -m venv .venv
.venv/bin/pip install -r requirements.txt
cp .env.example .env        # optional β€” see "Keys" below; works without any key

2. Start the dashboard (separate terminal):

.venv/bin/python -m auto_research_idea.dashboard      # β†’ http://localhost:8000

3. Run the pipeline β€” just talk to Claude Code in this directory:

use the research-ideas skill on: diffusion models for combinatorial optimization

Claude Code brainstorms variants, spawns the retrieve/digest subagents, fans out parallel hybridizers, prioritizes, runs parallel critics, and updates the dashboard live. Final ranked ideas land in runs/<id>/ideas.json and on the page.

(Alternatively: type your idea into the dashboard's box to queue it, then tell Claude Code "run the queued research-ideas request".)

πŸ”‘ Keys (all optional)

The default pipeline runs entirely on your Claude Code subscription β€” the subagents digest in-context, so no ANTHROPIC_API_KEY is required. Keys only unlock extras / raise rate limits, and all go in .env (never .env.example):

Variable Needed for Without it
ANTHROPIC_API_KEY the standalone digest.py CLI (API credits) pipeline still works; CLI digest won't
SEMANTIC_SCHOLAR_API_KEY higher Semantic Scholar rate limits S2 source fails soft (other sources still work)
GITHUB_TOKEN the github awesome-list source 60 req/hr instead of 5000
CONTACT_EMAIL politeness with arXiv / OpenAlex slightly stricter rate limits

πŸ§ͺ A run, end to end

Stage What happens Artifact
Brainstorm seed idea β†’ 10 distinct framings + targeted search queries brainstorm.json
Retrieve queries fan out across 8 sources β†’ merge, dedup by title, rank (~100 papers) papers.json
Digest each paper β†’ a structured gene: mechanism, assumption, failure mode genes_<k>.json
Hybridize N Opus agents in parallel cross-breed genes β†’ candidates with key insight + provenance ideas_raw_<k>.json
Prioritize score (novelty/feasibility/impact), dedup near-duplicates, rank ideas.json
Critique N Opus critics adversarially review each idea + assign per-step credit/blame reviews_<k>.json
Credit aggregate the reviews β†’ which pipeline step is the weak link credit_summary.json

Because each stage is a file, you can stop, inspect, edit, and resume β€” e.g. hand-curate genes_<k>.json before hybridizing, or re-run prioritization with different weights. The dashboard also writes a non-destructive annotations.json for your own notes/score/rank overrides β€” ideas.json is never overwritten.


πŸ—οΈ System design (for the agent-systems crowd)

The interesting bit isn't just the output β€” it's the orchestration pattern:

Skill/subagent = brain; Python = hands. Open-ended creative + judgment work is delegated to LLM subagents; deterministic, parallel, mechanical work (retrieval, dedup, ranking, credit aggregation) is plain Python. Each side does what it's good at.

  • Orchestrator β€” Claude Code running the research-ideas skill (.claude/skills/research-ideas/). Delegates each step to a subagent and tracks progress in runs/<id>/status.json.

  • One subagent per step (.claude/agents/*.md) β€” independently promptable and swappable, with task-based model routing: cheap mechanical steps run on Sonnet, open-ended creative/judgment steps on Opus.

    Step Agent (model) Writes Role
    1 idea-brainstormer (opus) brainstorm.json 10 idea variants + search queries
    2 paper-retriever (sonnet) papers.json drives search tool β†’ ~100 deduped, ranked papers
    3 paper-digester (sonnet) genes_<k>.json distills each paper into a reusable gene
    4 idea-hybridizer (opus, Γ—N) ideas_raw_<k>.json recombines genes β†’ 50–100 candidates
    5 idea-prioritizer (opus) ideas.json scores, dedups, ranks
    6 idea-critic (opus, Γ—N) reviews_<k>.json adversarial review + per-step credit assignment

    Then credit.py (pure Python, no key) pools the reviews into credit_summary.json β€” a feedback signal for which step to improve next.

  • Multi-source retrieval (auto_research_idea/sources/) β€” pluggable PaperSource backends fanned out in parallel with backoff; a registry merges results by normalized title, enriching each paper from every source that found it. Eight sources ship today:

    Source Covers
    arxiv arXiv preprints
    openalex broad cross-venue index + citation counts
    semantic_scholar abstracts + citations (key optional)
    github awesome-<topic> reading lists
    venue_pages official accepted-paper listings (CVF, NeurIPS, …)
    openreview ICLR / NeurIPS / etc. on OpenReview
    acl_anthology ACL / EMNLP / NAACL (+ Findings)
    ecva ECCV open-access proceedings

    Sources fail soft β€” a dead backend returns [], never crashes the run.

  • Dashboard (dashboard.py, stdlib only) β€” renders variants, papers, genes, ranked ideas, reviews, and the credit summary live. It never runs the pipeline, so it can't corrupt a run; the only thing it writes is your annotations.json.

The whole thing coordinates through one contract: JSON files in runs/<id>/. Orchestrator, subagents, and dashboard share nothing else β€” which is what makes runs reproducible and every stage independently hackable.


βš™οΈ Configuration

config.yaml tunes the tools: which of the 8 paper sources are enabled (+ the per-venue venue_pages / openreview / acl_anthology / ecva registries), retrieval limits (retrieval.max_papers β‰ˆ 100, enrich_abstracts, parse_pdf), and the digest model/effort. Orchestration knobs (10 variants, number of parallel hybridizers β†’ 50–100 ideas, number of critics) live in the skill and agent files.


πŸ› οΈ Running the Python tools directly (debugging)

The retrieval and credit paths need no key β€” exercise them standalone:

.venv/bin/python -m auto_research_idea.search --queries "graph neural network CO" --out papers.json
.venv/bin/python -m auto_research_idea.digest --papers papers.json --out genes.json   # needs ANTHROPIC_API_KEY
.venv/bin/python -m auto_research_idea.credit --run-dir runs/<id>                      # aggregate reviews, no key

πŸ“ Project layout

.claude/
  skills/research-ideas/SKILL.md     # orchestrator
  agents/                            # one subagent per step
    idea-brainstormer.md  paper-retriever.md  paper-digester.md
    idea-hybridizer.md    idea-prioritizer.md  idea-critic.md
auto_research_idea/
  search.py        # tool: queries -> papers.json (search + dedup + rank)
  digest.py        # tool: papers.json -> genes.json (parallel analysis)
  credit.py        # tool: reviews_*.json -> credit_summary.json (pure aggregation)
  dashboard.py     # live web dashboard + non-destructive annotations (stdlib only)
  runstate.py      # run-dir status contract (orchestrator <-> dashboard)
  llm.py models.py config.py
  sources/         # arxiv, openalex, semantic_scholar, github_awesome, venue_pages,
                   #   openreview, acl_anthology, ecva, pdf_extract, _http, registry
config.yaml  requirements.txt
runs/<id>/         # per-run artifacts (created at runtime; git-ignored)

Extend it: add a paper source by subclassing PaperSource in sources/ (must never raise β€” return [] on failure) and listing it under sources: in config.yaml. Re-prompt a step by editing its .claude/agents/*.md. Change the run-dir contract in runstate.py (the dashboard reads the same shapes).


⚠️ Honest limitations

  • Idea quality is bounded by retrieval. A thin or off-target paper set yields thin ideas β€” tune queries / max_papers / sources for unfamiliar areas.
  • Scores and reviews are LLM judgments, not ground truth. Treat novelty/feasibility/impact and the critic's verdicts as a triage signal to skim 100 ideas down to 10, not as a verdict.
  • Novelty β‰  correctness. A high-scoring idea can still be subtly known or flawed; the critique + provenance are there so you can verify, fast. This is a tool for augmenting researcher ideation, not replacing the literature review.

πŸ“ Notes for hackers

  • Python 3.8 + anthropic 0.72.0: the code deliberately avoids newer SDK features (messages.parse, int | None annotations). See CLAUDE.md for the full constraints before you refactor.
  • No test framework: verify by importing the modules and running the real (keyless) tools β€” CLAUDE.md β†’ Verifying changes.

Novelty is recombination at scale. This just does the bookkeeping β€” and red-teams the result.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages