DARWIN — an AI that researches trading alphas and measurably gets better at it

The LLM proposes, the backtester disposes. A research agent that invents quant trading signals, kills the weak, breeds the strong, and accumulates memory that compounds into sharper research over time — proven against a judge it cannot fool.

Built on the sponsor stack — every integration is load-bearing

_{Gemini authors the alphas · Antigravity researches new ones live in an isolated cloud box · Atlas + Voyage are the memory that drives the learning lift · MiniMax-M2.5 on DigitalOcean is the swappable reasoning proposer. Full breakdown ↓}

Why this is different — read this first

Most "self-improving agent" demos are graded by a soft judge: an LLM scoring itself, a benchmark that can be gamed, or vibes. Ours is graded by the market.

Every signal DARWIN invents is run through a deterministic backtest — rank correlation against forward returns, dollar-neutral, net of real trading costs, on data the agent never saw. You cannot p-hack, charm, or hallucinate your way past it. So when we show the agent getting better, that improvement is measured, not asserted.

That is the whole thesis: a self-improvement loop with an evaluator that can't be fooled.

What it does, in one breath

We seed the agent with 50 known cross-sectional equity alphas. Each research "generation" — one walk-forward step through 2013–2024 — it:

Evaluates every live alpha on a trailing window (IC, IC-IR, appraisal, turnover, cost).
Prunes the decayed, the cost-bleeders, and the redundant — natural selection.
Researches new alphas — Gemini proposes formulas conditioned on a memory of every past win and failure (MongoDB Atlas + Voyage embeddings); a Gemini Antigravity managed agent spins up an isolated cloud box to browse the literature, write code, and return brand-new, cited alphas.
Validates each candidate in a sandboxed backtest, de-dupes against the fleet, and admits only the keepers.
Trades the evolved fleet and scores it on the next unseen block.

Every keep / kill / add decision is made before the test data is seen.

What we measured — the proof

Full run: 600 US stocks, 2013–2024, 21 walk-forward generations, 777 alphas researched.

1 · The researcher learns (the core result)

Same Gemini model, same data, three setups — only the memory differs:

Proposer	Keeper alphas discovered	Avg proposal quality (out-of-sample IR)
LLM + memory	68	+0.29
LLM, memory ablated	31	+0.16
Random formula search	21	−0.34

Memory makes the agent discover +119% more keepers — and propose better ones. Random formula search proposes negative-quality junk. Turn memory off and the agent stops improving. This is continual learning, isolated by a controlled ablation — not a claim, a measurement.

2 · The fleet adapts

Across 21 unseen blocks, the evolving book beats the frozen seed fleet on out-of-sample appraisal (0.96 vs 0.92) and IC-IR (1.48 vs 1.37), while turning over far less.

3 · We show where we lose (honestly)

Gross signal is strong (holdout IC-IR > 2). But net of realistic 10 bps costs the book is underwater — and so is the frozen baseline. On the single sealed 2024 holdout, the elite frozen seeds even had a better year. We display this; hiding it would be the exact self-deception the system is built to prevent. The contribution is the improving research loop, not a profitable fund.

Why you can trust the numbers

A deterministic judge only matters if you can't cheat it. We engineered against the three ways quant results lie:

Lookahead → prospective walk-forward; every decision made before the test block; a sealed 2024 holdout scored exactly once.
P-hacking → an immutable ledger of all 777 trials; random-search and memory-ablated controls; a safe DSL sandbox (whitelisted AST — LLM-written formulas run without trust; injection attempts blocked in tests).
Ignoring costs → net-of-cost gate at 10 bps, a full cost sweep (1→50 bps), a 1-day execution-delay test, and a liquid top-600 universe.

Built on the sponsor stack

Every sponsor product below is load-bearing — pull it out and the result changes.

Sponsor product	What it powers in DARWIN	Read more
Google Gemini 2.5 Flash	the memory-conditioned alpha proposer	docs ↗
Gemini Antigravity managed agent (`antigravity-preview-05-2026`)	spins up an isolated Google-hosted box to browse the web + run code and return cited alphas (real DOIs, step traces, env IDs) — live on stage	docs ↗
MongoDB Atlas Vector Search	the agent's memory + idea-space "nearest alphas" — the substrate that drives the +119% learning lift	docs ↗
Voyage AI (`voyage-3.5`, 1024-d)	embeddings for memory, de-duplication, and similarity search	docs ↗
DigitalOcean Gradient Inference (`inference.do-ai.run`)	serves the swappable reasoning proposer through one OpenAI-compatible key	docs ↗
MiniMax M2.5	the reasoning ("interleaved thinking") proposer model, served on DigitalOcean	docs ↗

_{Also uses Firecrawl to gather the grounded literature corpus. Data: ~1,200 US stocks, daily OHLCV, 2010–2024 (run on a liquid top-600 universe).}

Run it / what to look at

./run_demo.sh     # FastAPI :8090 + Vite :5173  →  open http://localhost:5173

The dashboard replays a committed run fully offline. With the backend up you also get per-alpha drill-downs, Atlas vector search ("nearest alphas in idea-space"), a live ⚡ Propose button (Gemini authors + backtests a new alpha on the spot), and 🛰 Research live (the Antigravity agent researches a brand-new alpha in real time).

In the demo, look for:

Section 03 — "the researcher is learning": the three-tier chart above. This is the thesis.
Section 01 — the Living Fleet: watch alphas get born, thrive, decay, and culled across 11 years.
Section 05 — the honesty panel: the sealed holdout, cost sweep, and controls.

Honest caveats

Current-index universe (survivorship bias); seed formulas are in-sample to their ~2015 publication; the 2024 holdout is a single year. Net-of-cost profitability is the genuine frontier and we do not claim it. This is a research demo, not investment advice. We never claim to beat the market — we claim an AI that learns to search for alpha, and we prove it.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
alpha101		alpha101
frontend		frontend
lab		lab
resources		resources
runs/demo_committed		runs/demo_committed
server		server
.gitignore		.gitignore
DEMO.md		DEMO.md
README.md		README.md
backlog.md		backlog.md
idea.md		idea.md
prompt.md		prompt.md
run_demo.sh		run_demo.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DARWIN — an AI that researches trading alphas and measurably gets better at it

Built on the sponsor stack — every integration is load-bearing

Why this is different — read this first

What it does, in one breath

What we measured — the proof

1 · The researcher learns (the core result)

2 · The fleet adapts

3 · We show where we lose (honestly)

Why you can trust the numbers

Built on the sponsor stack

Run it / what to look at

Honest caveats

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DARWIN — an AI that researches trading alphas and measurably gets better at it

Built on the sponsor stack — every integration is load-bearing

Why this is different — read this first

What it does, in one breath

What we measured — the proof

1 · The researcher learns (the core result)

2 · The fleet adapts

3 · We show where we lose (honestly)

Why you can trust the numbers

Built on the sponsor stack

Run it / what to look at

Honest caveats

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages