AgentStockBenchmark: The Clean-Room Engine (Github:AgentStockBenchmark)

THE ULTIMATE STRESS TEST FOR AGI

This is a live, tamper-proof arena testing whether the world's smartest AI agents can actually solve the ultimate stock prediction problem. We are not testing raw models in a sterile academic sandbox. We are testing the full autonomous loop—tools like Claude Code, Codex, and Gemini CLI—given clean data, a strict objective, and zero internet access. Every day, they are judged on one highly specific question: which stock in the S&P 500 will have the best performance tomorrow?

Most AI coding benchmarks are broken by data contamination. You never know if an AI "solved" a challenge or just memorized a GitHub repo. But nobody—not OpenAI, not Anthropic, not Google—has a chance to know which stock in the S&P 500 will have the best performance tomorrow during its training process. The future is the only uncontaminated test set.

If you find this project interesting, please consider giving it a ⭐ Star and Forking the repository to test your own ideas!

JUST LOOKING FOR THE LEADERBOARD?

If you are here to see which AI makes the most money in this arena, check out our companion repository: 👉 AgentStockBenchmarkResults

The Results repository hosts the live leaderboard, the beautiful cumulative PnL charts, and the daily performance digests.

THE "CLEAN ROOM" ARCHITECTURE

To ensure 100% integrity, this engine enforces a strict two-repository boundary:

This Repo (AgentStockBenchmark): The "Clean Room." It hosts the frozen agent logic, the prompts, and the orchestration engine. Once an agent generates a strategy, it is merged here and receives a permanent server-side timestamp.
Results Repo (AgentStockBenchmarkResults): The "Arena." It hosts the realized market data and the public leaderboard.

The Time Invariant: An agent is only allowed to see a data snapshot truncated exactly at $t-1$ (yesterday). Its prediction for $t$ (today) must be frozen before market data for $t$ even exists.

FOR DEVELOPERS & RESEARCHERS

This repository is an open-source engineering laboratory. We invite tech-heavy users to fork this engine and experiment with the "Autonomous Loop."

1. Fork & Extend the Ideas

The true alpha in this benchmark isn't just the model—it's the ideas. We encourage you to:

Implement New Portfolio Math: Don't like our Linear Neutral ladder? Fork the engine and implement your own risk-parity or Kelly-criterion sizing logic in stage3.
Agentic Scaffolding: Modify the research workflow in agentstockbenchmark.research to test how different "chain-of-thought" or "self-reflection" loops affect strategy quality.
Custom Universes: The engine is built for the S&P 500, but the data-ingestion pipeline is flexible. Extend it to crypto, forex, or international equities.

2. Prompt Engineering is Alpha

The biggest variable in performance is the scaffolding provided to the agent.

Check STRATEGY_EDITORIAL.md to see how different model lineages (OpenAI, Anthropic, Google) responded to Prompt Version 20260517.
Experiment with the prompts in prompts/. Can you force a model to better understand overfitting? Can you scaffold it to build more robust volatility-normalization?

ENGINE DOCUMENTATION

SYSTEM.md: Deep dive into the architecture, data contracts, and the $t-1 \to t \to t+1$ failure model.
USAGE.md: Full CLI cookbook for production, backfilling, and model migration.
STRATEGY_EDITORIAL.md: A detailed quantitative analysis of the strategies produced by each model under Prompt Version 20260517.

QUICK START

# Clone the engine
git clone git@github.com:xsunsim/AgentStockBenchmark.git
cd AgentStockBenchmark
export PYTHONPATH=src

# List active prompts and strategies
python -m agentstockbenchmark stage1 list-prompts
python -m agentstockbenchmark stage1 list-strategies --prompt-id 20260517

WHAT WE ARE NOT

We are not a hedge fund. We are not a stock recommendation service. Use it at your own risk.

We care if Codex beats Claude Code—not if AAPL beats NVDA tomorrow.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
prompts		prompts
src		src
strategies		strategies
tests		tests
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_CN.md		README_CN.md
STRATEGY_EDITORIAL.md		STRATEGY_EDITORIAL.md
STRATEGY_EDITORIAL_CN.md		STRATEGY_EDITORIAL_CN.md
SYSTEM.md		SYSTEM.md
SYSTEM_CN.md		SYSTEM_CN.md
USAGE.md		USAGE.md
USAGE_CN.md		USAGE_CN.md
pyproject.toml		pyproject.toml
run_backfill.sh		run_backfill.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AgentStockBenchmark: The Clean-Room Engine (Github:AgentStockBenchmark)

THE ULTIMATE STRESS TEST FOR AGI

JUST LOOKING FOR THE LEADERBOARD?

THE "CLEAN ROOM" ARCHITECTURE

FOR DEVELOPERS & RESEARCHERS

1. Fork & Extend the Ideas

2. Prompt Engineering is Alpha

ENGINE DOCUMENTATION

QUICK START

WHAT WE ARE NOT

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AgentStockBenchmark: The Clean-Room Engine (Github:AgentStockBenchmark)

THE ULTIMATE STRESS TEST FOR AGI

JUST LOOKING FOR THE LEADERBOARD?

THE "CLEAN ROOM" ARCHITECTURE

FOR DEVELOPERS & RESEARCHERS

1. Fork & Extend the Ideas

2. Prompt Engineering is Alpha

ENGINE DOCUMENTATION

QUICK START

WHAT WE ARE NOT

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages