This is a live, tamper-proof arena testing whether the world's smartest AI agents can actually solve the ultimate stock prediction problem. We are not testing raw models in a sterile academic sandbox. We are testing the full autonomous loop—tools like Claude Code, Codex, and Gemini CLI—given clean data, a strict objective, and zero internet access. Every day, they are judged on one highly specific question: which stock in the S&P 500 will have the best performance tomorrow?
Most AI coding benchmarks are broken by data contamination. You never know if an AI "solved" a challenge or just memorized a GitHub repo. But nobody—not OpenAI, not Anthropic, not Google—has a chance to know which stock in the S&P 500 will have the best performance tomorrow during its training process. The future is the only uncontaminated test set.
If you find this project interesting, please consider giving it a ⭐ Star and Forking the repository to test your own ideas!
If you are here to see which AI makes the most money in this arena, check out our companion repository: 👉 AgentStockBenchmarkResults
The Results repository hosts the live leaderboard, the beautiful cumulative PnL charts, and the daily performance digests.
To ensure 100% integrity, this engine enforces a strict two-repository boundary:
- This Repo (
AgentStockBenchmark): The "Clean Room." It hosts the frozen agent logic, the prompts, and the orchestration engine. Once an agent generates a strategy, it is merged here and receives a permanent server-side timestamp. - Results Repo (
AgentStockBenchmarkResults): The "Arena." It hosts the realized market data and the public leaderboard.
The Time Invariant: An agent is only allowed to see a data snapshot truncated exactly at
This repository is an open-source engineering laboratory. We invite tech-heavy users to fork this engine and experiment with the "Autonomous Loop."
The true alpha in this benchmark isn't just the model—it's the ideas. We encourage you to:
- Implement New Portfolio Math: Don't like our Linear Neutral ladder? Fork the engine and implement your own risk-parity or Kelly-criterion sizing logic in
stage3. - Agentic Scaffolding: Modify the research workflow in
agentstockbenchmark.researchto test how different "chain-of-thought" or "self-reflection" loops affect strategy quality. - Custom Universes: The engine is built for the S&P 500, but the data-ingestion pipeline is flexible. Extend it to crypto, forex, or international equities.
The biggest variable in performance is the scaffolding provided to the agent.
- Check STRATEGY_EDITORIAL.md to see how different model lineages (OpenAI, Anthropic, Google) responded to Prompt Version 20260517.
- Experiment with the prompts in
prompts/. Can you force a model to better understand overfitting? Can you scaffold it to build more robust volatility-normalization?
-
SYSTEM.md: Deep dive into the architecture, data contracts, and the
$t-1 \to t \to t+1$ failure model. - USAGE.md: Full CLI cookbook for production, backfilling, and model migration.
- STRATEGY_EDITORIAL.md: A detailed quantitative analysis of the strategies produced by each model under Prompt Version 20260517.
# Clone the engine
git clone git@github.com:xsunsim/AgentStockBenchmark.git
cd AgentStockBenchmark
export PYTHONPATH=src
# List active prompts and strategies
python -m agentstockbenchmark stage1 list-prompts
python -m agentstockbenchmark stage1 list-strategies --prompt-id 20260517We are not a hedge fund. We are not a stock recommendation service. Use it at your own risk.
We care if Codex beats Claude Code—not if AAPL beats NVDA tomorrow.