Open-source benchmark and audit framework for evaluating LLM trading agents under explicit execution, risk, and replayability constraints.
Getting started | PyPI | Project site | Benchmark card | Submit results | Demo matrix | Contribute | Security
TradeArena turns every trading-agent decision into a traceable trajectory:
observation -> signal -> intended allocation -> risk gate -> order
-> fill/rejection -> portfolio state -> diagnostic report
It is not another "LLM trading bot." It is a framework for asking whether an LLM trading agent can be audited, reproduced, stress-tested, and constrained before anyone trusts its headline return.
TradeArena is organized as a deterministic agent loop. The runner in
src/tradearena/core/runner.py executes the
same lifecycle at every market timestamp:
observe market snapshot
-> collect analyst signals
-> convert signals into target-weight decisions
-> clip or block decisions with the risk manager
-> convert approved targets into orders
-> simulate fills under latency, liquidity, spread, slippage, and commission
-> write risk reports, execution reports, memory events, and trajectory rows
The default allocation logic is intentionally simple and inspectable. In
SignalWeightedStrategy, analyst signals
are grouped by symbol, confidence-weighted, and converted into target weights:
combined_score(symbol) =
sum(signal.score * max(0.01, signal.confidence)) /
sum(max(0.01, signal.confidence))
target_weight = clip(5 * combined_score, -max_short_weight, max_long_weight)
Small scores inside the deadband become HOLD. The optional
MemoryAwareSignalWeightedStrategy applies
a risk-off scale when recent memory contains drawdowns, rejected orders, or risk
violations. Classical baselines are also available, including equal
buy-and-hold and a rolling minimum-variance strategy that estimates realized
covariance from the current trajectory only.
Execution is split into two stages. First,
TargetWeightExecutionAgent translates
approved target weights into market orders by comparing current position value
with target portfolio value. Trades below min_trade_value are skipped to avoid
noise. Second,
RealisticOrderSimulator applies a
configurable paper-execution stress model:
- submitted orders enter a pending queue and become eligible after
latency_steps; - per-symbol fill capacity is capped by
bar.volume * participation_rate; - buys cannot exceed available cash, and sells cannot exceed holdings unless shorting is enabled;
- market orders cross half the configured bid-ask spread;
- execution price includes base slippage, spread, market impact, and intrabar volatility:
slip_rate =
spread_bps / 20000
+ base_slippage_bps / 10000
+ market_impact * (filled_quantity / volume)
+ 0.1 * ((high - low) / close)
The simulator records requested quantity, filled quantity, fill ratio, latency,
liquidity available, commission, slippage cost, partial fills, pending orders,
and rejections in an ExecutionReport. Its default settings are transparent
stress-test assumptions, not a claim of broker-grade transaction-cost
calibration.
Execution realism is only meaningful when its assumptions are visible. TradeArena therefore separates the simulator equation from parameter calibration:
| Parameter | Default role | Calibration source needed |
|---|---|---|
commission_bps |
explicit fee on traded notional | broker or exchange fee schedule |
spread_bps |
full quoted spread; market orders cross half | quote/NBBO or order-book snapshots |
base_slippage_bps |
residual shortfall before spread, impact, and bar volatility | historical order/fill logs |
participation_rate |
cap on fillable bar volume | execution policy or parent-order participation target |
latency_steps |
bar-delay before an order is eligible | submission, acknowledgement, and fill timestamps |
market_impact |
coefficient on participation | regression of implementation shortfall on participation |
The tracked Yahoo Finance OHLCV files can estimate bar range, tail range, dollar volume, and participation-cap diagnostics. They cannot identify quoted spread, queue depth, fee tier, latency, or realized shortfall. For that reason, current public benchmark results should be read as execution-stress comparisons under shared assumptions. Live-market execution claims require replacing the defaults with quote/fill-calibrated parameters.
Run the diagnostic:
python scripts/calibrate_execution_model.py --data-dir data/real/yahoo_intraday_1h_50This writes docs/results/execution_calibration_intraday_1h.json and
docs/results/execution_calibration_intraday_1h.md. Full details are in
docs/execution_model.md.
Risk control is an auditable gate, not a hidden post-processing step.
MaxPositionRiskManager runs three checks:
- pre-trade approval clips per-symbol weights to
max_abs_weight, blocks decisions belowmin_confidence, rescales gross exposure abovemax_gross_exposure, and reports projected turnover abovemax_single_step_turnover; - in-trade monitoring checks realized participation, latency, and slippage
against
max_order_participation,max_latency_steps, andmax_slippage_bps; - post-trade attribution reports realized PnL, commission, slippage cost, and final exposures.
Every intervention is serialized as a RiskReport with RiskCheck and
RiskViolation records. The trajectory therefore preserves both the model's
original intent and the executable decision after risk feedback, which is the
core substrate for risk-feedback, representation-drift, and hallucination-audit
experiments.
python -m pip install tradearena-benchmark
tradearena --benchmark tradearena-coreThis default command intentionally does not call an LLM. It is a no-key smoke test for the runner, trajectory schema, risk gate, execution simulator, and metric stack. It uses deterministic analysts so every new checkout can pass CI-style validation before provider keys, model routing, or billing enter the loop.
The PyPI distribution is tradearena-benchmark because tradearena is already
occupied on PyPI by an unrelated project. The import namespace and CLI remain
tradearena.
To run the full local showcase:
git clone https://github.com/weich97/TradeArena.git
cd TradeArena
python -m pip install -e ".[dev]"
python scripts/run_showcase.pyThen open:
outputs/examples/index.html
The first-run path uses deterministic agents, tracked snapshots, and local demo artifacts. It does not call DeepSeek, Poe, OpenAI, Hugging Face, AkShare, Yahoo Finance, or broker APIs unless you opt into the model or data commands below.
TradeArena supports LLM trading-agent experiments, but the repository keeps live provider calls out of the default path. Use the path that matches what you want to verify:
| Path | Calls an LLM? | Purpose |
|---|---|---|
tradearena --benchmark tradearena-core |
No | Deterministic smoke test for core mechanics |
python examples/llm_cache_replay_demo.py |
No | Redacted manifest of prior LLM experiment coverage; no raw prompts or responses |
tradearena --benchmark llm-smoke ... |
Yes, unless a matching cache row exists | Minimal live/cache-backed LLM analyst run |
tradearena --paper-output ... |
Optional | Larger paper-grade suite with cache-first LLM sections |
Minimal live LLM smoke test through Poe:
$env:POE_API_KEY="..."
tradearena --benchmark llm-smoke `
--analysts poe-llm `
--llm-model gpt-5.5 `
--periods 3 `
--symbols SYN,ALT `
--llm-cache outputs/examples/poe_llm_smoke_cache.jsonlMinimal live LLM smoke test through DeepSeek:
$env:DEEPSEEK_API_KEY="..."
tradearena --benchmark llm-smoke `
--analysts deepseek-llm `
--llm-model deepseek-v4-flash `
--periods 3 `
--symbols SYN,ALT `
--llm-cache outputs/examples/deepseek_llm_smoke_cache.jsonlThese commands run one LLM analyst case and write cache entries locally. The cache is deliberately ignored by Git because raw prompts and responses can carry provider, licensing, privacy, or portfolio constraints.
DeepSeek, Poe-hosted models, OpenAI-compatible chat endpoints, AkShare, Yahoo Finance, and broker-facing workflows are opt-in advanced paths. They are not part of the first-run command, and they must stay inside an explicit audit boundary:
| Surface | Default boundary | Public artifact policy |
|---|---|---|
| LLM providers | Environment-variable keys, cache-first replay, signals only | Track metrics and redacted manifests, not raw prompt/response caches |
| Yahoo Finance / AkShare | Download to normalized OHLCV CSV with source metadata | Record source, frequency, symbols, timestamp policy, and adjustment mode |
| Execution model | Stress assumptions unless calibrated with quote/fill logs | State parameter sources; do not call bar-only diagnostics broker-grade |
| Broker adapters | Paper export or human-review sandbox only | No live submission in default examples; no credentials in artifacts |
Use per-session environment variables or an OS secret manager. Do not commit
.env files, provider JSONL caches, broker tokens, account statements, or
private holdings. If a run needs to be shared, publish a redacted submission or
cache manifest instead of raw provider text.
The full checklist is in
docs/advanced_integrations_security.md.
No local install yet?
From a clone:
python -m pip install -e ".[dev]"
tradearena --benchmark tradearena-core
python -m tradearena.cli --benchmark tradearena-coreFrom GitHub without cloning first:
python -m pip install "git+https://github.com/weich97/TradeArena.git"
tradearena --benchmark tradearena-coreThe v0.1 benchmark card makes one compact claim:
LLM trading-agent evaluation changes materially once intended allocations pass through auditable risk gates and explicit execution-stress constraints.
Open:
- Static page:
weich97.github.io/TradeArena/benchmark-v0.1.html - Markdown artifact:
docs/results/benchmark_v0_1.md - Community registry:
docs/results/community_registry.md
Rebuild:
python scripts/build_benchmark_page.py
python scripts/build_benchmark_registry.py examples/benchmark_submissionsTradeArena supports redacted benchmark submissions. They share scenario, execution, risk, metrics, and reproducibility metadata without exposing raw provider prompts, responses, credentials, or private portfolios.
tradearena validate-submission examples/benchmark_submissions/example_redacted_submission.json
tradearena build-registry examples/benchmark_submissions --output docs/results/community_registry.md
tradearena hash-run outputs/examples/audit_walkthrough_trajectory.jsonSee docs/benchmark_submissions.md.
| Audit lifecycle | Execution stress | Diagnostic loop |
|---|---|---|
|
|
|
The browser-playable launch video is here:
weich97.github.io/TradeArena/demo_video.html.
| Need | TradeArena surface |
|---|---|
| Replayable decisions | Trajectory logs with prompts, memory digests, risk reports, fills, and metrics |
| Execution stress model | Configurable fees, spread, slippage, latency, liquidity caps, partial fills, rejections, and calibration diagnostics |
| Risk-aware evaluation | Pre-trade gates, in-trade monitors, post-trade attribution, violations |
| Extensibility | Data, analyst, strategy, risk, simulator, memory, planner, evaluator plugins |
| Community benchmarks | Redacted submission schema, registry builder, reproducibility hashes |
Start with one small plugin:
python examples/custom_plugin_demo.py
python examples/extension_walkthrough_demo.pyThe walkthrough swaps in a custom analyst, risk manager, and evaluator while reusing the existing runner, data provider, strategy, execution simulator, memory store, trajectory logger, and metric stack.
Useful entry points:
- Quickstart:
docs/getting_started.md - Advanced integration safety:
docs/advanced_integrations_security.md - Schemas:
docs/schemas.md - Execution model:
docs/execution_model.md - Benchmark submissions:
docs/benchmark_submissions.md - Related work:
docs/related_work.md - Retail planning sandbox:
docs/retail_planning.md - Research protocol:
docs/research_protocol.md - Security policy:
SECURITY.md - Governance:
GOVERNANCE.md
Each checkout can use its own .venv, so public and private repos do not
fight over editable installs:
powershell -ExecutionPolicy Bypass -File scripts\check_local.ps1The script installs the current checkout in editable mode, runs compile checks, Ruff critical checks, tests, release-readiness checks, submission validation, artifact-contract validation, and JSON validation.
TradeArena does not promise profitable trading, does not provide financial
advice, and does not execute live trades by default. Public examples are
offline, paper-only, or human-review oriented. Broker and provider integrations
must follow docs/advanced_integrations_security.md,
SECURITY.md, and GOVERNANCE.md.
See CITATION.cff. If you use TradeArena in research or
software, cite the repository release you used.


