TradeArena

Open-source benchmark and audit framework for evaluating LLM trading agents under explicit execution, risk, and replayability constraints.

TradeArena

TradeArena turns every trading-agent decision into a traceable trajectory:

observation -> signal -> intended allocation -> risk gate -> order
  -> fill/rejection -> portfolio state -> diagnostic report

It is not another "LLM trading bot." It is a framework for asking whether an LLM trading agent can be audited, reproduced, stress-tested, and constrained before anyone trusts its headline return.

Technical Mechanics

TradeArena is organized as a deterministic agent loop. The runner in src/tradearena/core/runner.py executes the same lifecycle at every market timestamp:

observe market snapshot
  -> collect analyst signals
  -> convert signals into target-weight decisions
  -> clip or block decisions with the risk manager
  -> convert approved targets into orders
  -> simulate fills under latency, liquidity, spread, slippage, and commission
  -> write risk reports, execution reports, memory events, and trajectory rows

The default allocation logic is intentionally simple and inspectable. In SignalWeightedStrategy, analyst signals are grouped by symbol, confidence-weighted, and converted into target weights:

combined_score(symbol) =
  sum(signal.score * max(0.01, signal.confidence)) /
  sum(max(0.01, signal.confidence))

target_weight = clip(5 * combined_score, -max_short_weight, max_long_weight)

Small scores inside the deadband become HOLD. The optional MemoryAwareSignalWeightedStrategy applies a risk-off scale when recent memory contains drawdowns, rejected orders, or risk violations. Classical baselines are also available, including equal buy-and-hold and a rolling minimum-variance strategy that estimates realized covariance from the current trajectory only.

Execution is split into two stages. First, TargetWeightExecutionAgent translates approved target weights into market orders by comparing current position value with target portfolio value. Trades below min_trade_value are skipped to avoid noise. Second, RealisticOrderSimulator applies a configurable paper-execution stress model:

submitted orders enter a pending queue and become eligible after latency_steps;
per-symbol fill capacity is capped by bar.volume * participation_rate;
buys cannot exceed available cash, and sells cannot exceed holdings unless shorting is enabled;
market orders cross half the configured bid-ask spread;
execution price includes base slippage, spread, market impact, and intrabar volatility:

slip_rate =
  spread_bps / 20000
  + base_slippage_bps / 10000
  + market_impact * (filled_quantity / volume)
  + 0.1 * ((high - low) / close)

The simulator records requested quantity, filled quantity, fill ratio, latency, liquidity available, commission, slippage cost, partial fills, pending orders, and rejections in an ExecutionReport. Its default settings are transparent stress-test assumptions, not a claim of broker-grade transaction-cost calibration.

Execution Calibration Boundary

Execution realism is only meaningful when its assumptions are visible. TradeArena therefore separates the simulator equation from parameter calibration:

Parameter	Default role	Calibration source needed
`commission_bps`	explicit fee on traded notional	broker or exchange fee schedule
`spread_bps`	full quoted spread; market orders cross half	quote/NBBO or order-book snapshots
`base_slippage_bps`	residual shortfall before spread, impact, and bar volatility	historical order/fill logs
`participation_rate`	cap on fillable bar volume	execution policy or parent-order participation target
`latency_steps`	bar-delay before an order is eligible	submission, acknowledgement, and fill timestamps
`market_impact`	coefficient on participation	regression of implementation shortfall on participation

The tracked Yahoo Finance OHLCV files can estimate bar range, tail range, dollar volume, and participation-cap diagnostics. They cannot identify quoted spread, queue depth, fee tier, latency, or realized shortfall. For that reason, current public benchmark results should be read as execution-stress comparisons under shared assumptions. Live-market execution claims require replacing the defaults with quote/fill-calibrated parameters.

Run the diagnostic:

python scripts/calibrate_execution_model.py --data-dir data/real/yahoo_intraday_1h_50

This writes docs/results/execution_calibration_intraday_1h.json and docs/results/execution_calibration_intraday_1h.md. Full details are in docs/execution_model.md.

Risk control is an auditable gate, not a hidden post-processing step. MaxPositionRiskManager runs three checks:

pre-trade approval clips per-symbol weights to max_abs_weight, blocks decisions below min_confidence, rescales gross exposure above max_gross_exposure, and reports projected turnover above max_single_step_turnover;
in-trade monitoring checks realized participation, latency, and slippage against max_order_participation, max_latency_steps, and max_slippage_bps;
post-trade attribution reports realized PnL, commission, slippage cost, and final exposures.

Every intervention is serialized as a RiskReport with RiskCheck and RiskViolation records. The trajectory therefore preserves both the model's original intent and the executable decision after risk feedback, which is the core substrate for risk-feedback, representation-drift, and hallucination-audit experiments.

Quick Start: Deterministic Smoke Test

python -m pip install tradearena-benchmark
tradearena --benchmark tradearena-core

This default command intentionally does not call an LLM. It is a no-key smoke test for the runner, trajectory schema, risk gate, execution simulator, and metric stack. It uses deterministic analysts so every new checkout can pass CI-style validation before provider keys, model routing, or billing enter the loop.

The PyPI distribution is tradearena-benchmark because tradearena is already occupied on PyPI by an unrelated project. The import namespace and CLI remain tradearena.

To run the full local showcase:

git clone https://github.com/weich97/TradeArena.git
cd TradeArena
python -m pip install -e ".[dev]"
python scripts/run_showcase.py

Then open:

outputs/examples/index.html

The first-run path uses deterministic agents, tracked snapshots, and local demo artifacts. It does not call DeepSeek, Poe, OpenAI, Hugging Face, AkShare, Yahoo Finance, or broker APIs unless you opt into the model or data commands below.

LLM Run Paths

TradeArena supports LLM trading-agent experiments, but the repository keeps live provider calls out of the default path. Use the path that matches what you want to verify:

Path	Calls an LLM?	Purpose
`tradearena --benchmark tradearena-core`	No	Deterministic smoke test for core mechanics
`python examples/llm_cache_replay_demo.py`	No	Redacted manifest of prior LLM experiment coverage; no raw prompts or responses
`tradearena --benchmark llm-smoke ...`	Yes, unless a matching cache row exists	Minimal live/cache-backed LLM analyst run
`tradearena --paper-output ...`	Optional	Larger paper-grade suite with cache-first LLM sections

Minimal live LLM smoke test through Poe:

$env:POE_API_KEY="..."
tradearena --benchmark llm-smoke `
  --analysts poe-llm `
  --llm-model gpt-5.5 `
  --periods 3 `
  --symbols SYN,ALT `
  --llm-cache outputs/examples/poe_llm_smoke_cache.jsonl

Minimal live LLM smoke test through DeepSeek:

$env:DEEPSEEK_API_KEY="..."
tradearena --benchmark llm-smoke `
  --analysts deepseek-llm `
  --llm-model deepseek-v4-flash `
  --periods 3 `
  --symbols SYN,ALT `
  --llm-cache outputs/examples/deepseek_llm_smoke_cache.jsonl

These commands run one LLM analyst case and write cache entries locally. The cache is deliberately ignored by Git because raw prompts and responses can carry provider, licensing, privacy, or portfolio constraints.

Advanced Integrations Safety

DeepSeek, Poe-hosted models, OpenAI-compatible chat endpoints, AkShare, Yahoo Finance, and broker-facing workflows are opt-in advanced paths. They are not part of the first-run command, and they must stay inside an explicit audit boundary:

Surface	Default boundary	Public artifact policy
LLM providers	Environment-variable keys, cache-first replay, signals only	Track metrics and redacted manifests, not raw prompt/response caches
Yahoo Finance / AkShare	Download to normalized OHLCV CSV with source metadata	Record source, frequency, symbols, timestamp policy, and adjustment mode
Execution model	Stress assumptions unless calibrated with quote/fill logs	State parameter sources; do not call bar-only diagnostics broker-grade
Broker adapters	Paper export or human-review sandbox only	No live submission in default examples; no credentials in artifacts

Use per-session environment variables or an OS secret manager. Do not commit .env files, provider JSONL caches, broker tokens, account statements, or private holdings. If a run needs to be shared, publish a redacted submission or cache manifest instead of raw provider text.

The full checklist is in docs/advanced_integrations_security.md.

No local install yet?

Install And Run

From a clone:

python -m pip install -e ".[dev]"
tradearena --benchmark tradearena-core
python -m tradearena.cli --benchmark tradearena-core

From GitHub without cloning first:

python -m pip install "git+https://github.com/weich97/TradeArena.git"
tradearena --benchmark tradearena-core

Benchmark Result

The v0.1 benchmark card makes one compact claim:

LLM trading-agent evaluation changes materially once intended allocations pass through auditable risk gates and explicit execution-stress constraints.

Open:

Static page: weich97.github.io/TradeArena/benchmark-v0.1.html
Markdown artifact: docs/results/benchmark_v0_1.md
Community registry: docs/results/community_registry.md

Rebuild:

python scripts/build_benchmark_page.py
python scripts/build_benchmark_registry.py examples/benchmark_submissions

Submit Or Validate A Benchmark Row

TradeArena supports redacted benchmark submissions. They share scenario, execution, risk, metrics, and reproducibility metadata without exposing raw provider prompts, responses, credentials, or private portfolios.

tradearena validate-submission examples/benchmark_submissions/example_redacted_submission.json
tradearena build-registry examples/benchmark_submissions --output docs/results/community_registry.md
tradearena hash-run outputs/examples/audit_walkthrough_trajectory.json

See docs/benchmark_submissions.md.

Visual Preview

Audit lifecycle	Execution stress	Diagnostic loop

The browser-playable launch video is here: weich97.github.io/TradeArena/demo_video.html.

What TradeArena Provides

Need	TradeArena surface
Replayable decisions	Trajectory logs with prompts, memory digests, risk reports, fills, and metrics
Execution stress model	Configurable fees, spread, slippage, latency, liquidity caps, partial fills, rejections, and calibration diagnostics
Risk-aware evaluation	Pre-trade gates, in-trade monitors, post-trade attribution, violations
Extensibility	Data, analyst, strategy, risk, simulator, memory, planner, evaluator plugins
Community benchmarks	Redacted submission schema, registry builder, reproducibility hashes

Extension Path

Start with one small plugin:

python examples/custom_plugin_demo.py
python examples/extension_walkthrough_demo.py

The walkthrough swaps in a custom analyst, risk manager, and evaluator while reusing the existing runner, data provider, strategy, execution simulator, memory store, trajectory logger, and metric stack.

Useful entry points:

Documentation Map

Quickstart: docs/getting_started.md
Advanced integration safety: docs/advanced_integrations_security.md
Schemas: docs/schemas.md
Execution model: docs/execution_model.md
Benchmark submissions: docs/benchmark_submissions.md
Related work: docs/related_work.md
Retail planning sandbox: docs/retail_planning.md
Research protocol: docs/research_protocol.md
Security policy: SECURITY.md
Governance: GOVERNANCE.md

Local Checks

Each checkout can use its own .venv, so public and private repos do not fight over editable installs:

powershell -ExecutionPolicy Bypass -File scripts\check_local.ps1

The script installs the current checkout in editable mode, runs compile checks, Ruff critical checks, tests, release-readiness checks, submission validation, artifact-contract validation, and JSON validation.

Safety Boundary

TradeArena does not promise profitable trading, does not provide financial advice, and does not execute live trades by default. Public examples are offline, paper-only, or human-review oriented. Broker and provider integrations must follow docs/advanced_integrations_security.md, SECURITY.md, and GOVERNANCE.md.

Cite

See CITATION.cff. If you use TradeArena in research or software, cite the repository release you used.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
.devcontainer		.devcontainer
.github		.github
data		data
docs		docs
examples		examples
notebooks		notebooks
schemas		schemas
scripts		scripts
src/tradearena		src/tradearena
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CITATION.cff		CITATION.cff
CODEOWNERS		CODEOWNERS
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
GOVERNANCE.md		GOVERNANCE.md
LICENSE		LICENSE
MAINTAINERS.md		MAINTAINERS.md
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TradeArena

Technical Mechanics

Execution Calibration Boundary

Quick Start: Deterministic Smoke Test

LLM Run Paths

Advanced Integrations Safety

Install And Run

Benchmark Result

Submit Or Validate A Benchmark Row

Visual Preview

What TradeArena Provides

Extension Path

Documentation Map

Local Checks

Safety Boundary

Cite

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TradeArena

Technical Mechanics

Execution Calibration Boundary

Quick Start: Deterministic Smoke Test

LLM Run Paths

Advanced Integrations Safety

Install And Run

Benchmark Result

Submit Or Validate A Benchmark Row

Visual Preview

What TradeArena Provides

Extension Path

Documentation Map

Local Checks

Safety Boundary

Cite

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages