Skip to content

weich97/TradeArena

TradeArena wordmark

Open-source benchmark and audit framework for evaluating LLM trading agents under explicit execution, risk, and replayability constraints.

CI CodeQL Release PyPI Python License

Getting started | PyPI | Project site | Benchmark card | Submit results | Demo matrix | Contribute | Security

TradeArena

TradeArena turns every trading-agent decision into a traceable trajectory:

observation -> signal -> intended allocation -> risk gate -> order
  -> fill/rejection -> portfolio state -> diagnostic report

It is not another "LLM trading bot." It is a framework for asking whether an LLM trading agent can be audited, reproduced, stress-tested, and constrained before anyone trusts its headline return.

Technical Mechanics

TradeArena is organized as a deterministic agent loop. The runner in src/tradearena/core/runner.py executes the same lifecycle at every market timestamp:

observe market snapshot
  -> collect analyst signals
  -> convert signals into target-weight decisions
  -> clip or block decisions with the risk manager
  -> convert approved targets into orders
  -> simulate fills under latency, liquidity, spread, slippage, and commission
  -> write risk reports, execution reports, memory events, and trajectory rows

The default allocation logic is intentionally simple and inspectable. In SignalWeightedStrategy, analyst signals are grouped by symbol, confidence-weighted, and converted into target weights:

combined_score(symbol) =
  sum(signal.score * max(0.01, signal.confidence)) /
  sum(max(0.01, signal.confidence))

target_weight = clip(5 * combined_score, -max_short_weight, max_long_weight)

Small scores inside the deadband become HOLD. The optional MemoryAwareSignalWeightedStrategy applies a risk-off scale when recent memory contains drawdowns, rejected orders, or risk violations. Classical baselines are also available, including equal buy-and-hold and a rolling minimum-variance strategy that estimates realized covariance from the current trajectory only.

Execution is split into two stages. First, TargetWeightExecutionAgent translates approved target weights into market orders by comparing current position value with target portfolio value. Trades below min_trade_value are skipped to avoid noise. Second, RealisticOrderSimulator applies a configurable paper-execution stress model:

  • submitted orders enter a pending queue and become eligible after latency_steps;
  • per-symbol fill capacity is capped by bar.volume * participation_rate;
  • buys cannot exceed available cash, and sells cannot exceed holdings unless shorting is enabled;
  • market orders cross half the configured bid-ask spread;
  • execution price includes base slippage, spread, market impact, and intrabar volatility:
slip_rate =
  spread_bps / 20000
  + base_slippage_bps / 10000
  + market_impact * (filled_quantity / volume)
  + 0.1 * ((high - low) / close)

The simulator records requested quantity, filled quantity, fill ratio, latency, liquidity available, commission, slippage cost, partial fills, pending orders, and rejections in an ExecutionReport. Its default settings are transparent stress-test assumptions, not a claim of broker-grade transaction-cost calibration.

Execution Calibration Boundary

Execution realism is only meaningful when its assumptions are visible. TradeArena therefore separates the simulator equation from parameter calibration:

Parameter Default role Calibration source needed
commission_bps explicit fee on traded notional broker or exchange fee schedule
spread_bps full quoted spread; market orders cross half quote/NBBO or order-book snapshots
base_slippage_bps residual shortfall before spread, impact, and bar volatility historical order/fill logs
participation_rate cap on fillable bar volume execution policy or parent-order participation target
latency_steps bar-delay before an order is eligible submission, acknowledgement, and fill timestamps
market_impact coefficient on participation regression of implementation shortfall on participation

The tracked Yahoo Finance OHLCV files can estimate bar range, tail range, dollar volume, and participation-cap diagnostics. They cannot identify quoted spread, queue depth, fee tier, latency, or realized shortfall. For that reason, current public benchmark results should be read as execution-stress comparisons under shared assumptions. Live-market execution claims require replacing the defaults with quote/fill-calibrated parameters.

Run the diagnostic:

python scripts/calibrate_execution_model.py --data-dir data/real/yahoo_intraday_1h_50

This writes docs/results/execution_calibration_intraday_1h.json and docs/results/execution_calibration_intraday_1h.md. Full details are in docs/execution_model.md.

Risk control is an auditable gate, not a hidden post-processing step. MaxPositionRiskManager runs three checks:

  • pre-trade approval clips per-symbol weights to max_abs_weight, blocks decisions below min_confidence, rescales gross exposure above max_gross_exposure, and reports projected turnover above max_single_step_turnover;
  • in-trade monitoring checks realized participation, latency, and slippage against max_order_participation, max_latency_steps, and max_slippage_bps;
  • post-trade attribution reports realized PnL, commission, slippage cost, and final exposures.

Every intervention is serialized as a RiskReport with RiskCheck and RiskViolation records. The trajectory therefore preserves both the model's original intent and the executable decision after risk feedback, which is the core substrate for risk-feedback, representation-drift, and hallucination-audit experiments.

Quick Start: Deterministic Smoke Test

python -m pip install tradearena-benchmark
tradearena --benchmark tradearena-core

This default command intentionally does not call an LLM. It is a no-key smoke test for the runner, trajectory schema, risk gate, execution simulator, and metric stack. It uses deterministic analysts so every new checkout can pass CI-style validation before provider keys, model routing, or billing enter the loop.

The PyPI distribution is tradearena-benchmark because tradearena is already occupied on PyPI by an unrelated project. The import namespace and CLI remain tradearena.

To run the full local showcase:

git clone https://github.com/weich97/TradeArena.git
cd TradeArena
python -m pip install -e ".[dev]"
python scripts/run_showcase.py

Then open:

outputs/examples/index.html

The first-run path uses deterministic agents, tracked snapshots, and local demo artifacts. It does not call DeepSeek, Poe, OpenAI, Hugging Face, AkShare, Yahoo Finance, or broker APIs unless you opt into the model or data commands below.

LLM Run Paths

TradeArena supports LLM trading-agent experiments, but the repository keeps live provider calls out of the default path. Use the path that matches what you want to verify:

Path Calls an LLM? Purpose
tradearena --benchmark tradearena-core No Deterministic smoke test for core mechanics
python examples/llm_cache_replay_demo.py No Redacted manifest of prior LLM experiment coverage; no raw prompts or responses
tradearena --benchmark llm-smoke ... Yes, unless a matching cache row exists Minimal live/cache-backed LLM analyst run
tradearena --paper-output ... Optional Larger paper-grade suite with cache-first LLM sections

Minimal live LLM smoke test through Poe:

$env:POE_API_KEY="..."
tradearena --benchmark llm-smoke `
  --analysts poe-llm `
  --llm-model gpt-5.5 `
  --periods 3 `
  --symbols SYN,ALT `
  --llm-cache outputs/examples/poe_llm_smoke_cache.jsonl

Minimal live LLM smoke test through DeepSeek:

$env:DEEPSEEK_API_KEY="..."
tradearena --benchmark llm-smoke `
  --analysts deepseek-llm `
  --llm-model deepseek-v4-flash `
  --periods 3 `
  --symbols SYN,ALT `
  --llm-cache outputs/examples/deepseek_llm_smoke_cache.jsonl

These commands run one LLM analyst case and write cache entries locally. The cache is deliberately ignored by Git because raw prompts and responses can carry provider, licensing, privacy, or portfolio constraints.

Advanced Integrations Safety

DeepSeek, Poe-hosted models, OpenAI-compatible chat endpoints, AkShare, Yahoo Finance, and broker-facing workflows are opt-in advanced paths. They are not part of the first-run command, and they must stay inside an explicit audit boundary:

Surface Default boundary Public artifact policy
LLM providers Environment-variable keys, cache-first replay, signals only Track metrics and redacted manifests, not raw prompt/response caches
Yahoo Finance / AkShare Download to normalized OHLCV CSV with source metadata Record source, frequency, symbols, timestamp policy, and adjustment mode
Execution model Stress assumptions unless calibrated with quote/fill logs State parameter sources; do not call bar-only diagnostics broker-grade
Broker adapters Paper export or human-review sandbox only No live submission in default examples; no credentials in artifacts

Use per-session environment variables or an OS secret manager. Do not commit .env files, provider JSONL caches, broker tokens, account statements, or private holdings. If a run needs to be shared, publish a redacted submission or cache manifest instead of raw provider text.

The full checklist is in docs/advanced_integrations_security.md.

No local install yet?

Open in GitHub Codespaces Open in Colab

Install And Run

From a clone:

python -m pip install -e ".[dev]"
tradearena --benchmark tradearena-core
python -m tradearena.cli --benchmark tradearena-core

From GitHub without cloning first:

python -m pip install "git+https://github.com/weich97/TradeArena.git"
tradearena --benchmark tradearena-core

Benchmark Result

The v0.1 benchmark card makes one compact claim:

LLM trading-agent evaluation changes materially once intended allocations pass through auditable risk gates and explicit execution-stress constraints.

Open:

Rebuild:

python scripts/build_benchmark_page.py
python scripts/build_benchmark_registry.py examples/benchmark_submissions

Submit Or Validate A Benchmark Row

TradeArena supports redacted benchmark submissions. They share scenario, execution, risk, metrics, and reproducibility metadata without exposing raw provider prompts, responses, credentials, or private portfolios.

tradearena validate-submission examples/benchmark_submissions/example_redacted_submission.json
tradearena build-registry examples/benchmark_submissions --output docs/results/community_registry.md
tradearena hash-run outputs/examples/audit_walkthrough_trajectory.json

See docs/benchmark_submissions.md.

Visual Preview

Audit lifecycle Execution stress Diagnostic loop
Animated observe-plan-risk-execute-reflect audit trace Animated execution comparison of ideal, realistic, high-spread,
                low-liquidity, and high-latency fills Animated representation, risk-feedback, and concentration diagnostics

The browser-playable launch video is here: weich97.github.io/TradeArena/demo_video.html.

What TradeArena Provides

Need TradeArena surface
Replayable decisions Trajectory logs with prompts, memory digests, risk reports, fills, and metrics
Execution stress model Configurable fees, spread, slippage, latency, liquidity caps, partial fills, rejections, and calibration diagnostics
Risk-aware evaluation Pre-trade gates, in-trade monitors, post-trade attribution, violations
Extensibility Data, analyst, strategy, risk, simulator, memory, planner, evaluator plugins
Community benchmarks Redacted submission schema, registry builder, reproducibility hashes

Extension Path

Start with one small plugin:

python examples/custom_plugin_demo.py
python examples/extension_walkthrough_demo.py

The walkthrough swaps in a custom analyst, risk manager, and evaluator while reusing the existing runner, data provider, strategy, execution simulator, memory store, trajectory logger, and metric stack.

Useful entry points:

Documentation Map

Local Checks

Each checkout can use its own .venv, so public and private repos do not fight over editable installs:

powershell -ExecutionPolicy Bypass -File scripts\check_local.ps1

The script installs the current checkout in editable mode, runs compile checks, Ruff critical checks, tests, release-readiness checks, submission validation, artifact-contract validation, and JSON validation.

Safety Boundary

TradeArena does not promise profitable trading, does not provide financial advice, and does not execute live trades by default. Public examples are offline, paper-only, or human-review oriented. Broker and provider integrations must follow docs/advanced_integrations_security.md, SECURITY.md, and GOVERNANCE.md.

Cite

See CITATION.cff. If you use TradeArena in research or software, cite the repository release you used.

Packages

 
 
 

Contributors

Languages