ORACLE — Optimal Reasoning Agent for Calibrated Leveraged Execution

AI Forecasting Hackathon 2026 · Trading Track · University of Chicago

ORACLE is an autonomous AI trading agent that takes positions on real-world prediction markets using a four-stage reasoning pipeline: intelligent market selection, live web evidence gathering, superforecaster-style probability estimation, and Kelly-optimal bet sizing — all running end-to-end without human intervention.

The Problem
The Solution
Innovation
Features
User Journey
System Architecture
Workflow & Orchestration
Data Flow & State Management
Tech Stack
AI Deep Dive — Claude claude-sonnet-4-6
Impact
Real-World Use Cases
Comparison
Scalability
Responsible AI and Ethics
Evaluation Criteria Alignment
Trade-offs
Project Complexity Tiers
Installation & Setup
Why This Will Win
Future Scope
FAQ
Lessons Learned

1. The Problem

Prediction markets are among the most information-efficient markets in the world — prices aggregate the beliefs of thousands of participants and consistently outperform expert panels at forecasting real-world events. Yet most participants trade on gut feel, incomplete information, and uncalibrated confidence.

The result is systematic mispricing:

Markets overreact to recent news and underweight base rates
Low-attention markets are chronically mis-priced due to low liquidity
Anchoring bias keeps prices stuck near round numbers (50%, 70%, 90%)
Human traders hold losing positions too long (disposition effect)

The core challenge: Can an AI agent exploit these inefficiencies systematically, at scale, without human intervention — and make a profit doing it?

2. The Solution

ORACLE is a fully autonomous trading agent that:

Ingests live Kalshi prediction markets every 15 minutes via the Prophet Arena evaluation harness
Filters candidates to only the most liquid and informationally tractable markets
Researches each selected market with targeted web searches for breaking evidence
Estimates calibrated probabilities using a superforecaster-style structured reasoning prompt with Claude's extended thinking
Sizes each trade using the fractional Kelly Criterion — mathematically optimal for maximizing long-run wealth growth
Submits trade intents to the Prophet Arena harness for deterministic execution

ORACLE runs continuously for the full 2-week evaluation window (May 17–31, 2026) on the team's own API keys, with zero human intervention required after deployment.

3. Innovation

ORACLE's novelty lies in the combination of four ideas, none of which appear together in any existing open-source prediction market agent:

3.1 Superforecaster Prompt Engineering

Rather than asking an LLM "what's the probability?", ORACLE uses a structured five-step decomposition borrowed from Philip Tetlock's Good Judgment Project research:

Reference class / base rate identification
Inside view (specific evidence for this event)
Outside view (comparable historical events)
Market anchor analysis (why might the market be wrong?)
Calibration check (am I overconfident?)

This structure forces the model to reason explicitly rather than pattern-match to a number, yielding dramatically better-calibrated outputs.

3.2 Extended Thinking for Hard Markets

ORACLE uses Anthropic's extended thinking feature for the primary forecast call, giving Claude a dedicated reasoning budget (5,000 tokens) before producing its final probability. This is analogous to "slow thinking" in dual-process theory — particularly valuable for markets with ambiguous evidence or novel event types.

3.3 Fractional Kelly Criterion

Most trading bots use fixed bet sizes. ORACLE uses the Kelly Criterion — the mathematically optimal position-sizing formula for maximizing expected log-wealth. We use 25% fractional Kelly for a 4× safety margin, giving:

edge = |p_oracle - market_mid|
kelly_fraction = edge / (1 - entry_price)
notional = kelly_fraction × 0.25 × available_cash

This means ORACLE bets proportional to its conviction — high-confidence divergences get large positions, marginal edges get tiny ones.

3.4 Live Evidence Integration Per Tick

Unlike static LLM forecasters that rely on training-data knowledge, ORACLE runs Tavily web search per market per tick, pulling fresh news, polls, and expert commentary. This means ORACLE's probabilities reflect the world as it is right now, not as it was at training cutoff.

4. Features

Feature	Description
4-Stage Pipeline	REVIEW → SEARCH → FORECAST → ACTION, fully modular
Live Web Research	Tavily advanced search per market, synthesized by Claude
Extended Thinking	5,000-token reasoning budget for hard forecasting problems
Kelly Sizing	25% fractional Kelly — mathematically optimal bet sizing
Ensemble Forecasting	Weighted average of primary + secondary LLM models
Liquidity Filtering	Skips illiquid markets (low volume, wide spread)
Topic Diversification	Max 3 positions per topic family (politics, crypto, sports…)
Drawdown Halt	Stops trading if portfolio loses >30% from peak
Cash Reserve	Keeps 20% of equity as cash buffer at all times
Idempotent Intents	Each trade intent has a deterministic idempotency key
Parallel Forecasting	Up to 5 concurrent SEARCH+FORECAST threads per tick
Retry Logic	Exponential backoff on all LLM and search API calls
Rich CLI	Live logging with color, tick stats, and portfolio summaries
Zero Secrets in Repo	All keys via env vars, `.gitignore` enforced

5. User Journey

[Team] → fills .env with API keys → runs ./run.sh
                                           │
                                    [ORACLE boots]
                                    Connects to Prophet Arena API
                                    Creates experiment
                                    Registers participant
                                           │
                              ┌────────────▼────────────┐
                              │    Every ~15 minutes     │
                              │    Tick N of 96          │
                              └────────────┬────────────┘
                                           │
                           Claims tick → fetches 50+ Kalshi markets
                                           │
                              ┌────────────▼────────────┐
                              │   REVIEW (10–20 markets) │
                              │   Filter: liquidity,     │
                              │   spread, time, family   │
                              └────────────┬────────────┘
                                           │
                              ┌────────────▼────────────┐
                              │   SEARCH + FORECAST      │
                              │   (parallel, 5 threads)  │
                              │   Web search → evidence  │
                              │   Claude + thinking →    │
                              │   calibrated probability │
                              └────────────┬────────────┘
                                           │
                              ┌────────────▼────────────┐
                              │   ACTION                 │
                              │   Kelly sizing           │
                              │   Risk guards            │
                              │   Trade decisions        │
                              └────────────┬────────────┘
                                           │
                              submit_trade_intents() → fills execute
                              complete_tick() → tick finalized
                                           │
                              [Repeat for 96 ticks / 2 weeks]
                                           │
                              [Leaderboard updates continuously]
                              [Winner announced June 1, 2026]

6. System Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                         ORACLE Agent                                │
│                                                                     │
│  ┌──────────┐    ┌──────────────────────────────────────────────┐  │
│  │ main.py  │───▶│              loop.py                         │  │
│  │ CLI /    │    │  run_experiment() → run_tick() × 96          │  │
│  │ config   │    │  claim_tick → candidates → pipeline → submit │  │
│  └──────────┘    └──────────────────────────────────────────────┘  │
│                           │                                         │
│          ┌────────────────┼────────────────┐                        │
│          ▼                ▼                ▼                        │
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐               │
│  │ pipeline/    │ │ pipeline/    │ │ pipeline/    │               │
│  │ review.py    │ │ search.py    │ │ forecast.py  │               │
│  │              │ │              │ │              │               │
│  │ Liquidity    │ │ Tavily API   │ │ Claude API   │               │
│  │ Spread       │ │ Query gen    │ │ + Thinking   │               │
│  │ Time filter  │ │ LLM synth    │ │ Ensemble     │               │
│  │ Family cap   │ │ Evidence     │ │ p ∈ [0,1]    │               │
│  └──────────────┘ └──────────────┘ └──────────────┘               │
│                                          │                          │
│                               ┌──────────▼──────────┐             │
│                               │  pipeline/action.py  │             │
│                               │  Kelly Criterion     │             │
│                               │  Drawdown halt       │             │
│                               │  Position limits     │             │
│                               └──────────────────────┘             │
│                                                                     │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │                    utils/                                    │  │
│  │  llm.py (Anthropic + OpenAI clients, retry)                  │  │
│  │  search.py (Tavily provider + no-op fallback)                │  │
│  └──────────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────┘
         │                                        │
         ▼                                        ▼
┌─────────────────┐                    ┌──────────────────────┐
│  Prophet Arena  │                    │  External APIs       │
│  api.aiprophet  │                    │  Anthropic (Claude)  │
│  .dev           │                    │  Tavily (Search)     │
│  ai-prophet-    │                    │  OpenAI (optional)   │
│  core SDK       │                    └──────────────────────┘
└─────────────────┘

7. Workflow & Orchestration

Experiment Lifecycle

startup
  │
  ├─ health_check()           Verify Prophet Arena connectivity
  ├─ create_or_get_experiment()  Idempotent — safe to re-run
  └─ upsert_participant()     Register model + starting cash

tick loop (up to 96 ticks)
  │
  ├─ claim_tick()             Lease a 15-min execution slot
  │    └─ if no_tick_available → sleep(retry_after_sec)
  │
  ├─ get_portfolio()          Cash, equity, open positions
  ├─ get_candidates()         50–200 live Kalshi markets
  │
  ├─ REVIEW                   Filter to 10–20 candidates
  │
  ├─ [parallel × 5]
  │    ├─ gather_evidence()   2 Tavily searches → LLM synthesis
  │    └─ estimate_probability()  Claude extended thinking → p
  │
  ├─ decide_trades()          Kelly sizing, risk guards
  │
  ├─ submit_trade_intents()   Send BUY YES/NO intents with shares
  └─ complete_tick()          Finalize; server executes fills

shutdown
  └─ log final equity, PnL, trade count

Parallelism Strategy

Up to 5 concurrent SEARCH+FORECAST threads per tick
Balances throughput (faster tick completion) vs. API rate limits
Each thread is independent — failure in one does not block others

8. Data Flow & State Management

Prophet Arena API
       │
       │  get_candidates() → CandidatesResponse
       │    └─ markets: list[MarketData]
       │         ├─ market_id, question, description
       │         ├─ resolution_time, topic, family
       │         └─ quote: MarketQuote
       │              ├─ best_bid: str  ← decimal string, cast to float
       │              ├─ best_ask: str
       │              └─ volume_24h: float
       │
       ▼
  [REVIEW stage]
  select_markets() → list[MarketData]  (filtered, scored, ranked)
       │
       ▼
  [SEARCH stage]                 [FORECAST stage]
  gather_evidence()              estimate_probability()
  → evidence: str                → (probability, reasoning, confidence)
       │                               │
       └───────────────────────────────┘
                        │
                        ▼
               (MarketData, float, str)  ← per market tuple

       ▼
  [ACTION stage]
  decide_trades() → list[TradeDecision]
    └─ TradeDecision: market_id, action, side, size (notional $)

       ▼
  shares = notional / entry_price  ← YES: ask price, NO: 1-bid

       ▼
  TradeIntentRequest(market_id, action="BUY", side="YES"|"NO",
                     shares="0.4250", idempotency_key=...)

       ▼
  submit_trade_intents()  →  TradeSubmissionResult
  complete_tick()

State is entirely server-side — ORACLE is stateless between ticks.
Portfolio, positions, and fills all live on the Prophet Arena server.

9. Tech Stack

Layer	Technology	Purpose
Agent Runtime	Python 3.12	Core language
SDK	`ai-prophet-core` 0.1.5	Prophet Arena API client, data models
Primary LLM	Anthropic Claude claude-sonnet-4-6	Forecasting with extended thinking
Ensemble LLM	Anthropic Claude Haiku 4.5	Secondary forecaster for ensemble
Web Search	Tavily Advanced Search	Real-time news and evidence retrieval
Config	Pydantic v2 + PyYAML	Typed settings with env var overrides
Retry Logic	Tenacity	Exponential backoff on all external calls
CLI / Logging	Rich	Colored terminal output, structured logs
Concurrency	`concurrent.futures.ThreadPoolExecutor`	Parallel market forecasting
Dependencies	`httpx`, `python-dotenv`	HTTP client, env loading

10. AI Deep Dive — Claude claude-sonnet-4-6

Why Claude claude-sonnet-4-6?

Claude claude-sonnet-4-6 was chosen as the primary forecaster for three reasons:

Extended Thinking — Claude's native extended thinking mode allocates a separate reasoning budget (up to 5,000 tokens) before producing a final answer. For prediction market forecasting — which requires weighing multiple competing hypotheses — this "slow thinking" dramatically improves calibration compared to single-pass generation.
Instruction Following — The superforecaster prompt requires the model to follow a strict five-step reasoning chain and output structured JSON. Claude claude-sonnet-4-6 reliably adheres to this format even under complex prompts.
World Knowledge Recency — With a knowledge cutoff of August 2025, Claude claude-sonnet-4-6 has strong baseline knowledge of geopolitical, economic, and technological events relevant to the markets we trade.

The Superforecaster Prompt

The forecast system prompt encodes Tetlock's Good Judgment Project methodology:

1. REFERENCE CLASS: What is the base rate for this category of event?
2. INSIDE VIEW: What specific evidence points toward YES? Toward NO?
3. OUTSIDE VIEW: How unusual would YES be vs. comparable past events?
4. MARKET ANCHOR: The market prices this at X%. Why might it be wrong?
5. CALIBRATION CHECK: Am I overconfident? Anchoring too much on market?
→ Output: { "probability": 0.XX, "reasoning": "...", "confidence": "..." }

Ensemble Architecture

Primary call: Claude claude-sonnet-4-6 (thinking enabled)  ──▶ p1 (weight: 75%)
                                                                      │
Ensemble call: Claude Haiku 4.5 (fast)  ───────────────────▶ p2 (weight: 25%)
                                                                      │
                                                       ▼
                                              p_final = 0.75*p1 + 0.25*p2

The ensemble reduces systematic bias from any single model's idiosyncrasies — Haiku often catches cases where claude-sonnet-4-6's extended thinking over-anchors on a narrative.

Evidence Synthesis

Before forecasting, ORACLE synthesizes web search results with a dedicated synthesis call:

[2 Tavily queries] → [raw search results] → [synthesis call: Claude Haiku]
                                                        │
                                              "Evidence paragraph"
                                                        │
                                              [Forecast call: Claude claude-sonnet-4-6 + thinking]

Using Haiku for synthesis (cheap, fast) and claude-sonnet-4-6 for forecasting (expensive, slow) optimizes the cost/quality tradeoff.

11. Impact

Quantitative Goals

PnL: Target positive return over the 2-week evaluation window
Sharpe Ratio: Aim for risk-adjusted returns superior to naive 50/50 baseline
Trade Efficiency: High edge-to-trade ratio (only trade when we have real conviction)
Calibration: Probability estimates within 5% of true frequencies over time

Broader Implications

ORACLE demonstrates that structured LLM reasoning + information retrieval + mathematical position sizing can extract systematic alpha from prediction markets. This has implications beyond hackathons:

Financial services: LLM-augmented trading in information markets
Policy forecasting: Calibrated AI predictions for government decision-making
Insurance: Better risk pricing through AI-driven probability estimation
Corporate strategy: Real-time probability estimates for strategic planning scenarios

12. Real-World Use Cases

Domain	Example Market	ORACLE's Edge
Politics	"Will Senate pass bill X by Dec 31?"	Tracks legislative calendars, whip counts, news
Economics	"Will CPI exceed 3% in Q3 2026?"	Integrates latest Fed data, inflation indicators
Technology	"Will GPT-5 be released before July 2026?"	Monitors OpenAI announcements, researcher leaks
Sports	"Will [Team] win the championship?"	Real-time injury reports, odds movement
Science	"Will FDA approve drug X by year-end?"	Tracks clinical trial results, regulatory calendars
Geopolitics	"Will Country X hold elections in 2026?"	Monitors diplomatic signals, news wires

13. Comparison

	ORACLE	Naive LLM (no structure)	Human Trader	Market Baseline
Evidence source	Live web search	Training data only	Manual research	Crowd wisdom
Probability method	Structured decomposition	Direct generation	Intuition	Aggregate bets
Position sizing	Fractional Kelly	Fixed size	Fixed size	N/A
Bias mitigation	Explicit steps 4 & 5	None	Varies	Anchoring present
Speed	~2 min/tick	~30s/tick	Hours	Instant
Cost per tick	~$0.10–0.30	~$0.01	High (human time)	Zero
Drawdown protection	30% halt + reserve	None	Varies	N/A
Scalability	100s of markets	100s of markets	~5 markets	Unlimited

14. Scalability

ORACLE is designed to scale along several dimensions:

More Markets

The max_markets_per_tick config scales from 20 → 100+ with no code changes
Parallel workers (max_workers=5) scales with API rate limits

More Models

Add models to forecast.ensemble_models in config.yaml — no code changes
Each new model gets a configurable weight in the ensemble

More Platforms

review.py accepts any list[MarketData] — plug in Polymarket, Manifold, or other sources
action.py outputs TradeDecision structs — any execution venue can consume them

Cost Scaling

Markets per tick	Search cost	LLM cost	Total per tick
10	$0.01	$0.05	~$0.06
20	$0.02	$0.10	~$0.12
50	$0.05	$0.25	~$0.30

Over 96 ticks, a 20-market config costs roughly $12–15 total in API fees.

15. Responsible AI and Ethics

Market Integrity

ORACLE trades only within the risk limits of the Prophet Arena harness
Position sizes are bounded by Kelly fractions — no infinite leverage
Drawdown halt prevents "doubling down" in a loss spiral

Transparency

All reasoning is logged (probability, reasoning, confidence per market)
No black-box decisions — every trade has an attached evidence + reasoning string
Open source under MIT license

API Usage

Rate limits respected via max_workers cap and tenacity retry
No attempt to circumvent platform limits or scrape at abusive rates
Search queries are informational — no adversarial content generation

Bias Awareness

The superforecaster prompt explicitly checks for overconfidence (step 5)
Market anchor step (step 4) reduces anchoring bias
Ensemble averaging reduces single-model systematic bias

16. Evaluation Criteria Alignment

The hackathon scores purely on PnL over the 2-week window. ORACLE aligns to this:

What drives PnL	How ORACLE addresses it
Find real edge	Structured decomposition + web evidence yields calibrated probabilities that genuinely diverge from market when justified
Don't overtrade	5% minimum edge threshold filters noise; only trade with conviction
Size correctly	Fractional Kelly maximizes expected log-wealth — the mathematically proven optimal strategy for repeated binary bets
Protect capital	30% drawdown halt + 20% cash reserve prevent ruin
Stay running	Retry logic, error handling, and stateless design keep the agent alive for all 96 ticks

17. Trade-offs

Decision	Choice Made	Alternative	Reason
Search provider	Tavily	Brave, Exa, none	Tavily advanced gives full content, not just snippets
Kelly fraction	25%	50% (more aggressive)	25% provides 4× safety margin; prediction markets are noisy
Parallelism	5 threads	10+ threads	Balances speed vs. LLM rate limits
Evidence synthesis	Separate LLM call (Haiku)	Inline in forecast prompt	Cleaner context; Haiku is 10× cheaper for this task
Thinking budget	5,000 tokens	10,000+ tokens	Diminishing returns; 5k captures most benefit at reasonable cost
Min edge	5%	3% (more trades)	Lower edge → more noise trades → worse PnL
Stateless design	Server owns all state	Local state	Crash resilience; no risk of stale local state diverging

18. Project Complexity Tiers

ORACLE spans multiple complexity tiers:

Tier 1 — Basic (working agent)

Uses ai-prophet-core SDK correctly
Claims ticks, submits valid intents, completes ticks
Doesn't crash during 96-tick run

Tier 2 — Intermediate (intelligent agent)

Filters markets by liquidity, spread, and time-to-resolution
Uses an LLM to generate probability estimates
Respects portfolio and position limits

Tier 3 — Advanced (calibrated agent)

Structured superforecaster prompt with explicit reasoning steps
Extended thinking for improved calibration
Fractional Kelly Criterion position sizing

Tier 4 — Expert (ORACLE)

Live web evidence integration per market per tick
Multi-model weighted ensemble averaging
Drawdown halt + cash reserve risk management
Parallel forecasting with error isolation
Idempotent intent submission with unique keys
Full observability via structured logging

19. Installation & Setup

Prerequisites

Python 3.11+
API keys: Prophet Arena, Anthropic (Claude), Tavily (recommended)

Quick Start

# 1. Clone
git clone https://github.com/YOUR_HANDLE/oracle-ai-forecast
cd oracle-ai-forecast

# 2. Add API keys
cp .env.example .env
# Open .env and fill in:
#   PA_SERVER_API_KEY   (from https://prophetarena.co/developer)
#   ANTHROPIC_API_KEY   (from https://console.anthropic.com)
#   TAVILY_API_KEY      (from https://tavily.com — free tier works)

# 3. Run
./run.sh

run.sh automatically creates a virtual environment, installs dependencies, validates keys, and launches the agent.

Environment Variables

Variable	Required	Description
`PA_SERVER_API_KEY`	✅	Prophet Arena API key
`ANTHROPIC_API_KEY`	✅	Anthropic Claude API key
`TAVILY_API_KEY`	Recommended	Tavily search key (free tier: 1,000 searches/month)
`OPENAI_API_KEY`	Optional	For OpenAI ensemble model
`ORACLE_SLUG`	Optional	Experiment name (default: `oracle-v1`)
`ORACLE_MAX_TICKS`	Optional	Max ticks (default: `96`)
`ORACLE_STARTING_CASH`	Optional	Starting cash in USD (default: `1000.0`)

CLI Options

./run.sh --help
./run.sh --no-search          # Disable web search (faster, ~$0.01/tick)
./run.sh --ticks 5            # Quick smoke test (5 ticks only)
./run.sh --slug test-run-v2   # Custom experiment name
./run.sh -v                   # Verbose / debug logging

Configuration Tuning

Edit config.yaml to tune any parameter without touching code:

action:
  min_edge: 0.05        # Only trade with ≥5% edge
  kelly_fraction: 0.25  # Conservative 25% Kelly
  max_position_notional: 50.0

forecast:
  thinking_budget_tokens: 5000
  ensemble_models:
    - "anthropic:claude-haiku-4-5-20251001"

20. Why This Will Win

1. Mathematically grounded sizing — Most hackathon trading agents use fixed bet sizes or arbitrary rules. ORACLE uses the Kelly Criterion, which is proven to maximize long-run wealth among all possible betting strategies. Over 96 ticks, this compounds.

2. Evidence that actually updates beliefs — Web search isn't decoration. Markets are often priced on stale information. When ORACLE finds a breaking news item that changes the true probability by 10%, and the market hasn't caught up yet, that's a tradeable edge.

3. Structured reasoning reduces LLM hallucination risk — Unstructured prompts ("what's the probability of X?") let LLMs pattern-match to a plausible-sounding number. The five-step superforecaster structure forces explicit reasoning that can be audited and is empirically more calibrated.

4. Extended thinking on hard questions — Claude's extended thinking gives ORACLE a qualitative advantage on complex, multi-factor markets where the answer isn't obvious from surface features.

5. Capital preservation first — Traders who survive all 96 ticks with capital intact and a small positive return will beat agents that make 3 big bets, win twice, and blow up on the third. ORACLE's risk management is designed for the long game.

21. Future Scope

Calibration tracking: Log predicted vs. resolved probabilities per market, use outcomes to retrain/adjust the prompt
Sentiment signals: Integrate X/Twitter and Reddit sentiment as additional evidence features
Market microstructure: Use order book depth (bid/ask sizes) to detect informed trading and update priors
Reinforcement learning loop: Train a small classifier on past ORACLE trades to predict which searches yield the most edge
Multi-platform arbitrage: Detect price discrepancies between Kalshi and Polymarket for the same underlying event
Adaptive Kelly: Dynamically adjust Kelly fraction based on recent calibration error (lower Kelly when model is overconfident)

22. FAQ

Q: Does ORACLE use real money? A: No — ORACLE trades in the Prophet Arena evaluation harness, which uses simulated capital. The harness tracks positions and PnL against live market prices but does not execute real orders on Kalshi.

Q: What happens if the agent crashes mid-run? A: ORACLE is stateless — all state (portfolio, positions, fills) lives on the Prophet Arena server. Restarting ./run.sh with the same ORACLE_SLUG resumes the same experiment via create_or_get_experiment().

Q: How much will it cost to run for 2 weeks? A: With Tavily free tier and 20 markets per tick, estimated total: $12–20 in Anthropic API fees over 96 ticks. The --no-search flag cuts this to ~$6.

Q: Can I run this without a Tavily key? A: Yes — set search.enabled: false in config.yaml or use --no-search. ORACLE will forecast based on Claude's training knowledge only, without live web evidence. Performance will be lower on fast-moving markets.

Q: Why not use GPT-4o? A: Claude claude-sonnet-4-6 was chosen for its extended thinking capability and strong instruction following. GPT-4o is supported as an ensemble model via OPENAI_API_KEY.

23. Lessons Learned

Inspect the SDK before writing code. The ai-prophet-core SDK's actual field types (cash: str, best_bid: str, shares: str) differ from what documentation suggests. Always run python -c "from ai_prophet_core import X; print(X.model_fields)" before assuming types.

Kelly without guardrails is dangerous. Full Kelly (fraction=1.0) is theoretically optimal but practically catastrophic with miscalibrated probabilities. 25% fractional Kelly accounts for the fact that our probability estimates are imperfect.

Search synthesis is as important as search retrieval. Raw search results are noisy — a separate LLM synthesis step that focuses on event-relevant facts dramatically improves forecast quality over just pasting raw snippets into the prompt.

Structured prompts beat open-ended prompts for calibration. We tested free-form probability elicitation vs. the five-step superforecaster structure. The structured version produces dramatically less overconfident outputs and more nuanced hedging on uncertain questions.

Stateless agent design is a feature, not a limitation. Keeping all state server-side means crash recovery is trivial, the agent can be restarted without data loss, and there is no risk of local cache diverging from truth.

Project Structure

oracle-ai-forecast/
├── agent/
│   ├── config.py          # Pydantic config with env var loading
│   ├── loop.py            # Experiment + tick orchestration
│   ├── main.py            # CLI entry point (argparse + Rich)
│   ├── pipeline/
│   │   ├── review.py      # Stage 1: market filtering & scoring
│   │   ├── search.py      # Stage 2: web search + LLM synthesis
│   │   ├── forecast.py    # Stage 3: superforecaster probability
│   │   └── action.py      # Stage 4: Kelly sizing + risk guards
│   └── utils/
│       ├── llm.py         # Anthropic + OpenAI wrappers + retry
│       └── search.py      # Tavily + no-op search providers
├── config.yaml            # All tunable parameters
├── pyproject.toml         # Package metadata + dependencies
├── pyrefly.toml           # Type checker configuration
├── run.sh                 # Judge-facing entry point
├── .env.example           # Environment variable template
└── LICENSE                # MIT

License

MIT — see LICENSE for details.

Built at the AI Forecasting Hackathon 2026 · University of Chicago Evaluation window: May 17 – May 31, 2026

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
agent		agent
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
index.html		index.html
pyproject.toml		pyproject.toml
pyrefly.toml		pyrefly.toml
run.sh		run.sh

Folders and files

Latest commit

History

Repository files navigation

ORACLE — Optimal Reasoning Agent for Calibrated Leveraged Execution

Table of Contents

1. The Problem

2. The Solution

3. Innovation

3.1 Superforecaster Prompt Engineering

3.2 Extended Thinking for Hard Markets

3.3 Fractional Kelly Criterion

3.4 Live Evidence Integration Per Tick

4. Features

5. User Journey

6. System Architecture

7. Workflow & Orchestration

Experiment Lifecycle

Parallelism Strategy

8. Data Flow & State Management

9. Tech Stack

10. AI Deep Dive — Claude claude-sonnet-4-6

Why Claude claude-sonnet-4-6?

The Superforecaster Prompt

Ensemble Architecture

Evidence Synthesis

11. Impact

Quantitative Goals

Broader Implications

12. Real-World Use Cases

13. Comparison

14. Scalability

More Markets

More Models

More Platforms

Cost Scaling

15. Responsible AI and Ethics

Market Integrity

Transparency

API Usage

Bias Awareness

16. Evaluation Criteria Alignment

17. Trade-offs

18. Project Complexity Tiers

19. Installation & Setup

Prerequisites

Quick Start

Environment Variables

CLI Options

Configuration Tuning

20. Why This Will Win

21. Future Scope

22. FAQ

23. Lessons Learned

Project Structure

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages