Skip to content

rohanjain1648/ORACLE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ORACLE — Optimal Reasoning Agent for Calibrated Leveraged Execution

AI Forecasting Hackathon 2026 · Trading Track · University of Chicago

ORACLE is an autonomous AI trading agent that takes positions on real-world prediction markets using a four-stage reasoning pipeline: intelligent market selection, live web evidence gathering, superforecaster-style probability estimation, and Kelly-optimal bet sizing — all running end-to-end without human intervention.


Table of Contents

  1. The Problem
  2. The Solution
  3. Innovation
  4. Features
  5. User Journey
  6. System Architecture
  7. Workflow & Orchestration
  8. Data Flow & State Management
  9. Tech Stack
  10. AI Deep Dive — Claude claude-sonnet-4-6
  11. Impact
  12. Real-World Use Cases
  13. Comparison
  14. Scalability
  15. Responsible AI and Ethics
  16. Evaluation Criteria Alignment
  17. Trade-offs
  18. Project Complexity Tiers
  19. Installation & Setup
  20. Why This Will Win
  21. Future Scope
  22. FAQ
  23. Lessons Learned

1. The Problem

Prediction markets are among the most information-efficient markets in the world — prices aggregate the beliefs of thousands of participants and consistently outperform expert panels at forecasting real-world events. Yet most participants trade on gut feel, incomplete information, and uncalibrated confidence.

The result is systematic mispricing:

  • Markets overreact to recent news and underweight base rates
  • Low-attention markets are chronically mis-priced due to low liquidity
  • Anchoring bias keeps prices stuck near round numbers (50%, 70%, 90%)
  • Human traders hold losing positions too long (disposition effect)

The core challenge: Can an AI agent exploit these inefficiencies systematically, at scale, without human intervention — and make a profit doing it?


2. The Solution

ORACLE is a fully autonomous trading agent that:

  1. Ingests live Kalshi prediction markets every 15 minutes via the Prophet Arena evaluation harness
  2. Filters candidates to only the most liquid and informationally tractable markets
  3. Researches each selected market with targeted web searches for breaking evidence
  4. Estimates calibrated probabilities using a superforecaster-style structured reasoning prompt with Claude's extended thinking
  5. Sizes each trade using the fractional Kelly Criterion — mathematically optimal for maximizing long-run wealth growth
  6. Submits trade intents to the Prophet Arena harness for deterministic execution

ORACLE runs continuously for the full 2-week evaluation window (May 17–31, 2026) on the team's own API keys, with zero human intervention required after deployment.


3. Innovation

ORACLE's novelty lies in the combination of four ideas, none of which appear together in any existing open-source prediction market agent:

3.1 Superforecaster Prompt Engineering

Rather than asking an LLM "what's the probability?", ORACLE uses a structured five-step decomposition borrowed from Philip Tetlock's Good Judgment Project research:

  1. Reference class / base rate identification
  2. Inside view (specific evidence for this event)
  3. Outside view (comparable historical events)
  4. Market anchor analysis (why might the market be wrong?)
  5. Calibration check (am I overconfident?)

This structure forces the model to reason explicitly rather than pattern-match to a number, yielding dramatically better-calibrated outputs.

3.2 Extended Thinking for Hard Markets

ORACLE uses Anthropic's extended thinking feature for the primary forecast call, giving Claude a dedicated reasoning budget (5,000 tokens) before producing its final probability. This is analogous to "slow thinking" in dual-process theory — particularly valuable for markets with ambiguous evidence or novel event types.

3.3 Fractional Kelly Criterion

Most trading bots use fixed bet sizes. ORACLE uses the Kelly Criterion — the mathematically optimal position-sizing formula for maximizing expected log-wealth. We use 25% fractional Kelly for a 4× safety margin, giving:

edge = |p_oracle - market_mid|
kelly_fraction = edge / (1 - entry_price)
notional = kelly_fraction × 0.25 × available_cash

This means ORACLE bets proportional to its conviction — high-confidence divergences get large positions, marginal edges get tiny ones.

3.4 Live Evidence Integration Per Tick

Unlike static LLM forecasters that rely on training-data knowledge, ORACLE runs Tavily web search per market per tick, pulling fresh news, polls, and expert commentary. This means ORACLE's probabilities reflect the world as it is right now, not as it was at training cutoff.


4. Features

Feature Description
4-Stage Pipeline REVIEW → SEARCH → FORECAST → ACTION, fully modular
Live Web Research Tavily advanced search per market, synthesized by Claude
Extended Thinking 5,000-token reasoning budget for hard forecasting problems
Kelly Sizing 25% fractional Kelly — mathematically optimal bet sizing
Ensemble Forecasting Weighted average of primary + secondary LLM models
Liquidity Filtering Skips illiquid markets (low volume, wide spread)
Topic Diversification Max 3 positions per topic family (politics, crypto, sports…)
Drawdown Halt Stops trading if portfolio loses >30% from peak
Cash Reserve Keeps 20% of equity as cash buffer at all times
Idempotent Intents Each trade intent has a deterministic idempotency key
Parallel Forecasting Up to 5 concurrent SEARCH+FORECAST threads per tick
Retry Logic Exponential backoff on all LLM and search API calls
Rich CLI Live logging with color, tick stats, and portfolio summaries
Zero Secrets in Repo All keys via env vars, .gitignore enforced

5. User Journey

[Team] → fills .env with API keys → runs ./run.sh
                                           │
                                    [ORACLE boots]
                                    Connects to Prophet Arena API
                                    Creates experiment
                                    Registers participant
                                           │
                              ┌────────────▼────────────┐
                              │    Every ~15 minutes     │
                              │    Tick N of 96          │
                              └────────────┬────────────┘
                                           │
                           Claims tick → fetches 50+ Kalshi markets
                                           │
                              ┌────────────▼────────────┐
                              │   REVIEW (10–20 markets) │
                              │   Filter: liquidity,     │
                              │   spread, time, family   │
                              └────────────┬────────────┘
                                           │
                              ┌────────────▼────────────┐
                              │   SEARCH + FORECAST      │
                              │   (parallel, 5 threads)  │
                              │   Web search → evidence  │
                              │   Claude + thinking →    │
                              │   calibrated probability │
                              └────────────┬────────────┘
                                           │
                              ┌────────────▼────────────┐
                              │   ACTION                 │
                              │   Kelly sizing           │
                              │   Risk guards            │
                              │   Trade decisions        │
                              └────────────┬────────────┘
                                           │
                              submit_trade_intents() → fills execute
                              complete_tick() → tick finalized
                                           │
                              [Repeat for 96 ticks / 2 weeks]
                                           │
                              [Leaderboard updates continuously]
                              [Winner announced June 1, 2026]

6. System Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                         ORACLE Agent                                │
│                                                                     │
│  ┌──────────┐    ┌──────────────────────────────────────────────┐  │
│  │ main.py  │───▶│              loop.py                         │  │
│  │ CLI /    │    │  run_experiment() → run_tick() × 96          │  │
│  │ config   │    │  claim_tick → candidates → pipeline → submit │  │
│  └──────────┘    └──────────────────────────────────────────────┘  │
│                           │                                         │
│          ┌────────────────┼────────────────┐                        │
│          ▼                ▼                ▼                        │
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐               │
│  │ pipeline/    │ │ pipeline/    │ │ pipeline/    │               │
│  │ review.py    │ │ search.py    │ │ forecast.py  │               │
│  │              │ │              │ │              │               │
│  │ Liquidity    │ │ Tavily API   │ │ Claude API   │               │
│  │ Spread       │ │ Query gen    │ │ + Thinking   │               │
│  │ Time filter  │ │ LLM synth    │ │ Ensemble     │               │
│  │ Family cap   │ │ Evidence     │ │ p ∈ [0,1]    │               │
│  └──────────────┘ └──────────────┘ └──────────────┘               │
│                                          │                          │
│                               ┌──────────▼──────────┐             │
│                               │  pipeline/action.py  │             │
│                               │  Kelly Criterion     │             │
│                               │  Drawdown halt       │             │
│                               │  Position limits     │             │
│                               └──────────────────────┘             │
│                                                                     │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │                    utils/                                    │  │
│  │  llm.py (Anthropic + OpenAI clients, retry)                  │  │
│  │  search.py (Tavily provider + no-op fallback)                │  │
│  └──────────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────┘
         │                                        │
         ▼                                        ▼
┌─────────────────┐                    ┌──────────────────────┐
│  Prophet Arena  │                    │  External APIs       │
│  api.aiprophet  │                    │  Anthropic (Claude)  │
│  .dev           │                    │  Tavily (Search)     │
│  ai-prophet-    │                    │  OpenAI (optional)   │
│  core SDK       │                    └──────────────────────┘
└─────────────────┘

7. Workflow & Orchestration

Experiment Lifecycle

startup
  │
  ├─ health_check()           Verify Prophet Arena connectivity
  ├─ create_or_get_experiment()  Idempotent — safe to re-run
  └─ upsert_participant()     Register model + starting cash

tick loop (up to 96 ticks)
  │
  ├─ claim_tick()             Lease a 15-min execution slot
  │    └─ if no_tick_available → sleep(retry_after_sec)
  │
  ├─ get_portfolio()          Cash, equity, open positions
  ├─ get_candidates()         50–200 live Kalshi markets
  │
  ├─ REVIEW                   Filter to 10–20 candidates
  │
  ├─ [parallel × 5]
  │    ├─ gather_evidence()   2 Tavily searches → LLM synthesis
  │    └─ estimate_probability()  Claude extended thinking → p
  │
  ├─ decide_trades()          Kelly sizing, risk guards
  │
  ├─ submit_trade_intents()   Send BUY YES/NO intents with shares
  └─ complete_tick()          Finalize; server executes fills

shutdown
  └─ log final equity, PnL, trade count

Parallelism Strategy

  • Up to 5 concurrent SEARCH+FORECAST threads per tick
  • Balances throughput (faster tick completion) vs. API rate limits
  • Each thread is independent — failure in one does not block others

8. Data Flow & State Management

Prophet Arena API
       │
       │  get_candidates() → CandidatesResponse
       │    └─ markets: list[MarketData]
       │         ├─ market_id, question, description
       │         ├─ resolution_time, topic, family
       │         └─ quote: MarketQuote
       │              ├─ best_bid: str  ← decimal string, cast to float
       │              ├─ best_ask: str
       │              └─ volume_24h: float
       │
       ▼
  [REVIEW stage]
  select_markets() → list[MarketData]  (filtered, scored, ranked)
       │
       ▼
  [SEARCH stage]                 [FORECAST stage]
  gather_evidence()              estimate_probability()
  → evidence: str                → (probability, reasoning, confidence)
       │                               │
       └───────────────────────────────┘
                        │
                        ▼
               (MarketData, float, str)  ← per market tuple

       ▼
  [ACTION stage]
  decide_trades() → list[TradeDecision]
    └─ TradeDecision: market_id, action, side, size (notional $)

       ▼
  shares = notional / entry_price  ← YES: ask price, NO: 1-bid

       ▼
  TradeIntentRequest(market_id, action="BUY", side="YES"|"NO",
                     shares="0.4250", idempotency_key=...)

       ▼
  submit_trade_intents()  →  TradeSubmissionResult
  complete_tick()

State is entirely server-side — ORACLE is stateless between ticks.
Portfolio, positions, and fills all live on the Prophet Arena server.

9. Tech Stack

Layer Technology Purpose
Agent Runtime Python 3.12 Core language
SDK ai-prophet-core 0.1.5 Prophet Arena API client, data models
Primary LLM Anthropic Claude claude-sonnet-4-6 Forecasting with extended thinking
Ensemble LLM Anthropic Claude Haiku 4.5 Secondary forecaster for ensemble
Web Search Tavily Advanced Search Real-time news and evidence retrieval
Config Pydantic v2 + PyYAML Typed settings with env var overrides
Retry Logic Tenacity Exponential backoff on all external calls
CLI / Logging Rich Colored terminal output, structured logs
Concurrency concurrent.futures.ThreadPoolExecutor Parallel market forecasting
Dependencies httpx, python-dotenv HTTP client, env loading

10. AI Deep Dive — Claude claude-sonnet-4-6

Why Claude claude-sonnet-4-6?

Claude claude-sonnet-4-6 was chosen as the primary forecaster for three reasons:

  1. Extended Thinking — Claude's native extended thinking mode allocates a separate reasoning budget (up to 5,000 tokens) before producing a final answer. For prediction market forecasting — which requires weighing multiple competing hypotheses — this "slow thinking" dramatically improves calibration compared to single-pass generation.

  2. Instruction Following — The superforecaster prompt requires the model to follow a strict five-step reasoning chain and output structured JSON. Claude claude-sonnet-4-6 reliably adheres to this format even under complex prompts.

  3. World Knowledge Recency — With a knowledge cutoff of August 2025, Claude claude-sonnet-4-6 has strong baseline knowledge of geopolitical, economic, and technological events relevant to the markets we trade.

The Superforecaster Prompt

The forecast system prompt encodes Tetlock's Good Judgment Project methodology:

1. REFERENCE CLASS: What is the base rate for this category of event?
2. INSIDE VIEW: What specific evidence points toward YES? Toward NO?
3. OUTSIDE VIEW: How unusual would YES be vs. comparable past events?
4. MARKET ANCHOR: The market prices this at X%. Why might it be wrong?
5. CALIBRATION CHECK: Am I overconfident? Anchoring too much on market?
→ Output: { "probability": 0.XX, "reasoning": "...", "confidence": "..." }

Ensemble Architecture

Primary call: Claude claude-sonnet-4-6 (thinking enabled)  ──▶ p1 (weight: 75%)
                                                                      │
Ensemble call: Claude Haiku 4.5 (fast)  ───────────────────▶ p2 (weight: 25%)
                                                                      │
                                                       ▼
                                              p_final = 0.75*p1 + 0.25*p2

The ensemble reduces systematic bias from any single model's idiosyncrasies — Haiku often catches cases where claude-sonnet-4-6's extended thinking over-anchors on a narrative.

Evidence Synthesis

Before forecasting, ORACLE synthesizes web search results with a dedicated synthesis call:

[2 Tavily queries] → [raw search results] → [synthesis call: Claude Haiku]
                                                        │
                                              "Evidence paragraph"
                                                        │
                                              [Forecast call: Claude claude-sonnet-4-6 + thinking]

Using Haiku for synthesis (cheap, fast) and claude-sonnet-4-6 for forecasting (expensive, slow) optimizes the cost/quality tradeoff.


11. Impact

Quantitative Goals

  • PnL: Target positive return over the 2-week evaluation window
  • Sharpe Ratio: Aim for risk-adjusted returns superior to naive 50/50 baseline
  • Trade Efficiency: High edge-to-trade ratio (only trade when we have real conviction)
  • Calibration: Probability estimates within 5% of true frequencies over time

Broader Implications

ORACLE demonstrates that structured LLM reasoning + information retrieval + mathematical position sizing can extract systematic alpha from prediction markets. This has implications beyond hackathons:

  • Financial services: LLM-augmented trading in information markets
  • Policy forecasting: Calibrated AI predictions for government decision-making
  • Insurance: Better risk pricing through AI-driven probability estimation
  • Corporate strategy: Real-time probability estimates for strategic planning scenarios

12. Real-World Use Cases

Domain Example Market ORACLE's Edge
Politics "Will Senate pass bill X by Dec 31?" Tracks legislative calendars, whip counts, news
Economics "Will CPI exceed 3% in Q3 2026?" Integrates latest Fed data, inflation indicators
Technology "Will GPT-5 be released before July 2026?" Monitors OpenAI announcements, researcher leaks
Sports "Will [Team] win the championship?" Real-time injury reports, odds movement
Science "Will FDA approve drug X by year-end?" Tracks clinical trial results, regulatory calendars
Geopolitics "Will Country X hold elections in 2026?" Monitors diplomatic signals, news wires

13. Comparison

ORACLE Naive LLM (no structure) Human Trader Market Baseline
Evidence source Live web search Training data only Manual research Crowd wisdom
Probability method Structured decomposition Direct generation Intuition Aggregate bets
Position sizing Fractional Kelly Fixed size Fixed size N/A
Bias mitigation Explicit steps 4 & 5 None Varies Anchoring present
Speed ~2 min/tick ~30s/tick Hours Instant
Cost per tick ~$0.10–0.30 ~$0.01 High (human time) Zero
Drawdown protection 30% halt + reserve None Varies N/A
Scalability 100s of markets 100s of markets ~5 markets Unlimited

14. Scalability

ORACLE is designed to scale along several dimensions:

More Markets

  • The max_markets_per_tick config scales from 20 → 100+ with no code changes
  • Parallel workers (max_workers=5) scales with API rate limits

More Models

  • Add models to forecast.ensemble_models in config.yaml — no code changes
  • Each new model gets a configurable weight in the ensemble

More Platforms

  • review.py accepts any list[MarketData] — plug in Polymarket, Manifold, or other sources
  • action.py outputs TradeDecision structs — any execution venue can consume them

Cost Scaling

Markets per tick Search cost LLM cost Total per tick
10 $0.01 $0.05 ~$0.06
20 $0.02 $0.10 ~$0.12
50 $0.05 $0.25 ~$0.30

Over 96 ticks, a 20-market config costs roughly $12–15 total in API fees.


15. Responsible AI and Ethics

Market Integrity

  • ORACLE trades only within the risk limits of the Prophet Arena harness
  • Position sizes are bounded by Kelly fractions — no infinite leverage
  • Drawdown halt prevents "doubling down" in a loss spiral

Transparency

  • All reasoning is logged (probability, reasoning, confidence per market)
  • No black-box decisions — every trade has an attached evidence + reasoning string
  • Open source under MIT license

API Usage

  • Rate limits respected via max_workers cap and tenacity retry
  • No attempt to circumvent platform limits or scrape at abusive rates
  • Search queries are informational — no adversarial content generation

Bias Awareness

  • The superforecaster prompt explicitly checks for overconfidence (step 5)
  • Market anchor step (step 4) reduces anchoring bias
  • Ensemble averaging reduces single-model systematic bias

16. Evaluation Criteria Alignment

The hackathon scores purely on PnL over the 2-week window. ORACLE aligns to this:

What drives PnL How ORACLE addresses it
Find real edge Structured decomposition + web evidence yields calibrated probabilities that genuinely diverge from market when justified
Don't overtrade 5% minimum edge threshold filters noise; only trade with conviction
Size correctly Fractional Kelly maximizes expected log-wealth — the mathematically proven optimal strategy for repeated binary bets
Protect capital 30% drawdown halt + 20% cash reserve prevent ruin
Stay running Retry logic, error handling, and stateless design keep the agent alive for all 96 ticks

17. Trade-offs

Decision Choice Made Alternative Reason
Search provider Tavily Brave, Exa, none Tavily advanced gives full content, not just snippets
Kelly fraction 25% 50% (more aggressive) 25% provides 4× safety margin; prediction markets are noisy
Parallelism 5 threads 10+ threads Balances speed vs. LLM rate limits
Evidence synthesis Separate LLM call (Haiku) Inline in forecast prompt Cleaner context; Haiku is 10× cheaper for this task
Thinking budget 5,000 tokens 10,000+ tokens Diminishing returns; 5k captures most benefit at reasonable cost
Min edge 5% 3% (more trades) Lower edge → more noise trades → worse PnL
Stateless design Server owns all state Local state Crash resilience; no risk of stale local state diverging

18. Project Complexity Tiers

ORACLE spans multiple complexity tiers:

Tier 1 — Basic (working agent)

  • Uses ai-prophet-core SDK correctly
  • Claims ticks, submits valid intents, completes ticks
  • Doesn't crash during 96-tick run

Tier 2 — Intermediate (intelligent agent)

  • Filters markets by liquidity, spread, and time-to-resolution
  • Uses an LLM to generate probability estimates
  • Respects portfolio and position limits

Tier 3 — Advanced (calibrated agent)

  • Structured superforecaster prompt with explicit reasoning steps
  • Extended thinking for improved calibration
  • Fractional Kelly Criterion position sizing

Tier 4 — Expert (ORACLE)

  • Live web evidence integration per market per tick
  • Multi-model weighted ensemble averaging
  • Drawdown halt + cash reserve risk management
  • Parallel forecasting with error isolation
  • Idempotent intent submission with unique keys
  • Full observability via structured logging

19. Installation & Setup

Prerequisites

  • Python 3.11+
  • API keys: Prophet Arena, Anthropic (Claude), Tavily (recommended)

Quick Start

# 1. Clone
git clone https://github.com/YOUR_HANDLE/oracle-ai-forecast
cd oracle-ai-forecast

# 2. Add API keys
cp .env.example .env
# Open .env and fill in:
#   PA_SERVER_API_KEY   (from https://prophetarena.co/developer)
#   ANTHROPIC_API_KEY   (from https://console.anthropic.com)
#   TAVILY_API_KEY      (from https://tavily.com — free tier works)

# 3. Run
./run.sh

run.sh automatically creates a virtual environment, installs dependencies, validates keys, and launches the agent.

Environment Variables

Variable Required Description
PA_SERVER_API_KEY Prophet Arena API key
ANTHROPIC_API_KEY Anthropic Claude API key
TAVILY_API_KEY Recommended Tavily search key (free tier: 1,000 searches/month)
OPENAI_API_KEY Optional For OpenAI ensemble model
ORACLE_SLUG Optional Experiment name (default: oracle-v1)
ORACLE_MAX_TICKS Optional Max ticks (default: 96)
ORACLE_STARTING_CASH Optional Starting cash in USD (default: 1000.0)

CLI Options

./run.sh --help
./run.sh --no-search          # Disable web search (faster, ~$0.01/tick)
./run.sh --ticks 5            # Quick smoke test (5 ticks only)
./run.sh --slug test-run-v2   # Custom experiment name
./run.sh -v                   # Verbose / debug logging

Configuration Tuning

Edit config.yaml to tune any parameter without touching code:

action:
  min_edge: 0.05        # Only trade with ≥5% edge
  kelly_fraction: 0.25  # Conservative 25% Kelly
  max_position_notional: 50.0

forecast:
  thinking_budget_tokens: 5000
  ensemble_models:
    - "anthropic:claude-haiku-4-5-20251001"

20. Why This Will Win

1. Mathematically grounded sizing — Most hackathon trading agents use fixed bet sizes or arbitrary rules. ORACLE uses the Kelly Criterion, which is proven to maximize long-run wealth among all possible betting strategies. Over 96 ticks, this compounds.

2. Evidence that actually updates beliefs — Web search isn't decoration. Markets are often priced on stale information. When ORACLE finds a breaking news item that changes the true probability by 10%, and the market hasn't caught up yet, that's a tradeable edge.

3. Structured reasoning reduces LLM hallucination risk — Unstructured prompts ("what's the probability of X?") let LLMs pattern-match to a plausible-sounding number. The five-step superforecaster structure forces explicit reasoning that can be audited and is empirically more calibrated.

4. Extended thinking on hard questions — Claude's extended thinking gives ORACLE a qualitative advantage on complex, multi-factor markets where the answer isn't obvious from surface features.

5. Capital preservation first — Traders who survive all 96 ticks with capital intact and a small positive return will beat agents that make 3 big bets, win twice, and blow up on the third. ORACLE's risk management is designed for the long game.


21. Future Scope

  • Calibration tracking: Log predicted vs. resolved probabilities per market, use outcomes to retrain/adjust the prompt
  • Sentiment signals: Integrate X/Twitter and Reddit sentiment as additional evidence features
  • Market microstructure: Use order book depth (bid/ask sizes) to detect informed trading and update priors
  • Reinforcement learning loop: Train a small classifier on past ORACLE trades to predict which searches yield the most edge
  • Multi-platform arbitrage: Detect price discrepancies between Kalshi and Polymarket for the same underlying event
  • Adaptive Kelly: Dynamically adjust Kelly fraction based on recent calibration error (lower Kelly when model is overconfident)

22. FAQ

Q: Does ORACLE use real money? A: No — ORACLE trades in the Prophet Arena evaluation harness, which uses simulated capital. The harness tracks positions and PnL against live market prices but does not execute real orders on Kalshi.

Q: What happens if the agent crashes mid-run? A: ORACLE is stateless — all state (portfolio, positions, fills) lives on the Prophet Arena server. Restarting ./run.sh with the same ORACLE_SLUG resumes the same experiment via create_or_get_experiment().

Q: How much will it cost to run for 2 weeks? A: With Tavily free tier and 20 markets per tick, estimated total: $12–20 in Anthropic API fees over 96 ticks. The --no-search flag cuts this to ~$6.

Q: Can I run this without a Tavily key? A: Yes — set search.enabled: false in config.yaml or use --no-search. ORACLE will forecast based on Claude's training knowledge only, without live web evidence. Performance will be lower on fast-moving markets.

Q: Why not use GPT-4o? A: Claude claude-sonnet-4-6 was chosen for its extended thinking capability and strong instruction following. GPT-4o is supported as an ensemble model via OPENAI_API_KEY.


23. Lessons Learned

Inspect the SDK before writing code. The ai-prophet-core SDK's actual field types (cash: str, best_bid: str, shares: str) differ from what documentation suggests. Always run python -c "from ai_prophet_core import X; print(X.model_fields)" before assuming types.

Kelly without guardrails is dangerous. Full Kelly (fraction=1.0) is theoretically optimal but practically catastrophic with miscalibrated probabilities. 25% fractional Kelly accounts for the fact that our probability estimates are imperfect.

Search synthesis is as important as search retrieval. Raw search results are noisy — a separate LLM synthesis step that focuses on event-relevant facts dramatically improves forecast quality over just pasting raw snippets into the prompt.

Structured prompts beat open-ended prompts for calibration. We tested free-form probability elicitation vs. the five-step superforecaster structure. The structured version produces dramatically less overconfident outputs and more nuanced hedging on uncertain questions.

Stateless agent design is a feature, not a limitation. Keeping all state server-side means crash recovery is trivial, the agent can be restarted without data loss, and there is no risk of local cache diverging from truth.


Project Structure

oracle-ai-forecast/
├── agent/
│   ├── config.py          # Pydantic config with env var loading
│   ├── loop.py            # Experiment + tick orchestration
│   ├── main.py            # CLI entry point (argparse + Rich)
│   ├── pipeline/
│   │   ├── review.py      # Stage 1: market filtering & scoring
│   │   ├── search.py      # Stage 2: web search + LLM synthesis
│   │   ├── forecast.py    # Stage 3: superforecaster probability
│   │   └── action.py      # Stage 4: Kelly sizing + risk guards
│   └── utils/
│       ├── llm.py         # Anthropic + OpenAI wrappers + retry
│       └── search.py      # Tavily + no-op search providers
├── config.yaml            # All tunable parameters
├── pyproject.toml         # Package metadata + dependencies
├── pyrefly.toml           # Type checker configuration
├── run.sh                 # Judge-facing entry point
├── .env.example           # Environment variable template
└── LICENSE                # MIT

License

MIT — see LICENSE for details.

Built at the AI Forecasting Hackathon 2026 · University of Chicago Evaluation window: May 17 – May 31, 2026

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors