# Kalshi Market Calibration Analysis

**Purpose**: Validate that Kalshi market prices are well-calibrated probability estimates.

**Why This Matters**: A well-calibrated market means:
- Events predicted at 70% actually happen ~70% of the time
- Trading strategies can trust price signals
- Edge exists through timing, not market inefficiency

**Phase 1C Requirement**: Demonstrate calibration quality with reliability diagram.

In [None]:
import sys
sys.path.append('..')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta

# Import utils
from utils import get_engine, load_market_data
from utils.visualization import plot_reliability_diagram

sns.set_style('darkgrid')
plt.rcParams['figure.figsize'] = (14, 7)

%matplotlib inline

## Load Market Data

In [None]:
engine = get_engine()

# Load all market snapshots
df = load_market_data(engine, min_snapshots=10)

print(f"Loaded {len(df):,} snapshots")
print(f"Unique tickers: {df['ticker'].nunique()}")
print(f"Date range: {df['timestamp'].min()} to {df['timestamp'].max()}")
df.head()

## Identify Resolved Markets

**Challenge**: We need market outcomes to assess calibration.

**Approach**:
1. Find markets that stopped updating (likely resolved)
2. Infer outcome from final price movement
3. Validate calibration on resolved markets

**Note**: This is a simplified approach. Production systems would use Kalshi's settlement API.

In [None]:
# Find markets that haven't updated recently (likely resolved)
now = datetime.now()
ticker_last_update = df.groupby('ticker')['timestamp'].max()
hours_since_update = (now - ticker_last_update).dt.total_seconds() / 3600

# Consider markets resolved if no update in 2+ hours
resolved_tickers = hours_since_update[hours_since_update > 2].index

print(f"Found {len(resolved_tickers)} potentially resolved markets")
print(f"Sample tickers: {list(resolved_tickers[:5])}")

In [None]:
# For resolved markets, infer outcome from final price
# If yes_prob > 0.9 at end, assume YES outcome (1)
# If yes_prob < 0.1 at end, assume NO outcome (0)
# Otherwise, exclude (ambiguous)

def infer_outcome(ticker_df):
    """Infer market outcome from final price."""
    final_price = ticker_df.iloc[-1]['yes_prob']
    
    if final_price > 0.9:
        return 1  # YES outcome
    elif final_price < 0.1:
        return 0  # NO outcome
    else:
        return None  # Ambiguous

outcomes = {}
for ticker in resolved_tickers:
    ticker_df = df[df['ticker'] == ticker].sort_values('timestamp')
    outcome = infer_outcome(ticker_df)
    if outcome is not None:
        outcomes[ticker] = outcome

print(f"Inferred outcomes for {len(outcomes)} markets")
print(f"YES outcomes: {sum(outcomes.values())}")
print(f"NO outcomes: {len(outcomes) - sum(outcomes.values())}")

## Prepare Calibration Data

For each market snapshot, we need:
- Predicted probability (yes_prob)
- Actual outcome (0 or 1)

We'll use all snapshots from resolved markets.

In [None]:
# Build calibration dataset
calibration_data = []

for ticker, outcome in outcomes.items():
    ticker_snapshots = df[df['ticker'] == ticker]
    
    for _, row in ticker_snapshots.iterrows():
        calibration_data.append({
            'ticker': ticker,
            'timestamp': row['timestamp'],
            'predicted_prob': row['yes_prob'],
            'actual_outcome': outcome
        })

calib_df = pd.DataFrame(calibration_data)

print(f"Calibration dataset: {len(calib_df):,} predictions")
print(f"From {calib_df['ticker'].nunique()} markets")
calib_df.head()

## Reliability Diagram

**The CRITICAL visualization for Phase 1C.**

This plot shows:
- **X-axis**: Predicted probability (what market says)
- **Y-axis**: Actual frequency (what actually happened)
- **Perfect calibration**: Points on diagonal line
- **Overconfident**: Points below diagonal (predicted > actual)
- **Underconfident**: Points above diagonal (predicted < actual)

In [None]:
if len(calib_df) > 0:
    fig = plot_reliability_diagram(
        predicted_probs=calib_df['predicted_prob'],
        actual_outcomes=calib_df['actual_outcome'],
        n_bins=10,
        figsize=(10, 10)
    )
    plt.show()
else:
    print("⚠️ Not enough resolved markets yet. Let poller run longer.")
    print("Need markets to settle (close) before calibration analysis is possible.")

## Calibration Metrics

Quantitative measures of calibration quality.

In [None]:
def calculate_calibration_metrics(predicted_probs, actual_outcomes, n_bins=10):
    """Calculate calibration quality metrics."""
    bins = np.linspace(0, 1, n_bins + 1)
    
    predicted_freq = []
    actual_freq = []
    
    for i in range(n_bins):
        mask = (predicted_probs >= bins[i]) & (predicted_probs < bins[i + 1])
        if i == n_bins - 1:
            mask = mask | (predicted_probs == bins[i + 1])
        
        if mask.sum() > 0:
            predicted_freq.append(predicted_probs[mask].mean())
            actual_freq.append(actual_outcomes[mask].mean())
    
    # Mean Squared Error
    mse = np.mean((np.array(predicted_freq) - np.array(actual_freq)) ** 2)
    
    # Mean Absolute Error
    mae = np.mean(np.abs(np.array(predicted_freq) - np.array(actual_freq)))
    
    # Brier Score (lower is better)
    brier = np.mean((predicted_probs - actual_outcomes) ** 2)
    
    return {
        'calibration_mse': mse,
        'calibration_mae': mae,
        'brier_score': brier,
        'n_predictions': len(predicted_probs)
    }

if len(calib_df) > 0:
    metrics = calculate_calibration_metrics(
        calib_df['predicted_prob'].values,
        calib_df['actual_outcome'].values
    )
    
    print("\n" + "="*50)
    print("CALIBRATION QUALITY METRICS")
    print("="*50)
    print(f"Calibration MSE:      {metrics['calibration_mse']:.4f}")
    print(f"Calibration MAE:      {metrics['calibration_mae']:.4f}")
    print(f"Brier Score:          {metrics['brier_score']:.4f}")
    print(f"Total Predictions:    {metrics['n_predictions']:,}")
    print("="*50)
    
    # Interpretation
    if metrics['calibration_mse'] < 0.01:
        print("\n✅ EXCELLENT calibration (MSE < 0.01)")
    elif metrics['calibration_mse'] < 0.05:
        print("\n✅ GOOD calibration (MSE < 0.05)")
    elif metrics['calibration_mse'] < 0.10:
        print("\n⚠️ FAIR calibration (MSE < 0.10)")
    else:
        print("\n❌ POOR calibration (MSE >= 0.10)")
else:
    print("⚠️ No calibration data available yet.")

## Calibration by Time Period

Check if calibration varies by how far from resolution.

In [None]:
if len(calib_df) > 0:
    # Add time-to-resolution for each snapshot
    def get_time_to_resolution(ticker, timestamp):
        ticker_df = df[df['ticker'] == ticker]
        last_time = ticker_df['timestamp'].max()
        return (last_time - timestamp).total_seconds() / 3600  # hours
    
    calib_df['hours_to_resolution'] = calib_df.apply(
        lambda row: get_time_to_resolution(row['ticker'], row['timestamp']),
        axis=1
    )
    
    # Group by time buckets
    time_buckets = [
        (0, 1, 'Last Hour'),
        (1, 6, '1-6 Hours'),
        (6, 24, '6-24 Hours'),
        (24, float('inf'), '24+ Hours')
    ]
    
    print("\nCalibration by Time to Resolution:")
    print("-" * 60)
    
    for min_h, max_h, label in time_buckets:
        mask = (calib_df['hours_to_resolution'] >= min_h) & (calib_df['hours_to_resolution'] < max_h)
        subset = calib_df[mask]
        
        if len(subset) > 10:  # Need minimum data
            metrics = calculate_calibration_metrics(
                subset['predicted_prob'].values,
                subset['actual_outcome'].values
            )
            print(f"{label:15} | MSE: {metrics['calibration_mse']:.4f} | N: {len(subset):>5}")
else:
    print("⚠️ Not enough data for time-based analysis.")

## Interpretation & Next Steps

### What Good Calibration Means
- Markets are efficient at aggregating information
- Prices reflect true probabilities
- Trading edge comes from timing, not market mispricing

### What to Do Next

**If calibration is good (MSE < 0.05)**:
- ✅ Proceed with strategy development
- Focus on timing (mean reversion, momentum)
- Risk management is critical (markets are smart)

**If calibration is poor (MSE > 0.10)**:
- 🔍 Investigate why:
  - Not enough resolved markets?
  - Outcome inference incorrect?
  - Markets need more time to settle?
- Consider focusing on specific market types
- May indicate inefficiencies to exploit

**Data Requirements**:
- Need ≥100 predictions from ≥10 markets for reliable calibration
- Markets must have resolved (settled)
- May take 1-7 days depending on market types

### Phase 1C Completion
Once we have:
1. ✅ Reliability diagram (this notebook)
2. ✅ Calibration metrics calculated
3. ✅ Strategy with Sharpe >0.5 (from 02_strategy_backtest.ipynb)

→ **Phase 1C COMPLETE** → Proceed to Phase 2 (Real-time monitoring)

In [None]:
# Save calibration results for future reference
if len(calib_df) > 0:
    calib_df.to_csv('../data/calibration_results.csv', index=False)
    print("\n✅ Calibration results saved to data/calibration_results.csv")