## Potential Analysis with the dataset

Now that we have identified matches that appear in both datasets, we can perform integrated analyses combining match events with prediction market data.

###  5 Key Analyses

This section serves as examples of integration analysis the team can perform based on the findings from EDA:

1. **Market Efficiency**: Do betting odds accurately predict match outcomes?
2. **Live Event Impact**: How do goals affect betting odds in real-time?
3. **Betting Volume Patterns**: Which teams attract the most betting interest?
4. **xG vs Market Odds**: Can Expected Goals models identify value bets?
5. **Playing Style Mispricing**: Do markets correctly price different team styles?

Each analysis includes:
- **Business Question**: What we're trying to answer
- **Method**: How to perform the analysis
- **Expected Insights**: What patterns to look for
- **Code Template**: Sample implementation

### Analysis 1: Market Efficiency - Do Odds Predict Outcomes?

**Business Question**: Are Polymarket betting odds accurate predictors of match results?

**Method**:
1. Get final pre-match odds from Polymarket
2. Get actual match outcomes from StatsBomb
3. Calculate Brier score (prediction accuracy metric)
4. Identify systematic biases (e.g., favorite-longshot bias)

**Expected Insights**:
- Well-calibrated markets typically have Brier score ~ 0.20-0.25
- Markets may overprice favorites or underprice underdogs
- Certain competitions or match types may be harder to predict

**Key Metric**: Brier Score = Average of (predicted_probability - actual_outcome)^2 
Brier scored used to evaluate the accuracy of probabilistic forecasts for binary outcomes
- Lower is better (0 = perfect predictions)
- Random guessing = 0.25
- Professional markets = 0.20-0.23

In [4]:
# Analysis 1: Market Efficiency 

# STEP 1: Load required data
print("="*80)
print("ANALYSIS 1: MARKET EFFICIENCY")
print("="*80)

# Load StatsBomb matches (already loaded above)
# Load Polymarket markets and odds
pm_markets = pd.read_parquet('../data/Polymarket/soccer_markets.parquet')
pm_odds = pd.read_parquet('../data/Polymarket/soccer_odds_history.parquet')
pm_tokens = pd.read_parquet('../data/Polymarket/soccer_tokens.parquet')

print(f"\nLoaded Polymarket data:")
print(f"  Markets: {len(pm_markets):,}")
print(f"  Odds snapshots: {len(pm_odds):,}")
print(f"  Tokens: {len(pm_tokens):,}")

# STEP 2: Get final pre-match odds for each market
print("\n--- Getting Final Pre-Match Odds ---")

# Sort by timestamp to get chronological order
pm_odds_sorted = pm_odds.sort_values(['market_id', 'timestamp'])

# Get last odds snapshot for each market
final_odds = pm_odds_sorted.groupby('market_id').last().reset_index()
final_odds = final_odds[['market_id', 'token_id', 'price']]
final_odds.columns = ['market_id', 'token_id', 'final_odds']

print(f"Final odds for {len(final_odds):,} market-token pairs")

# STEP 3: Join with market and token information
markets_with_odds = pm_markets.merge(
    final_odds,
    on='market_id',
    how='inner'
)

markets_with_odds = markets_with_odds.merge(
    pm_tokens[['market_id', 'token_id', 'outcome']],
    on=['market_id', 'token_id'],
    how='inner'
)

print(f"Markets with final odds and outcomes: {len(markets_with_odds):,}")

# STEP 4: Match with StatsBomb results
# This requires team name mapping - see Integration Guide for full implementation
print("\n--- Team Mapping Required ---")
print("""
To complete this analysis, we can:
1. Create team_mapping.csv (StatsBomb team → Polymarket slug)
2. Match markets to StatsBomb games by date and teams
3. Compare final odds with actual outcomes
4. Calculate Brier score

See statsbomb_polymarket_integration.py for complete implementation.
""")

# STEP 5: Calculate Brier Score (conceptual)
print("\n--- Brier Score Calculation (Conceptual) ---")
print("""
Once matched, calculate Brier score:

brier_score = mean((predicted_probability - actual_outcome)²)

Example:
- Market: "Will Arsenal win?" 
- Final odds: 0.65 (65% probability)
- Actual: Arsenal wins (1) or loses (0)
- Brier: (0.65 - 1)² = 0.1225 (if win)
- Brier: (0.65 - 0)² = 0.4225 (if loss)

Average across all markets to get overall accuracy.
""")


print("="*80)

ANALYSIS 1: MARKET EFFICIENCY

Loaded Polymarket data:
  Markets: 8,549
  Odds snapshots: 666,837
  Tokens: 17,096

--- Getting Final Pre-Match Odds ---
Final odds for 8,322 market-token pairs
Markets with final odds and outcomes: 8,322

--- Team Mapping Required ---

To complete this analysis, we can:
1. Create team_mapping.csv (StatsBomb team → Polymarket slug)
2. Match markets to StatsBomb games by date and teams
3. Compare final odds with actual outcomes
4. Calculate Brier score

See statsbomb_polymarket_integration.py for complete implementation.


--- Brier Score Calculation (Conceptual) ---

Once matched, calculate Brier score:

brier_score = mean((predicted_probability - actual_outcome)²)

Example:
- Market: "Will Arsenal win?" 
- Final odds: 0.65 (65% probability)
- Actual: Arsenal wins (1) or loses (0)
- Brier: (0.65 - 1)² = 0.1225 (if win)
- Brier: (0.65 - 0)² = 0.4225 (if loss)

Average across all markets to get overall accuracy.



### Analysis 2: Live Event Impact - How Goals Affect Odds

**Business Question**: How do in-game goals affect betting odds in real-time?

**Method**:
1. Extract goal events with timestamps from StatsBomb
2. Get odds snapshots before and after each goal (±5 minutes)
3. Measure odds change magnitude and speed
4. Compare early goals vs late goals

**Expected Insights**:
- Odds typically shift 10-30% after a goal
- Late goals (80+ min) cause larger swings due to time pressure
- Score state matters: 1-0 → 1-1 has different impact than 3-0 → 3-1
- Markets react within seconds to minutes after goals

**Why This Matters**:
- Understanding market reaction speed
- Identifying arbitrage opportunities (slow market updates)
- Predicting odds movements for live trading strategies

In [14]:
# Analysis 2: Live Event Impact 

print("="*80)
print("ANALYSIS 2: LIVE EVENT IMPACT")
print("="*80)

# STEP 1: Extract goals with timing from StatsBomb
print("\n--- Extracting Goal Events ---")

# Load events
events = pd.read_parquet(events_path)

goals = events[
    (events['type'] == 'Shot') & 
    (events['shot_outcome'] == 'Goal') &
    (events['minute'].notna())
].copy()

print(f"Total goals with timing data: {len(goals):,}")

if len(goals) > 0:
    # Goal timing distribution
    print(f"\nGoal timing:")
    print(f"  Earliest: {goals['minute'].min():.0f} min")
    print(f"  Latest: {goals['minute'].max():.0f} min")
    print(f"  Average: {goals['minute'].mean():.1f} min")
    
    # Goals by period
    print(f"\nGoals by period:")
    for period in sorted(goals['period'].unique()):
        count = len(goals[goals['period'] == period])
        pct = count / len(goals) * 100
        print(f"  Period {period}: {count:,} ({pct:.1f}%)")

# STEP 2: Match goals to Polymarket odds snapshots
print("\n--- Matching Goals to Odds Snapshots ---")
print("""
To analyze live impact, what we can do:

1. Match each goal to its corresponding market
2. Find odds snapshots:
   - 5 minutes before goal
   - Immediately after goal (0-2 min)
   - 5 minutes after goal
3. Calculate odds change:
   odds_change = (odds_after - odds_before) / odds_before * 100

Example:
- Goal at 23:00 (23rd minute)
- Odds before (22:58): 0.45 (45% win probability)
- Odds after (23:02): 0.62 (62% win probability)
- Change: (0.62 - 0.45) / 0.45 = +37.8%
""")

# STEP 3: Analyze impact by timing
print("\n--- Impact by Goal Timing ---")
print("""
Expected patterns:

Early goals (0-30 min):
- Moderate odds shift (10-20%)
- Plenty of time for comeback
- Market has time to adjust

Mid-game goals (31-60 min):
- Larger odds shift (15-25%)
- Less recovery time
- Momentum becomes important

Late goals (61-90+ min):
- Dramatic odds shift (20-40%)
- Very limited recovery time
- Score state becomes dominant
""")

# STEP 4: Visualize impact
print("\n--- Visualization Template ---")
print("""
Create scatter plot:
- X-axis: Goal minute
- Y-axis: Odds change (%)
- Color: Score state (equalizer, go-ahead, insurance)
- Size: Market volume

This reveals:
- Time pressure effect (late goals, bigger swings)
- Score context effect (1-1 vs 3-1)
- Market liquidity impact (volume affects volatility)
""")

print("="*80)

ANALYSIS 2: LIVE EVENT IMPACT

--- Extracting Goal Events ---
Total goals with timing data: 9,790

Goal timing:
  Earliest: 0 min
  Latest: 139 min
  Average: 51.4 min

Goals by period:
  Period 1: 4,233 (43.2%)
  Period 2: 5,285 (54.0%)
  Period 3: 11 (0.1%)
  Period 4: 19 (0.2%)
  Period 5: 242 (2.5%)

--- Matching Goals to Odds Snapshots ---

To analyze live impact, what we can do:

1. Match each goal to its corresponding market
2. Find odds snapshots:
   - 5 minutes before goal
   - Immediately after goal (0-2 min)
   - 5 minutes after goal
3. Calculate odds change:
   odds_change = (odds_after - odds_before) / odds_before * 100

Example:
- Goal at 23:00 (23rd minute)
- Odds before (22:58): 0.45 (45% win probability)
- Odds after (23:02): 0.62 (62% win probability)
- Change: (0.62 - 0.45) / 0.45 = +37.8%


--- Impact by Goal Timing ---

Expected patterns:

Early goals (0-30 min):
- Moderate odds shift (10-20%)
- Plenty of time for comeback
- Market has time to adjust

Mid-game goal

### Analysis 3: Expected Goals (xG) vs Market Odds

**Business Question**: Can xG models identify mispriced betting markets?

**Method**:
1. Calculate team xG from StatsBomb shot data
2. Convert xG to win probabilities using statistical models
3. Compare with Polymarket implied probabilities from odds
4. Find discrepancies (our model vs market consensus)

**Expected Insights**:
- Markets sometimes underprice high-xG, low-goals teams
- Markets may overreact to recent form vs underlying performance
- Small edges (i.e. 3-5%) can compound over many bets
- xG provides more objective team strength assessment

**Value Bet Detection**:
If our_win_prob - market_implied_prob > threshold (e.g., 10%), you've found potential value.

In [15]:
# Analysis 3: xG vs Market Odds

print("="*80)
print("ANALYSIS 3: xG vs MARKET ODDS")
print("="*80)

# STEP 1: Calculate team xG from shots
print("\n--- Calculating Team xG ---")

# Filter shots with xG data
shots_with_xg = events[
    (events['type'] == 'Shot') & 
    (events['shot_statsbomb_xg'].notna())
].copy()

print(f"Shots with xG data: {len(shots_with_xg):,}")

# Aggregate xG by team
team_xg = shots_with_xg.groupby('team').agg({
    'shot_statsbomb_xg': ['sum', 'mean', 'count']
}).round(3)

team_xg.columns = ['total_xg', 'avg_xg_per_shot', 'total_shots']
team_xg = team_xg.sort_values('total_xg', ascending=False)

print(f"\nTop 10 teams by total xG:")
print(team_xg.head(10))

# STEP 2: Convert xG to win probability
print("\n--- Converting xG to Win Probability ---")
print("""
Method 1: Poisson Distribution
- Model goals as Poisson process with λ = xG
- P(team_A wins) = Σ P(A scores i) × P(B scores < i)

Method 2: Historical Calibration
- Group matches by xG difference bins
- Calculate actual win rate in each bin
- Use as lookup table for new matches

Method 3: Machine Learning
- Train model on features: xG, xG_against, home/away, form
- Predict win probability
- Compare with market odds

Example calculation:
- Team A xG: 1.8
- Team B xG: 0.9
- xG difference: +0.9
- Historical win rate with +0.9 xG diff: ~65%
- Model win probability: 65%
""")

# STEP 3: Compare with market
print("\n--- Comparing with Market Odds ---")
print("""
Steps:
1. Get our model's win probability (e.g., 65%)
2. Get market implied probability from odds (e.g., 55%)
3. Calculate edge: 65% - 55% = +10%

Value bet criteria:
- Edge > 5%: Small value
- Edge > 10%: Significant value
- Edge > 15%: Strong value (or model error!)

Important: Account for market vig (typical 2-5% overround)
""")

# STEP 4: Backtest performance
print("\n--- Backtesting Strategy ---")
print("""
Test your xG-based predictions:

1. For each historical match:
   - Calculate your xG-based win probability
   - Get market odds
   - Identify value bets (our prob > market prob)

2. Simulate betting:
   - Bet on all value opportunities
   - Track ROI over time
   - Calculate Sharpe ratio

3. Expected results:
   - Perfect model: ~3-5% ROI (market is efficient)
   - Random model: -5% ROI (market vig)
   - Breakeven: ~0% ROI


""")

print("="*80)

ANALYSIS 3: xG vs MARKET ODDS

--- Calculating Team xG ---
Shots with xG data: 88,023

Top 10 teams by total xG:
                        total_xg  avg_xg_per_shot  total_shots
team                                                          
Barcelona            1069.078979            0.127         8401
Paris Saint-Germain   196.520004            0.137         1432
Chelsea FCW           143.257996            0.124         1153
Manchester City WFC   137.809998            0.127         1088
Arsenal WFC           132.705994            0.137          968
Real Madrid           132.020004            0.111         1188
Arsenal               126.009003            0.112         1127
Bayer Leverkusen      122.494003            0.111         1103
Manchester United     110.219002            0.105         1051
Bayern Munich          88.399002            0.119          742

--- Converting xG to Win Probability ---

Method 1: Poisson Distribution
- Model goals as Poisson process with λ = xG
- P(team_A w

### Analysis 4 & 5: Additional Integration Analyses we can consider

**Analysis 4: Betting Volume Patterns**

**Question**: Which teams attract disproportionate betting interest?

**Method**:
- Aggregate Polymarket trade volume by team
- Compare with team performance (win rate, goals)
- Identify "public teams" (high volume, average performance)

**Insight**: Popular teams may be overbet, creating value on opponents

---

**Analysis 5: Playing Style vs Market Perception**

**Question**: Do markets correctly price different playing styles?

**Method**:
- Classify teams by style (possession %, defensive actions, pass length)
- Analyze odds for style matchups (possession vs counter-attack)
- Test if certain styles are systematically mispriced

**Insight**: Defensive teams might be undervalued in certain matchups

---

