# Polymarket × StatsBomb Data Matching

## Overview

This notebook experiments with maching soccer prediction market data (Polymarket) to match event data (StatsBomb). The gola is to explore possible ways these datasets can be linked and to surface a few preliminary observaitons. It's mainly a sandbox for testing ideas and testing logic. Use the sections below to navigate the notebook.

## Dataset Characteristics

### Polymarket 2024
- **1,054 markets** 
- **Market types:** Match outcomes, team futures, tournament winners, etc. 
- **Temporal:** Markets created 0-7 days before events
- **Volume:** Total betting volume across markets (data available but not analyzed yet)

### StatsBomb 2024
- **153 matches** across 4 competitions
- **Competitions:** UEFA Euro (51), AFCON (52), Copa America (32), Bundesliga (18)
- **Coverage:** 88% international tournaments, 12% club matches
- **Players:** 1,927 unique players across lineups

### Why 2024?
- Best temporal overlap: both datasets have substantial 2024 coverage
- 2024 = tournament-heavy year (Euro, Copa America, AFCON) where Polymarket is most active
- 2025 is incomplete and 2023 has fewer Polymarket markets 

## Matching Pipeline

### Team Detection
Fuzzy matching with strict guardrails:
- **National teams:** Exact full name required ("United States" only, not "USA" or "US")
- **Club teams:** Partial match on distinctive parts ("Bayern" → "Bayern Munich", "Dortmund" → "Borussia Dortmund")
- **Common words excluded:** FC, United, City, Union (prevents "Union Berlin" matching "Union Saint-Gilloise")
- **Multi-word clubs:** All distinctive parts required (prevents "Borussia Mönchengladbach" matching "Borussia Dortmund")

### Player Detection
Strict matching to avoid false positives:
- **2-name players:** Both names required ("Lionel Messi", not "Messi")
- **3+ name players:** First + last name within 50 characters ("Cristiano Ronaldo" matches "Cristiano Ronaldo dos Santos Aveiro")
- **No single-name matches:** Prevents "Haaland" matching "Erling Haaland"

## Observations

### Match Statistics
- 9 head-to-head (both teams detected) 
- 129 single-team (one team detected) 
- 15 player props 

### Tournament Bias 
88% of StatsBomb matches are international tournaments vs. 12% club. Limits generalizability to regular season club soccer. 

## Data Quality

**Strengths:** 
- Clean match-to-market mappings for 130+ markets 

**Limitations**
- No major leagues: Premier League, La Liga, Serie A absent 
- Small sample (130+) limits statistical power 
- Tournament-only bias 

## Analysis Considerations

1. **Match-level correlations:** xG vs. betting volume, possession vs. market odds
2. **Team aggregations:** Total xG across all Argentina matches vs. "Will Argentina win Copa America?" market
3. The datasets don't enable predictive modeling (sample too small) or cross-league comparisons (only 4 competitions)

In [2]:
!pip install fuzzywuzzy python-Levenshtein
import pandas as pd
import numpy as np
import re
from datetime import datetime, timedelta
from collections import Counter
from fuzzywuzzy import fuzz, process
from pathlib import Path
import os
import warnings
import re
    
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 100)

# Suppress common warnings for cleaner notebook output
warnings.filterwarnings("ignore")


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [3]:
# Load Polymarket data
DATA_DIR = Path("..") / "data" 
polymarket_df = pd.read_parquet(DATA_DIR / "Polymarket/soccer_markets.parquet")

# Convert date columns
polymarket_df['end_date'] = pd.to_datetime(polymarket_df['end_date'])
polymarket_df['created_at'] = pd.to_datetime(polymarket_df['created_at'])

polymarket_df.head()

Unnamed: 0,market_id,question,slug,event_slug,category,volume,active,closed,created_at,end_date
0,242920,Will Ukraine qualify for the 2022 FIFA World Cup?,will-ukraine-qualify-to-the-2022-fifa-world-cup,will-ukraine-qualify-to-the-2022-fifa-world-cup,Sports,4766.88,True,True,2022-04-06 07:51:48,2022-06-30
1,244963,UEFA Europa League final: Who will win Eintracht vs. Rangers?,uefa-europa-league-final-who-will-win-eintracht-vs-rangers,uefa-europa-league-final-who-will-win-eintracht-vs-rangers,Sports,1543.29,True,True,2022-05-18 14:16:53,2022-05-18
2,246443,Soccer: Who will win the United States vs. Uruguay international friendly game on June 5?,soccer-who-will-win-the-united-states-vs-uruguay-international-friendly-game-on-june-5,soccer-who-will-win-the-united-states-vs-uruguay-international-friendly-game-on-june-5,Sports,1363.07,True,True,2022-06-05 12:45:16,2022-06-05
3,246490,UEFA Nations League: Who will win the Germany vs. England game on June 7?,uefa-nations-league-who-will-win-the-germany-vs-england-game-on-june-7,uefa-nations-league-who-will-win-the-germany-vs-england-game-on-june-7,Sports,1031.58,True,True,2022-06-06 17:09:19,2022-06-07
4,246661,2022 Wimbledon Championships: Who will win Kyrgios vs. Nadal?,2022-wimbledon-championships-who-will-win-kyrgios-vs-nadal,2022-wimbledon,Sports,3098.29,True,True,2022-07-06 19:33:08,2022-07-08


In [4]:
statsbomb_df = pd.read_parquet(DATA_DIR / "StatsBomb/matches.parquet")
statsbomb_df['match_date'] = pd.to_datetime(statsbomb_df['match_date'])
statsbomb_df.head()

Unnamed: 0,match_id,match_date,match_week,match_status,match_status_360,kickoff,home_score,away_score,competition_id,competition,competition_stage,season_id,season,home_team_id,home_team,home_managers,away_team_id,away_team,away_managers,stadium_id,stadium,referee_id,referee,last_updated,last_updated_360,data_version,shot_fidelity_version,xy_fidelity_version,competition_name,gender,is_youth,is_international,country_name,season_name,match_updated,match_available_360
0,9880,2018-04-14,32,available,scheduled,16:15:00,2,1,11,La Liga,Regular Season,1,2017/2018,217,Barcelona,"[{""id"":227,""name"":""Ernesto Valverde Tejedor"",""nickname"":""Ernesto Valverde"",""dob"":""1964-02-09"",""c...",207,Valencia,"[{""id"":211,""name"":""Marcelino García Toral"",""nickname"":""Marcelino"",""dob"":""1965-08-14"",""country"":{...",342.0,Spotify Camp Nou,2728.0,Carlos del Cerro Grande,2023-02-08T17:23:53.901920,2021-06-13T16:17:31.694,1.1.0,2,2,La Liga,male,False,False,Spain,2017/2018,2025-07-14T10:01:16.674864,
1,9912,2018-04-29,35,available,scheduled,20:45:00,2,4,11,La Liga,Regular Season,1,2017/2018,219,RC Deportivo La Coruña,"[{""id"":371,""name"":""Clarence Seedorf"",""nickname"":null,""dob"":""1976-04-01"",""country"":{""id"":160,""nam...",217,Barcelona,"[{""id"":227,""name"":""Ernesto Valverde Tejedor"",""nickname"":""Ernesto Valverde"",""dob"":""1964-02-09"",""c...",4658.0,Estadio Abanca-Riazor,2602.0,Ricardo De Burgos Bengoetxea,2022-12-05T14:42:44.641092,2021-06-13T16:17:31.694,1.1.0,2,2,La Liga,male,False,False,Spain,2017/2018,2025-07-14T10:01:16.674864,
2,9924,2018-05-06,36,available,scheduled,20:45:00,2,2,11,La Liga,Regular Season,1,2017/2018,217,Barcelona,"[{""id"":227,""name"":""Ernesto Valverde Tejedor"",""nickname"":""Ernesto Valverde"",""dob"":""1964-02-09"",""c...",220,Real Madrid,"[{""id"":56,""name"":""Zinédine Zidane"",""nickname"":null,""dob"":""1972-06-23"",""country"":{""id"":78,""name"":...",342.0,Spotify Camp Nou,2608.0,Alejandro José Hernández Hernández,2022-12-01T03:25:12.063586,2021-06-13T16:17:31.694,1.1.0,2,2,La Liga,male,False,False,Spain,2017/2018,2025-07-14T10:01:16.674864,
3,9855,2018-03-18,29,available,scheduled,16:15:00,2,0,11,La Liga,Regular Season,1,2017/2018,217,Barcelona,"[{""id"":227,""name"":""Ernesto Valverde Tejedor"",""nickname"":""Ernesto Valverde"",""dob"":""1964-02-09"",""c...",215,Athletic Club,"[{""id"":210,""name"":""José Ángel Ziganda Lacunza"",""nickname"":""Cuco Ziganda"",""dob"":""1966-10-01"",""cou...",342.0,Spotify Camp Nou,2575.0,Santiago Jaime Latre,2022-12-01T02:33:31.178193,2021-06-13T16:17:31.694,1.1.0,2,2,La Liga,male,False,False,Spain,2017/2018,2025-07-14T10:01:16.674864,
4,9827,2018-03-01,26,available,scheduled,21:00:00,1,1,11,La Liga,Regular Season,1,2017/2018,208,Las Palmas,"[{""id"":220,""name"":""Francisco Jémez Martín"",""nickname"":""Paco Jémez"",""dob"":""1970-04-18"",""country"":...",217,Barcelona,"[{""id"":227,""name"":""Ernesto Valverde Tejedor"",""nickname"":""Ernesto Valverde"",""dob"":""1964-02-09"",""c...",357.0,Estadio de Gran Canaria,180.0,Antonio Miguel Mateu Lahoz,2022-08-04T17:18:06.540844,2021-06-13T16:17:31.694,1.1.0,2,2,La Liga,male,False,False,Spain,2017/2018,2025-07-14T10:01:16.674864,


In [5]:
# Check temporal coverage of both datasets to find overlap periods
print("POLYMARKET DATA:")
print("=" * 80)
print(f"Earliest market created: {polymarket_df['created_at'].min()}")
print(f"Latest market created: {polymarket_df['created_at'].max()}")
print(f"Earliest end_date: {polymarket_df['end_date'].min()}")
print(f"Latest end_date: {polymarket_df['end_date'].max()}")

print("\n\nSTATSBOMB DATA:")
print("=" * 80)
print(f"Earliest match: {statsbomb_df['match_date'].min()}")
print(f"Latest match: {statsbomb_df['match_date'].max()}")

# Identify overlapping time period where both datasets have data
print("\n\nOVERLAP ANALYSIS:")
print("=" * 80)
pm_start = polymarket_df['end_date'].min()
pm_end = polymarket_df['end_date'].max()
sb_start = statsbomb_df['match_date'].min()
sb_end = statsbomb_df['match_date'].max()

overlap_start = max(pm_start, sb_start)
overlap_end = min(pm_end, sb_end)

if overlap_start < overlap_end:
    print(f"Overlap period: {overlap_start.date()} to {overlap_end.date()}")
    print(f"Duration: {(overlap_end - overlap_start).days} days")
else:
    print("No overlap between datasets")

POLYMARKET DATA:
Earliest market created: 2021-04-12 19:50:01
Latest market created: 2025-12-09 15:31:51
Earliest end_date: 2021-04-13 00:00:00
Latest end_date: 2026-07-20 00:00:00


STATSBOMB DATA:
Earliest match: 1958-06-24 00:00:00
Latest match: 2025-07-27 00:00:00


OVERLAP ANALYSIS:
Overlap period: 2021-04-13 to 2025-07-27
Duration: 1566 days


In [6]:
# Extract year from dates for temporal analysis
polymarket_df['year'] = polymarket_df['end_date'].dt.year
statsbomb_df['year'] = statsbomb_df['match_date'].dt.year

# Find years present in both datasets
polymarket_years = set(polymarket_df['year'].unique())
statsbomb_years = set(statsbomb_df['year'].unique())
overlap_years = sorted(polymarket_years & statsbomb_years)

print(f"Years in both datasets: {overlap_years}")

if overlap_years:
    # Filter both datasets to years where we have data from both sources
    polymarket_overlap = polymarket_df[polymarket_df['year'].isin(overlap_years)].copy()
    statsbomb_overlap = statsbomb_df[statsbomb_df['year'].isin(overlap_years)].copy()
    
    print(f"\nPolymarket markets in overlap years: {len(polymarket_overlap)}")
    print(f"StatsBomb matches in overlap years: {len(statsbomb_overlap)}")
    
    # Show year-by-year breakdown to identify best year for analysis
    print("\nBreakdown by year:")
    for year in overlap_years:
        pm_count = len(polymarket_overlap[polymarket_overlap['year'] == year])
        sb_count = len(statsbomb_overlap[statsbomb_overlap['year'] == year])
        print(f"  {year}: {pm_count} markets, {sb_count} matches")
else:
    print("No overlapping years")

Years in both datasets: [np.float64(2021.0), np.float64(2022.0), np.float64(2023.0), np.float64(2024.0), np.float64(2025.0)]

Polymarket markets in overlap years: 7799
StatsBomb matches in overlap years: 686

Breakdown by year:
  2021.0: 98 markets, 204 matches
  2022.0: 54 markets, 193 matches
  2023.0: 69 markets, 105 matches
  2024.0: 1054 markets, 153 matches
  2025.0: 6524 markets, 31 matches


In [7]:
# Focus analysis on 2024 - the year with most data overlap
pm_2024 = polymarket_df[polymarket_df['year'] == 2024].copy()
sb_2024 = statsbomb_df[statsbomb_df['year'] == 2024].copy()

# Check which competitions StatsBomb covers in 2024
print("2024 COMPETITIONS:")
print(sb_2024['competition'].value_counts())

# Sample Polymarket questions to understand market types
print("\n2024 SAMPLE MARKETS:")
print(pm_2024['question'].sample(10).tolist())

2024 COMPETITIONS:
competition
African Cup of Nations    52
UEFA Euro                 51
Copa America              32
1. Bundesliga             18
Name: count, dtype: int64

2024 SAMPLE MARKETS:
['Will Everton vs. Nottingham Forest end in a draw?', 'Will Feyenoord beat Benfica?', 'Will Wolves win on 2024-10-26?', 'Will Chelsea win on 2024-12-22?', 'Will the match between Brighton and Tottenham end in a draw?', 'Will the match between Stade Brestois and PSV end in a draw?', 'Will Tottenham beat Brentford?', 'Will Inter Milan beat Red Star Belgrade?', 'Will the match between Braga vs. Maccabi TLV end in a draw?', 'Will Manchester City vs. Nottingham Forest end in a draw?']


In [8]:
# Get all unique teams from StatsBomb 2024
sb_teams_2024 = pd.concat([
    sb_2024[['home_team_id', 'home_team']].rename(columns={'home_team_id': 'team_id', 'home_team': 'team_name'}),
    sb_2024[['away_team_id', 'away_team']].rename(columns={'away_team_id': 'team_id', 'away_team': 'team_name'})
]).drop_duplicates().sort_values('team_name').reset_index(drop=True)

print(f"Total unique teams in StatsBomb 2024: {len(sb_teams_2024)}")
print("\nAll StatsBomb teams:")
print(sb_teams_2024)

# Create a list for easy searching
sb_team_names = sb_teams_2024['team_name'].tolist()
sb_team_names_lower = [name.lower() for name in sb_team_names]

Total unique teams in StatsBomb 2024: 82

All StatsBomb teams:
    team_id      team_name
0       906        Albania
1      4898        Algeria
2      4901         Angola
3       779      Argentina
4       172       Augsburg
..      ...            ...
77     3563      Venezuela
78      174  VfB Stuttgart
79      176  Werder Bremen
80      179      Wolfsburg
81     4963         Zambia

[82 rows x 2 columns]


In [9]:
def find_statsbomb_teams_in_text_fuzzy(text, sb_team_names):
    """
    Match team names with fuzzy logic, being strict about country names.
    - Club teams: allow partial matches and nicknames
    - National teams: require full name only
    """
    if pd.isna(text):
        return []
    
    text_lower = text.lower()
    found_teams = []
    
    # List of national teams that should ONLY match on full name
    national_teams = {
        'Albania', 'Algeria', 'Angola', 'Argentina', 'Austria', 'Belgium', 
        'Bolivia', 'Brazil', 'Burkina Faso', 'Cameroon', 'Canada', 
        'Cape Verde Islands', 'Chile', 'Colombia', 'Congo DR', 'Costa Rica', 
        'Croatia', 'Czech Republic', 'Côte d\'Ivoire', 'Denmark', 'Ecuador', 
        'Egypt', 'England', 'Equatorial Guinea', 'France', 'Gambia', 'Georgia', 
        'Germany', 'Ghana', 'Guinea', 'Guinea-Bissau', 'Hungary', 'Italy', 
        'Jamaica', 'Mali', 'Mauritania', 'Mexico', 'Morocco', 'Mozambique', 
        'Namibia', 'Netherlands', 'Nigeria', 'Panama', 'Paraguay', 'Peru', 
        'Poland', 'Portugal', 'Romania', 'Scotland', 'Senegal', 'Serbia', 
        'Slovakia', 'Slovenia', 'South Africa', 'Spain', 'Switzerland', 
        'Tanzania', 'Tunisia', 'Turkey', 'Ukraine', 'United States', 'Uruguay', 
        'Venezuela', 'Wales', 'Zambia'
    }
    
    for team_name in sb_team_names:
        team_lower = team_name.lower()
        
        # For national teams, ONLY allow exact full name match
        if team_name in national_teams:
            pattern = r'\b' + re.escape(team_lower) + r'\b'
            if re.search(pattern, text_lower):
                found_teams.append(team_name)
            continue
        
        # For club teams, try exact match first
        pattern = r'\b' + re.escape(team_lower) + r'\b'
        if re.search(pattern, text_lower):
            found_teams.append(team_name)
            continue
        
        # For multi-word club names, only match if ALL distinctive parts appear
        team_parts = team_lower.split()
        
        if len(team_parts) >= 2:
            common_words = {'fc', 'cf', 'ac', 'sc', 'united', 'city', 'athletic', 
                          'sporting', 'real', 'club', 'de', 'del', 'la', 'el',
                          'vfb', 'fsv', 'rb', '98', '05', 'islands', 'dr', 'union'}
            
            distinctive_parts = [p for p in team_parts if p not in common_words and len(p) >= 5]
            
            # ALL distinctive parts must appear in text
            if distinctive_parts:
                all_parts_found = True
                for part in distinctive_parts:
                    part_pattern = r'\b' + re.escape(part) + r'\b'
                    if not re.search(part_pattern, text_lower):
                        all_parts_found = False
                        break
                
                if all_parts_found:
                    found_teams.append(team_name)
        
        elif len(team_parts) == 1 and len(team_lower) >= 5:
            part_pattern = r'\b' + re.escape(team_lower) + r'\b'
            if re.search(part_pattern, text_lower):
                found_teams.append(team_name)
    
    # Handle club nicknames only (full match required, not partial)
    club_nickname_map = {
        'leverkusen': 'Bayer Leverkusen',
        'dortmund': 'Borussia Dortmund',
        'gladbach': 'Borussia Mönchengladbach',
        'monchengladbach': 'Borussia Mönchengladbach',
        'mönchengladbach': 'Borussia Mönchengladbach',
        'frankfurt': 'Eintracht Frankfurt',
        'koln': 'FC Köln',
        'köln': 'FC Köln',
        'cologne': 'FC Köln',
        'mainz': 'FSV Mainz 05',
        'heidenheim': 'FC Heidenheim',
        'leipzig': 'RB Leipzig',
        'stuttgart': 'VfB Stuttgart',
        'bremen': 'Werder Bremen',
        'bayern': 'Bayern Munich',
    }
    
    for nickname, full_name in club_nickname_map.items():
        if full_name in sb_team_names and full_name not in found_teams:
            pattern = r'\b' + re.escape(nickname) + r'\b'
            if re.search(pattern, text_lower):
                found_teams.append(full_name)
    
    return list(dict.fromkeys(found_teams))

# Re-run team detection with stricter rules
pm_2024['found_teams'] = pm_2024['question'].apply(
    lambda x: find_statsbomb_teams_in_text_fuzzy(x, sb_team_names)
)

pm_2024['found_teams_slug'] = pm_2024['slug'].apply(
    lambda x: find_statsbomb_teams_in_text_fuzzy(x, sb_team_names)
)

pm_2024['all_found_teams'] = pm_2024.apply(
    lambda row: list(set(row['found_teams'] + row['found_teams_slug'])), 
    axis=1
)

pm_2024['num_teams_found'] = pm_2024['all_found_teams'].apply(len)

# Test the problem cases
test_cases = [
    "Will Juventus beat Stuttgart?",
    "Will the match between Fenerbahce vs. Union Saint-Gilloise end in a draw?",
    "Will Atletico Madrid vs Borussia Dortmund be a draw?"
]

print("Testing problem cases:")
for test in test_cases:
    found = find_statsbomb_teams_in_text_fuzzy(test, sb_team_names)
    print(f"\n'{test}'")
    print(f"  → {found}")

print("\n\nOverall detection results:")
print(pm_2024['num_teams_found'].value_counts().sort_index())

Testing problem cases:

'Will Juventus beat Stuttgart?'
  → ['VfB Stuttgart']

'Will the match between Fenerbahce vs. Union Saint-Gilloise end in a draw?'
  → []

'Will Atletico Madrid vs Borussia Dortmund be a draw?'
  → ['Borussia Dortmund']


Overall detection results:
num_teams_found
0    916
1    129
2      9
Name: count, dtype: int64


In [10]:
# Load player lineup data
lineups_path = DATA_DIR / "Statsbomb/lineups.parquet"
lineups = pd.read_parquet(lineups_path)

# Filter to 2024 matches only
match_ids_2024 = sb_2024['match_id'].unique()
lineups_2024 = lineups[lineups['match_id'].isin(match_ids_2024)].copy()

print(f"Loaded {len(lineups_2024)} player appearances across {len(match_ids_2024)} matches")
print(f"Unique players: {lineups_2024['player_name'].nunique()}")

lineups_2024.head(10)

Loaded 9651 player appearances across 153 matches
Unique players: 1927


Unnamed: 0,match_id,team_id,player_id,player_name,player_nickname,jersey_number,country_id,country_name,team_name,position_name,from_time,to_time,from_period,to_period,card_time,card_type,card_reason
96851,3895194,172,9198,Niklas Dorsch,,30,85.0,Germany,Augsburg,Center Defensive Midfield,00:00,,1.0,,43:13,Yellow Card,Foul Committed
96852,3895194,172,25118,Iago Amaral Borduchi,Iago,22,31.0,Brazil,Augsburg,Left Back,00:00,87:09,1.0,2.0,18:43,Yellow Card,Bad Behaviour
96853,3895194,172,25118,Iago Amaral Borduchi,Iago,22,31.0,Brazil,Augsburg,Left Center Back,87:09,,2.0,,18:43,Yellow Card,Bad Behaviour
96854,3895194,172,30401,Ruben Vargas,,16,221.0,Switzerland,Augsburg,Center Attacking Midfield,00:00,61:26,1.0,2.0,17:56,Yellow Card,Bad Behaviour
96855,3895194,904,8804,Jonas Hofmann,,7,85.0,Germany,Bayer Leverkusen,Right Attacking Midfield,00:00,,1.0,,45:24,Yellow Card,Foul Committed
96856,3895202,904,8221,Jonathan Tah,,4,85.0,Germany,Bayer Leverkusen,Center Back,00:00,,1.0,,10:55,Yellow Card,Foul Committed
96857,3895202,904,40724,Florian Wirtz,,10,85.0,Germany,Bayer Leverkusen,Left Attacking Midfield,00:00,,1.0,,65:47,Yellow Card,Foul Committed
96858,3895202,904,49337,Josip Stanišić,,2,56.0,Croatia,Bayer Leverkusen,Right Center Back,00:00,,1.0,,37:02,Yellow Card,Foul Committed
96859,3895202,182,8769,Xaver Schlager,,24,15.0,Austria,RB Leipzig,Left Defensive Midfield,00:00,,1.0,,86:13,Yellow Card,Foul Committed
96860,3895202,182,39167,Xavi Simons,,20,160.0,Netherlands,RB Leipzig,Right Attacking Midfield,00:00,62:40,1.0,2.0,70:47,Yellow Card,Foul Committed


In [11]:
# Get all unique players from 2024 matches
if 'player_name' in lineups_2024.columns:
    players_2024 = lineups_2024[['player_name', 'team_name']].drop_duplicates().sort_values('player_name')
    
    print(f"UNIQUE PLAYERS IN STATSBOMB 2024:")
    print("=" * 80)
    print(f"Total: {len(players_2024)}")
    
    # Show sample
    print("\nSample players:")
    print(players_2024.head(10))
    
    # Create list for searching
    player_names = players_2024['player_name'].tolist()
    player_names_lower = [name.lower() for name in player_names]
    
else:
    print("'player_name' column not found. Available columns:")
    print(lineups_2024.columns.tolist())
    print("\nPlease check the column names and adjust accordingly")

UNIQUE PLAYERS IN STATSBOMB 2024:
Total: 2011

Sample players:
                            player_name     team_name
144568                   Aaron Ramsdale       England
110480                   Aaron Tshibola      Congo DR
144823       Aarón Moisés Cruz Esquivel    Costa Rica
151784               Abdallah Dipo Sima       Senegal
146696                 Abdelkabir Abqar       Morocco
110049            Abdessamad Ezzalzouli       Morocco
137962                       Abdi Banda      Tanzania
113887  Abdiel Armando Ayarza Cocanegra        Panama
109223                     Abdou Diallo       Senegal
109246            Abdoul Fessal Tapsoba  Burkina Faso


In [12]:
def find_players_in_text_strict(text, player_names):
    """
    Match player names with strict requirements to avoid false positives.
    - 2-name players: requires full name (e.g., "Lionel Messi")
    - 3+ name players: requires first + last name (e.g., "Cristiano Ronaldo" for "Cristiano Ronaldo dos Santos")
    - No single-name matches allowed
    """
    if pd.isna(text):
        return []
    
    text_lower = text.lower()
    found_players = []
    
    for player_name in player_names:
        player_lower = player_name.lower()
        name_parts = player_name.split()
        
        # Try exact full name match
        full_name_pattern = r'\b' + re.escape(player_lower) + r'\b'
        if re.search(full_name_pattern, text_lower):
            found_players.append(player_name)
            continue
        
        # For players with 3+ names, allow first + last name match
        # (handles cases like "Cristiano Ronaldo" matching "Cristiano Ronaldo dos Santos Aveiro")
        if len(name_parts) >= 3:
            first_name = name_parts[0]
            last_name = name_parts[-1]
            
            first_pattern = r'\b' + re.escape(first_name.lower()) + r'\b'
            last_pattern = r'\b' + re.escape(last_name.lower()) + r'\b'
            
            if re.search(first_pattern, text_lower) and re.search(last_pattern, text_lower):
                # Check that first and last names appear close together (within 50 chars)
                # to avoid matching unrelated occurrences
                first_matches = [m.start() for m in re.finditer(first_pattern, text_lower)]
                last_matches = [m.start() for m in re.finditer(last_pattern, text_lower)]
                
                for first_pos in first_matches:
                    for last_pos in last_matches:
                        if abs(first_pos - last_pos) <= 50:
                            found_players.append(player_name)
                            break
                    if player_name in found_players:
                        break
    
    return found_players

# Detect players in market text
pm_2024['found_players_question'] = pm_2024['question'].apply(
    lambda x: find_players_in_text_strict(x, player_names)
)

pm_2024['found_players_slug'] = pm_2024['slug'].apply(
    lambda x: find_players_in_text_strict(x, player_names)
)

pm_2024['all_found_players'] = pm_2024.apply(
    lambda row: list(set(row['found_players_question'] + row['found_players_slug'])),
    axis=1
)

pm_2024['num_players_found'] = pm_2024['all_found_players'].apply(len)

# Check results
print("Player detection results:")
print(pm_2024['num_players_found'].value_counts().sort_index())

markets_with_players = pm_2024[pm_2024['num_players_found'] > 0]
if len(markets_with_players) > 0:
    print(f"\nFound {len(markets_with_players)} markets mentioning players\n")
    sample = markets_with_players.sample(min(10, len(markets_with_players)))
    
    for idx, row in sample.iterrows():
        print(f"{row['question'][:65]}")
        print(f"  → {row['all_found_players']}\n")

Player detection results:
num_players_found
0    1039
1      15
Name: count, dtype: int64

Found 15 markets mentioning players

Will Xavi Simons be Player of the Tournament for Euro 2024?
  → ['Xavi Simons']

Will Florian Wirtz be Euro 2024 top scorer?
  → ['Florian Wirtz']

Will Harry Kane be Euro 2024 top scorer?
  → ['Harry Kane']

Will Cody Gakpo be Player of the Tournament for Euro 2024?
  → ['Cody Mathès Gakpo']

Will Jude Bellingham be Euro 2024 top scorer?
  → ['Jude Bellingham']

Will Toni Kroos be Player of the Tournament for Euro 2024?
  → ['Toni Kroos']

Will Kai Havertz be Euro 2024 top scorer?
  → ['Kai Havertz']

Will Jamal Musiala be Euro 2024 top scorer?
  → ['Jamal Musiala']

Will Antoine Griezmann be Euro 2024 top scorer?
  → ['Antoine Griezmann']

Will Phil Foden be Euro 2024 top scorer?
  → ['Phil Foden']



In [13]:
# Prepare export with all relevant columns
export_cols = ['market_id', 'question', 'slug', 'category', 'volume', 
               'active', 'closed', 'created_at', 'end_date', 'year']

# Add detection columns
export_cols.extend(['num_teams_found', 'all_found_teams', 
                    'num_players_found', 'all_found_players'])

# Only keep columns that actually exist
export_cols = [c for c in export_cols if c in pm_2024.columns]

results_df = pm_2024[export_cols].copy()

# Convert lists to semicolon-separated strings for CSV compatibility
for col in ['all_found_players', 'all_found_teams']:
    if col in results_df.columns:
        results_df[col] = results_df[col].apply(
            lambda x: '; '.join(x) if isinstance(x, list) and len(x) > 0 else ''
        )

# Save to CSV
out_dir = Path('..') / 'data' / 'Polymarket'
out_dir.mkdir(parents=True, exist_ok=True)

out_path = out_dir / 'pm_2024_detection_results.csv'
results_df.to_csv(out_path, index=False)

print(f"Exported {len(results_df)} markets to: {out_path}")
print(f"Columns: {', '.join(results_df.columns)}")

Exported 1054 markets to: ../data/Polymarket/pm_2024_detection_results.csv
Columns: market_id, question, slug, category, volume, active, closed, created_at, end_date, year, num_teams_found, all_found_teams, num_players_found, all_found_players
