## üìä Analysing Expected Goals (xG) ‚Äì Women‚Äôs Euros 2025 & World Cup 2023

**Competitions:** UEFA Women‚Äôs Euros 2025 & FIFA Women‚Äôs World Cup 2023  
**Purpose:** Analyse and compare team and player performance using Expected Goals (xG) metrics  
**Methods:** Event data analysis, xG modelling, exploratory visualisation  
**Author:** [Victoria Friss de Kereki](https://www.linkedin.com/in/victoria-friss-de-kereki/)  

---

**Notebook first written:** `01/01/2026`  
**Last updated:** `04/01/2026`  

> This notebook explores Expected Goals (xG) as a performance metric in elite women‚Äôs international football. Using match event data from major tournaments, it analyses shot quality, chance creation, and finishing efficiency at both team and player level. The framework can be extended to other competitions or adapted for club football with mini
> al changes.
al changes.


## Packages and configuration

In [46]:
from statsbombpy import sb
import pandas as pd
from mplsoccer import VerticalPitch,Pitch
from highlight_text import ax_text, fig_text
from matplotlib.colors import LinearSegmentedColormap
import matplotlib.pyplot as plt
import matplotlib.patheffects as path_effects
import seaborn as sns
import pprint
import numpy as np

## Load Competiton, Match, and Event Data from statsbombpy

In [2]:
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

# Call statsbombpy API to get all free competitions, then chec Women's comps
free_comps = sb.competitions()
women_comps = free_comps[free_comps['competition_gender'] == 'female']
women_comps

Unnamed: 0,competition_id,season_id,country_name,competition_name,competition_gender,competition_youth,competition_international,season_name,match_updated,match_updated_360,match_available_360,match_available
25,37,90,England,FA Women's Super League,female,False,False,2020/2021,2025-04-23T14:16:46.924831,2021-06-13T16:17:31.694,,2025-04-23T14:16:46.924831
26,37,42,England,FA Women's Super League,female,False,False,2019/2020,2024-02-12T15:05:34.211400,2021-06-13T16:17:31.694,,2024-02-12T15:05:34.211400
27,37,4,England,FA Women's Super League,female,False,False,2018/2019,2024-08-07T17:22:40.334287,2021-06-13T16:17:31.694,,2024-08-07T17:22:40.334287
63,49,3,United States of America,NWSL,female,False,False,2018,2024-12-15T12:31:48.035735,2021-06-13T16:17:31.694,,2024-12-15T12:31:48.035735
71,53,315,Europe,UEFA Women's Euro,female,False,True,2025,2025-07-28T14:19:20.467348,2025-07-29T16:03:07.355174,2025-07-29T16:03:07.355174,2025-07-28T14:19:20.467348
72,53,106,Europe,UEFA Women's Euro,female,False,True,2022,2024-02-13T13:27:17.178263,2024-02-13T13:30:52.820588,2024-02-13T13:30:52.820588,2024-02-13T13:27:17.178263
73,72,107,International,Women's World Cup,female,False,True,2023,2025-07-14T10:07:06.620906,2025-07-14T10:10:27.224586,2025-07-14T10:10:27.224586,2025-07-14T10:07:06.620906
74,72,30,International,Women's World Cup,female,False,True,2019,2024-08-08T15:57:56.748740,2021-06-13T16:17:31.694,,2024-08-08T15:57:56.748740


# EUROS

In [3]:
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

matches_euros = sb.matches(competition_id=53, season_id=315)
matches_euros.head(2)

Unnamed: 0,match_id,match_date,kick_off,competition,season,home_team,away_team,home_score,away_score,match_status,match_status_360,last_updated,last_updated_360,match_week,competition_stage,stadium,referee,home_managers,away_managers,data_version,shot_fidelity_version,xy_fidelity_version
0,4020846,2025-07-27,16:00:00.000,Europe - UEFA Women's Euro,2025,England Women's,Spain Women's,1,1,available,available,2025-07-28T14:19:20.467348,2025-07-29T16:03:07.355174,6,Final,St. Jakob-Park,St√©phanie Frappart,Sarina Glotzbach-Wiegman,Montserrat Tom√© V√°zquez,1.1.0,2,2
1,4020077,2025-07-23,19:00:00.000,Europe - UEFA Women's Euro,2025,Germany Women's,Spain Women's,0,1,available,available,2025-07-24T19:44:48.774783,2025-07-25T15:22:27.432293,5,Semi-finals,Stadion Letzigrund,Edina Alves Batista,Christian Richard W√ºck,Montserrat Tom√© V√°zquez,1.1.0,2,2


### EUROS final - events

In [4]:
final_match = matches_euros[matches_euros['competition_stage'] == 'Final'].iloc[0]
final_match_id = final_match['match_id']

events_df = sb.events(match_id=final_match_id)

In [5]:
shots_df = events_df[events_df['type'] == 'Shot'].copy()

# Mark goals
shots_df['goal'] = shots_df['shot_outcome'].apply(lambda x: 1 if x == 'Goal' else 0)

In [6]:
# LINEUP

# Select first two Starting XI rows
xi_rows = events_df.loc[events_df['type'] == 'Starting XI'].iloc[:2]

# Create an empty list to store DataFrames
lineups = []

for _, xi_row in xi_rows.iterrows():
    tactics = xi_row['tactics']
    lineup_df = pd.DataFrame(tactics['lineup'])
    
    # Extract player and position info
    lineup_df['player_id'] = lineup_df['player'].apply(lambda x: x['id'])
    lineup_df['player'] = lineup_df['player'].apply(lambda x: x['name'])
    
    lineup_df['position_id'] = lineup_df['position'].apply(lambda x: x['id'])
    lineup_df['position'] = lineup_df['position'].apply(lambda x: x['name'])
    
    # Add team and formation
    lineup_df['team'] = xi_row['team']
    
    lineups.append(lineup_df)

# Combine home and away into one DataFrame
lineup_df = pd.concat(lineups, ignore_index=True)

lineup_df_euros = lineup_df

In [7]:
# Filter shots up to minute 120 (to exclude penalties)
shots_up_to_120 = shots_df[shots_df['minute'] < 120]

# Calculate total xG, goals, number of shots, and avg xG per shot for each player
player_xg_summary = shots_up_to_120.groupby(['player', 'team']).agg(
    shots=('shot_statsbomb_xg', 'count'),
    total_xg=('shot_statsbomb_xg', 'sum'),
    xg_per_shot=('shot_statsbomb_xg', 'mean'),
    goals=('goal', 'sum')
).sort_values('total_xg', ascending=False)

# Merge position info from the lineup
player_xg_summary = player_xg_summary.reset_index().merge(
    lineup_df_euros[['player', 'team', 'position', 'jersey_number']],
    on=['player', 'team'],
    how='left'
)

# Fill missing positions and jersey numbers for substitutes
player_xg_summary['position'] = player_xg_summary['position'].fillna('Sub')
player_xg_summary['jersey_number'] = player_xg_summary['jersey_number'].fillna(-1).astype(int)

# Reorder columns
player_xg_summary = player_xg_summary[
    ['player', 'team', 'position', 'jersey_number', 'shots', 'total_xg', 'xg_per_shot', 'goals']
]

player_xg_summary_euros = player_xg_summary
player_xg_summary_euros

Unnamed: 0,player,team,position,jersey_number,shots,total_xg,xg_per_shot,goals
0,Mar√≠a Francesca Caldentey Oliver,Spain Women's,Left Wing,8,3,0.547321,0.18244,1
1,Salma Paralluelo Ayingono,Spain Women's,Sub,-1,3,0.441645,0.147215,0
2,Lauren Hemp,England Women's,Right Wing,11,1,0.33025,0.33025,0
3,Esther Gonzalez Rodr√≠guez,Spain Women's,Center Forward,9,3,0.304451,0.101484,0
4,Alessia Russo,England Women's,Center Forward,23,2,0.277606,0.138803,1
5,Victoria L√≥pez,Spain Women's,Sub,-1,3,0.249601,0.0832,0
6,Aitana Bonmati Conca,Spain Women's,Right Center Midfield,6,4,0.216761,0.05419,0
7,Claudia Pina Medina,Spain Women's,Sub,-1,2,0.138034,0.069017,0
8,Athenea del Castillo Belvide,Spain Women's,Right Wing,10,3,0.115528,0.038509,0
9,Chloe Kelly,England Women's,Sub,-1,2,0.108643,0.054321,0


In [8]:
# Total xG by substitutes
subs_xg = player_xg_summary_euros.loc[player_xg_summary_euros['position'] == 'Sub', 'total_xg'].sum()

# Total xG by other positions
starters_xg = player_xg_summary_euros.loc[player_xg_summary_euros['position'] != 'Sub', 'total_xg'].sum()

# Combine in a DataFrame for easy view
xg_split = pd.DataFrame({
    'Category': ['Subs', 'Starters/Other'],
    'Total_xG': [subs_xg, starters_xg]
})

xg_split

Unnamed: 0,Category,Total_xG
0,Subs,0.937922
1,Starters/Other,2.080566


In [9]:
# Calculate total xG and goals for each team
team_xg_summary = shots_up_to_120.groupby('team').agg(
    shots=('shot_statsbomb_xg', 'count'),
    total_xg=('shot_statsbomb_xg', 'sum'),
    goals=('goal', 'sum'),
    xg_per_shot=('shot_statsbomb_xg', 'mean')
).sort_values('total_xg', ascending=False)

team_xg_summary

Unnamed: 0_level_0,shots,total_xg,goals,xg_per_shot
team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Spain Women's,23,2.13598,1,0.092869
England Women's,8,0.882508,1,0.110313


### Passes

In [10]:
passes_euros_final = events_df[events_df['type'] == 'Pass'].copy()

passes_euros_final.head(2)

Unnamed: 0,50_50,ball_receipt_outcome,ball_recovery_recovery_failure,carry_end_location,clearance_aerial_won,clearance_body_part,clearance_head,clearance_left_foot,clearance_other,clearance_right_foot,counterpress,dribble_nutmeg,dribble_outcome,duel_outcome,duel_type,duration,foul_committed_advantage,foul_committed_card,foul_won_advantage,foul_won_defensive,goalkeeper_body_part,goalkeeper_end_location,goalkeeper_outcome,goalkeeper_position,goalkeeper_technique,goalkeeper_type,id,index,injury_stoppage_in_chain,interception_outcome,location,match_id,minute,miscontrol_aerial_won,off_camera,out,pass_aerial_won,pass_angle,pass_assisted_shot_id,pass_body_part,pass_cross,pass_cut_back,pass_deflected,pass_end_location,pass_goal_assist,pass_height,pass_inswinging,pass_length,pass_outcome,pass_outswinging,pass_recipient,pass_recipient_id,pass_shot_assist,pass_straight,pass_switch,pass_technique,pass_through_ball,pass_type,period,play_pattern,player,player_id,position,possession,possession_team,possession_team_id,related_events,second,shot_body_part,shot_end_location,shot_first_time,shot_freeze_frame,shot_key_pass_id,shot_one_on_one,shot_outcome,shot_statsbomb_xg,shot_technique,shot_type,substitution_outcome,substitution_outcome_id,substitution_replacement,substitution_replacement_id,tactics,team,team_id,timestamp,type,under_pressure
12,,,,,,,,,,,,,,,,2.551716,,,,,,,,,,,8b621ae4-ea81-415c-af41-9669db9bdd93,5,,,"[61.0, 40.1]",4020846,0,,,,,3.049369,,Right Foot,,,,"[26.4, 43.3]",,Ground Pass,,34.74766,,,Hannah Hampton,22032.0,,,,,,Kick Off,1,From Kick Off,Ella Toone,31534.0,Center Attacking Midfield,2,England Women's,865,[4706efbe-767c-45aa-9351-09528a77d135],0,,,,,,,,,,,,,,,,England Women's,865,00:00:00.682,Pass,
13,,,,,,,,,,,,,,,,3.108339,,,,,,,,,,,27fa7d4d-d637-4487-98e2-5c078ad600c7,9,,,"[28.5, 43.8]",4020846,0,,,,,0.269167,,Right Foot,,,,"[83.6, 59.0]",,High Pass,,57.158115,,,Lucy Bronze,10178.0,,,,,,,1,From Kick Off,Hannah Hampton,22032.0,Goalkeeper,2,England Women's,865,"[6fd0aa81-8d8c-47aa-965b-d5737b19b868, 764d437...",4,,,,,,,,,,,,,,,,England Women's,865,00:00:04.396,Pass,True


In [25]:
pass_columns = [
    # Identifiers & context
    "id", "match_id", "period", "minute", "second", "timestamp",
    "team", "team_id", "player", "player_id", "position",
    "possession", "possession_team", "possession_team_id",
    "play_pattern", "under_pressure",

    # Locations
    "location", "pass_end_location",

    # Pass characteristics
    "pass_length", "pass_angle", "pass_height",
    "pass_body_part", "pass_technique", "pass_type",

    # Pass intent / style
    "pass_cross", "pass_through_ball", "pass_switch",
    "pass_cut_back", "pass_straight",
    "pass_inswinging", "pass_outswinging",

    # Outcomes & links
    "pass_outcome",
    "pass_recipient", "pass_recipient_id",
    "pass_assisted_shot_id", "pass_shot_assist",
    "pass_goal_assist",

    # Pressure & context
    "counterpress"
]

In [32]:
passes_euros_final_clean = passes_euros_final[pass_columns].copy()
passes_euros_final_clean.head(4)

Unnamed: 0,id,match_id,period,minute,second,timestamp,team,team_id,player,player_id,position,possession,possession_team,possession_team_id,play_pattern,under_pressure,location,pass_end_location,pass_length,pass_angle,pass_height,pass_body_part,pass_technique,pass_type,pass_cross,pass_through_ball,pass_switch,pass_cut_back,pass_straight,pass_inswinging,pass_outswinging,pass_outcome,pass_recipient,pass_recipient_id,pass_assisted_shot_id,pass_shot_assist,pass_goal_assist,counterpress
12,8b621ae4-ea81-415c-af41-9669db9bdd93,4020846,1,0,0,00:00:00.682,England Women's,865,Ella Toone,31534.0,Center Attacking Midfield,2,England Women's,865,From Kick Off,,"[61.0, 40.1]","[26.4, 43.3]",34.74766,3.049369,Ground Pass,Right Foot,,Kick Off,,,,,,,,,Hannah Hampton,22032.0,,,,
13,27fa7d4d-d637-4487-98e2-5c078ad600c7,4020846,1,0,4,00:00:04.396,England Women's,865,Hannah Hampton,22032.0,Goalkeeper,2,England Women's,865,From Kick Off,True,"[28.5, 43.8]","[83.6, 59.0]",57.158115,0.269167,High Pass,Right Foot,,,,,,,,,,,Lucy Bronze,10178.0,,,,
14,dec91305-49ce-463b-bda2-876bbc37660f,4020846,1,0,7,00:00:07.505,England Women's,865,Lucy Bronze,10178.0,Right Back,2,England Women's,865,From Kick Off,,"[83.6, 59.0]","[91.8, 69.5]",13.322537,0.907778,High Pass,Head,,,,,,,,,,,Lauren Hemp,15555.0,,,,
15,ace589b9-15e1-4326-af54-7928fe9178df,4020846,1,0,9,00:00:09.296,England Women's,865,Lauren Hemp,15555.0,Right Wing,2,England Women's,865,From Kick Off,True,"[91.8, 69.5]","[95.1, 60.9]",9.211406,-1.204402,Low Pass,,,,,,,,,,,,Alessia Russo,47521.0,,,,


In [29]:
print("Total Euros's Final Pass Events:", len(passes_euros_final_clean))

Total Euros's Final Pass Events: 1306


In [37]:
# Descriptive analysis of columns indicating pass difficulty

df = passes_euros_final_clean.copy()

# treat IDs / state variables as categorical
for col in [
    "possession",
    "possession_team_id",
    "team_id",
    "player_id",
    "pass_recipient_id",
    "match_id"
]:
    if col in df.columns:
        df[col] = df[col].astype("category")

# split columns
num_cols = df.select_dtypes(include="number").columns
cat_cols = df.columns.difference(num_cols)

# numeric summary
df[num_cols].describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
period,1306.0,1.916539,0.97631,1.0,1.0,2.0,2.0,4.0
minute,1306.0,57.112557,35.530791,0.0,27.0,53.0,91.0,122.0
second,1306.0,28.705207,17.405542,0.0,13.0,28.0,44.0,59.0
pass_length,1306.0,19.59055,13.816006,1.0,10.171888,16.23961,24.184021,100.25108
pass_angle,1306.0,0.035255,1.55025,-3.122727,-1.186932,0.045143,1.200095,3.141593


In [38]:
# categorical options
for c in cat_cols:
    print(c)
    print(df[c].value_counts(dropna=False))
    print()

counterpress
counterpress
NaN     1300
True       6
Name: count, dtype: int64

id
id
8b621ae4-ea81-415c-af41-9669db9bdd93    1
f48a2e91-6be1-417b-b141-948e6ef217a9    1
f707b0b7-e000-4244-877c-88905a28ea41    1
1e890c1d-f822-4022-ad7a-ac2b046f2d39    1
8826bdd1-1202-427d-8d2d-351021b9ed47    1
                                       ..
1f6398b0-0b87-4329-a6b5-177a38853ee9    1
01b15b2b-1ce3-4b83-8b02-8065c77d6da1    1
95f8048e-4ce7-47c9-b9d1-64485dabfcd6    1
6c2ac10e-9bd3-4f5f-9aa7-2e2a5b4ac5a4    1
20fc3418-afd8-49f0-936e-88682d7e62d8    1
Name: count, Length: 1306, dtype: int64

location
location
[61.0, 40.1]     6
[120.0, 0.1]     6
[120.0, 80.0]    4
[6.0, 30.0]      2
[66.8, 31.9]     2
                ..
[66.0, 80.0]     1
[55.1, 76.0]     1
[66.2, 2.2]      1
[50.0, 77.3]     1
[105.2, 76.3]    1
Name: count, Length: 1287, dtype: int64

match_id
match_id
4020846    1306
Name: count, dtype: int64

pass_assisted_shot_id
pass_assisted_shot_id
NaN                                    

In [41]:
# Basic (raw) pass completion rate
df = passes_euros_final_clean.copy()
df["completed"] = df["pass_outcome"].isna().astype(int)

In [44]:
# Per player
player_basic = (
    df
    .groupby("player")
    .agg(
        passes=("id", "count"),
        completed=("completed", "sum")
    )
)

player_basic["completion_rate"] = (
    player_basic["completed"] / player_basic["passes"]
)

player_basic

Unnamed: 0_level_0,passes,completed,completion_rate
player,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Aitana Bonmati Conca,86,73,0.848837
Alessia Russo,8,4,0.5
Alex Greenwood,51,42,0.823529
Alexia Putellas Segura,49,39,0.795918
Athenea del Castillo Belvide,29,19,0.655172
Bethany Mead,5,5,1.0
Catalina Thomas Coll Lluch,32,26,0.8125
Chloe Kelly,18,11,0.611111
Claudia Pina Medina,33,20,0.606061
Ella Toone,21,18,0.857143


In [45]:
team_basic = (
    df
    .groupby("team")
    .agg(
        passes=("id", "count"),
        completed=("completed", "sum")
    )
)

team_basic["completion_rate"] = (
    team_basic["completed"] / team_basic["passes"]
)

team_basic

Unnamed: 0_level_0,passes,completed,completion_rate
team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
England Women's,474,341,0.719409
Spain Women's,832,699,0.840144


#### Adjusted passes completion rate
(a) Longer
More time in flight
More interception risk
‚Üí pass_length

(b) Played under pressure
Reduced decision time
Poorer body orientation
‚Üí under_pressure

(c) Aerial
Harder to control and receive
More variance
‚Üí pass_height
(High Pass > Low Pass > Ground Pass)

(d) Intentional risk passes
These are designed to break structure:
pass_through_ball
pass_cross
pass_switch

(e) Progressive
Moves the ball meaningfully forward
‚Üí based on location ‚Üí pass_end_location

In [48]:
df["under_pressure"] = df["under_pressure"].fillna(False).astype(bool).astype(int)
df["through_ball"] = df["pass_through_ball"].fillna(False).astype(bool).astype(int)
df["cross"] = df["pass_cross"].fillna(False).astype(bool).astype(int)
df["switch"] = df["pass_switch"].fillna(False).astype(bool).astype(int)
df["high_pass"] = (df["pass_height"] == "High Pass").astype(int)

df["start_x"] = df["location"].str[0]
df["end_x"] = df["pass_end_location"].str[0]
df["progressive"] = (df["end_x"] - df["start_x"] > 10).astype(int)

df["difficulty"] = (
    0.03 * df["pass_length"] +
    0.5  * df["under_pressure"] +
    0.7  * df["high_pass"] +
    0.8  * df["through_ball"] +
    0.6  * df["cross"] +
    0.6  * df["switch"] +
    0.5  * df["progressive"]
)

player_pass = (
    df.groupby("player")
      .agg(
          passes=("id", "count"),
          completed=("completed", "sum"),
          total_difficulty=("difficulty", "sum"),
          completed_difficulty=("difficulty", lambda x: x[df.loc[x.index, "completed"] == 1].sum())
      )
)

player_pass["completion_rate"] = player_pass["completed"] / player_pass["passes"]
player_pass["difficulty_adjusted_rate"] = (
    player_pass["completed_difficulty"] / player_pass["total_difficulty"]
)

team_pass = (
    df.groupby("team")
      .agg(
          passes=("id", "count"),
          completed=("completed", "sum"),
          total_difficulty=("difficulty", "sum"),
          completed_difficulty=("difficulty", lambda x: x[df.loc[x.index, "completed"] == 1].sum())
      )
)

team_pass["completion_rate"] = team_pass["completed"] / team_pass["passes"]
team_pass["difficulty_adjusted_rate"] = (
    team_pass["completed_difficulty"] / team_pass["total_difficulty"]
)

  df["through_ball"] = df["pass_through_ball"].fillna(False).astype(bool).astype(int)
  df["cross"] = df["pass_cross"].fillna(False).astype(bool).astype(int)
  df["switch"] = df["pass_switch"].fillna(False).astype(bool).astype(int)


In [49]:
player_pass

Unnamed: 0_level_0,passes,completed,total_difficulty,completed_difficulty,completion_rate,difficulty_adjusted_rate
player,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Aitana Bonmati Conca,86,73,60.27429,49.789467,0.848837,0.826048
Alessia Russo,8,4,8.174196,4.996706,0.5,0.611278
Alex Greenwood,51,42,57.554178,37.987062,0.823529,0.660023
Alexia Putellas Segura,49,39,37.599232,22.03647,0.795918,0.586088
Athenea del Castillo Belvide,29,19,17.744875,8.121691,0.655172,0.457692
Bethany Mead,5,5,4.129474,4.129474,1.0,1.0
Catalina Thomas Coll Lluch,32,26,47.170994,32.74354,0.8125,0.694146
Chloe Kelly,18,11,26.55307,12.45725,0.611111,0.469145
Claudia Pina Medina,33,20,45.42844,19.65113,0.606061,0.432573
Ella Toone,21,18,18.440899,15.793045,0.857143,0.856414


In [50]:
team_pass

Unnamed: 0_level_0,passes,completed,total_difficulty,completed_difficulty,completion_rate,difficulty_adjusted_rate
team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
England Women's,474,341,624.779088,342.754552,0.719409,0.548601
Spain Women's,832,699,672.578646,496.60408,0.840144,0.738358


In [51]:
# Build a xP model

df["start_x"] = df["location"].str[0]
df["start_y"] = df["location"].str[1]
df["end_x"] = df["pass_end_location"].str[0]
df["end_y"] = df["pass_end_location"].str[1]

df["dx"] = df["end_x"] - df["start_x"]
df["dy"] = df["end_y"] - df["start_y"]

df["distance"] = np.sqrt(df["dx"]**2 + df["dy"]**2)
df["forward"] = (df["dx"] > 0).astype(int)

df["under_pressure"] = df["under_pressure"].fillna(False).astype(bool).astype(int)
df["through_ball"] = df["pass_through_ball"].fillna(False).astype(bool).astype(int)
df["cross"] = df["pass_cross"].fillna(False).astype(bool).astype(int)
df["switch"] = df["pass_switch"].fillna(False).astype(bool).astype(int)

df["high_pass"] = (df["pass_height"] == "High Pass").astype(int)

  df["through_ball"] = df["pass_through_ball"].fillna(False).astype(bool).astype(int)
  df["cross"] = df["pass_cross"].fillna(False).astype(bool).astype(int)
  df["switch"] = df["pass_switch"].fillna(False).astype(bool).astype(int)


In [52]:
from sklearn.linear_model import LogisticRegression

features = [
    "distance",
    "forward",
    "under_pressure",
    "high_pass",
    "through_ball",
    "cross",
    "switch"
]

X = df[features]
y = df["completed"]

model = LogisticRegression(max_iter=1000)
model.fit(X, y)

In [53]:
df["xP"] = model.predict_proba(X)[:, 1]
df["difficulty"] = 1 - df["xP"]


In [56]:
player_xp = (
    df.groupby("player")
      .agg(
          passes=("id", "count"),
          completed=("completed", "sum"),
          expected_completed=("xP", "sum")
      )
)

player_xp["raw_completion"] = (
    player_xp["completed"] / player_xp["passes"]
)

player_xp["xp_adjusted"] = (
    player_xp["completed"] - player_xp["expected_completed"]
)

player_xp

Unnamed: 0_level_0,passes,completed,expected_completed,raw_completion,xp_adjusted
player,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Aitana Bonmati Conca,86,73,73.712751,0.848837,-0.712751
Alessia Russo,8,4,6.621285,0.5,-2.621285
Alex Greenwood,51,42,39.874229,0.823529,2.125771
Alexia Putellas Segura,49,39,40.816219,0.795918,-1.816219
Athenea del Castillo Belvide,29,19,24.26503,0.655172,-5.26503
Bethany Mead,5,5,4.410629,1.0,0.589371
Catalina Thomas Coll Lluch,32,26,22.468441,0.8125,3.531559
Chloe Kelly,18,11,12.237217,0.611111,-1.237217
Claudia Pina Medina,33,20,24.653198,0.606061,-4.653198
Ella Toone,21,18,18.550292,0.857143,-0.550292


In [57]:
team_xp = (
    df.groupby("team")
      .agg(
          passes=("id", "count"),
          completed=("completed", "sum"),
          expected_completed=("xP", "sum")
      )
)

team_xp["raw_completion"] = (
    team_xp["completed"] / team_xp["passes"]
)

team_xp["xp_adjusted"] = (
    team_xp["completed"] - team_xp["expected_completed"]
)

team_xp

Unnamed: 0_level_0,passes,completed,expected_completed,raw_completion,xp_adjusted
team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
England Women's,474,341,347.150823,0.719409,-6.150823
Spain Women's,832,699,692.851022,0.840144,6.148978


## WORLDS

In [10]:
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

matches_worlds = sb.matches(competition_id=72, season_id=107)
matches_worlds.head(2)

Unnamed: 0,match_id,match_date,kick_off,competition,season,home_team,away_team,home_score,away_score,match_status,match_status_360,last_updated,last_updated_360,match_week,competition_stage,stadium,referee,home_managers,away_managers,data_version,shot_fidelity_version,xy_fidelity_version
0,3904629,2023-08-16,13:00:00.000,International - Women's World Cup,2023,Australia Women's,England Women's,1,3,available,available,2023-08-30T11:15:11.306289,2023-08-30T11:17:47.551826,6,Semi-finals,Accor Stadium,Tori Penso,Tony Gustavsson,Sarina Glotzbach-Wiegman,1.1.0,2,2
1,3906390,2023-08-20,13:00:00.000,International - Women's World Cup,2023,Spain Women's,England Women's,1,0,available,available,2023-08-22T19:29:29.948278,2023-08-22T19:38:43.965521,7,Final,Accor Stadium,Tori Penso,Jorge Vilda,Sarina Glotzbach-Wiegman,1.1.0,2,2


In [11]:
final_match = matches_worlds[matches_worlds['competition_stage'] == 'Final'].iloc[0]
final_match_id = final_match['match_id']

events_df = sb.events(match_id=final_match_id)

In [12]:
shots_df = events_df[events_df['type'] == 'Shot'].copy()

# Mark goals
shots_df['goal'] = shots_df['shot_outcome'].apply(lambda x: 1 if x == 'Goal' else 0)

In [13]:
# LINEUP

# Select first two Starting XI rows
xi_rows = events_df.loc[events_df['type'] == 'Starting XI'].iloc[:2]

# Create an empty list to store DataFrames
lineups = []

for _, xi_row in xi_rows.iterrows():
    tactics = xi_row['tactics']
    lineup_df = pd.DataFrame(tactics['lineup'])
    
    # Extract player and position info
    lineup_df['player_id'] = lineup_df['player'].apply(lambda x: x['id'])
    lineup_df['player'] = lineup_df['player'].apply(lambda x: x['name'])
    
    lineup_df['position_id'] = lineup_df['position'].apply(lambda x: x['id'])
    lineup_df['position'] = lineup_df['position'].apply(lambda x: x['name'])
    
    # Add team and formation
    lineup_df['team'] = xi_row['team']
    
    lineups.append(lineup_df)

# Combine home and away into one DataFrame
lineup_df = pd.concat(lineups, ignore_index=True)

lineup_df_worlds = lineup_df

In [14]:
# Filter shots up to minute 120 (to exclude penalties)
shots_up_to_120 = shots_df[shots_df['minute'] < 120]

# Calculate total xG, goals, number of shots, and avg xG per shot for each player
player_xg_summary = shots_up_to_120.groupby(['player', 'team']).agg(
    shots=('shot_statsbomb_xg', 'count'),
    total_xg=('shot_statsbomb_xg', 'sum'),
    xg_per_shot=('shot_statsbomb_xg', 'mean'),
    goals=('goal', 'sum')
).sort_values('total_xg', ascending=False)

# Merge position info from the lineup
player_xg_summary = player_xg_summary.reset_index().merge(
    lineup_df_worlds[['player', 'team', 'position', 'jersey_number']],
    on=['player', 'team'],
    how='left'
)

# Fill missing positions and jersey numbers for substitutes
player_xg_summary['position'] = player_xg_summary['position'].fillna('Sub')
player_xg_summary['jersey_number'] = player_xg_summary['jersey_number'].fillna(-1).astype(int)

# Reorder columns
player_xg_summary = player_xg_summary[
    ['player', 'team', 'position', 'jersey_number', 'shots', 'total_xg', 'xg_per_shot', 'goals']
]

player_xg_summary_worlds = player_xg_summary
player_xg_summary_worlds

Unnamed: 0,player,team,position,jersey_number,shots,total_xg,xg_per_shot,goals
0,Jennifer Hermoso Fuentes,Spain Women's,Center Attacking Midfield,10,2,0.843641,0.421821,0
1,Salma Paralluelo Ayingono,Spain Women's,Center Forward,18,3,0.503586,0.167862,0
2,Alba Mar√≠a Redondo Ferrer,Spain Women's,Right Wing,17,2,0.471368,0.235684,0
3,Lauren Hemp,England Women's,Left Center Forward,11,4,0.397953,0.099488,0
4,Alexia Putellas Segura,Spain Women's,Sub,-1,1,0.098619,0.098619,0
5,Aitana Bonmati Conca,Spain Women's,Right Center Midfield,6,2,0.091389,0.045694,0
6,Olga Carmona Garc√≠a,Spain Women's,Left Back,19,1,0.054861,0.054861,1
7,Millie Bright,England Women's,Center Back,6,2,0.0507,0.02535,0
8,Irene Paredes Hernandez,Spain Women's,Right Center Back,4,1,0.0498,0.0498,0
9,Mar√≠a Francesca Caldentey Oliver,Spain Women's,Left Wing,8,1,0.046465,0.046465,0


In [15]:
# Total xG by substitutes
subs_xg = player_xg_summary_worlds.loc[player_xg_summary_worlds['position'] == 'Sub', 'total_xg'].sum()

# Total xG by other positions
starters_xg = player_xg_summary_worlds.loc[player_xg_summary_worlds['position'] != 'Sub', 'total_xg'].sum()

# Combine in a DataFrame for easy view
xg_split = pd.DataFrame({
    'Category': ['Subs', 'Starters/Other'],
    'Total_xG': [subs_xg, starters_xg]
})

xg_split

Unnamed: 0,Category,Total_xG
0,Subs,0.112913
1,Starters/Other,2.550889


In [16]:
# Calculate total xG and goals for each team
team_xg_summary = shots_up_to_120.groupby('team').agg(
    shots=('shot_statsbomb_xg', 'count'),
    total_xg=('shot_statsbomb_xg', 'sum'),
    goals=('goal', 'sum'),
    xg_per_shot=('shot_statsbomb_xg', 'mean')
).sort_values('total_xg', ascending=False)

team_xg_summary

Unnamed: 0_level_0,shots,total_xg,goals,xg_per_shot
team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Spain Women's,14,2.194123,1,0.156723
England Women's,8,0.469679,0,0.05871
