NAME: Farhan Sakif (preferred name Sakif)



EMAIL: farhan.sakif@gmail.com



TIMEZONE: Pacific Standard Time (PST)

# PART 1: DATA COLLECTION, PREPARATION AND AGGREGATION

## Imports

In [1]:
from statsbombpy import sb
import pandas as pd

## Loading in all events from the 2020/2021 FA WSL (data provided by Statsbomb)

In [2]:
events = sb.competition_events(
    country="England",
    division= "FA Women's Super League",
    season="2020/2021",
    gender='female',
    split=True
)



In [3]:
#All the event data available in the dataset:

for i in events:
    print(i)

starting_xis
half_starts
passes
ball_receipts
carrys
pressures
dispossesseds
duels
ball_recoverys
dribbled_pasts
dribbles
goal_keepers
clearances
shots
blocks
foul_committeds
miscontrols
interceptions
foul_wons
offsides
shields
half_ends
substitutions
50/50s
injury_stoppages
referee_ball_drops
tactical_shifts
errors
player_offs
player_ons
bad_behaviours
own_goal_fors
own_goal_againsts


## Identifying CDMs in the league

- The first task is to identify the players that would be playing the role of a CDM in the FA WSL.

- To do so, players' stats for passes, ball recoveries, interceptions and duels were looked at while filtering for the positions of "Center Defensive Midfield, "Right Defensive Midfield", and "Left Defensive Midfield". 

- It is important to note that a higher number for a certain player in these statisics also suggest that they are more likely to be playing more minutes. More attention will be paid to these nuances later on when trying to evaulate how good the CDMs actually are but for now the task is to identify the players that regularly play in the role of the CDM in the WSL.

In [8]:
#Getting the event data for passes made by players in the WSL

passing = events['passes']

#Filter for players listed as CDM

df_pass = passing[(passing['position'] == 'Center Defensive Midfield')
                 |(passing['position'] == 'Left Defensive Midfield')
                 |(passing['position'] == 'Right Defensive Midfield')]

#Since this is passing event data, every row is a pass made by a player and so a groupby using the player column
# alongside the count() function should return the total number of passes

total_passes = df_pass.groupby(['player']).count()
total_passes = total_passes[['id']]

#Sort in ascending order

total_passes = total_passes.sort_values(by='id', ascending=False)
most_passes = total_passes[(total_passes['id'] > 300)]
most_passes = most_passes.rename(columns={'id': 'passes'})

most_passes

Unnamed: 0_level_0,passes
player,Unnamed: 1_level_1
Keira Walsh,1259
Lia Wälti,750
Melanie Leupolz,728
Katie Zelem,664
Angharad James,535
Alanna Stephanie Kennedy,528
Isobel Mary Christiansen,515
Jessica Fishlock,461
Jackie Groenen,401
Sophie Louise Ingle,395


In [9]:
#Using the same process for interceptions event data

interceptions = events['interceptions']

df_int = interceptions[(interceptions['position'] == 'Center Defensive Midfield')
                      |(interceptions['position'] == 'Left Defensive Midfield')
                      |(interceptions['position'] == 'Right Defensive Midfield')]


total_int = df_int.groupby(['player']).count()
total_int = total_int[['id']]
total_int = total_int.sort_values(by='id', ascending=False)
most_int = total_int[(total_int['id'] > 10)]
most_int = most_int.rename(columns={'id': 'interceptions'})

most_int

Unnamed: 0_level_0,interceptions
player,Unnamed: 1_level_1
Marisa Ewers,28
Aimee Palmer,24
Chloe Arthur,23
Keira Walsh,23
Alanna Stephanie Kennedy,22
Angharad James,22
Megan Connolly,22
Jessica Fishlock,22
Katie Zelem,21
Melanie Leupolz,21


In [10]:
#Now for ball recoveries

recoveries = events['ball_recoverys']

df_rec = recoveries[(recoveries['position'] == 'Center Defensive Midfield')
                   |(recoveries['position'] == 'Left Defensive Midfield')
                   |(recoveries['position'] == 'Right Defensive Midfield')]

total_rec = df_rec.groupby(['player']).count()
total_rec = total_rec[['id']]
total_rec = total_rec.sort_values(by='id', ascending=False)
most_rec = total_rec[(total_rec['id'] > 50)]
most_rec = most_rec.rename(columns={'id': 'recoveries'})

most_rec

Unnamed: 0_level_0,recoveries
player,Unnamed: 1_level_1
Keira Walsh,109
Katie Zelem,100
Angharad James,83
Alanna Stephanie Kennedy,81
Christie Murray,80
Kate Longhurst,79
Jessica Fishlock,78
Jackie Groenen,75
Melanie Leupolz,74
Lia Wälti,73


In [11]:
# And finally, duels

duels = events['duels']

df_duels = duels[(duels['position'] == 'Center Defensive Midfield')
                      |(duels['position'] == 'Left Defensive Midfield')
                      |(duels['position'] == 'Right Defensive Midfield')]

total_duels = df_duels.groupby(['player']).count()
total_duels = total_duels[['id']]
total_duels = total_duels.sort_values(by='id', ascending=False)
most_duels = total_duels[(total_duels['id'] > 20)]
most_duels = most_duels.rename(columns={'id': 'duels'})


most_duels

Unnamed: 0_level_0,duels
player,Unnamed: 1_level_1
Melanie Leupolz,56
Christie Murray,54
Lia Wälti,54
Angharad James,52
Keira Walsh,48
Katie Zelem,45
Alanna Stephanie Kennedy,44
Marisa Ewers,42
Jessica Fishlock,40
Chloe Arthur,39


- Appearing multiple times in the above dataframes is a good sign that the player regularly plays as a CDM in the FA WSL

- Given that criteria, the players that will be looked at under closer inspection for this study to find the top 3 CDMs in the league are:

    - Keira Walsh
    - Katie Zelem
    - Melanie Leupolz
    - Lia Wälti
    - Marisa Ewers
    - Jill Scott
    - Angharad James	
    - Jessica Fishlock	
    - Jackie Groenen
    - Alanna Stephanie Kennedy
    - Christie Murray
    - Josie Green
    - Sophie Louise Ingle	
    - Kate Longhurst

## Data Processing and Aggregation

#### Aggregated stats at the player-season or player-match level is a paid feature for the Statsbomb API. Therefore we need to get creative to get some aggregated stats for the selected players

In [35]:
#Helper functions

def player_filter(event_name):
    '''
    - Function that filters the selected players based on passed in event (e.g passes or interceptions)
    '''
    
    df = events[event_name][(events[event_name]['player'] == 'Keira Walsh')
                            |(events[event_name]['player'] == 'Katie Zelem')
                            |(events[event_name]['player'] == 'Melanie Leupolz')
                            |(events[event_name]['player'] == 'Lia Wälti')
                            |(events[event_name]['player'] == 'Marisa Ewers')
                            |(events[event_name]['player'] == 'Jill Scott')
                            |(events[event_name]['player'] == 'Angharad James')
                            |(events[event_name]['player'] == 'Jessica Fishlock')
                            |(events[event_name]['player'] == 'Jackie Groenen')
                            |(events[event_name]['player'] == 'Alanna Stephanie Kennedy')
                            |(events[event_name]['player'] == 'Christie Murray')
                            |(events[event_name]['player'] == 'Josie Green')
                            |(events[event_name]['player'] == 'Sophie Louise Ingle')
                            |(events[event_name]['player'] == 'Kate Longhurst')]
    
    return df

def incomplete_passes(player):
    '''
    - Function to return the number of incomplete passes made by a player
    '''
    
    return len(passing.loc[(passing.player == player) & (passing.pass_outcome == 'Incomplete')])

def assists(player):
    '''
    - Function to calculate number of assists given by a player
    '''
    
    df = passing.loc[(passing.player == player)]
    assists = df.loc[df.pass_goal_assist == True]
    return len(assists)

def int_per_match(player):
    '''
    - Function to calculate interceptions made per match by a player
    - The idea is to get per match values since mins played is not available in the data
    - The unique values in the match_id column should give the total matches played
    '''
    
    df = interceptions.loc[(interceptions.player == player)]
    
    return len(df)/(len(set(df.match_id)))

def rec_per_match(player):
    '''
    - Function to calculate recoveries per match for a player (similar to interceptions per match)
    '''
    
    df = recoveries.loc[(recoveries.player == player)]
    
    return len(df)/(len(set(df.match_id)))

def duel_success(player):
    '''
    - Return duel success rate for a given player
    '''
    successful_duels = len(duels.loc[(duels.player == player)
                                     &((duels.duel_outcome == 'Success In Play')
                                     |(duels.duel_outcome == 'Success Out')
                                     |(duels.duel_outcome == 'Won'))])
    
    return (successful_duels / len(duels.loc[(duels.player == player)])) * 100

def dribble_success_rate(player):
    '''
    - Return dribble success rate of a player
    '''
    successful_dribbles = len(dribbles.loc[(dribbles.player == player)&(dribbles.dribble_outcome == 'Complete')])
                                       
    return successful_dribbles / len(dribbles.loc[(dribbles.player == player)]) * 100 

In [39]:
#Aggregating the total passing and passing accuracy for selected players

passing = player_filter('passes')
total_passes = passing.groupby(['player']).count()
total_passes = total_passes[['id']]

total_passes = total_passes.sort_values(by='id', ascending=False)
total_passes = total_passes.rename(columns={'id': 'passes'})

incomplete_passes_list = []

for i in total_passes.index:
    inc_pass = incomplete_passes(i)
    incomplete_passes_list.append(inc_pass)
    
total_passes['incomplete_passes'] = incomplete_passes_list
total_passes['pass_accuracy'] = (1 - (total_passes.incomplete_passes/total_passes.passes)) * 100

#Identifying assists

asst_list = []
for i in (total_passes.index):
    asst = assists(i)
    asst_list.append(asst)
    
total_passes['assists'] = asst_list

#Interceptions and interceptions per match

interceptions = player_filter('interceptions')

total_int = interceptions.groupby(['player']).count()
total_int = total_int[['id']]
total_int = total_int.sort_values(by='id', ascending=False)
total_int = total_int.rename(columns={'id': 'interceptions'})

int_list = []
for i in (total_int.index):
    ints = int_per_match(i)
    int_list.append(ints)

total_int["ints_per_match"] = int_list

#Recoveries and recoveries per match

recoveries = player_filter('ball_recoverys')

total_rec = recoveries.groupby(['player']).count()
total_rec = total_rec[['id']]
total_rec = total_rec.sort_values(by='id', ascending=False)
total_rec = total_rec.rename(columns={'id': 'recoveries'})

rec_list = []
for i in (total_rec.index):
    recs = rec_per_match(i)
    rec_list.append(recs)

total_rec["recs_per_match"] = rec_list

#Duels and duel success rate

duels = player_filter('duels')

total_duels = duels.groupby(['player']).count()
total_duels = total_duels[['id']]
total_duels = total_duels.sort_values(by='id', ascending=False)
total_duels = total_duels.rename(columns={'id': 'duels'})

duel_rate_list = []

for i in total_duels.index:
    duel = duel_success(i)
    duel_rate_list.append(duel)
    
total_duels['duel_success_rate'] = duel_rate_list

#Dribbles and dribble success rate

dribbles = player_filter('dribbles')

total_dribbles = dribbles.groupby(['player']).count()
total_dribbles = total_dribbles[['id']]

total_dribbles = total_dribbles.sort_values(by='id', ascending=False)
total_dribbles = total_dribbles.rename(columns={'id': 'dribbles'})

drb_success_rate = []

for i in total_dribbles.index:
    drbl = dribble_success_rate(i)
    drb_success_rate.append(drbl)
    
total_dribbles['dribble_success_rate'] = drb_success_rate

#Fouls

fouls = player_filter('foul_committeds')

total_fouls = fouls.groupby(['player']).count()
total_fouls = total_fouls[['id']]
total_fouls = total_fouls.sort_values(by='id', ascending=False)
total_fouls = total_fouls.rename(columns={'id': 'fouls'})

#Miscontrols

miscontrols = player_filter('miscontrols')

total_miscontrols = miscontrols.groupby(['player']).count()
total_miscontrols = total_miscontrols[['id']]
total_miscontrols = total_miscontrols.sort_values(by='id', ascending=False)
total_miscontrols = total_miscontrols.rename(columns={'id': 'miscontrols'})

In [57]:
#Merge all the above dataframes to create one

merge = total_passes.join([total_int, total_rec, total_dribbles, total_duels, total_miscontrols, total_fouls])

# merge.to_csv("FAWSL_CDM.csv")