# Final Feature Exploration

Having done initial exploration for all of the features in the dataset, we now collate the features that we identified as potentially promising in helping to predict/explain goals scored. We then perform more data exploration with the promising features to either perform further feature engineering, or to eliminate additional features. 

In [1]:
import pandas as pd 
import numpy as np
import os
import matplotlib.pyplot as plt

In [2]:
#load the att_explore dataframe in 
att_final = pd.read_csv('att_explore.csv')
att_final.head()

Unnamed: 0.1,Unnamed: 0,Player ID,Day,Matchweek,Venue,Result,Team,Opponent,Start,Position,...,kickoff_time,own_goals,saves,selected,threat,total_points,transfers_balance,transfers_in,transfers_out,value
0,10000,140,Sun,32,Away,L 1–2,Crystal Palace,Leicester City,Y,DM,...,2022-04-10 13:00:00+00:00,0,0,6955,36.0,2,27,343,316,54
1,24977,340,Sat,29,Away,L 1–2,Bournemouth,Liverpool,N,LW,...,2020-03-07 12:30:00+00:00,0,0,24392,0.0,1,-359,133,492,52
2,37756,498,Sun,37,Away,D 0–0,Huddersfield,Manchester City,Y,CM,...,2018-05-06 12:30:00+00:00,0,0,10349,0.0,3,-47,129,176,44
3,18759,262,Sun,34,Away,D 2–2,Southampton,Brighton,N,LM,...,2022-04-24 13:00:00+00:00,0,0,19229,5.0,1,-1446,288,1734,58
4,168,3,Sun,38,Home,W 5–0,Manchester City,Norwich City,Y*,LM,...,2020-07-26 15:00:00+00:00,0,0,801163,101.0,3,198047,228466,30419,75


### General Features

In [3]:
#remove the Unnamed:0 column
att_final = att_final.drop(columns=['Unnamed: 0'])

#removing observations where 'Minutes Played = 0'
att_final = att_final[att_final['Minutes Played'] != 0]
att_final.head()

Unnamed: 0,Player ID,Day,Matchweek,Venue,Result,Team,Opponent,Start,Position,Minutes Played,...,kickoff_time,own_goals,saves,selected,threat,total_points,transfers_balance,transfers_in,transfers_out,value
0,140,Sun,32,Away,L 1–2,Crystal Palace,Leicester City,Y,DM,90,...,2022-04-10 13:00:00+00:00,0,0,6955,36.0,2,27,343,316,54
1,340,Sat,29,Away,L 1–2,Bournemouth,Liverpool,N,LW,23,...,2020-03-07 12:30:00+00:00,0,0,24392,0.0,1,-359,133,492,52
2,498,Sun,37,Away,D 0–0,Huddersfield,Manchester City,Y,CM,90,...,2018-05-06 12:30:00+00:00,0,0,10349,0.0,3,-47,129,176,44
3,262,Sun,34,Away,D 2–2,Southampton,Brighton,N,LM,25,...,2022-04-24 13:00:00+00:00,0,0,19229,5.0,1,-1446,288,1734,58
4,3,Sun,38,Home,W 5–0,Manchester City,Norwich City,Y*,LM,84,...,2020-07-26 15:00:00+00:00,0,0,801163,101.0,3,198047,228466,30419,75


From 'General_FeatureExplore', the following features were identified as promising. 
* Venue - We didn't really see a significant difference in the proportion of goalscoring observations when comparing 'Home and 'Away', but we know contextually that the venue is generally an important feature, so we will keep it for now

In [4]:
#we drop 'Day' and 'Matchweek' from att_final
att_final = att_final.drop(columns = ['Day', 'Matchweek'])
att_final.head()

Unnamed: 0,Player ID,Venue,Result,Team,Opponent,Start,Position,Minutes Played,Goals,Assists,...,kickoff_time,own_goals,saves,selected,threat,total_points,transfers_balance,transfers_in,transfers_out,value
0,140,Away,L 1–2,Crystal Palace,Leicester City,Y,DM,90,0,0,...,2022-04-10 13:00:00+00:00,0,0,6955,36.0,2,27,343,316,54
1,340,Away,L 1–2,Bournemouth,Liverpool,N,LW,23,0,0,...,2020-03-07 12:30:00+00:00,0,0,24392,0.0,1,-359,133,492,52
2,498,Away,D 0–0,Huddersfield,Manchester City,Y,CM,90,0,0,...,2018-05-06 12:30:00+00:00,0,0,10349,0.0,3,-47,129,176,44
3,262,Away,D 2–2,Southampton,Brighton,N,LM,25,0,0,...,2022-04-24 13:00:00+00:00,0,0,19229,5.0,1,-1446,288,1734,58
4,3,Home,W 5–0,Manchester City,Norwich City,Y*,LM,84,0,0,...,2020-07-26 15:00:00+00:00,0,0,801163,101.0,3,198047,228466,30419,75


* Result - We transformed this feature by splitting it into two features. 'Outcome' (whether or not the game being played in ended up in a Win, Loss or Draw) and 'Score' (the final score of the game). We then further transformed the 'Score' feature into 'Team Goals', by looking at the number of goals scored by the team in that game. We ended up seeing that the number of goals conceded in a game didn't really matter. Therefore, we will get rid of the 'Score' feature, and only keep 'Outcome' and 'Team Goals'. 

In [5]:
#create new dataframe with just result and goals 
result_transform = att_final[['Result']].copy()

#strip the result column of any whitespace, to make it easier to process the string 
result_transform.loc[:, 'Result'] = result_transform['Result'].str.strip()

#use str.extract method to extract the relevant strings from the result column. the purpose of this is to create two new features (outcome and score)
result_transform[['Outcome', 'Score']] = result_transform['Result'].str.extract(r'([LWD])\s+(\d+[–-]\d+)')

#drop the result column, as we no longer need this 
result_transform = result_transform.drop('Result', axis = 1)

#replace the dash in the score column with a hyphen, to make it easier to work with in the future 
result_transform['Score'] = result_transform['Score'].str.replace('\u2013', '-', regex = True)

#create 'Team Goals' column
result_transform['Team Goals'] = result_transform['Score'].str.split('-').str[0].astype(int)

#remove 'Score' column 
result_transform = result_transform.drop(columns = ['Score'])

#append back onto att_final
att_final = pd.concat([att_final, result_transform], axis = 1)

#remove 'Result' from att_final
att_final = att_final.drop(columns = ['Result'])
att_final.head()

Unnamed: 0,Player ID,Venue,Team,Opponent,Start,Position,Minutes Played,Goals,Assists,Penalties Scored,...,saves,selected,threat,total_points,transfers_balance,transfers_in,transfers_out,value,Outcome,Team Goals
0,140,Away,Crystal Palace,Leicester City,Y,DM,90,0,0,0,...,0,6955,36.0,2,27,343,316,54,L,1
1,340,Away,Bournemouth,Liverpool,N,LW,23,0,0,0,...,0,24392,0.0,1,-359,133,492,52,L,1
2,498,Away,Huddersfield,Manchester City,Y,CM,90,0,0,0,...,0,10349,0.0,3,-47,129,176,44,D,0
3,262,Away,Southampton,Brighton,N,LM,25,0,0,0,...,0,19229,5.0,1,-1446,288,1734,58,D,2
4,3,Home,Manchester City,Norwich City,Y*,LM,84,0,0,0,...,0,801163,101.0,3,198047,228466,30419,75,W,5


* Team - We saw that there were certain teams that were associated with higher proportions of goalscoring observations, where the teams in question are the stronger teams in the league. We should keep this feature, as they are useful identifiers for incorporating team statistics into the model. 

* Opponent - Once again, we saw that certain teams were associated with higher proportions of goalscoring observations, where the teams in question this time are the weaker teams in the league. We also did some feature transformation by looking at the relationship between goalscoring observations and the final league position of the opposing team. Here, we saw that there is a higher proportion of goalscoring observations when playing against teams at the bottom of the table, which is what we expected. 

* Start - We saw that observations that started games were associated with higher proportions of goalscoring observations. However, we need to transform this feature by combining Y and Y* entries, because Y* (which indicates that a player started the game as captain), doesn't really have a significant impact. 

In [6]:
#replacing all Y* entries with Y in 'Start' column 
att_final['Start'] = att_final['Start'].replace('Y*', 'Y')
att_final['Start'].unique()

array(['Y', 'N'], dtype=object)

* Position - We transformed this feature by first one-hot encoding into a range of positions (these positions are the unique positions that we could find). The reason we had to do this was because there was a large range of unique values in this column (this is because a player may have played multiple positions in a match, and this was recorded as such. For example, if a player started the game in DM, but moved to RM, then the position entry would be DM, RM.). We then further refined the entries by combining certain groups (so observations with a 1 in either LW or RW were marked as having a 1 in Wingers). We ended up transforming the 'Position' column into a series of one-hot encoded columns, with a 1 if the observation played in that position in that game, and a 0 otherwise. 

In [7]:
#manually inputting the position for these 3 observations, as they were missing
att_final.loc[12664, 'Position'] = 'FW'
att_final.loc[16504, 'Position'] = 'FW'
att_final.loc[25979, 'Position'] = 'RW'

#performing the one-hot encoding
positions = att_final['Position']
positions_df = pd.DataFrame(positions, columns = ['Position'])
positions_encode = positions_df['Position'].str.get_dummies(sep = ',')


#for any observation that has a 1 in 'RB', 'LB' or 'CB', we also enter 1 in 'Defender' 
positions_encode['Defender'] = positions_encode[['RB', 'LB', 'CB']].any(axis = 1).astype(int)
#we now remove 'RB', 'LB' and 'CB'
positions_encode = positions_encode.drop(columns = ['RB', 'CB', 'LB'])

#for any observation that has a 1 in 'DM' or 'CM', we also enter 1 in 'Midfielder' 
positions_encode['Midfielder'] = positions_encode[['DM', 'CM']].any(axis = 1).astype(int)
#we now remove 'DM' and 'CM'
positions_encode = positions_encode.drop(columns = ['DM', 'CM'])

#for any observation that has a 1 in 'LM' or 'RM', we also enter 1 in 'Wide Midfielder' 
positions_encode['Wide Midfielder'] = positions_encode[['LM', 'RM']].any(axis = 1).astype(int)
#we now remove 'LM' and 'RM'
positions_encode = positions_encode.drop(columns = ['LM', 'RM'])

#for any observation that has a 1 in 'LW' or 'RW', we also enter 1 in 'Winger' 
positions_encode['Winger'] = positions_encode[['LW', 'RW']].any(axis = 1).astype(int)
#we now remove 'LW' and 'RW'
positions_encode = positions_encode.drop(columns = ['LW', 'RW'])


#rename the 'AM' column 'Attacking Midfielder' 
positions_encode = positions_encode.rename(columns={'AM': 'Attacking Midfielder'})

#rename the 'FW' column 'Forward' 
positions_encode = positions_encode.rename(columns={'FW': 'Forward'})

#rename the 'WB' column 'Wingback' 
positions_encode = positions_encode.rename(columns={'WB': 'Wingback'})

#append back onto att_final
att_final = pd.concat([att_final, positions_encode], axis = 1)

#remove 'Position' from att_final
att_final = att_final.drop(columns = ['Position'])
att_final.head()

Unnamed: 0,Player ID,Venue,Team,Opponent,Start,Minutes Played,Goals,Assists,Penalties Scored,Penalties Attempted,...,value,Outcome,Team Goals,Attacking Midfielder,Forward,Wingback,Defender,Midfielder,Wide Midfielder,Winger
0,140,Away,Crystal Palace,Leicester City,Y,90,0,0,0,0,...,54,L,1,0,0,0,0,1,0,0
1,340,Away,Bournemouth,Liverpool,N,23,0,0,0,0,...,52,L,1,0,0,0,0,0,0,1
2,498,Away,Huddersfield,Manchester City,Y,90,0,0,0,0,...,44,D,0,0,0,0,0,1,0,0
3,262,Away,Southampton,Brighton,N,25,0,0,0,0,...,58,D,2,0,0,0,0,0,1,0
4,3,Home,Manchester City,Norwich City,Y,84,0,0,0,0,...,75,W,5,0,0,0,0,0,1,0


* Minutes Played - We know from contextual information that this is an important feature. However, we are unsure about the direct relationship between this feature and goals. We will keep this feature because it is useful to transform certain features into per90. We will need to keep this feature to do that. 

### Performance Features

* Penalties Attempted - We saw that players that attempted penalties in a game scored a goal almost 80% of the time. In other words, if we know a player is going to have a penalty attempt in a game, the probability of them scoring a goal is quite high. However, this isn't necessarily the most relevant information. We also saw that Penalty Success Rate (calculated as the proportion of Successful Penalties over Penalties Attempted) didn't necessarily provide any additional information. However, we did also do some feature engineering, which allowed us to mark certain Player ID's as 'Designated Penalty Takers'. We then saw that the proportion of goalscoring observations was much higher for the designated penalty takers, compared to the non-designated penalty takers. Based on this, we will include the 'Penalties Attempted' and 'Designated Penalty Taker' features. The 'Penalties Scored' feature will be removed for now. 

(As part of the 'Designated Penalty Takers' feature engineering, we also needed to add another feature called 'Season'. This is essentially just a simplified version of the kickoff_time feature, with the date and time of the match stripped away)


In [8]:
# Convert 'kickoff_time' to datetime
att_final['kickoff_time'] = pd.to_datetime(att_final['kickoff_time'])

# Function to determine the season
def determine_season(kickoff_time):
    month = kickoff_time.month
    year = kickoff_time.year
    if month >= 8:  # August to December
        return f'{year}-{year + 1}'  # Current year to next year
    else:  # January to July
        return f'{year - 1}-{year}'  # Previous year to current year

# Apply the function to create the 'season' column
att_final['Season'] = att_final['kickoff_time'].apply(determine_season)

In [9]:
#group observations by player ID and penalties attempted 
pen_group = att_final.groupby('Player ID', as_index = False)['Penalties Attempted'].sum()

#remove obs with 0 penalties attempted 
pen_group = pen_group[pen_group['Penalties Attempted'] > 0 ]

#create new dataframe which has 'kickoff_time', 'Season', 'penalties attempted' and 'team in it 
team_pens = att_final[['kickoff_time', 'Season', 'Penalties Attempted', 'Team']].copy()

#now we group by team and season to compute how many penalties were taken by each team in each season
team_pens_summary = team_pens.groupby(['Season', 'Team'], as_index=False)['Penalties Attempted'].sum()
team_pens_summary.rename(columns={'Penalties Attempted': 'Team Penalties'}, inplace=True)

#create empty dataframe
pen_prop = pd.DataFrame()

#loop through to get the Player ID and Penalties Attempted for each team in each season, filtering so that we only include observations with at 
#least 1 penalty taken 
for index, row in team_pens_summary.iterrows():
    team = row['Team']
    season = row['Season']
    
    filtered = att_final[(att_final['Season'] == season) & (att_final['Team'] == team) & (att_final['Penalties Attempted'] > 0)][['Player ID', 'Penalties Attempted']]
    filtered['Team'] = team
    filtered['Season'] = season
    pen_prop = pd.concat([pen_prop, filtered], ignore_index= True)

#adding a new column into pen_prop called 'Team Penalties' which merges the relevant information from team_pens_summary
pen_prop = pen_prop.merge(team_pens_summary, on=['Team', 'Season'], how='left')

#we now merge rows that have the same player ID, team and season together. For the rows that satisfy this, we sum the penalties attempted to 
#reflect the number of penalties a particular player ID took in a given season 
merged_penprop = pen_prop.groupby(['Team', 'Season', 'Player ID'], as_index=False).agg({
    'Penalties Attempted': 'sum',
    'Team Penalties': 'first'  
})
merged_penprop = merged_penprop.sort_values(by='Player ID')

#adding new column called Proportion of Team Penalties Taken
merged_penprop['Proportion of Team Penalties Taken'] = (
    merged_penprop['Penalties Attempted'] / merged_penprop['Team Penalties']
)

#final dataframe which merges the rows based on Player ID. Each row now corresponds to one unique player ID, the penalties attempted and team 
#penalties columns are now summed. The proportion is then recalculated 
penprop_summary = merged_penprop.groupby('Player ID').agg(
    Penalties_Attempted=('Penalties Attempted', 'sum'),
    Team_Penalties=('Team Penalties', 'sum')
).reset_index()

penprop_summary['Proportion of Team Penalties Taken'] = (
    penprop_summary['Penalties_Attempted'] / penprop_summary['Team_Penalties'])

#first off, we can probably include all player ID's with 100% team penalties taken as 'designated penalty takers'
desig_pen_takers = penprop_summary.loc[penprop_summary['Proportion of Team Penalties Taken'] == 1, 'Player ID'].tolist()

#we now look at the rest of the observations. let's remove the player ID's that are already included in desig_pen_takers from penprop_summary 
#for clarity 
penprop_summary = penprop_summary[~penprop_summary['Player ID'].isin(desig_pen_takers)]
penprop_summary = penprop_summary.sort_values(by='Penalties_Attempted', ascending=False)

#we now add the Player ID's of players that took more than 50% of their team's penalties 
additional_takers = penprop_summary.loc[penprop_summary['Proportion of Team Penalties Taken'] > 0.5, 'Player ID'].tolist()
desig_pen_takers.extend(additional_takers)

#we now remove the rows corresponding to the player ID's that we just added to desig_pen_takers
penprop_summary = penprop_summary[~penprop_summary['Player ID'].isin(desig_pen_takers)]

#construct 'Designated Penalty Taker' feature 
att_final['Designated Penalty Taker'] = att_final['Player ID'].isin(desig_pen_takers).astype(int)

* Shots - We saw that shots was a useful predictor of goals, which was what we expected. 
* Shots on Target - We also saw that SOT (Shots on Target) was a useful predictor of goals. Not only this, but we also saw that SOT is potentially a less noisy predictor of goals. In other words, it may be more valuable to use Shots on Target in the model instead of Shots. Finally, we saw that there was some correlation between these two features, but it wasn't strong enough to consider including only one of these features in the final list at the expense of the other. 
* Yellow/Red Cards - We saw that neither yellow nor red cards were good predictors of goals. Therefore, we should remove these features. 

In [10]:
#remove 'Result' from att_final
att_final = att_final.drop(columns = ['Yellow Cards', 'Red Cards'])