# Final Feature Exploration

Having done initial exploration for all of the features in the dataset, we now collate the features that we identified as potentially promising in helping to predict/explain goals scored. We then perform more data exploration with the promising features to either perform further feature engineering, or to eliminate additional features. 

In [1]:
import pandas as pd 
import numpy as np
import os
import matplotlib.pyplot as plt

In [2]:
#load the att_explore dataframe in 
att_final = pd.read_csv('att_explore.csv')
att_final.head()

Unnamed: 0.1,Unnamed: 0,Player ID,Day,Matchweek,Venue,Result,Team,Opponent,Start,Position,...,RW,WB,Defenders,Midfielders,Wide Midfielders,Wingers,Penalty Success Rate,SOT Percentage,Team Goals,Pass Completion Percentage
0,10000,140,Sun,32,Away,L 1–2,Crystal Palace,Leicester City,Y,DM,...,0,0,0,1,0,0,,0.0,1,83.783784
1,24977,340,Sat,29,Away,L 1–2,Bournemouth,Liverpool,N,LW,...,0,0,0,0,0,1,,,1,71.428571
2,37756,498,Sun,37,Away,D 0–0,Huddersfield,Manchester City,Y,CM,...,0,0,0,1,0,0,,,0,58.823529
3,18759,262,Sun,34,Away,D 2–2,Southampton,Brighton,N,LM,...,0,0,0,0,1,0,,0.0,2,81.818182
4,168,3,Sun,38,Home,W 5–0,Manchester City,Norwich City,Y*,LM,...,0,0,0,0,1,0,,0.2,5,87.5


### General Features

In [3]:
#remove the Unnamed:0 column
att_final = att_final.drop(columns=['Unnamed: 0'])
att_final.head()

Unnamed: 0,Player ID,Day,Matchweek,Venue,Result,Team,Opponent,Start,Position,Minutes Played,...,RW,WB,Defenders,Midfielders,Wide Midfielders,Wingers,Penalty Success Rate,SOT Percentage,Team Goals,Pass Completion Percentage
0,140,Sun,32,Away,L 1–2,Crystal Palace,Leicester City,Y,DM,90,...,0,0,0,1,0,0,,0.0,1,83.783784
1,340,Sat,29,Away,L 1–2,Bournemouth,Liverpool,N,LW,23,...,0,0,0,0,0,1,,,1,71.428571
2,498,Sun,37,Away,D 0–0,Huddersfield,Manchester City,Y,CM,90,...,0,0,0,1,0,0,,,0,58.823529
3,262,Sun,34,Away,D 2–2,Southampton,Brighton,N,LM,25,...,0,0,0,0,1,0,,0.0,2,81.818182
4,3,Sun,38,Home,W 5–0,Manchester City,Norwich City,Y*,LM,84,...,0,0,0,0,1,0,,0.2,5,87.5


From 'General_FeatureExplore', the following features were identified as promising. 
* Venue - We didn't really see a significant difference in the proportion of goalscoring observations when comparing 'Home and 'Away', but we know contextually that the venue is generally an important feature, so we will keep it for now

In [4]:
#we drop 'Day' and 'Matchweek' from att_final
att_final = att_final.drop(columns = ['Day', 'Matchweek'])
att_final.head()

Unnamed: 0,Player ID,Venue,Result,Team,Opponent,Start,Position,Minutes Played,Goals,Assists,...,RW,WB,Defenders,Midfielders,Wide Midfielders,Wingers,Penalty Success Rate,SOT Percentage,Team Goals,Pass Completion Percentage
0,140,Away,L 1–2,Crystal Palace,Leicester City,Y,DM,90,0,0,...,0,0,0,1,0,0,,0.0,1,83.783784
1,340,Away,L 1–2,Bournemouth,Liverpool,N,LW,23,0,0,...,0,0,0,0,0,1,,,1,71.428571
2,498,Away,D 0–0,Huddersfield,Manchester City,Y,CM,90,0,0,...,0,0,0,1,0,0,,,0,58.823529
3,262,Away,D 2–2,Southampton,Brighton,N,LM,25,0,0,...,0,0,0,0,1,0,,0.0,2,81.818182
4,3,Home,W 5–0,Manchester City,Norwich City,Y*,LM,84,0,0,...,0,0,0,0,1,0,,0.2,5,87.5


* Result - We transformed this feature by splitting it into two features. 'Outcome' (whether or not the game being played in ended up in a Win, Loss or Draw) and 'Score' (the final score of the game). We then further transformed the 'Score' feature into 'Team Goals', by looking at the number of goals scored by the team in that game. We ended up seeing that the number of goals conceded in a game didn't really matter. Therefore, we will get rid of the 'Score' feature, and only keep 'Outcome' and 'Team Goals'. 

In [5]:
#create new dataframe with just result and goals 
result_transform = att_final[['Result']].copy()

#strip the result column of any whitespace, to make it easier to process the string 
result_transform.loc[:, 'Result'] = result_transform['Result'].str.strip()

#use str.extract method to extract the relevant strings from the result column. the purpose of this is to create two new features (outcome and score)
result_transform[['Outcome', 'Score']] = result_transform['Result'].str.extract(r'([LWD])\s+(\d+[–-]\d+)')

#drop the result column, as we no longer need this 
result_transform = result_transform.drop('Result', axis = 1)

#replace the dash in the score column with a hyphen, to make it easier to work with in the future 
result_transform['Score'] = result_transform['Score'].str.replace('\u2013', '-', regex = True)

#create 'Team Goals' column
result_transform['Team Goals'] = result_transform['Score'].str.split('-').str[0].astype(int)

#remove 'Score' column 
result_transform = result_transform.drop(columns = ['Score'])

#append back onto att_final
att_final = pd.concat([att_final, result_transform], axis = 1)

#remove 'Result' from att_final
att_final = att_final.drop(columns = ['Result'])
att_final.head()

Unnamed: 0,Player ID,Venue,Team,Opponent,Start,Position,Minutes Played,Goals,Assists,Penalties Scored,...,Defenders,Midfielders,Wide Midfielders,Wingers,Penalty Success Rate,SOT Percentage,Team Goals,Pass Completion Percentage,Outcome,Team Goals.1
0,140,Away,Crystal Palace,Leicester City,Y,DM,90,0,0,0,...,0,1,0,0,,0.0,1,83.783784,L,1
1,340,Away,Bournemouth,Liverpool,N,LW,23,0,0,0,...,0,0,0,1,,,1,71.428571,L,1
2,498,Away,Huddersfield,Manchester City,Y,CM,90,0,0,0,...,0,1,0,0,,,0,58.823529,D,0
3,262,Away,Southampton,Brighton,N,LM,25,0,0,0,...,0,0,1,0,,0.0,2,81.818182,D,2
4,3,Home,Manchester City,Norwich City,Y*,LM,84,0,0,0,...,0,0,1,0,,0.2,5,87.5,W,5


* Team - We saw that there were certain teams that were associated with higher proportions of goalscoring observations, where the teams in question are the stronger teams in the league. We should keep this feature, as they are useful identifiers for incorporating team statistics into the model. 

* Opponent - Once again, we saw that certain teams were associated with higher proportions of goalscoring observations, where the teams in question this time are the weaker teams in the league. We also did some feature transformation by looking at the relationship between goalscoring observations and the final league position of the opposing team. Here, we saw that there is a higher proportion of goalscoring observations when playing against teams at the bottom of the table, which is what we expected. 

* Start - We saw that observations that started games were associated with higher proportions of goalscoring observations. However, we need to transform this feature by combining Y and Y* entries, because Y* (which indicates that a player started the game as captain), doesn't really have a significant impact. 

In [6]:
#replacing all Y* entries with Y in 'Start' column 
att_final['Start'] = att_final['Start'].replace('Y*', 'Y')
att_final['Start'].unique()

array(['Y', 'N'], dtype=object)

* Position - We transformed this feature by first one-hot encoding into a range of positions (these positions are the unique positions that we could find). The reason we had to do this was because there was a large range of unique values in this column (this is because a player may have played multiple positions in a match, and this was recorded as such. For example, if a player started the game in DM, but moved to RM, then the position entry would be DM, RM.). We then further refined the entries by combining certain groups (so observations with a 1 in either LW or RW were marked as having a 1 in Wingers). We ended up transforming the 'Position' column into a series of one-hot encoded columns, with a 1 if the observation played in that position in that game, and a 0 otherwise. 

In [None]:
#removing this observation as player didn't play 
att_final = att_final.drop(index = 7668)

#manually inputting the position for these 3 observations, as they were missing
att_final.loc[17110, 'Position'] = 'FW'
att_final.loc[3733, 'Position'] = 'FW'
att_final.loc[26879, 'Position'] = 'RW'

#performing the one-hot encoding
positions = att_final['Position']
positions_df = pd.DataFrame(positions, columns = ['Position'])
positions_encode = positions_df['Position'].str.get_dummies(sep = ',')


#for any observation that has a 1 in 'RB', 'LB' or 'CB', we also enter 1 in 'Defender' 
positions_encode['Defender'] = positions_encode[['RB', 'LB', 'CB']].any(axis = 1).astype(int)
#we now remove 'RB', 'LB' and 'CB'
positions_encode = positions_encode.drop(columns = ['RB', 'CB', 'LB'])

#for any observation that has a 1 in 'DM' or 'CM', we also enter 1 in 'Midfielder' 
positions_encode['Midfielder'] = positions_encode[['DM', 'CM']].any(axis = 1).astype(int)
#we now remove 'DM' and 'CM'
positions_encode = positions_encode.drop(columns = ['DM', 'CM'])

#for any observation that has a 1 in 'LM' or 'RM', we also enter 1 in 'Wide Midfielder' 
positions_encode['Wide Midfielder'] = positions_encode[['LM', 'RM']].any(axis = 1).astype(int)
#we now remove 'LM' and 'RM'
positions_encode = positions_encode.drop(columns = ['LM', 'RM'])

#for any observation that has a 1 in 'LW' or 'RW', we also enter 1 in 'Winger' 
positions_encode['Winger'] = positions_encode[['LW', 'RW']].any(axis = 1).astype(int)
#we now remove 'LW' and 'RW'
positions_encode = positions_encode.drop(columns = ['LW', 'RW'])


#rename the 'AM' column 'Attacking Midfielder' 
positions_encode = positions_encode.rename(columns={'AM': 'Attacking Midfielder'})

#rename the 'FW' column 'Forward' 
positions_encode = positions_encode.rename(columns={'FW': 'Forward'})

#rename the 'WB' column 'Wingback' 
positions_encode = positions_encode.rename(columns={'WB': 'Wingback'})

#append back onto att_final
att_final = pd.concat([att_final, positions_encode], axis = 1)

#remove 'Position' from att_final
att_final = att_final.drop(columns = ['Position'])
att_final.head()

* Minutes Played - We know from contextual information that this is an important feature. However, we are unsure about the direct relationship between this feature and goals. We will keep this feature because it is useful to transform certain features into per90. We will need to keep this feature to do that. 

### Performance Features