## Transfers Capstone - Data Cleaning

### Data Collection: 2016 - 2020 Transfers Data
Data sourced from: https:github.com/fivethirtyeight/data/tree/master/soccer-spi

 - The FiveThirtyEight soccer ranking data is compile from ESPN and Engsoccerdata GitHub repository. Their ranking indexes clubs using a model that analyzes offensive and defensive production using goals for and goals against in a given match. More on the index and ranking data here: https://fivethirtyeight.com/methodology/how-our-club-soccer-predictions-work/
 - For mypurposes, I am really interested in the home and away SPI ratings for the EP and EFL-C clubs during 2019 through 2020.
     - Engsoccerdata Github (for reference only): https://github.com/jalapic/engsoccerdata

### I. Loading Data - Home & Away SPI Data for PL & EFL-C

In [3]:
# Loading packages and combining various seasons datafrom for PL and Championship into one DataFrame "df"
import pandas as pd
import glob
import os
import numpy as np
from pandas_profiling import ProfileReport

path = "/home/tdraths/sb_assignments/Transfers_Capstone/data/original_data_sources/spi_matches.csv"
spi = pd.read_csv(path)

spi.head()

Unnamed: 0,season,date,league_id,league,team1,team2,spi1,spi2,prob1,prob2,...,importance1,importance2,score1,score2,xg1,xg2,nsxg1,nsxg2,adj_score1,adj_score2
0,2016,2016-07-09,7921,FA Women's Super League,Liverpool Women,Reading,51.56,50.42,0.4389,0.2767,...,,,2.0,0.0,,,,,,
1,2016,2016-07-10,7921,FA Women's Super League,Arsenal Women,Notts County Ladies,46.61,54.03,0.3572,0.3608,...,,,2.0,0.0,,,,,,
2,2016,2016-07-10,7921,FA Women's Super League,Chelsea FC Women,Birmingham City,59.85,54.64,0.4799,0.2487,...,,,1.0,1.0,,,,,,
3,2016,2016-07-16,7921,FA Women's Super League,Liverpool Women,Notts County Ladies,53.0,52.35,0.4289,0.2699,...,,,0.0,0.0,,,,,,
4,2016,2016-07-17,7921,FA Women's Super League,Chelsea FC Women,Arsenal Women,59.43,60.99,0.4124,0.3157,...,,,1.0,2.0,,,,,,


### II. Examining the data

In [7]:
spi.columns

Index(['season', 'date', 'league_id', 'league', 'team1', 'team2', 'spi1',
       'spi2', 'prob1', 'prob2', 'probtie', 'proj_score1', 'proj_score2',
       'importance1', 'importance2', 'score1', 'score2', 'xg1', 'xg2', 'nsxg1',
       'nsxg2', 'adj_score1', 'adj_score2'],
      dtype='object')

In [8]:
spi.league.value_counts()
# Before I start dropping columns and messing with the data, I'm going to subset spi 
# I only need EFL-C and PL data, seen here as "English League Championship" and "Barclays Premier League"

English League Championship                 2223
Italy Serie A                               1900
Spanish Primera Division                    1900
Barclays Premier League                     1900
French Ligue 1                              1900
Spanish Segunda Division                    1865
Italy Serie B                               1594
English League Two                          1554
German Bundesliga                           1530
French Ligue 2                              1520
Brasileiro Série A                          1520
English League One                          1514
United Soccer League                        1487
Major League Soccer                         1459
Turkish Turkcell Super Lig                  1338
Portuguese Liga                             1224
German 2. Bundesliga                        1224
Dutch Eredivisie                            1224
Argentina Primera Division                   979
Russian Premier Liga                         960
Swedish Allsvenskan 

In [11]:
efl_c = spi[spi['league'] == 'English League Championship']
epl = spi[spi['league'] == 'Barclays Premier League']

spi = pd.concat([efl_c, epl])
display(spi.head())
spi.tail()

Unnamed: 0,season,date,league_id,league,team1,team2,spi1,spi2,prob1,prob2,...,importance1,importance2,score1,score2,xg1,xg2,nsxg1,nsxg2,adj_score1,adj_score2
2992,2017,2017-08-04,2412,English League Championship,Sunderland,Derby County,50.39,40.83,0.5266,0.2184,...,,,1.0,1.0,2.24,1.23,1.92,1.38,1.05,1.05
2994,2017,2017-08-04,2412,English League Championship,Nottingham Forest,Millwall,35.55,28.23,0.5149,0.2186,...,,,1.0,0.0,0.45,3.49,1.26,2.73,1.05,0.0
3004,2017,2017-08-05,2412,English League Championship,Sheffield United,Brentford,27.72,39.7,0.3031,0.4486,...,,,1.0,0.0,0.72,1.84,0.97,1.43,1.05,0.0
3005,2017,2017-08-05,2412,English League Championship,Queens Park Rangers,Reading,36.33,34.9,0.442,0.2823,...,,,2.0,0.0,2.15,0.29,1.27,0.51,2.1,0.0
3006,2017,2017-08-05,2412,English League Championship,Fulham,Norwich City,43.0,42.6,0.4434,0.3142,...,,,1.0,1.0,1.19,1.71,2.35,1.88,1.05,1.05


Unnamed: 0,season,date,league_id,league,team1,team2,spi1,spi2,prob1,prob2,...,importance1,importance2,score1,score2,xg1,xg2,nsxg1,nsxg2,adj_score1,adj_score2
42041,2020,2021-05-23,2411,Barclays Premier League,Manchester City,Everton,94.16,75.8,0.7825,0.0736,...,,,,,,,,,,
42042,2020,2021-05-23,2411,Barclays Premier League,Liverpool,Crystal Palace,91.38,69.71,0.7854,0.0661,...,,,,,,,,,,
42043,2020,2021-05-23,2411,Barclays Premier League,Wolverhampton,Manchester United,78.46,85.38,0.3142,0.4122,...,,,,,,,,,,
42044,2020,2021-05-23,2411,Barclays Premier League,Arsenal,Brighton and Hove Albion,78.83,70.26,0.5391,0.2142,...,,,,,,,,,,
42045,2020,2021-05-23,2411,Barclays Premier League,West Ham United,Southampton,69.01,70.94,0.4094,0.3385,...,,,,,,,,,,


In [12]:
display(spi.dtypes)
spi.info()

season           int64
date            object
league_id        int64
league          object
team1           object
team2           object
spi1           float64
spi2           float64
prob1          float64
prob2          float64
probtie        float64
proj_score1    float64
proj_score2    float64
importance1    float64
importance2    float64
score1         float64
score2         float64
xg1            float64
xg2            float64
nsxg1          float64
nsxg2          float64
adj_score1     float64
adj_score2     float64
dtype: object

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4123 entries, 2992 to 42045
Data columns (total 23 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   season       4123 non-null   int64  
 1   date         4123 non-null   object 
 2   league_id    4123 non-null   int64  
 3   league       4123 non-null   object 
 4   team1        4123 non-null   object 
 5   team2        4123 non-null   object 
 6   spi1         4123 non-null   float64
 7   spi2         4123 non-null   float64
 8   prob1        4123 non-null   float64
 9   prob2        4123 non-null   float64
 10  probtie      4123 non-null   float64
 11  proj_score1  4123 non-null   float64
 12  proj_score2  4123 non-null   float64
 13  importance1  2929 non-null   float64
 14  importance2  2929 non-null   float64
 15  score1       3233 non-null   float64
 16  score2       3233 non-null   float64
 17  xg1          3229 non-null   float64
 18  xg2          3229 non-null   float64
 19  ns

In [13]:
spi.shape

(4123, 23)

In [14]:
spi.nunique()

season            5
date            690
league_id         2
league            2
team1            51
team2            51
spi1           2448
spi2           2479
prob1          2888
prob2          2809
probtie        1347
proj_score1     253
proj_score2     253
importance1     782
importance2     771
score1            9
score2            9
xg1             385
xg2             342
nsxg1           351
nsxg2           317
adj_score1      232
adj_score2      173
dtype: int64

In [20]:
display(spi.league.value_counts())
spi.season.value_counts()
# Looks like I'm missing the English League Championship data for 2016
# Might have to disregard 2016 from my analysis when I combine spi with my transfers data

English League Championship    2223
Barclays Premier League        1900
Name: league, dtype: int64

2017    937
2018    937
2019    937
2020    932
2016    380
Name: season, dtype: int64

In [17]:
spi.isna().sum()

season            0
date              0
league_id         0
league            0
team1             0
team2             0
spi1              0
spi2              0
prob1             0
prob2             0
probtie           0
proj_score1       0
proj_score2       0
importance1    1194
importance2    1194
score1          890
score2          890
xg1             894
xg2             894
nsxg1           894
nsxg2           894
adj_score1      894
adj_score2      894
dtype: int64

In [21]:
spi_nulls = pd.DataFrame(spi.isnull().sum().sort_values(ascending=False) / len(spi),
                        columns=['percent'])
percent_null = spi_nulls['percent'] > 0
spi_nulls[percent_null]

Unnamed: 0,percent
importance2,0.289595
importance1,0.289595
adj_score2,0.216832
adj_score1,0.216832
nsxg2,0.216832
nsxg1,0.216832
xg2,0.216832
xg1,0.216832
score2,0.215862
score1,0.215862


*__Missing Data__*
- I have a lot of missing data in the last ten columns of 'spi'
- They also happen to be columns I'm not using for my analysis, as they deal with specific matches
- I also will not use 'prob1', 'prob2', 'probtie', 'proj_score1', 'proj_score2'

In [23]:
columns = ['importance2', 'importance1', 'adj_score2', 'nsxg2',
           'nsxg1', 'xg2', 'xg1', 'adj_score1', 'score2', 'score1',
           'prob1', 'prob2', 'probtie', 'proj_score1', 'proj_score2']

spi.drop(columns, inplace=True, axis=1)
spi.columns

Index(['season', 'date', 'league_id', 'league', 'team1', 'team2', 'spi1',
       'spi2'],
      dtype='object')

##### What's happened so far:
 - I used a data set with match day specifics for teams in a wide range of leagues across the globe, which I've narrowed down to only the EPL and EFL-C.
 - I have data for multiple seasons, which match those of the transfers dataset from the previous step, and I have SPI scores for each team, home and away, for each game per season.
 - The SPI scores will help me assess how much a club's performance has improved (or not) using an objective metric that takes into account only the team's performance.

### III. Fixing club name issues
 - I want to standardize my club names since I'll be combining two dataframes, and I do not want to lose any data when the club names are merged
 - I'll build a dictionary of the new names I want for my clubs, and add them to the dataframe
 - I'll keep the original club names as well, just in case I need them for reference or future use

In [24]:
spi.team1.value_counts()

Middlesbrough               112
Swansea City                108
Stoke City                  107
West Bromwich Albion        104
AFC Bournemouth              99
Watford                      99
Liverpool                    95
Chelsea                      95
Manchester United            95
Burnley                      95
Crystal Palace               95
Southampton                  95
Tottenham Hotspur            95
Leicester City               95
West Ham United              95
Arsenal                      95
Manchester City              95
Everton                      95
Derby County                 94
Brentford                    94
Birmingham                   92
Preston North End            92
Bristol City                 92
Reading                      92
Nottingham Forest            92
Millwall                     92
Queens Park Rangers          92
Sheffield Wednesday          92
Cardiff City                 89
Leeds United                 89
Norwich City                 88
Hull Cit

In [25]:
spi_names_short = {'Middlesbrough': 'MID', 'Swansea City': 'SWA', 'Stoke City': 'STO',
               'West Bromwich Albion': 'WBA', 'AFC Bournemouth': 'BOU', 'Watford': 'WAT',
               'Tottenham Hotspur':'TOT', 'Burnley': 'BUR', 'Manchester United': 'MNU',
               'Liverpool': 'LIV', 'Leicester City': 'LEI','Arsenal':'ARS','Crystal Palace': 'CRY',
               'Manchester City': 'MNC', 'West Ham United': 'WHU', 'Everton': 'EVE', 'Southampton': 'SOU',
               'Chelsea': 'CHE', 'Derby County': 'DER', 'Brentford': 'BRE', 'Birmingham': 'BRM', 
               'Bristol City': 'BRS', 'Queens Park Rangers': 'QPR', 'Preston North End':'PRE', 
               'Nottingham Forest': 'NOT', 'Millwall': 'MIL', 'Sheffield Wednesday': 'SHW', 'Reading': 'REA',
               'Leeds United': 'LEE', 'Cardiff City': 'CAR', 'Hull City': 'HUL', 'Aston Villa': 'AST', 
               'Norwich City': 'NOR', 'Fulham': 'FUL', 'Huddersfield Town': 'HUD', 'Sheffield United': 'SHU',
               'Wolverhampton': 'WLV', 'Newcastle': 'NEW', 'Brighton and Hove Albion': 'BHA',
               'Blackburn': 'BLA', 'Barnsley': 'BAR', 'Ipswich Town': 'IPS', 'Wigan': 'WIG', 
               'Rotherham United': 'ROT', 'Bolton': 'BOL', 'Luton Town': 'LUT', 'Sunderland': 'SUN', 
               'Burton Albion': 'BRT', 'Charlton Athletic': 'CHA', 'Coventry City': 'COV', 
               'Wycombe Wanderers': 'WYC'}

In [27]:
spi['team1_short'] = spi['team1'].replace(spi_names_short)
spi['team2_short'] = spi['team2'].replace(spi_names_short)

In [29]:
display(spi.team1.value_counts())
spi.team1_short.value_counts()

Middlesbrough               112
Swansea City                108
Stoke City                  107
West Bromwich Albion        104
AFC Bournemouth              99
Watford                      99
Liverpool                    95
Chelsea                      95
Manchester United            95
Burnley                      95
Crystal Palace               95
Southampton                  95
Tottenham Hotspur            95
Leicester City               95
West Ham United              95
Arsenal                      95
Manchester City              95
Everton                      95
Derby County                 94
Brentford                    94
Birmingham                   92
Preston North End            92
Bristol City                 92
Reading                      92
Nottingham Forest            92
Millwall                     92
Queens Park Rangers          92
Sheffield Wednesday          92
Cardiff City                 89
Leeds United                 89
Norwich City                 88
Hull Cit

MID    112
SWA    108
STO    107
WBA    104
BOU     99
WAT     99
LEI     95
EVE     95
SOU     95
ARS     95
CHE     95
MNU     95
CRY     95
TOT     95
LIV     95
WHU     95
BUR     95
MNC     95
DER     94
BRE     94
NOT     92
QPR     92
SHW     92
BRM     92
BRS     92
MIL     92
PRE     92
REA     92
CAR     89
LEE     89
HUL     88
AST     88
NOR     88
FUL     86
SHU     84
HUD     84
WLV     80
NEW     76
BHA     76
BLA     69
BAR     69
ROT     46
LUT     46
IPS     46
BOL     46
WIG     46
SUN     42
WYC     23
CHA     23
BRT     23
COV     23
Name: team1_short, dtype: int64

In [28]:
# Reorganizing my columns
cols = ['season', 'date', 'league_id', 'league', 'team1', 'team1_short', 'team2', 'team2_short',
                  'spi1', 'spi2']
spi = spi[cols]
spi.head()

Unnamed: 0,season,date,league_id,league,team1,team1_short,team2,team2_short,spi1,spi2
2992,2017,2017-08-04,2412,English League Championship,Sunderland,SUN,Derby County,DER,50.39,40.83
2994,2017,2017-08-04,2412,English League Championship,Nottingham Forest,NOT,Millwall,MIL,35.55,28.23
3004,2017,2017-08-05,2412,English League Championship,Sheffield United,SHU,Brentford,BRE,27.72,39.7
3005,2017,2017-08-05,2412,English League Championship,Queens Park Rangers,QPR,Reading,REA,36.33,34.9
3006,2017,2017-08-05,2412,English League Championship,Fulham,FUL,Norwich City,NOR,43.0,42.6


In [30]:
spi.to_csv('/home/tdraths/sb_assignments/Transfers_Capstone/data/data_cleaning_outputs/spi_best.csv')