## Transfers Capstone - Data Cleaning

### Data Collection: 2016 - 2020 Transfers Data, English Premier League & EFL Championship
Data sourced from: https://github.com/ewenme/transfers

### I. Loading Data

In [1]:
# Loading packages and combining multiple seasons' data from PL & EFL-C into one .csv file
import pandas as pd
import glob
import os
import numpy as np
from pandas_profiling import ProfileReport

cleaning_path = "/home/tdraths/sb_assignments/Transfers_Capstone/data/original_data_sources/season_data"

os.chdir(cleaning_path)
extension = 'csv'
files = [i for i in glob.glob('*.{}'.format(extension))]

combined_data = pd.concat([pd.read_csv(f) for f in files])

combined_data.to_csv('combined_seasons_data.csv', index=False)

In [2]:
# Loading new csv into 'transfers' dataframe
transfers = pd.read_csv('combined_seasons_data.csv', low_memory=False)

### II. Examining 'transfers' dataframe

In [3]:
transfers.head()

Unnamed: 0,club_name,player_name,age,position,club_involved_name,fee,transfer_movement,transfer_period,fee_cleaned,league_name,year,season
0,Arsenal FC,Thomas,27,Defensive Midfield,Atlético Madrid,£45.00m,in,Summer,45.0,Premier League,2020,2020/2021
1,Arsenal FC,Gabriel,22,Centre-Back,LOSC Lille,£23.40m,in,Summer,23.4,Premier League,2020,2020/2021
2,Arsenal FC,Pablo Marí,26,Centre-Back,Flamengo,£7.20m,in,Summer,7.2,Premier League,2020,2020/2021
3,Arsenal FC,Rúnar Alex Rúnarsson,25,Goalkeeper,Dijon,£1.80m,in,Summer,1.8,Premier League,2020,2020/2021
4,Arsenal FC,Cédric Soares,28,Right-Back,Southampton,Free transfer,in,Summer,0.0,Premier League,2020,2020/2021


In [4]:
transfers.columns

Index(['club_name', 'player_name', 'age', 'position', 'club_involved_name',
       'fee', 'transfer_movement', 'transfer_period', 'fee_cleaned',
       'league_name', 'year', 'season'],
      dtype='object')

In [5]:
display(transfers.dtypes)
transfers.info()

club_name              object
player_name            object
age                     int64
position               object
club_involved_name     object
fee                    object
transfer_movement      object
transfer_period        object
fee_cleaned           float64
league_name            object
year                    int64
season                 object
dtype: object

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8597 entries, 0 to 8596
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   club_name           8597 non-null   object 
 1   player_name         8597 non-null   object 
 2   age                 8597 non-null   int64  
 3   position            8597 non-null   object 
 4   club_involved_name  8597 non-null   object 
 5   fee                 8597 non-null   object 
 6   transfer_movement   8597 non-null   object 
 7   transfer_period     1200 non-null   object 
 8   fee_cleaned         7884 non-null   float64
 9   league_name         8597 non-null   object 
 10  year                8597 non-null   int64  
 11  season              8597 non-null   object 
dtypes: float64(1), int64(2), object(9)
memory usage: 806.1+ KB


In [6]:
transfers.shape

(8597, 12)

From the steps above:
 - I have a dataframe 'transfers' with 12 columns and over 160k records. 
 - I have plenty of object-type columns, but only two int-type and one float-type
 - The columns in 'transfers' should help me create features later on, so I'll keep them all here for now.
 - **transfers['fee']** and **transfers['fee_cleaned']** are the object- and float-type columns showing the amount paid to a club for the player.
 - I'll use 'fee_cleaned' to calculate how much each club is spending per season, per player, per position, etc.
 - I'll drop 'fee'

In [7]:
transfers.nunique()

club_name               51
player_name           2698
age                     27
position                16
club_involved_name     803
fee                    718
transfer_movement        2
transfer_period          2
fee_cleaned            365
league_name              2
year                     5
season                   5
dtype: int64

In [8]:
transfers.club_name.value_counts()

Nottingham Forest          294
Chelsea FC                 283
Wolverhampton Wanderers    282
Manchester City            246
Birmingham City            245
Reading FC                 244
Watford FC                 242
Brighton & Hove Albion     241
Norwich City               234
Leeds United               233
Bristol City               231
Aston Villa                212
Fulham FC                  211
Cardiff City               205
Queens Park Rangers        202
Barnsley FC                202
Huddersfield Town          199
Everton FC                 198
Swansea City               198
Newcastle United           196
Middlesbrough FC           184
Derby County               183
Wigan Athletic             181
Sheffield United           172
Preston North End          167
Stoke City                 167
Hull City                  166
AFC Bournemouth            161
West Ham United            160
Sheffield Wednesday        153
Crystal Palace             148
Brentford FC               147
Southamp

In [9]:
transfers.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,8597.0,24.719786,4.31577,16.0,21.0,24.0,28.0,43.0
fee_cleaned,7884.0,1.827636,6.698353,0.0,0.0,0.0,0.0,130.5
year,8597.0,2017.793649,1.359691,2016.0,2017.0,2018.0,2019.0,2020.0


In [10]:
transfers.agg([min, max]).T

Unnamed: 0,min,max
club_name,AFC Bournemouth,Wycombe Wanderers
player_name,Aapo Halme,Örjan Nyland
age,16,43
position,Attacking Midfield,Second Striker
club_involved_name,1. FC Köln,Östersund
fee,-,£990k
transfer_movement,in,out
fee_cleaned,0.0,130.5
league_name,Championship,Premier League
year,2016,2020


In [11]:
transfers.isna().sum()
# transfer_period has a lot of null values; I'll drop it


club_name                0
player_name              0
age                      0
position                 0
club_involved_name       0
fee                      0
transfer_movement        0
transfer_period       7397
fee_cleaned            713
league_name              0
year                     0
season                   0
dtype: int64

In [12]:
transfers.drop(columns=['fee', 'transfer_period'], inplace=True)

In [13]:
transfers.columns

Index(['club_name', 'player_name', 'age', 'position', 'club_involved_name',
       'transfer_movement', 'fee_cleaned', 'league_name', 'year', 'season'],
      dtype='object')

In [14]:
# Handling the null values in 'fee_cleaned'
transfers.fee_cleaned.fillna(0, inplace=True)
transfers.isna().sum()

club_name             0
player_name           0
age                   0
position              0
club_involved_name    0
transfer_movement     0
fee_cleaned           0
league_name           0
year                  0
season                0
dtype: int64

In [15]:
transfers.duplicated().sum()

142

#### Example of duplicate records from web-scraped data
 - The subset below shows duplicate data for one player, and there are other players with similar duplicates.
 - While it is relatively common that a player might transfer in and out of a club multiple times in a season, I do not need a duplicate record of the exact same transfer.

In [16]:
transfers[transfers.player_name == 'Örjan Nyland']

Unnamed: 0,club_name,player_name,age,position,club_involved_name,transfer_movement,fee_cleaned,league_name,year,season
36,Aston Villa,Örjan Nyland,30,Goalkeeper,Without Club,out,0.0,Premier League,2020,2020/2021
502,Aston Villa,Örjan Nyland,30,Goalkeeper,Without Club,out,0.0,Premier League,2020,2020/2021
5257,Aston Villa,Örjan Nyland,27,Goalkeeper,FC Ingolstadt,in,2.7,Championship,2018,2018/2019


In [17]:
transfers = transfers.drop_duplicates()
transfers.shape

(8455, 10)

In [18]:
transfers.describe()

Unnamed: 0,age,fee_cleaned,year
count,8455.0,8455.0,8455.0
mean,24.73152,1.669604,2017.766647
std,4.322691,6.434017,1.346455
min,16.0,0.0,2016.0
25%,21.0,0.0,2017.0
50%,24.0,0.0,2018.0
75%,28.0,0.0,2019.0
max,43.0,130.5,2020.0


### III. Fixing club name issues
 - I want to standardize my club names since I'll be combining two dataframes, and I do not want to lose any data when the club names are merged
 - I'll build a dictionary of the new names I want for my clubs, and add them to the dataframe
 - I'll keep the original club names as well, just in case I need them for reference or future use

In [19]:
transfers.club_name.value_counts()

Nottingham Forest          286
Wolverhampton Wanderers    279
Chelsea FC                 277
Birmingham City            242
Manchester City            241
Reading FC                 240
Watford FC                 236
Brighton & Hove Albion     233
Norwich City               231
Leeds United               230
Bristol City               227
Aston Villa                210
Fulham FC                  208
Cardiff City               203
Huddersfield Town          199
Queens Park Rangers        198
Barnsley FC                197
Newcastle United           194
Swansea City               193
Everton FC                 193
Middlesbrough FC           184
Derby County               181
Wigan Athletic             179
Sheffield United           169
Hull City                  166
Preston North End          165
Stoke City                 165
AFC Bournemouth            159
West Ham United            155
Sheffield Wednesday        153
Crystal Palace             146
Brentford FC               144
Southamp

In [20]:
# Creating a dictionary of shortened team names using common acronyms
transfer_names_short = {
    'Nottingham Forest': 'NOT', 'Chelsea FC': 'CHE', 'Wolverhampton Wanderers': 'WLV', 'Manchester City': 'MNC', 'Birmingham City': 'BRM',
    'Reading FC': 'REA', 'Watford FC': 'WAT', 'Brighton & Hove Albion': 'BHA', 'Norwich City': 'NOR', 'Leeds United': 'LEE',
    'Bristol City': 'BRS', 'Aston Villa': 'AST', 'Fulham FC': 'FUL', 'Cardiff City': 'CAR', 'Queens Park Rangers': 'QPR', 'Barnsley FC': 'BAR',
    'Huddersfield Town': 'HUD', 'Everton FC': 'EVE', 'Swansea City': 'SWA', 'Newcastle United': 'NEW', 'Middlesbrough FC': 'MID', 
    'Derby County': 'DER', 'Wigan Athletic': 'WIG', 'Sheffield United': 'SHU', 'Stoke City': 'STO', 'Preston North End': 'PRE', 
    'Hull City': 'HUL', 'AFC Bournemouth': 'BOU', 'West Ham United': 'WHU', 'Sheffield Wednesday': 'SHW', 'Crystal Palace': 'CRY', 
    'Brentford FC': 'BRE', 'Southampton FC': 'SOU', 'Leicester City': 'LEI', 'Liverpool FC': 'LIV', 'Arsenal FC': 'ARS', 
    'West Bromwich Albion': 'WBA', 'Burnley FC': 'BUR', 'Millwall FC': 'MIL', 'Ipswich Town': 'IPS', 'Rotherham United': 'ROT', 
    'Blackburn Rovers': 'BLA', 'Manchester United': 'MNU', 'Burton Albion': 'BRT', 'Tottenham Hotspur': 'TOT', 'Sunderland AFC': 'SUN', 
    'Bolton Wanderers': 'BOL', 'Luton Town': 'LUT', 'Charlton Athletic': 'CHA', 'Coventry City': 'COV', 'Wycombe Wanderers': 'WYC'
}

In [21]:
# Creating new columns of shortened names
transfers['team_short'] = transfers['club_name'].replace(transfer_names_short)
transfers.columns

Index(['club_name', 'player_name', 'age', 'position', 'club_involved_name',
       'transfer_movement', 'fee_cleaned', 'league_name', 'year', 'season',
       'team_short'],
      dtype='object')

In [22]:
# Reorganizing my columns
cols = ['club_name','team_short', 'player_name', 'age', 'position', 'club_involved_name', 
        'transfer_movement', 'fee_cleaned', 'league_name', 'year', 'season']
transfers = transfers[cols]

transfers.head()

Unnamed: 0,club_name,team_short,player_name,age,position,club_involved_name,transfer_movement,fee_cleaned,league_name,year,season
0,Arsenal FC,ARS,Thomas,27,Defensive Midfield,Atlético Madrid,in,45.0,Premier League,2020,2020/2021
1,Arsenal FC,ARS,Gabriel,22,Centre-Back,LOSC Lille,in,23.4,Premier League,2020,2020/2021
2,Arsenal FC,ARS,Pablo Marí,26,Centre-Back,Flamengo,in,7.2,Premier League,2020,2020/2021
3,Arsenal FC,ARS,Rúnar Alex Rúnarsson,25,Goalkeeper,Dijon,in,1.8,Premier League,2020,2020/2021
4,Arsenal FC,ARS,Cédric Soares,28,Right-Back,Southampton,in,0.0,Premier League,2020,2020/2021


In [23]:
display(transfers.club_name.value_counts())
transfers.team_short.value_counts()

Nottingham Forest          286
Wolverhampton Wanderers    279
Chelsea FC                 277
Birmingham City            242
Manchester City            241
Reading FC                 240
Watford FC                 236
Brighton & Hove Albion     233
Norwich City               231
Leeds United               230
Bristol City               227
Aston Villa                210
Fulham FC                  208
Cardiff City               203
Huddersfield Town          199
Queens Park Rangers        198
Barnsley FC                197
Newcastle United           194
Swansea City               193
Everton FC                 193
Middlesbrough FC           184
Derby County               181
Wigan Athletic             179
Sheffield United           169
Hull City                  166
Preston North End          165
Stoke City                 165
AFC Bournemouth            159
West Ham United            155
Sheffield Wednesday        153
Crystal Palace             146
Brentford FC               144
Southamp

NOT    286
WLV    279
CHE    277
BRM    242
MNC    241
REA    240
WAT    236
BHA    233
NOR    231
LEE    230
BRS    227
AST    210
FUL    208
CAR    203
HUD    199
QPR    198
BAR    197
NEW    194
SWA    193
EVE    193
MID    184
DER    181
WIG    179
SHU    169
HUL    166
STO    165
PRE    165
BOU    159
WHU    155
SHW    153
CRY    146
BRE    144
LEI    140
SOU    140
LIV    139
WBA    138
ARS    137
BUR    136
MIL    135
IPS    127
ROT    123
BLA    118
MNU    109
BRT    108
TOT     95
SUN     93
BOL     86
LUT     55
CHA     45
COV     29
WYC     19
Name: team_short, dtype: int64

In [24]:
transfers.to_csv('/home/tdraths/sb_assignments/Transfers_Capstone/data/data_cleaning_outputs/transfers_best.csv')

#### What's happened so far:
 - I have multiple seasons' (2016-17 to 2020-21) worth of data showing the names, positions, clubs, fees and other data points for every transfer in the EPL and EFL-C.
 - I combined those season .csv files into one .csv file and loaded it into a dataframe 'transfers'
 - I dropped two columns that were unnecessary: 'fee' because it was an object-type column and I already had a float-type to use, and 'transfer_period' because it had too much missing data.
 - I dropped all duplicate rows.