In [1]:
!pip3 install quickda





# Combining Transfer Data Sets

I'm using datasets forked from https://github.com/ewenme/transfers. It'll take a bit of manual work to do this in a way that continually works with Colab, but here's a brief outline of what happens in this notebook.

There are nine professional soccer leagues with available transfer data:
 - Dutch Eredivisie
 - English Championship
 - English Premier League
 - German Bundesliga
 - French Ligue 1
 - Italian Serie A
 - Portuguese Primeira Liga
 - Russian Premier Liga
 - Spanish La Liga

The data was scraped from https://www.transfermarkt.com/, and I have selected the years 2016 - 2020 for each league. The final, compiled dataset will include every transfer involving clubs from each league, over five seasons. Later, I will combine this data with a separate dataset, but this notebook will be limited to simply compiling the scraped data into one workable .csv file to be uploaded to the repository for this project: https://github.com/tdraths/spi_transfers_global

The steps I'll use to compile this data for each league are:
  - Define path variables for each of the five 'season' urls for each league
  - Read those path variables into dataframes
  - Check that the columns for each dataframe match in name and number
  - Concatenate those dataframes into one league-specific dataframe
  - Repeat that process across all nine leagues (somewhat manual, but necessary)
  - Concatenate the final nine league dataframes into one master dataframe
  - Save the data as a .csv file in the appropriate folder in the repository

### Naming Convention
From experience with this dataset, I know I need to standardize how my clubs are named, because when I try to combine this data with data from another source, they may have different names, e.g 'AFC Bournemouth' & 'Bournemouth'.

It's a more manageable process taking each league in turn, in my opinion, so I'll do that here, and then do that again when I look at the other dataset.

### NETHERLANDS - Eredivisie

In [55]:
import pandas as pd
from quickda.explore_data import *
from quickda.clean_data import *
from quickda.explore_numeric import *
from quickda.explore_categoric import *
from quickda.explore_numeric_categoric import *
from quickda.explore_time_series import *

path1 = 'https://raw.githubusercontent.com/tdraths/transfers/master/data/2016/dutch_eredivisie.csv'
path2 = 'https://raw.githubusercontent.com/tdraths/transfers/master/data/2017/dutch_eredivisie.csv'
path3 = 'https://raw.githubusercontent.com/tdraths/transfers/master/data/2018/dutch_eredivisie.csv'
path4 = 'https://raw.githubusercontent.com/tdraths/transfers/master/data/2019/dutch_eredivisie.csv'
path5 = 'https://raw.githubusercontent.com/tdraths/transfers/master/data/2020/dutch_eredivisie.csv'

df1 = pd.read_csv(path1)
df2 = pd.read_csv(path2)
df3 = pd.read_csv(path3)
df4 = pd.read_csv(path4)
df5 = pd.read_csv(path5)

display(df1.columns)
display(df2.columns)
display(df3.columns)
display(df4.columns)
display(df5.columns)

Index(['club_name', 'player_name', 'age', 'position', 'club_involved_name',
       'fee', 'transfer_movement', 'transfer_period', 'fee_cleaned',
       'league_name', 'year', 'season'],
      dtype='object')

Index(['club_name', 'player_name', 'age', 'position', 'club_involved_name',
       'fee', 'transfer_movement', 'transfer_period', 'fee_cleaned',
       'league_name', 'year', 'season'],
      dtype='object')

Index(['club_name', 'player_name', 'age', 'position', 'club_involved_name',
       'fee', 'transfer_movement', 'transfer_period', 'fee_cleaned',
       'league_name', 'year', 'season'],
      dtype='object')

Index(['club_name', 'player_name', 'age', 'position', 'club_involved_name',
       'fee', 'transfer_movement', 'transfer_period', 'fee_cleaned',
       'league_name', 'year', 'season'],
      dtype='object')

Index(['club_name', 'player_name', 'age', 'position', 'club_involved_name',
       'fee', 'transfer_movement', 'transfer_period', 'fee_cleaned',
       'league_name', 'year', 'season'],
      dtype='object')

In [3]:
frames = [df1, df2, df3, df4, df5]

dutch_eredivisie = pd.concat(frames)
dutch_eredivisie.head()

Unnamed: 0,club_name,player_name,age,position,club_involved_name,fee,transfer_movement,transfer_period,fee_cleaned,league_name,year,season
0,PSV Eindhoven,Bart Ramselaar,20,Attacking Midfield,FC Utrecht,£4.28m,in,Summer,4.28,Eredivisie,2016,2016/2017
1,PSV Eindhoven,Hidde Jurjus,22,Goalkeeper,De Graafschap,£450Th.,in,Summer,0.45,Eredivisie,2016,2016/2017
2,PSV Eindhoven,Daniel Schwaab,27,Centre-Back,VfB Stuttgart,Free transfer,in,Summer,0.0,Eredivisie,2016,2016/2017
3,PSV Eindhoven,Siem de Jong,27,Attacking Midfield,Newcastle,Loan,in,Summer,0.0,Eredivisie,2016,2016/2017
4,PSV Eindhoven,Oleksandr Zinchenko,19,Left-Back,Man City,Loan,in,Summer,0.0,Eredivisie,2016,2016/2017


### ENGLAND - EFL Championship

In [4]:
path1 = 'https://raw.githubusercontent.com/tdraths/transfers/master/data/2016/english_championship.csv'
path2 = 'https://raw.githubusercontent.com/tdraths/transfers/master/data/2017/english_championship.csv'
path3 = 'https://raw.githubusercontent.com/tdraths/transfers/master/data/2018/english_championship.csv'
path4 = 'https://raw.githubusercontent.com/tdraths/transfers/master/data/2019/english_championship.csv'
path5 = 'https://raw.githubusercontent.com/tdraths/transfers/master/data/2020/english_championship.csv'

df1 = pd.read_csv(path1)
df2 = pd.read_csv(path2)
df3 = pd.read_csv(path3)
df4 = pd.read_csv(path4)
df5 = pd.read_csv(path5)

display(df1.columns)
display(df2.columns)
display(df3.columns)
display(df4.columns)
display(df5.columns)

Index(['club_name', 'player_name', 'age', 'position', 'club_involved_name',
       'fee', 'transfer_movement', 'transfer_period', 'fee_cleaned',
       'league_name', 'year', 'season'],
      dtype='object')

Index(['club_name', 'player_name', 'age', 'position', 'club_involved_name',
       'fee', 'transfer_movement', 'transfer_period', 'fee_cleaned',
       'league_name', 'year', 'season'],
      dtype='object')

Index(['club_name', 'player_name', 'age', 'position', 'club_involved_name',
       'fee', 'transfer_movement', 'transfer_period', 'fee_cleaned',
       'league_name', 'year', 'season'],
      dtype='object')

Index(['club_name', 'player_name', 'age', 'position', 'club_involved_name',
       'fee', 'transfer_movement', 'transfer_period', 'fee_cleaned',
       'league_name', 'year', 'season'],
      dtype='object')

Index(['club_name', 'player_name', 'age', 'position', 'club_involved_name',
       'fee', 'transfer_movement', 'transfer_period', 'fee_cleaned',
       'league_name', 'year', 'season'],
      dtype='object')

In [5]:
frames = [df1, df2, df3, df4, df5]

english_championship = pd.concat(frames)
english_championship.head()

Unnamed: 0,club_name,player_name,age,position,club_involved_name,fee,transfer_movement,transfer_period,fee_cleaned,league_name,year,season
0,Newcastle United,Matt Ritchie,26,Left-Back,Bournemouth,£10.80m,in,Summer,10.8,Championship,2016,2016/2017
1,Newcastle United,Dwight Gayle,25,Centre-Forward,Crystal Palace,£10.80m,in,Summer,10.8,Championship,2016,2016/2017
2,Newcastle United,Grant Hanley,24,Centre-Back,Blackburn,£5.94m,in,Summer,5.94,Championship,2016,2016/2017
3,Newcastle United,Matz Sels,24,Goalkeeper,KAA Gent,£5.94m,in,Summer,5.94,Championship,2016,2016/2017
4,Newcastle United,Ciaran Clark,26,Centre-Back,Aston Villa,£5.40m,in,Summer,5.4,Championship,2016,2016/2017


### ENGLAND - Premier League

In [6]:
path1 = 'https://raw.githubusercontent.com/tdraths/transfers/master/data/2016/english_premier_league.csv'
path2 = 'https://raw.githubusercontent.com/tdraths/transfers/master/data/2017/english_premier_league.csv'
path3 = 'https://raw.githubusercontent.com/tdraths/transfers/master/data/2018/english_premier_league.csv'
path4 = 'https://raw.githubusercontent.com/tdraths/transfers/master/data/2019/english_premier_league.csv'
path5 = 'https://raw.githubusercontent.com/tdraths/transfers/master/data/2020/english_premier_league.csv'

df1 = pd.read_csv(path1)
df2 = pd.read_csv(path2)
df3 = pd.read_csv(path3)
df4 = pd.read_csv(path4)
df5 = pd.read_csv(path5)

display(df1.columns)
display(df2.columns)
display(df3.columns)
display(df4.columns)
display(df5.columns)

Index(['club_name', 'player_name', 'age', 'position', 'club_involved_name',
       'fee', 'transfer_movement', 'transfer_period', 'fee_cleaned',
       'league_name', 'year', 'season'],
      dtype='object')

Index(['club_name', 'player_name', 'age', 'position', 'club_involved_name',
       'fee', 'transfer_movement', 'transfer_period', 'fee_cleaned',
       'league_name', 'year', 'season'],
      dtype='object')

Index(['club_name', 'player_name', 'age', 'position', 'club_involved_name',
       'fee', 'transfer_movement', 'transfer_period', 'fee_cleaned',
       'league_name', 'year', 'season'],
      dtype='object')

Index(['club_name', 'player_name', 'age', 'position', 'club_involved_name',
       'fee', 'transfer_movement', 'transfer_period', 'fee_cleaned',
       'league_name', 'year', 'season'],
      dtype='object')

Index(['club_name', 'player_name', 'age', 'position', 'club_involved_name',
       'fee', 'transfer_movement', 'transfer_period', 'fee_cleaned',
       'league_name', 'year', 'season'],
      dtype='object')

In [7]:
frames = [df1, df2, df3, df4, df5]

premier_league = pd.concat(frames)
premier_league.head()

Unnamed: 0,club_name,player_name,age,position,club_involved_name,fee,transfer_movement,transfer_period,fee_cleaned,league_name,year,season
0,Leicester City,Islam Slimani,28,Centre-Forward,Sporting CP,£27.45m,in,Summer,27.45,Premier League,2016,2016/2017
1,Leicester City,Ahmed Musa,23,Left Winger,CSKA Moscow,£17.55m,in,Summer,17.55,Premier League,2016,2016/2017
2,Leicester City,Nampalys Mendy,24,Defensive Midfield,OGC Nice,£13.95m,in,Summer,13.95,Premier League,2016,2016/2017
3,Leicester City,Bartosz Kapustka,19,Attacking Midfield,Cracovia,£4.50m,in,Summer,4.5,Premier League,2016,2016/2017
4,Leicester City,Ron-Robert Zieler,27,Goalkeeper,Hannover 96,£3.15m,in,Summer,3.15,Premier League,2016,2016/2017


### FRANCE - Ligue 1

In [8]:
path1 = 'https://raw.githubusercontent.com/tdraths/transfers/master/data/2016/french_ligue_1.csv'
path2 = 'https://raw.githubusercontent.com/tdraths/transfers/master/data/2017/french_ligue_1.csv'
path3 = 'https://raw.githubusercontent.com/tdraths/transfers/master/data/2018/french_ligue_1.csv'
path4 = 'https://raw.githubusercontent.com/tdraths/transfers/master/data/2019/french_ligue_1.csv'
path5 = 'https://raw.githubusercontent.com/tdraths/transfers/master/data/2020/french_ligue_1.csv'

df1 = pd.read_csv(path1)
df2 = pd.read_csv(path2)
df3 = pd.read_csv(path3)
df4 = pd.read_csv(path4)
df5 = pd.read_csv(path5)

display(df1.columns)
display(df2.columns)
display(df3.columns)
display(df4.columns)
display(df5.columns)

Index(['club_name', 'player_name', 'age', 'position', 'club_involved_name',
       'fee', 'transfer_movement', 'transfer_period', 'fee_cleaned',
       'league_name', 'year', 'season'],
      dtype='object')

Index(['club_name', 'player_name', 'age', 'position', 'club_involved_name',
       'fee', 'transfer_movement', 'transfer_period', 'fee_cleaned',
       'league_name', 'year', 'season'],
      dtype='object')

Index(['club_name', 'player_name', 'age', 'position', 'club_involved_name',
       'fee', 'transfer_movement', 'transfer_period', 'fee_cleaned',
       'league_name', 'year', 'season'],
      dtype='object')

Index(['club_name', 'player_name', 'age', 'position', 'club_involved_name',
       'fee', 'transfer_movement', 'transfer_period', 'fee_cleaned',
       'league_name', 'year', 'season'],
      dtype='object')

Index(['club_name', 'player_name', 'age', 'position', 'club_involved_name',
       'fee', 'transfer_movement', 'transfer_period', 'fee_cleaned',
       'league_name', 'year', 'season'],
      dtype='object')

In [9]:
frames = [df1, df2, df3, df4, df5]

ligue_1 = pd.concat(frames)
ligue_1.head()

Unnamed: 0,club_name,player_name,age,position,club_involved_name,fee,transfer_movement,transfer_period,fee_cleaned,league_name,year,season
0,Paris Saint-Germain,Grzegorz Krychowiak,26,Defensive Midfield,Sevilla FC,£24.75m,in,Summer,24.75,Ligue 1,2016,2016/2017
1,Paris Saint-Germain,Jesé,23,Left Winger,Real Madrid,£22.50m,in,Summer,22.5,Ligue 1,2016,2016/2017
2,Paris Saint-Germain,Giovani Lo Celso,20,Central Midfield,CA Rosario,£9.00m,in,Summer,9.0,Ligue 1,2016,2016/2017
3,Paris Saint-Germain,Thomas Meunier,24,Right-Back,Club Brugge,£5.40m,in,Summer,5.4,Ligue 1,2016,2016/2017
4,Paris Saint-Germain,Hatem Ben Arfa,29,Attacking Midfield,OGC Nice,Free transfer,in,Summer,0.0,Ligue 1,2016,2016/2017


### GERMANY - Bundesliga

In [10]:
path1 = 'https://raw.githubusercontent.com/tdraths/transfers/master/data/2016/german_bundesliga_1.csv'
path2 = 'https://raw.githubusercontent.com/tdraths/transfers/master/data/2017/german_bundesliga_1.csv'
path3 = 'https://raw.githubusercontent.com/tdraths/transfers/master/data/2018/german_bundesliga_1.csv'
path4 = 'https://raw.githubusercontent.com/tdraths/transfers/master/data/2019/german_bundesliga_1.csv'
path5 = 'https://raw.githubusercontent.com/tdraths/transfers/master/data/2020/german_bundesliga_1.csv'

df1 = pd.read_csv(path1)
df2 = pd.read_csv(path2)
df3 = pd.read_csv(path3)
df4 = pd.read_csv(path4)
df5 = pd.read_csv(path5)

display(df1.columns)
display(df2.columns)
display(df3.columns)
display(df4.columns)
display(df5.columns)

Index(['club_name', 'player_name', 'age', 'position', 'club_involved_name',
       'fee', 'transfer_movement', 'transfer_period', 'fee_cleaned',
       'league_name', 'year', 'season'],
      dtype='object')

Index(['club_name', 'player_name', 'age', 'position', 'club_involved_name',
       'fee', 'transfer_movement', 'transfer_period', 'fee_cleaned',
       'league_name', 'year', 'season'],
      dtype='object')

Index(['club_name', 'player_name', 'age', 'position', 'club_involved_name',
       'fee', 'transfer_movement', 'transfer_period', 'fee_cleaned',
       'league_name', 'year', 'season'],
      dtype='object')

Index(['club_name', 'player_name', 'age', 'position', 'club_involved_name',
       'fee', 'transfer_movement', 'transfer_period', 'fee_cleaned',
       'league_name', 'year', 'season'],
      dtype='object')

Index(['club_name', 'player_name', 'age', 'position', 'club_involved_name',
       'fee', 'transfer_movement', 'transfer_period', 'fee_cleaned',
       'league_name', 'year', 'season'],
      dtype='object')

In [11]:
frames = [df1, df2, df3, df4, df5]

bundesliga = pd.concat(frames)
bundesliga.head()

Unnamed: 0,club_name,player_name,age,position,club_involved_name,fee,transfer_movement,transfer_period,fee_cleaned,league_name,year,season
0,Bayern Munich,Renato Sanches,18,Central Midfield,Benfica,£31.50m,in,Summer,31.5,1 Bundesliga,2016,2016/2017
1,Bayern Munich,Mats Hummels,27,Centre-Back,Bor. Dortmund,£31.50m,in,Summer,31.5,1 Bundesliga,2016,2016/2017
2,Bayern Munich,Niklas Dorsch,18,Defensive Midfield,FC Bayern II,-,in,Summer,0.0,1 Bundesliga,2016,2016/2017
3,Bayern Munich,Fabian Benko,18,Attacking Midfield,FC Bayern II,-,in,Summer,0.0,1 Bundesliga,2016,2016/2017
4,Bayern Munich,Pierre-Emile Höjbjerg,20,Central Midfield,FC Schalke 04,"End of loanJun 30, 2016",in,Summer,0.0,1 Bundesliga,2016,2016/2017


### ITALY - Serie A

In [12]:
path1 = 'https://raw.githubusercontent.com/tdraths/transfers/master/data/2016/italian_serie_a.csv'
path2 = 'https://raw.githubusercontent.com/tdraths/transfers/master/data/2017/italian_serie_a.csv'
path3 = 'https://raw.githubusercontent.com/tdraths/transfers/master/data/2018/italian_serie_a.csv'
path4 = 'https://raw.githubusercontent.com/tdraths/transfers/master/data/2019/italian_serie_a.csv'
path5 = 'https://raw.githubusercontent.com/tdraths/transfers/master/data/2020/italian_serie_a.csv'

df1 = pd.read_csv(path1)
df2 = pd.read_csv(path2)
df3 = pd.read_csv(path3)
df4 = pd.read_csv(path4)
df5 = pd.read_csv(path5)

display(df1.columns)
display(df2.columns)
display(df3.columns)
display(df4.columns)
display(df5.columns)

Index(['club_name', 'player_name', 'age', 'position', 'club_involved_name',
       'fee', 'transfer_movement', 'transfer_period', 'fee_cleaned',
       'league_name', 'year', 'season'],
      dtype='object')

Index(['club_name', 'player_name', 'age', 'position', 'club_involved_name',
       'fee', 'transfer_movement', 'transfer_period', 'fee_cleaned',
       'league_name', 'year', 'season'],
      dtype='object')

Index(['club_name', 'player_name', 'age', 'position', 'club_involved_name',
       'fee', 'transfer_movement', 'transfer_period', 'fee_cleaned',
       'league_name', 'year', 'season'],
      dtype='object')

Index(['club_name', 'player_name', 'age', 'position', 'club_involved_name',
       'fee', 'transfer_movement', 'transfer_period', 'fee_cleaned',
       'league_name', 'year', 'season'],
      dtype='object')

Index(['club_name', 'player_name', 'age', 'position', 'club_involved_name',
       'fee', 'transfer_movement', 'transfer_period', 'fee_cleaned',
       'league_name', 'year', 'season'],
      dtype='object')

In [13]:
frames = [df1, df2, df3, df4, df5]

serie_a = pd.concat(frames)
serie_a.head()

Unnamed: 0,club_name,player_name,age,position,club_involved_name,fee,transfer_movement,transfer_period,fee_cleaned,league_name,year,season
0,Atalanta BC,Alberto Paloschi,26,Centre-Forward,Swansea,£6.03m,in,Summer,6.03,Serie A,2016,2016/2017
1,Atalanta BC,Bryan Cabezas,19,Left Winger,Independiente,£2.25m,in,Summer,2.25,Serie A,2016,2016/2017
2,Atalanta BC,Ervin Zukanovic,29,Centre-Back,AS Roma,Loan fee:£900Th.,in,Summer,0.9,Serie A,2016,2016/2017
3,Atalanta BC,Etrit Berisha,27,Goalkeeper,Lazio,Loan fee:£630Th.,in,Summer,0.63,Serie A,2016,2016/2017
4,Atalanta BC,Alberto Grassi,21,Central Midfield,SSC Napoli,Loan fee:£360Th.,in,Summer,0.36,Serie A,2016,2016/2017


### PORTUGAL - Liga Nos

In [14]:
path1 = 'https://raw.githubusercontent.com/tdraths/transfers/master/data/2016/portugese_liga_nos.csv'
path2 = 'https://raw.githubusercontent.com/tdraths/transfers/master/data/2017/portugese_liga_nos.csv'
path3 = 'https://raw.githubusercontent.com/tdraths/transfers/master/data/2018/portugese_liga_nos.csv'
path4 = 'https://raw.githubusercontent.com/tdraths/transfers/master/data/2019/portugese_liga_nos.csv'
path5 = 'https://raw.githubusercontent.com/tdraths/transfers/master/data/2020/portugese_liga_nos.csv'

df1 = pd.read_csv(path1)
df2 = pd.read_csv(path2)
df3 = pd.read_csv(path3)
df4 = pd.read_csv(path4)
df5 = pd.read_csv(path5)

display(df1.columns)
display(df2.columns)
display(df3.columns)
display(df4.columns)
display(df5.columns)

Index(['club_name', 'player_name', 'age', 'position', 'club_involved_name',
       'fee', 'transfer_movement', 'transfer_period', 'fee_cleaned',
       'league_name', 'year', 'season'],
      dtype='object')

Index(['club_name', 'player_name', 'age', 'position', 'club_involved_name',
       'fee', 'transfer_movement', 'transfer_period', 'fee_cleaned',
       'league_name', 'year', 'season'],
      dtype='object')

Index(['club_name', 'player_name', 'age', 'position', 'club_involved_name',
       'fee', 'transfer_movement', 'transfer_period', 'fee_cleaned',
       'league_name', 'year', 'season'],
      dtype='object')

Index(['club_name', 'player_name', 'age', 'position', 'club_involved_name',
       'fee', 'transfer_movement', 'transfer_period', 'fee_cleaned',
       'league_name', 'year', 'season'],
      dtype='object')

Index(['club_name', 'player_name', 'age', 'position', 'club_involved_name',
       'fee', 'transfer_movement', 'transfer_period', 'fee_cleaned',
       'league_name', 'year', 'season'],
      dtype='object')

In [15]:
frames = [df1, df2, df3, df4, df5]

portuguese_liga = pd.concat(frames)
portuguese_liga.head()

Unnamed: 0,club_name,player_name,age,position,club_involved_name,fee,transfer_movement,transfer_period,fee_cleaned,league_name,year,season
0,SL Benfica,Rafa Silva,23.0,Left Midfield,Braga,£14.40m,in,Summer,14.4,Liga Nos,2016,2016/2017
1,SL Benfica,Konstantinos Mitroglou,28.0,Centre-Forward,Fulham,£6.30m,in,Summer,6.3,Liga Nos,2016,2016/2017
2,SL Benfica,Franco Cervi,22.0,Left Midfield,CA Rosario,£5.13m,in,Summer,5.13,Liga Nos,2016,2016/2017
3,SL Benfica,Oscar Benítez,23.0,Left Winger,Lanús,£3.96m,in,Summer,3.96,Liga Nos,2016,2016/2017
4,SL Benfica,Guillermo Celis,23.0,Defensive Midfield,Junior FC,£2.07m,in,Summer,2.07,Liga Nos,2016,2016/2017


### RUSSIA - Premier Liga

In [16]:
path1 = 'https://raw.githubusercontent.com/tdraths/transfers/master/data/2016/russian_premier_liga.csv'
path2 = 'https://raw.githubusercontent.com/tdraths/transfers/master/data/2017/russian_premier_liga.csv'
path3 = 'https://raw.githubusercontent.com/tdraths/transfers/master/data/2018/russian_premier_liga.csv'
path4 = 'https://raw.githubusercontent.com/tdraths/transfers/master/data/2019/russian_premier_liga.csv'
path5 = 'https://raw.githubusercontent.com/tdraths/transfers/master/data/2020/russian_premier_liga.csv'

df1 = pd.read_csv(path1)
df2 = pd.read_csv(path2)
df3 = pd.read_csv(path3)
df4 = pd.read_csv(path4)
df5 = pd.read_csv(path5)

display(df1.columns)
display(df2.columns)
display(df3.columns)
display(df4.columns)
display(df5.columns)

Index(['club_name', 'player_name', 'age', 'position', 'club_involved_name',
       'fee', 'transfer_movement', 'transfer_period', 'fee_cleaned',
       'league_name', 'year', 'season'],
      dtype='object')

Index(['club_name', 'player_name', 'age', 'position', 'club_involved_name',
       'fee', 'transfer_movement', 'transfer_period', 'fee_cleaned',
       'league_name', 'year', 'season'],
      dtype='object')

Index(['club_name', 'player_name', 'age', 'position', 'club_involved_name',
       'fee', 'transfer_movement', 'transfer_period', 'fee_cleaned',
       'league_name', 'year', 'season'],
      dtype='object')

Index(['club_name', 'player_name', 'age', 'position', 'club_involved_name',
       'fee', 'transfer_movement', 'transfer_period', 'fee_cleaned',
       'league_name', 'year', 'season'],
      dtype='object')

Index(['club_name', 'player_name', 'age', 'position', 'club_involved_name',
       'fee', 'transfer_movement', 'transfer_period', 'fee_cleaned',
       'league_name', 'year', 'season'],
      dtype='object')

In [17]:
frames = [df1, df2, df3, df4, df5]

russian_premier_liga = pd.concat(frames)
russian_premier_liga.head()

Unnamed: 0,club_name,player_name,age,position,club_involved_name,fee,transfer_movement,transfer_period,fee_cleaned,league_name,year,season
0,Amkar Perm,Darko Bodul,27,Centre-Forward,Dundee United,Free transfer,in,Summer,0.0,Premier Liga,2016,2016/2017
1,Amkar Perm,Stanislav Prokofjev,29,Centre-Forward,SKA Khabarovsk,Free transfer,in,Summer,0.0,Premier Liga,2016,2016/2017
2,Amkar Perm,Mikhail Kostyukov,24,Right Winger,Volga NN,Free transfer,in,Summer,0.0,Premier Liga,2016,2016/2017
3,Amkar Perm,Anton Shynder,29,Centre-Forward,Shakhtar D.,Free transfer,in,Summer,0.0,Premier Liga,2016,2016/2017
4,Amkar Perm,Aleksandr Budakov,31,Goalkeeper,Isloch,Free transfer,in,Summer,0.0,Premier Liga,2016,2016/2017


### SPAIN - La Liga

In [18]:
path1 = 'https://raw.githubusercontent.com/tdraths/transfers/master/data/2016/spanish_primera_division.csv'
path2 = 'https://raw.githubusercontent.com/tdraths/transfers/master/data/2017/spanish_primera_division.csv'
path3 = 'https://raw.githubusercontent.com/tdraths/transfers/master/data/2018/spanish_primera_division.csv'
path4 = 'https://raw.githubusercontent.com/tdraths/transfers/master/data/2019/spanish_primera_division.csv'
path5 = 'https://raw.githubusercontent.com/tdraths/transfers/master/data/2020/spanish_primera_division.csv'

df1 = pd.read_csv(path1)
df2 = pd.read_csv(path2)
df3 = pd.read_csv(path3)
df4 = pd.read_csv(path4)
df5 = pd.read_csv(path5)

display(df1.columns)
display(df2.columns)
display(df3.columns)
display(df4.columns)
display(df5.columns)

Index(['club_name', 'player_name', 'age', 'position', 'club_involved_name',
       'fee', 'transfer_movement', 'transfer_period', 'fee_cleaned',
       'league_name', 'year', 'season'],
      dtype='object')

Index(['club_name', 'player_name', 'age', 'position', 'club_involved_name',
       'fee', 'transfer_movement', 'transfer_period', 'fee_cleaned',
       'league_name', 'year', 'season'],
      dtype='object')

Index(['club_name', 'player_name', 'age', 'position', 'club_involved_name',
       'fee', 'transfer_movement', 'transfer_period', 'fee_cleaned',
       'league_name', 'year', 'season'],
      dtype='object')

Index(['club_name', 'player_name', 'age', 'position', 'club_involved_name',
       'fee', 'transfer_movement', 'transfer_period', 'fee_cleaned',
       'league_name', 'year', 'season'],
      dtype='object')

Index(['club_name', 'player_name', 'age', 'position', 'club_involved_name',
       'fee', 'transfer_movement', 'transfer_period', 'fee_cleaned',
       'league_name', 'year', 'season'],
      dtype='object')

In [19]:
frames = [df1, df2, df3, df4, df5]

la_liga = pd.concat(frames)
la_liga.head()

Unnamed: 0,club_name,player_name,age,position,club_involved_name,fee,transfer_movement,transfer_period,fee_cleaned,league_name,year,season
0,FC Barcelona,André Gomes,22,Central Midfield,Valencia,£33.30m,in,Summer,33.3,Primera Division,2016,2016/2017
1,FC Barcelona,Paco Alcácer,23,Centre-Forward,Valencia,£27.00m,in,Summer,27.0,Primera Division,2016,2016/2017
2,FC Barcelona,Samuel Umtiti,22,Centre-Back,Olympique Lyon,£22.50m,in,Summer,22.5,Primera Division,2016,2016/2017
3,FC Barcelona,Lucas Digne,22,Left-Back,Paris SG,£14.85m,in,Summer,14.85,Primera Division,2016,2016/2017
4,FC Barcelona,Jasper Cillessen,27,Goalkeeper,Ajax,£11.70m,in,Summer,11.7,Primera Division,2016,2016/2017


### EUROPE - Full Dataframe of Transfers

In [20]:
frames = [dutch_eredivisie, english_championship, premier_league, ligue_1, bundesliga, serie_a, portuguese_liga, russian_premier_liga, la_liga]

df = pd.concat(frames)
df.describe()

Unnamed: 0,age,fee_cleaned,year
count,35928.0,31493.0,35930.0
mean,24.420981,1.489763,2017.9398
std,4.183484,5.984241,1.400794
min,12.0,0.0,2016.0
25%,21.0,0.0,2017.0
50%,24.0,0.0,2018.0
75%,27.0,0.0,2019.0
max,43.0,199.8,2020.0


In [22]:
explore(df)

Unnamed: 0,dtypes,count,null_sum,null_pct,nunique,min,25%,50%,75%,max,mean,median,std,skew
age,float64,35928,2,0.0,30,12.0,21.0,24.0,27.0,43.0,24.420981,24.0,4.183484,0.707532
club_involved_name,object,35930,0,0.0,2471,1. FC Köln,-,-,-,Ümraniyespor,-,-,-,-
club_name,object,35930,0,0.0,235,1. FC Köln,-,-,-,Zenit St. Petersburg,-,-,-,-
fee,object,35930,0,0.0,1128,-,-,-,-,£9Th.,-,-,-,-
fee_cleaned,float64,31493,4437,0.123,559,0.0,0.0,0.0,0.0,199.8,1.489763,0.0,5.984241,10.109474
league_name,object,35930,0,0.0,9,1 Bundesliga,-,-,-,Serie A,-,-,-,-
player_name,object,35930,0,0.0,11315,Aapo Halme,-,-,-,Özkan Yildirim,-,-,-,-
position,object,35930,0,0.0,17,Attacking Midfield,-,-,-,midfield,-,-,-,-
season,object,35930,0,0.0,5,2016/2017,-,-,-,2020/2021,-,-,-,-
transfer_movement,object,35930,0,0.0,2,in,-,-,-,out,-,-,-,-


In [23]:
df.columns

Index(['club_name', 'player_name', 'age', 'position', 'club_involved_name',
       'fee', 'transfer_movement', 'transfer_period', 'fee_cleaned',
       'league_name', 'year', 'season'],
      dtype='object')

In [24]:
df.league_name.value_counts()

Serie A             5911
Championship        5513
Liga Nos            4838
Primera Division    3826
Premier League      3707
Ligue 1             3516
Eredivisie          3002
Premier Liga        2821
1 Bundesliga        2796
Name: league_name, dtype: int64

In [25]:
df = clean(df, method='duplicates')

### Cleaning `fee_cleaned`
The `fee_cleaned` column is an important one for my analysis. It records the amount in `fee`, but as a float type, instead of an object. From the QuickDA report above, it's also missing about 12% of its values. The easy way out would be convenient - just replace them all with 0.0, or maybe the average, or some other figure. In this case however, I'm going to take a few extra steps to try to get those `fee_cleaned` nulls filled with a close to the correct value as possible. Here's what I'm going to try to do:

 - Check the various types of transfers in the `fee` column
 - Cross-reference the missing `fee_cleaned` values with the `fee` transfer type
 - Loan transfers, free transfers: fill with 0.0
 - Determine a strategy to fill any other transfer type
 
 First, I'll create a filtered nulls dataframe so that I can get a good look at the types of transfers that seem to be missing `fee_cleaned` values more often.

In [26]:
null_fee = pd.isnull(df['fee_cleaned'])
nulls = df[null_fee]
nulls['fee'].value_counts()

?                1705
loan transfer    1643
free transfer    1074
£900                9
draft               3
£90                 2
£450                1
Name: fee, dtype: int64

In [30]:
df['fee_cleaned'] = np.where(df['fee'] == 'loan transfer', df['fee_cleaned'].fillna(0.0), df['fee_cleaned'])

df['fee_cleaned'] = np.where(df['fee'] == 'free transfer', df['fee_cleaned'].fillna(0.0), df['fee_cleaned'])

df['fee_cleaned'] = np.where(df['fee'] == 'draft', df['fee_cleaned'].fillna(0.0), df['fee_cleaned'])

In [31]:
null_fee = pd.isnull(df['fee_cleaned'])
nulls = df[null_fee]
nulls['fee'].value_counts()

?       1705
£900       9
£90        2
£450       1
Name: fee, dtype: int64

These remaining fee values are incredibly small, so I think I'll have a look at a few of them to see if there's something off about the data. It was scraped from Transfermarkt, so I'm sure the value listed there is what's on the site, but I want to get a better understanding of what the value might signify before simply filling with such low amounts.

In [39]:
display(df.loc[df['fee'] ==  '£450'])
display(df.loc[df['fee'] == '£90'])
display(df.loc[df['fee'] == '£900'])

Unnamed: 0,club_name,player_name,age,position,club_involved_name,fee,transfer_movement,transfer_period,fee_cleaned,league_name,year,season
998,Udinese Calcio,Guglielmo Vicario,19.0,Goalkeeper,Venezia,£450,out,Summer,,Serie A,2016,2016/2017


Unnamed: 0,club_name,player_name,age,position,club_involved_name,fee,transfer_movement,transfer_period,fee_cleaned,league_name,year,season
963,Atalanta BC,Andrea Masiello,33.0,Centre-Back,Genoa,£90,out,Winter,,Serie A,2019,2019/2020
1027,Genoa CFC,Andrea Masiello,33.0,Centre-Back,Atalanta,£90,in,Winter,,Serie A,2019,2019/2020


Unnamed: 0,club_name,player_name,age,position,club_involved_name,fee,transfer_movement,transfer_period,fee_cleaned,league_name,year,season
226,Chievo Verona,Alessandro Bassoli,27.0,Centre-Back,Pordenone,£900,out,Summer,,Serie A,2017,2017/2018
372,Genoa CFC,Francesco Renzetti,30.0,Left-Back,Cremonese,£900,out,Summer,,Serie A,2018,2018/2019
691,AS Roma,Matteo Ricci,24.0,Defensive Midfield,Spezia Calcio,£900,out,Summer,,Serie A,2018,2018/2019
1124,Parma Calcio 1913,Abdou Diakhate,20.0,Central Midfield,Fiorentina U19,£900,in,Winter,,Serie A,2018,2018/2019
1201,Udinese Calcio,Simone Pontisso,21.0,Defensive Midfield,LR Vicenza,£900,out,Winter,,Serie A,2018,2018/2019
462,SS Lazio,Luca Germoni,21.0,Left-Back,Virtus Entella,£900,in,Summer,,Serie A,2019,2019/2020
803,US Sassuolo,Pietro Cianci,23.0,Centre-Forward,Teramo,£900,out,Summer,,Serie A,2019,2019/2020
804,US Sassuolo,Martin Erlic,21.0,Centre-Back,Spezia Calcio,£900,out,Summer,,Serie A,2019,2019/2020
805,US Sassuolo,Filippo Bandinelli,24.0,Central Midfield,FC Empoli,£900,out,Summer,,Serie A,2019,2019/2020


#### Transfer Data for Guglielmo Vicario
*Screenshot from transfermarkt.com*

![image.png](attachment:image.png)

In [36]:
df.loc[df.fee_cleaned == 199.8]

Unnamed: 0,club_name,player_name,age,position,club_involved_name,fee,transfer_movement,transfer_period,fee_cleaned,league_name,year,season
443,Paris Saint-Germain,Neymar,25.0,Left Winger,FC Barcelona,£199.80m,in,Summer,199.8,Ligue 1,2017,2017/2018
29,FC Barcelona,Neymar,25.0,Left Winger,Paris SG,£199.80m,out,Summer,199.8,Primera Division,2017,2017/2018


From the above table, it looks like Vicario was purchased for £450 as a young player who had been in and out of loans with his Udinese U19 home squad. Given that context, I should fill with the actual transfer value.

But there's one more thing to consider...

The `fee_cleaned` columns contains float values scaled by £1 million. Note the value for Neymar in the `fee_cleaned` column.

I don't think scaling a low transfer value of say ***£450*** makes to a value of 0.00045 is all that useful, ***but*** I think that since I know the context, and the scale to £1 million is technically possible, I might as well fill with the exact value and press on.

In [41]:
df['fee_cleaned'] = np.where(df['fee'] == '£900', df['fee_cleaned'].fillna(0.00090), df['fee_cleaned'])

df['fee_cleaned'] = np.where(df['fee'] == '£90', df['fee_cleaned'].fillna(0.00009), df['fee_cleaned'])

df['fee_cleaned'] = np.where(df['fee'] == '£450', df['fee_cleaned'].fillna(0.00045), df['fee_cleaned'])

In [42]:
null_fee = pd.isnull(df['fee_cleaned'])
nulls = df[null_fee]
nulls['fee'].value_counts()

?    1705
Name: fee, dtype: int64

#### What's happened so far:
 - 'Loan transfer', 'free transfer' and 'draft' transfer types have been assigned a value of 0.0 in the `fee_cleaned` column.
 - I had to do a bit of Googling to determine what to do with those super-low transfer fee amounts from the `fee` column. Ultimately, I filled them all with the correct value in the `fee_cleaned` column, scaled to £1 million.
 - Interestingly, all of those low transfer values were from Serie A. And one of the players was involved in a major match-fixing scandal....
 
Next, I'll investigate the remaining '?' values from `fee`, and see what the correct course of action should be with them.

In [44]:
question_marks = df.loc[df.fee == '?']

question_marks.sample(5)

Unnamed: 0,club_name,player_name,age,position,club_involved_name,fee,transfer_movement,transfer_period,fee_cleaned,league_name,year,season
349,Boavista FC,Cassiano,30.0,Centre-Forward,CSA,?,in,Summer,,Liga Nos,2019,2019/2020
725,CS Marítimo,Andreas Karo,24.0,Centre-Back,Lazio,?,in,Winter,,Liga Nos,2020,2020/2021
559,Portimonense SC,Fernandinho,24.0,Left Midfield,CD Fátima,?,out,Summer,,Liga Nos,2019,2019/2020
35,Aston Villa,James Bree,22.0,Right-Back,Luton,?,out,Summer,,Premier League,2020,2020/2021
390,Moreirense FC,Lazar Rosic,26.0,Centre-Back,Braga,?,in,Summer,,Liga Nos,2019,2019/2020


Looking through just this sample, I gather that the players are largely loaned in and out of clubs, with few transfers with an actual monetary value. I'm going to fill them as if they were loan transfers and crack on.

In [45]:
df['fee_cleaned'] = np.where(df['fee'] == '?', df['fee_cleaned'].fillna(0.0), df['fee_cleaned'])
null_fee = pd.isnull(df['fee_cleaned'])
nulls = df[null_fee]
nulls['fee'].value_counts()

Series([], Name: fee, dtype: int64)

In [48]:
explore(df)

Unnamed: 0,dtypes,count,null_sum,null_pct,nunique,min,25%,50%,75%,max,mean,median,std,skew
age,float64,35926,2,0.0,30,12.0,21.0,24.0,27.0,43.0,24.421283,24.0,4.183405,0.707516
club_involved_name,object,35928,0,0.0,2471,1. FC Köln,-,-,-,Ümraniyespor,-,-,-,-
club_name,object,35928,0,0.0,235,1. FC Köln,-,-,-,Zenit St. Petersburg,-,-,-,-
fee,object,35928,0,0.0,1128,-,-,-,-,£9Th.,-,-,-,-
fee_cleaned,float64,35928,0,0.0,562,0.0,0.0,0.0,0.0,199.8,1.305865,0.0,5.624109,10.770908
league_name,object,35928,0,0.0,9,1 Bundesliga,-,-,-,Serie A,-,-,-,-
player_name,object,35928,0,0.0,11315,Aapo Halme,-,-,-,Özkan Yildirim,-,-,-,-
position,object,35928,0,0.0,17,Attacking Midfield,-,-,-,midfield,-,-,-,-
season,object,35928,0,0.0,5,2016/2017,-,-,-,2020/2021,-,-,-,-
transfer_movement,object,35928,0,0.0,2,in,-,-,-,out,-,-,-,-


The last null values to work with are the two from the `age` column. I'll just look to see who the players are and look up their age to get the proper value.

In [47]:
df.loc[df.age.isna()]

Unnamed: 0,club_name,player_name,age,position,club_involved_name,fee,transfer_movement,transfer_period,fee_cleaned,league_name,year,season
979,CS Marítimo,Abdul Basit,,Attacking Midfield,FC Stumbras,Loan,in,Winter,0.0,Liga Nos,2016,2016/2017
454,CS Marítimo,Abdul Basit,,Attacking Midfield,FC Stumbras,"End of loanJun 30, 2017",out,Summer,0.0,Liga Nos,2017,2017/2018


In [49]:
df.age.fillna('24', inplace=True)

In [50]:
explore(df)

Unnamed: 0,dtypes,count,null_sum,null_pct,nunique,min,25%,50%,75%,max,mean,median,std,skew
age,object,35928,0,0.0,31,-,-,-,-,-,-,-,-,-
club_involved_name,object,35928,0,0.0,2471,1. FC Köln,-,-,-,Ümraniyespor,-,-,-,-
club_name,object,35928,0,0.0,235,1. FC Köln,-,-,-,Zenit St. Petersburg,-,-,-,-
fee,object,35928,0,0.0,1128,-,-,-,-,£9Th.,-,-,-,-
fee_cleaned,float64,35928,0,0.0,562,0.0,0.0,0.0,0.0,199.8,1.305865,0.0,5.624109,10.770908
league_name,object,35928,0,0.0,9,1 Bundesliga,-,-,-,Serie A,-,-,-,-
player_name,object,35928,0,0.0,11315,Aapo Halme,-,-,-,Özkan Yildirim,-,-,-,-
position,object,35928,0,0.0,17,Attacking Midfield,-,-,-,midfield,-,-,-,-
season,object,35928,0,0.0,5,2016/2017,-,-,-,2020/2021,-,-,-,-
transfer_movement,object,35928,0,0.0,2,in,-,-,-,out,-,-,-,-


Now that there are no more null values to sort out, I no longer need the `fee` column; the `fee_cleaned` column has values of the right type, and I can move on using it as my monetary value feature.

In [51]:
df.drop(columns='fee', inplace=True)

## What's Happened So Far:
 - Had nine smaller datasets of nearly 36000 total records including the transfer data for nine major European leagues:
     - England: English Premier League
     - England: Football League Championship
     - France: Ligue 1 
     - Germany: Bundesliga
     - Italy: Serie A
     - Portugal: Liga Nos
     - Spain: La Liga
     - Netherlands: Eredivisie
     - Russia: Premier Liga
     
 - Concatenated those nine datasets together into one larger dataset
 - Used values from the object-type `fee` column to fill null values in the float-type `fee_cleaned` column
 - Filled null values in the `age` column
 - Dropped the `fee` column; using `fee_cleaned` from now on.
 

**Next Steps**
 - Look at a dataset of spi rankings from FiveThirtyEight.
 - Combine spi rankings and transfer data into one dataframe
 - Shape the dataframe to include 1 record per team per year
 - Explore that dataframe visually for interesting patterns
 - Begin the modelling portion of this project.

In [53]:
df.to_csv('concatenated_datasets.csv', index=False)