# Applied ML

We will apply Machine Learning to an interesting dataset about soccer players and referees.
The work will be divided in two parts, first a pre-processing + visualization pipeline to get comfortable with the data, and finally the prediction tasks, where the color of the skin of players will be inferered from other different parameters (features).

In [1]:
# A number of libraries will be used:
import pandas as pd
import numpy as np
import seaborn as sns

%matplotlib inline

### 1. Pre-processing and Visualization

#### 1.1 Pre-processing

In [25]:
#Loading the data to a DataFrame
df = pd.read_csv('CrowdstormingDataJuly1st.csv')
df.head()

Unnamed: 0,playerShort,player,club,leagueCountry,birthday,height,weight,position,games,victories,...,rater2,refNum,refCountry,Alpha_3,meanIAT,nIAT,seIAT,meanExp,nExp,seExp
0,lucas-wilchez,Lucas Wilchez,Real Zaragoza,Spain,31.08.1983,177.0,72.0,Attacking Midfielder,1,0,...,0.5,1,1,GRC,0.326391,712.0,0.000564,0.396,750.0,0.002696
1,john-utaka,John Utaka,Montpellier HSC,France,08.01.1982,179.0,82.0,Right Winger,1,0,...,0.75,2,2,ZMB,0.203375,40.0,0.010875,-0.204082,49.0,0.061504
2,abdon-prats,Abdón Prats,RCD Mallorca,Spain,17.12.1992,181.0,79.0,,1,0,...,,3,3,ESP,0.369894,1785.0,0.000229,0.588297,1897.0,0.001002
3,pablo-mari,Pablo Marí,RCD Mallorca,Spain,31.08.1993,191.0,87.0,Center Back,1,1,...,,3,3,ESP,0.369894,1785.0,0.000229,0.588297,1897.0,0.001002
4,ruben-pena,Rubén Peña,Real Valladolid,Spain,18.07.1991,172.0,70.0,Right Midfielder,1,1,...,,3,3,ESP,0.369894,1785.0,0.000229,0.588297,1897.0,0.001002


In [26]:
# Just by descibing the data we notice how incomplete it is
df.describe()



Unnamed: 0,height,weight,games,victories,ties,defeats,goals,yellowCards,yellowReds,redCards,rater1,rater2,refNum,refCountry,meanIAT,nIAT,seIAT,meanExp,nExp,seExp
count,145765.0,143785.0,146028.0,146028.0,146028.0,146028.0,146028.0,146028.0,146028.0,146028.0,124621.0,124621.0,146028.0,146028.0,145865.0,145865.0,145865.0,145865.0,145865.0,145865.0
mean,181.935938,76.075662,2.921166,1.278344,0.708241,0.934581,0.338058,0.385364,0.011381,0.012559,0.264255,0.302862,1534.827444,29.642842,0.346276,19697.41,0.0006310849,0.452026,20440.23,0.002994
std,6.738726,7.140906,3.413633,1.790725,1.116793,1.383059,0.906481,0.795333,0.107931,0.112889,0.295382,0.29302,918.736625,27.496189,0.032246,127126.2,0.004735857,0.217469,130615.7,0.019723
min,161.0,54.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,-0.047254,2.0,2.235373e-07,-1.375,2.0,1e-06
25%,,,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,641.0,7.0,,,,,,
50%,,,2.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,,,1604.0,21.0,,,,,,
75%,,,3.0,2.0,1.0,1.0,0.0,1.0,0.0,0.0,,,2345.0,44.0,,,,,,
max,203.0,100.0,47.0,29.0,14.0,18.0,23.0,14.0,3.0,2.0,1.0,1.0,3147.0,161.0,0.573793,1975803.0,0.2862871,1.8,2029548.0,1.06066


Having in mind that our final goal is to predict the skintone, we can already get rid of all the rows that don't have this. Also we need to create an aggregate of the two raters scores to act as our labels for classification.

In [44]:
df1 = df.dropna(axis=0, subset=['rater1', 'rater2'], how='any')

#For the aggregate, the simplest thing is to compute the mean, although it will increase the possible "skintones" from
# 5 to 9!
df1['Skintone']= (df1['rater1']+df1['rater2'])/2
df1.Skintone.value_counts()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


0.250    38517
0.000    33723
0.125    17876
0.500     8989
1.000     7570
0.750     7079
0.375     5609
0.875     2841
0.625     2417
Name: Skintone, dtype: int64

There are many parameters that simply wont help to discriminate between skintone. We can dispose of them.

In [None]:
df1.drop('player', axis=1, inplace=True)      #The name is no use, we keep playerShort as the identifier
df1.drop('photoID', axis=1, inplace=True)     #Trash
df1.drop('rater1', axis=1, inplace=True)      #Not needed anymore
df1.drop('rater2', axis=1, inplace=True)      #Not needed anymore
df1.drop('refNum', axis=1, inplace=True)      #Referee should be independent or at most correlated through country
df1.drop('refCountry', axis=1, inplace=True)  #refCountry because it feels like cheating to look into the country of origin
df1.drop('Alpha_3', axis=1, inplace=True)     #Alpha_3 because it feels like cheating to look into the country of origin
df1.drop('meanIAT', axis=1, inplace=True)     #Because we are going to group by playerShort and referee data cannot be mixed
df1.drop('nIAT', axis=1, inplace=True)        #Because we are going to group by playerShort and referee data cannot be mixed
df1.drop('seIAT', axis=1, inplace=True)       #Because we are going to group by playerShort and referee data cannot be mixed
df1.drop('meanExp', axis=1, inplace=True)     #Because we are going to group by playerShort and referee data cannot be mixed
df1.drop('nExp', axis=1, inplace=True)        #Because we are going to group by playerShort and referee data cannot be mixed
df1.drop('seExp', axis=1, inplace=True)       #Because we are going to group by playerShort and referee data cannot be mixed

In [46]:
df1.head()

Unnamed: 0,playerShort,club,leagueCountry,birthday,height,weight,position,games,victories,ties,defeats,goals,yellowCards,yellowReds,redCards,Skintone
0,lucas-wilchez,Real Zaragoza,Spain,31.08.1983,177.0,72.0,Attacking Midfielder,1,0,0,1,0,0,0,0,0.375
1,john-utaka,Montpellier HSC,France,08.01.1982,179.0,82.0,Right Winger,1,0,0,1,0,1,0,0,0.75
5,aaron-hughes,Fulham FC,England,08.11.1979,182.0,71.0,Center Back,1,0,0,1,0,0,0,0,0.125
6,aleksandar-kolarov,Manchester City,England,10.11.1985,187.0,80.0,Left Fullback,1,1,0,0,0,0,0,0,0.125
7,alexander-tettey,Norwich City,England,04.04.1986,180.0,68.0,Defensive Midfielder,1,0,0,1,0,0,0,0,1.0


In [59]:
df1.shape

(124621, 16)

In [61]:
#we eliminate all the rows with missing values of interest
df2 = df1.dropna(axis=0, how='any')
df2.shape

(115603, 16)

We have trashed around 10% of the data, but we believe it is acceptable as we still have over 100k entries left.

Now we will split the data in two, to make the aggregation by player easier. This is done separating sumable feature from those that are not summable. We asumme that the player remains in the same "club" (and the same "leagueCountry"  consequentially), with the same "position" for the entire season (2012-2013).

In [78]:
#In both cases we keep the identifier
df_summable = df2.loc[:,["playerShort", "games", "victories", "ties", "defeats", "goals", "yellowCards", "yellowReds", "redCards"]]
df_non_summable = df2.loc[:,["playerShort", "club", "leagueCountry", "birthday", "height", "weight", "position", "Skintone"]]


In [79]:
# And addition for the summables
df_g_summable = df_summable.groupby(['playerShort']).sum()
df_g_summable.head()

Unnamed: 0_level_0,games,victories,ties,defeats,goals,yellowCards,yellowReds,redCards
playerShort,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
aaron-hughes,654,247,179,228,9,19,0,0
aaron-hunt,336,141,73,122,62,42,0,1
aaron-lennon,412,200,97,115,31,11,0,0
aaron-ramsey,260,150,42,68,39,31,0,1
abdelhamid-el-kaoutari,124,41,40,43,1,8,4,2


In [80]:
# Simply drop duplicates for the non-summables
df_g_non_summable = df_non_summable.drop_duplicates(subset='playerShort', keep='first').set_index(['playerShort'])
df_g_non_summable.sort_index().head()

Unnamed: 0_level_0,club,leagueCountry,birthday,height,weight,position,Skintone
playerShort,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
aaron-hughes,Fulham FC,England,08.11.1979,182.0,71.0,Center Back,0.125
aaron-hunt,Werder Bremen,Germany,04.09.1986,183.0,73.0,Attacking Midfielder,0.125
aaron-lennon,Tottenham Hotspur,England,16.04.1987,165.0,63.0,Right Midfielder,0.25
aaron-ramsey,Arsenal FC,England,26.12.1990,178.0,76.0,Center Midfielder,0.0
abdelhamid-el-kaoutari,Montpellier HSC,France,17.03.1990,180.0,73.0,Center Back,0.25


In [82]:
# We check that each has the same number of rows
print(df_g_summable.shape)
print(df_g_non_summable.shape)

(1419, 8)
(1419, 7)


In [102]:
# Merging the two again
df_by_player = pd.concat([df_g_non_summable, df_g_summable], axis=1, join='outer')
df_by_player.head()

Unnamed: 0,club,leagueCountry,birthday,height,weight,position,Skintone,games,victories,ties,defeats,goals,yellowCards,yellowReds,redCards
aaron-hughes,Fulham FC,England,08.11.1979,182.0,71.0,Center Back,0.125,654,247,179,228,9,19,0,0
aaron-hunt,Werder Bremen,Germany,04.09.1986,183.0,73.0,Attacking Midfielder,0.125,336,141,73,122,62,42,0,1
aaron-lennon,Tottenham Hotspur,England,16.04.1987,165.0,63.0,Right Midfielder,0.25,412,200,97,115,31,11,0,0
aaron-ramsey,Arsenal FC,England,26.12.1990,178.0,76.0,Center Midfielder,0.0,260,150,42,68,39,31,0,1
abdelhamid-el-kaoutari,Montpellier HSC,France,17.03.1990,180.0,73.0,Center Back,0.25,124,41,40,43,1,8,4,2


An extra transformation that seems reasonable is to change the "birthday" parameter for an "age" parameter

In [103]:
# Transform to datetime and substract to the season's year when we collected the data
df_by_player['age'] = pd.to_datetime(df_by_player.birthday).map(lambda x: 2012 - x.year)
df_by_player.age.head()

aaron-hughes              33
aaron-hunt                26
aaron-lennon              25
aaron-ramsey              22
abdelhamid-el-kaoutari    22
Name: age, dtype: int64

In [104]:
# We can now drop the "birthday" parameter
df_by_player.drop('birthday', axis=1, inplace=True)
df_by_player.head()

Unnamed: 0,club,leagueCountry,height,weight,position,Skintone,games,victories,ties,defeats,goals,yellowCards,yellowReds,redCards,age
aaron-hughes,Fulham FC,England,182.0,71.0,Center Back,0.125,654,247,179,228,9,19,0,0,33
aaron-hunt,Werder Bremen,Germany,183.0,73.0,Attacking Midfielder,0.125,336,141,73,122,62,42,0,1,26
aaron-lennon,Tottenham Hotspur,England,165.0,63.0,Right Midfielder,0.25,412,200,97,115,31,11,0,0,25
aaron-ramsey,Arsenal FC,England,178.0,76.0,Center Midfielder,0.0,260,150,42,68,39,31,0,1,22
abdelhamid-el-kaoutari,Montpellier HSC,France,180.0,73.0,Center Back,0.25,124,41,40,43,1,8,4,2,22


The data is ready. Now we can do some extra visualizations or go directry to the machine learning tasks

#### 1.2 Visualization

### 2. Machine Learning

#### 2.1 Random Forest

In [None]:
sklearn.ensemble.RandomForestClassifier

#### 2.2 Unsupervised Learning 

In [None]:
kmeans k =2?