# 04 - Applied ML

## Deadline
Tuesday November 22, 2016 at 11:59PM

## Background
In this homework we will gain experience on Applied Machine Learning, exploring an interesting dataset about soccer players and referees.
You can find all the data in the `CrowdstormingDataJuly1st.csv` file, while you can read a thorough [dataset description here](DATA.md).
Given that the focus of this homework is Machine Learning, I recommend you to first take a look at [this notebook](http://nbviewer.jupyter.org/github/mathewzilla/redcard/blob/master/Crowdstorming_visualisation.ipynb)
containing a solid work in pre-processing + visualization of the given dataset. You are *not* allowed to just copy/paste the pre-processing steps
performed by the notebook authors -- you are still supposed to perform your own data analysis for the homework. Still, I'm confident that consulting first
the work done by expert data analysts will speed up tangibly your effort (i.e., they have already found for you many glitches in the data :)


## Assignment
1. Train a `sklearn.ensemble.RandomForestClassifier` that given a soccer player description outputs his skin color. Show how different parameters 
passed to the Classifier affect the overfitting issue. Perform cross-validation to mitigate the overfitting of your model. Once you assessed your model,
inspect the `feature_importances_` attribute and discuss the obtained results. With different assumptions on the data (e.g., dropping certain features even
before feeding them to the classifier), can you obtain a substantially different `feature_importances_` attribute?

  *BONUS*: plot the learning curves against at least 2 different sets of parameters passed to your Random Forest. To obtain smooth curves, partition
your data in at least 20 folds. Can you find a set of parameters that leads to high bias, and one which does not?

2. Aggregate the referee information grouping by soccer player, and use an unsupervised learning technique to cluster the soccer players in 2 disjoint
clusters. Remove features iteratively, and at each step perform again the clustering and compute the silhouette score -- can you find a configuration of features with high silhouette
score where players with dark and light skin colors belong to different clusters? Discuss the obtained results.


----

# Data Description

From a company for sports statistics, we obtained data and profile photos from all soccer players (N = 2053) playing in the first male divisions of England, Germany, France and Spain in the 2012-2013 season and all referees (N = 3147) that these players played under in their professional career. We created a dataset of player–referee dyads including the number of matches players and referees encountered each other and our dependent variable, the number of red cards given to a player by a particular referee throughout all matches the two encountered each other.
 
Player photos were available from the source for 1586 out of 2053 players. Players’ skin tone was coded by two independent raters blind to the research question who, based on their profile photo, categorized players on a 5-point scale ranging from “very light skin” to “very dark skin” with “neither dark nor light skin” as the center value. 

Additionally, implicit bias scores for each referee country were calculated using a race implicit association test (IAT), with higher values corresponding to faster white | good, black | bad associations. Explicit bias scores for each referee country were calculated using a racial thermometer task, with higher values corresponding to greater feelings of warmth toward whites versus blacks. Both these measures were created by aggregating data from many online users in referee countries taking these tests on [Project Implicit](http://projectimplicit.net).

In all, the dataset has a total of 146028 dyads of players and referees. A detailed description of all variables in the dataset can be seen in the list below.

## Variables:

*playerShort* - short player ID

*player* - player name

*club* - player club

*leagueCountry* - country of player club (England, Germany, France, and Spain)

*birthday* - player birthday

*height* - player height (in cm)

*weight* - player weight (in kg)

*position* - detailed player position

*games* - number of games in the player-referee dyad

*victories* - victories in the player-referee dyad

*ties* - ties in the player-referee dyad

*defeats* - losses in the player-referee dyad

*goals* - goals scored by a player in the player-referee dyad

*yellowCards* - number of yellow cards player received from referee

*yellowReds* - number of yellow-red cards player received from referee

*redCards* - number of red cards player received from referee

*photoID* - ID of player photo (if available)

*rater1* - skin rating of photo by rater 1 (5-point scale ranging from “very light skin” to “very dark skin”)

*rater2* - skin rating of photo by rater 2 (5-point scale ranging from “very light skin” to “very dark skin”)

*refNum* - unique referee ID number (referee name removed for anonymizing purposes)

*refCountry* - unique referee country ID number (country name removed for anonymizing purposes)

*meanIAT* - mean implicit bias score (using the race IAT) for referee country, higher values correspond to faster white | good, black | bad associations

*nIAT* - sample size for race IAT in that particular country

*seIAT* - standard error for mean estimate of race IAT

*meanExp* - mean explicit bias score (using a racial thermometer task) for referee country, higher values correspond to greater feelings of warmth toward whites versus blacks

*nExp* - sample size for explicit bias in that particular country

*seExp* - standard error for mean estimate of explicit bias measure


---
# Links & Resources

sklearn : 
* [Feature importances with forests of trees](http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html)
* [Label encoder](http://scikit-learn.org/dev/modules/generated/sklearn.preprocessing.LabelEncoder.html)
* [Random Forest Classifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)

---
# Experiments

In [3]:
# Import stuff
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
%matplotlib inline


In [4]:
filename='CrowdstormingDataJuly1st.csv'
df = pd.read_csv(filename)

In [6]:
df.sample(10)

Unnamed: 0,playerShort,player,club,leagueCountry,birthday,height,weight,position,games,victories,...,rater2,refNum,refCountry,Alpha_3,meanIAT,nIAT,seIAT,meanExp,nExp,seExp
13281,adrien-rabiot,Adrien Rabiot,Paris Saint-Germain,France,03.04.1995,188.0,71.0,Defensive Midfielder,3,2,...,0.0,282,7,FRA,0.334684,2882.0,0.000151,0.336101,3011.0,0.000586
123622,eloge-enza-yamissi,Eloge Enza Yamissi,ESTAC Troyes,France,23.01.1983,175.0,70.0,,4,0,...,,2635,7,FRA,0.334684,2882.0,0.000151,0.336101,3011.0,0.000586
118400,cheik-tiote,Cheik Tioté,Newcastle United,England,21.06.1986,180.0,76.0,Defensive Midfielder,4,3,...,1.0,2521,64,NLD,0.35292,5952.0,7e-05,0.445679,6121.0,0.000269
132576,remy-cabella,Rémy Cabella,Montpellier HSC,France,08.03.1990,171.0,62.0,Attacking Midfielder,1,0,...,0.0,2814,45,SCOT,0.32669,44791.0,1e-05,0.356446,46916.0,3.7e-05
116199,luis-suarez_5,Luis Suárez,Liverpool FC,England,24.01.1987,182.0,81.0,Right Winger,7,4,...,0.25,2430,64,NLD,0.35292,5952.0,7e-05,0.445679,6121.0,0.000269
22215,bryan-ruiz,Bryan Ruiz,Fulham FC,England,18.08.1985,186.0,70.0,Right Winger,1,1,...,0.25,429,94,GTM,0.344396,263.0,0.00153,0.829787,282.0,0.007309
76582,david-vaughan,David Vaughan,Sunderland AFC,England,18.02.1983,170.0,70.0,,1,0,...,,1664,112,EST,0.429315,201.0,0.002194,1.180488,205.0,0.012096
4079,fabio-coentrao,Fábio Coentrão,Real Madrid,Spain,11.03.1988,179.0,70.0,Left Fullback,1,1,...,0.25,113,52,RUS,0.398174,526.0,0.000809,1.212727,550.0,0.004521
129803,milan-bisevac,Milan Biševac,Olympique Lyon,France,31.08.1983,185.0,81.0,Center Back,8,3,...,0.25,2792,7,FRA,0.334684,2882.0,0.000151,0.336101,3011.0,0.000586
7391,jeremy-mathieu,Jérémy Mathieu,Valencia CF,Spain,29.10.1983,192.0,83.0,Left Fullback,2,0,...,0.0,172,7,FRA,0.334684,2882.0,0.000151,0.336101,3011.0,0.000586


In [10]:
df['skintone']=(df['rater1']+df['rater2'])/2
df['allreds']=df['yellowReds']+df['redCards']

In [11]:
df.sample(10)

Unnamed: 0,playerShort,player,club,leagueCountry,birthday,height,weight,position,games,victories,...,Alpha_3,meanIAT,nIAT,seIAT,meanExp,nExp,seExp,skintone,allreds,allredsStrict
38873,frederic-sammaritano,Frédéric Sammaritano,AC Ajaccio,France,23.03.1986,162.0,61.0,,3,0,...,FRA,0.334684,2882.0,0.000151,0.336101,3011.0,0.000586,0.25,0,0
133281,ali-al-habsi,Ali Al Habsi,Wigan Athletic,England,30.12.1981,194.0,88.0,Goalkeeper,2,1,...,MYS,0.39375,862.0,0.000485,0.599356,931.0,0.001805,,0,0
19505,rob-friend,Rob Friend,Eintracht Frankfurt,Germany,23.01.1981,195.0,94.0,Center Forward,2,1,...,DEU,0.336628,7749.0,5.5e-05,0.335967,7974.0,0.000225,0.0,0,0
27029,klaas-jan-huntelaar,Klaas-Jan Huntelaar,FC Schalke 04,Germany,12.08.1983,186.0,78.0,Center Forward,4,2,...,DEU,0.336628,7749.0,5.5e-05,0.335967,7974.0,0.000225,0.125,0,0
117899,granddi-ngoyi,Granddi Ngoyi,ESTAC Troyes,France,17.05.1988,186.0,77.0,Defensive Midfielder,1,0,...,ITA,0.386174,1761.0,0.000232,0.529815,1895.0,0.001091,,0,0
22208,alex_4,Alex,Paris Saint-Germain,France,17.06.1982,188.0,92.0,Center Back,1,1,...,GTM,0.344396,263.0,0.00153,0.829787,282.0,0.007309,0.5,0,0
51051,abdoul-camara,Abdoul Camara,FC Sochaux,France,20.02.1990,177.0,70.0,Left Winger,5,1,...,FRA,0.334684,2882.0,0.000151,0.336101,3011.0,0.000586,,0,0
10971,kenwyne-jones,Kenwyne Jones,Stoke City,England,05.10.1984,188.0,78.0,Right Winger,2,0,...,ENGL,0.32669,44791.0,1e-05,0.356446,46916.0,3.7e-05,1.0,0,0
22999,iago-aspas,Iago Aspas,Celta Vigo,Spain,01.08.1987,176.0,67.0,,6,2,...,ESP,0.369894,1785.0,0.000229,0.588297,1897.0,0.001002,0.25,0,0
11963,artur-boruc,Artur Boruc,Southampton FC,England,20.02.1980,193.0,88.0,Goalkeeper,5,1,...,ITA,0.386174,1761.0,0.000232,0.529815,1895.0,0.001091,0.125,0,0


Explained in [here](https://nbviewer.jupyter.org/github/mathewzilla/redcard/blob/master/Crowdstorming_visualisation.ipynb). The idea is basically to "disaggregate" the data : the current df holds dyads of interactions between a player and a ref. This means they can have several games in common and several interactions. So we'll split everything so each row is an interaction between a player and a ref, not the sum of interactions between the two.

In [15]:
disag = [0 for _ in range(sum(df['games']))]
j=0

for _, row in df.iterrows():
    reds_row = row['allreds']
    for game in range(row['games']):
        row['allreds'] = 1 if (reds_row - game > 0) else 0
        disag[j] = list(row)
        j+=1
        
pd.DataFrame(disag, columns=list(df.columns)).to_csv('crowdstorm_disaggregated.csv', index=False) 

They noticed that there are lots of referees with less than 22 dyads (the median was even 11 dyads), which should not be possible as there are 22 players on the pitch during a game : this means if a ref is at a game he will have a dyad with every one of the 22 players. In other words, referees with more than one game have more than 22 references. 

Apparently the issue is that the numbers for the referees include the interactions for the entire career : i.e. if in 2002 this ref gave a red card to a player it'll appear in the dyad between him and the player. BUT the player data is only for 2012-2013. So we'll filter the data so that every ref has at least 22 dyads (if there are less it corresponds to an old interaction)

In [17]:
dfd = pd.read_csv('crowdstorm_disaggregated.csv')

allRefs = dfd.refNum.value_counts()
goodRefs = allRefs[allRefs > 21]
#Copying from 
#http://stackoverflow.com/questions/12065885/how-to-filter-the-dataframe-rows-of-pandas-by-within-in
#
#This line defines a new dataframe based on our >21 games filter
disag_good = dfd[dfd['refNum'].isin(goodRefs.index.values)]
disag_good.sample(10)

Unnamed: 0,playerShort,player,club,leagueCountry,birthday,height,weight,position,games,victories,...,Alpha_3,meanIAT,nIAT,seIAT,meanExp,nExp,seExp,skintone,allreds,allredsStrict
289820,pavle-ninkov,Pavle Ninkov,Toulouse FC,France,20.04.1985,183.0,81.0,Right Fullback,4,3,...,FRA,0.334684,2882.0,0.000151,0.336101,3011.0,0.000586,,0,0
149793,yohan-cabaye,Yohan Cabaye,Newcastle United,England,14.01.1986,175.0,69.0,Defensive Midfielder,12,5,...,FRA,0.334684,2882.0,0.000151,0.336101,3011.0,0.000586,0.0,0,0
154311,calatayud,Calatayud,RCD Mallorca,Spain,21.12.1979,191.0,88.0,Goalkeeper,1,1,...,ESP,0.369894,1785.0,0.000229,0.588297,1897.0,0.001002,0.0,0,0
65153,pablo-sarabia,Pablo Sarabia,Getafe CF,Spain,11.05.1992,174.0,70.0,Left Midfielder,6,0,...,ESP,0.369894,1785.0,0.000229,0.588297,1897.0,0.001002,0.25,0,0
288198,movilla,Movilla,Real Zaragoza,Spain,08.02.1975,171.0,70.0,Defensive Midfielder,17,3,...,ESP,0.369894,1785.0,0.000229,0.588297,1897.0,0.001002,0.25,0,0
35573,arribas_2,Arribas,CA Osasuna,Spain,01.05.1989,185.0,77.0,Center Back,9,1,...,ESP,0.369894,1785.0,0.000229,0.588297,1897.0,0.001002,0.25,0,0
8990,moya,Moyá,Getafe CF,Spain,02.04.1984,189.0,82.0,Goalkeeper,11,4,...,ESP,0.369894,1785.0,0.000229,0.588297,1897.0,0.001002,0.125,0,0
389011,joleon-lescott,Joleon Lescott,Manchester City,England,16.08.1982,188.0,83.0,Center Back,12,5,...,ENGL,0.32669,44791.0,1e-05,0.356446,46916.0,3.7e-05,0.5,0,0
329535,kevin-grosskreutz,Kevin Großkreutz,Borussia Dortmund,Germany,19.07.1988,186.0,72.0,Left Winger,19,10,...,DEU,0.336628,7749.0,5.5e-05,0.335967,7974.0,0.000225,0.0,0,0
200734,emir-spahic,Emir Spahić,Sevilla FC,Spain,18.08.1980,185.0,80.0,Center Back,3,1,...,ESP,0.369894,1785.0,0.000229,0.588297,1897.0,0.001002,0.25,0,0


---
# 1. Random Forest Classifier

We want to train a `RandomForestClassifier` to predict a player's skin color based on the player's description.

In [46]:
# Import stuff
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
import pandas as pd
import numpy as np


### 1.1 Loading the data

The first step is of course to load the data provided to us in `CrowdstormingDataJuly1st.csv`. Its fields are described above for reference.

In [27]:
filename='CrowdstormingDataJuly1st.csv'
df = pd.read_csv(filename)
df.sample(10)

Unnamed: 0,playerShort,player,club,leagueCountry,birthday,height,weight,position,games,victories,...,rater2,refNum,refCountry,Alpha_3,meanIAT,nIAT,seIAT,meanExp,nExp,seExp
83069,tim-krul,Tim Krul,Newcastle United,England,03.04.1988,188.0,74.0,Goalkeeper,1,1,...,0.25,1811,8,DEU,0.336628,7749.0,5.5e-05,0.335967,7974.0,0.000225
95981,danny-fox,Danny Fox,Southampton FC,England,29.05.1986,183.0,78.0,Left Fullback,1,0,...,,2060,44,ENGL,0.32669,44791.0,1e-05,0.356446,46916.0,3.7e-05
73091,predrag-stevanovic,Predrag Stevanović,Werder Bremen,Germany,03.03.1991,178.0,63.0,Attacking Midfielder,1,0,...,0.25,1605,8,DEU,0.336628,7749.0,5.5e-05,0.335967,7974.0,0.000225
16022,marc-planus,Marc Planus,Girondins Bordeaux,France,07.03.1982,183.0,76.0,Center Back,3,0,...,0.25,343,7,FRA,0.334684,2882.0,0.000151,0.336101,3011.0,0.000586
144806,marc-andre-ter-stegen,Marc-André ter Stegen,Bor. Mönchengladbach,Germany,30.04.1992,189.0,85.0,Goalkeeper,10,3,...,0.25,3099,8,DEU,0.336628,7749.0,5.5e-05,0.335967,7974.0,0.000225
84857,marcel-schmelzer,Marcel Schmelzer,Borussia Dortmund,Germany,22.01.1988,181.0,74.0,Left Fullback,1,0,...,0.0,1846,8,DEU,0.336628,7749.0,5.5e-05,0.335967,7974.0,0.000225
64429,palop,Palop,Sevilla FC,Spain,22.10.1973,184.0,83.0,Goalkeeper,1,0,...,0.25,1353,3,ESP,0.369894,1785.0,0.000229,0.588297,1897.0,0.001002
52427,federico-macheda,Federico Macheda,Manchester United,England,22.08.1991,185.0,72.0,Center Forward,1,0,...,0.0,1014,44,ENGL,0.32669,44791.0,1e-05,0.356446,46916.0,3.7e-05
38136,sebastian-polter,Sebastian Polter,1. FC Nürnberg,Germany,01.04.1991,190.0,88.0,Center Forward,3,0,...,0.0,681,8,DEU,0.336628,7749.0,5.5e-05,0.335967,7974.0,0.000225
20240,idrissa-gueye,Idrissa Gueye,Lille OSC,France,26.09.1989,174.0,64.0,Defensive Midfielder,1,0,...,1.0,393,6,MAR,0.322177,140.0,0.003344,0.117647,136.0,0.013721


### 1.2 Cleaning the data

A few things have to be done to clean the data before we feed it to the classifier.

Firstly, let's merge the two skin color ratings into a mean one, since this score will act as our label for the classifier, and remove the two rating columns.

In [35]:
df['skintone']=(df['rater1']+df['rater2'])/2
df = df.drop('rater1', 1)
df = df.drop('rater2', 1)

Secondly, let's remove players who don't have this rating since we won't be able to train the classifier with these examples. They are currently store as `NaN` scpre in the `skintone` column.

In [88]:
clean_df = df.copy()
clean_df = clean_df.dropna(axis=0)

In [89]:
clean_df.sample(10)

Unnamed: 0,playerShort,player,club,leagueCountry,birthday,height,weight,position,games,victories,...,refNum,refCountry,Alpha_3,meanIAT,nIAT,seIAT,meanExp,nExp,seExp,skintone
2219,vincent-kompany,Vincent Kompany,Manchester City,England,10.04.1986,190.0,85.0,Center Back,5,4,...,77,44,ENGL,0.32669,44791.0,1e-05,0.356446,46916.0,3.7e-05,0.5
18066,dariusz-dudka,Dariusz Dudka,Levante UD,Spain,09.12.1983,183.0,80.0,Defensive Midfielder,1,0,...,363,7,FRA,0.334684,2882.0,0.000151,0.336101,3011.0,0.000586,0.25
101519,mesut-oezil,Mesut Özil,Real Madrid,Spain,15.10.1988,183.0,76.0,Attacking Midfielder,5,4,...,2176,48,ITA,0.386174,1761.0,0.000232,0.529815,1895.0,0.001091,0.125
137396,sven-bender,Sven Bender,Borussia Dortmund,Germany,27.04.1989,185.0,72.0,Defensive Midfielder,2,0,...,2931,8,DEU,0.336628,7749.0,5.5e-05,0.335967,7974.0,0.000225,0.0
131380,sidney-govou,Sidney Govou,Évian Thonon Gaillard,France,27.07.1979,175.0,72.0,Right Winger,9,6,...,2797,7,FRA,0.334684,2882.0,0.000151,0.336101,3011.0,0.000586,0.875
91999,fernando-torres,Fernando Torres,Chelsea FC,England,20.03.1984,183.0,70.0,Center Forward,1,1,...,1971,48,ITA,0.386174,1761.0,0.000232,0.529815,1895.0,0.001091,0.125
49118,lars-stindl,Lars Stindl,Hannover 96,Germany,26.08.1988,180.0,78.0,Attacking Midfielder,2,0,...,936,8,DEU,0.336628,7749.0,5.5e-05,0.335967,7974.0,0.000225,0.25
101947,brandao,Brandão,AS Saint-Étienne,France,16.06.1980,189.0,78.0,Center Forward,5,1,...,2181,7,FRA,0.334684,2882.0,0.000151,0.336101,3011.0,0.000586,0.5
142164,nelson-valdez,Nelson Valdez,Valencia CF,Spain,28.11.1983,178.0,73.0,Center Forward,2,1,...,3037,52,RUS,0.398174,526.0,0.000809,1.212727,550.0,0.004521,0.25
41439,marcell-jansen,Marcell Jansen,Hamburger SV,Germany,04.11.1985,191.0,89.0,Left Midfielder,7,3,...,758,8,DEU,0.336628,7749.0,5.5e-05,0.335967,7974.0,0.000225,0.0


We can now extract the labels that will be used to train the classifier.

In [121]:
labels = np.array(clean_df['skintone'])
le_labels = LabelEncoder()
labels = le_labels.fit_transform(labels)
labels

array([3, 6, 1, ..., 3, 2, 1])

We'll start by using all the features provided by the dataset to train the classifier, later figuring out which ones are actually interesting.

In [122]:
features = clean_df.drop('skintone', 1)
f = features.copy()

le = LabelEncoder()
for col in features.columns.values:
    f[col] = le.fit_transform(features[col])

### 1.3 Training the classifier

We've now got our training data so let's try to train the `RandomForestClassifier`.

In [112]:
forest = RandomForestClassifier(n_estimators = 100)
clf = forest.fit(f, labels)

In [123]:
# Get a player
clean_df.iloc[[2]]['skintone']

5    0.125
Name: skintone, dtype: float64

In [125]:
# See what the predicted skin tone is
le_labels.inverse_transform(clf.predict(f.iloc[[2]])[0])

0.125