# 04 - Applied ML

## Deadline
Tuesday November 22, 2016 at 11:59PM

## Background
In this homework we will gain experience on Applied Machine Learning, exploring an interesting dataset about soccer players and referees.
You can find all the data in the `CrowdstormingDataJuly1st.csv` file, while you can read a thorough [dataset description here](DATA.md).
Given that the focus of this homework is Machine Learning, I recommend you to first take a look at [this notebook](http://nbviewer.jupyter.org/github/mathewzilla/redcard/blob/master/Crowdstorming_visualisation.ipynb)
containing a solid work in pre-processing + visualization of the given dataset. You are *not* allowed to just copy/paste the pre-processing steps
performed by the notebook authors -- you are still supposed to perform your own data analysis for the homework. Still, I'm confident that consulting first
the work done by expert data analysts will speed up tangibly your effort (i.e., they have already found for you many glitches in the data :)


## Assignment
1. Train a `sklearn.ensemble.RandomForestClassifier` that given a soccer player description outputs his skin color. Show how different parameters 
passed to the Classifier affect the overfitting issue. Perform cross-validation to mitigate the overfitting of your model. Once you assessed your model,
inspect the `feature_importances_` attribute and discuss the obtained results. With different assumptions on the data (e.g., dropping certain features even
before feeding them to the classifier), can you obtain a substantially different `feature_importances_` attribute?

  *BONUS*: plot the learning curves against at least 2 different sets of parameters passed to your Random Forest. To obtain smooth curves, partition
your data in at least 20 folds. Can you find a set of parameters that leads to high bias, and one which does not?

2. Aggregate the referee information grouping by soccer player, and use an unsupervised learning technique to cluster the soccer players in 2 disjoint
clusters. Remove features iteratively, and at each step perform again the clustering and compute the silhouette score -- can you find a configuration of features with high silhouette
score where players with dark and light skin colors belong to different clusters? Discuss the obtained results.


----

# Data Description

From a company for sports statistics, we obtained data and profile photos from all soccer players (N = 2053) playing in the first male divisions of England, Germany, France and Spain in the 2012-2013 season and all referees (N = 3147) that these players played under in their professional career. We created a dataset of player–referee dyads including the number of matches players and referees encountered each other and our dependent variable, the number of red cards given to a player by a particular referee throughout all matches the two encountered each other.
 
Player photos were available from the source for 1586 out of 2053 players. Players’ skin tone was coded by two independent raters blind to the research question who, based on their profile photo, categorized players on a 5-point scale ranging from “very light skin” to “very dark skin” with “neither dark nor light skin” as the center value. 

Additionally, implicit bias scores for each referee country were calculated using a race implicit association test (IAT), with higher values corresponding to faster white | good, black | bad associations. Explicit bias scores for each referee country were calculated using a racial thermometer task, with higher values corresponding to greater feelings of warmth toward whites versus blacks. Both these measures were created by aggregating data from many online users in referee countries taking these tests on [Project Implicit](http://projectimplicit.net).

In all, the dataset has a total of 146028 dyads of players and referees. A detailed description of all variables in the dataset can be seen in the list below.

## Variables:

*playerShort* - short player ID

*player* - player name

*club* - player club

*leagueCountry* - country of player club (England, Germany, France, and Spain)

*birthday* - player birthday

*height* - player height (in cm)

*weight* - player weight (in kg)

*position* - detailed player position

*games* - number of games in the player-referee dyad

*victories* - victories in the player-referee dyad

*ties* - ties in the player-referee dyad

*defeats* - losses in the player-referee dyad

*goals* - goals scored by a player in the player-referee dyad

*yellowCards* - number of yellow cards player received from referee

*yellowReds* - number of yellow-red cards player received from referee

*redCards* - number of red cards player received from referee

*photoID* - ID of player photo (if available)

*rater1* - skin rating of photo by rater 1 (5-point scale ranging from “very light skin” to “very dark skin”)

*rater2* - skin rating of photo by rater 2 (5-point scale ranging from “very light skin” to “very dark skin”)

*refNum* - unique referee ID number (referee name removed for anonymizing purposes)

*refCountry* - unique referee country ID number (country name removed for anonymizing purposes)

*meanIAT* - mean implicit bias score (using the race IAT) for referee country, higher values correspond to faster white | good, black | bad associations

*nIAT* - sample size for race IAT in that particular country

*seIAT* - standard error for mean estimate of race IAT

*meanExp* - mean explicit bias score (using a racial thermometer task) for referee country, higher values correspond to greater feelings of warmth toward whites versus blacks

*nExp* - sample size for explicit bias in that particular country

*seExp* - standard error for mean estimate of explicit bias measure


---
# Links & Resources

sklearn : 
* [Feature importances with forests of trees](http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html)
* [Label encoder](http://scikit-learn.org/dev/modules/generated/sklearn.preprocessing.LabelEncoder.html)
* [Random Forest Classifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)

---
# Experiments

In [3]:
# Import stuff
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
%matplotlib inline


In [4]:
filename='CrowdstormingDataJuly1st.csv'
df = pd.read_csv(filename)

In [5]:
df.sample(10)

Unnamed: 0,playerShort,player,club,leagueCountry,birthday,height,weight,position,games,victories,...,rater2,refNum,refCountry,Alpha_3,meanIAT,nIAT,seIAT,meanExp,nExp,seExp
136466,sejad-salihovic,Sejad Salihović,1899 Hoffenheim,Germany,08.10.1984,180.0,81.0,Center Midfielder,14,3,...,0.25,2902,8,DEU,0.336628,7749.0,5.5e-05,0.335967,7974.0,0.000225
79102,ryan-bertrand,Ryan Bertrand,Chelsea FC,England,05.08.1989,179.0,85.0,Left Fullback,3,2,...,0.75,1707,44,ENGL,0.32669,44791.0,1e-05,0.356446,46916.0,3.7e-05
60223,lisandro-lopez,Lisandro López,Olympique Lyon,France,02.03.1983,175.0,72.0,Center Forward,1,1,...,0.25,1219,15,TUR,0.354707,656.0,0.000606,0.182081,692.0,0.002717
114531,mapou-yanga-mbiwa,Mapou Yanga-Mbiwa,Montpellier HSC,France,15.05.1989,184.0,77.0,Center Back,4,1,...,1.0,2412,7,FRA,0.334684,2882.0,0.000151,0.336101,3011.0,0.000586
1006,leo-franco,Leo Franco,Real Zaragoza,Spain,20.05.1977,188.0,79.0,Goalkeeper,1,0,...,0.0,66,4,LUX,0.325185,127.0,0.003297,0.538462,130.0,0.013752
82946,peter-crouch,Peter Crouch,Stoke City,England,30.01.1981,202.0,80.0,Center Forward,1,0,...,0.0,1811,8,DEU,0.336628,7749.0,5.5e-05,0.335967,7974.0,0.000225
37895,igor-de-camargo,Igor de Camargo,Bor. Mönchengladbach,Germany,12.05.1983,187.0,83.0,Center Forward,2,0,...,0.5,681,8,DEU,0.336628,7749.0,5.5e-05,0.335967,7974.0,0.000225
6123,sebastien-squillaci,Sébastien Squillaci,Arsenal FC,England,11.08.1980,183.0,76.0,Center Back,1,0,...,0.25,139,3,ESP,0.369894,1785.0,0.000229,0.588297,1897.0,0.001002
94664,ivan-klasnic,Ivan Klasnić,1. FSV Mainz 05,Germany,29.01.1980,186.0,82.0,Center Forward,1,1,...,0.0,2036,8,DEU,0.336628,7749.0,5.5e-05,0.335967,7974.0,0.000225
88221,steven-sidwell,Steven Sidwell,Fulham FC,England,14.12.1982,178.0,70.0,Center Midfielder,2,1,...,0.25,1915,44,ENGL,0.32669,44791.0,1e-05,0.356446,46916.0,3.7e-05


In [6]:
df['skintone']=(df['rater1']+df['rater2'])/2
df['allreds']=df['yellowReds']+df['redCards']

In [7]:
df.sample(10)

Unnamed: 0,playerShort,player,club,leagueCountry,birthday,height,weight,position,games,victories,...,refCountry,Alpha_3,meanIAT,nIAT,seIAT,meanExp,nExp,seExp,skintone,allreds
140723,giorgios-karagounis,Giorgios Karagounis,Fulham FC,England,06.03.1977,176.0,83.0,Attacking Midfielder,2,0,...,32,CHE,0.345305,1886.0,0.000219,0.377193,1938.0,0.000823,0.0,0
60671,mikele-leigertwood,Mikele Leigertwood,Reading FC,England,12.11.1982,186.0,72.0,,4,1,...,44,ENGL,0.32669,44791.0,1e-05,0.356446,46916.0,3.7e-05,0.5,0
50068,goran-popov,Goran Popov,West Bromwich Albion,England,02.10.1984,189.0,89.0,Left Fullback,1,1,...,58,BEL,0.36272,3219.0,0.000128,0.568785,3351.0,0.000575,0.125,0
115365,javier-pastore,Javier Pastore,Paris Saint-Germain,France,20.06.1989,187.0,78.0,Attacking Midfielder,2,2,...,7,FRA,0.334684,2882.0,0.000151,0.336101,3011.0,0.000586,0.25,0
119698,srdjan-lakic,Srđan Lakić,VfL Wolfsburg,Germany,02.10.1983,186.0,82.0,Center Forward,1,1,...,8,DEU,0.336628,7749.0,5.5e-05,0.335967,7974.0,0.000225,0.0,0
16588,lukas-rupp,Lukas Rupp,Bor. Mönchengladbach,Germany,08.01.1991,178.0,73.0,Right Winger,2,0,...,8,DEU,0.336628,7749.0,5.5e-05,0.335967,7974.0,0.000225,0.25,0
30973,matthew-upson,Matthew Upson,Stoke City,England,18.04.1979,185.0,72.0,Center Back,3,2,...,44,ENGL,0.32669,44791.0,1e-05,0.356446,46916.0,3.7e-05,0.125,0
17956,yannick-sagbo,Yannick Sagbo,Évian Thonon Gaillard,France,12.07.1988,183.0,78.0,Center Forward,3,1,...,7,FRA,0.334684,2882.0,0.000151,0.336101,3011.0,0.000586,,0
20118,maurice-edu,Maurice Edu,Stoke City,England,18.04.1986,183.0,85.0,Defensive Midfielder,2,2,...,45,SCOT,0.32669,44791.0,1e-05,0.356446,46916.0,3.7e-05,0.875,0
6323,markel-bergara,Markel Bergara,Real Sociedad,Spain,05.05.1986,181.0,78.0,Defensive Midfielder,3,2,...,3,ESP,0.369894,1785.0,0.000229,0.588297,1897.0,0.001002,0.0,0


Explained in [here](https://nbviewer.jupyter.org/github/mathewzilla/redcard/blob/master/Crowdstorming_visualisation.ipynb). The idea is basically to "disaggregate" the data : the current df holds dyads of interactions between a player and a ref. This means they can have several games in common and several interactions. So we'll split everything so each row is an interaction between a player and a ref, not the sum of interactions between the two.

In [9]:
#disag = [0 for _ in range(sum(df['games']))]
#j=0

#for _, row in df.iterrows():
#    reds_row = row['allreds']
#    for game in range(row['games']):
#        row['allreds'] = 1 if (reds_row - game > 0) else 0
#        disag[j] = list(row)
#        j+=1
        
#pd.DataFrame(disag, columns=list(df.columns)).to_csv('crowdstorm_disaggregated.csv', index=False) 

They noticed that there are lots of referees with less than 22 dyads (the median was even 11 dyads), which should not be possible as there are 22 players on the pitch during a game : this means if a ref is at a game he will have a dyad with every one of the 22 players. In other words, referees with more than one game have more than 22 references. 

Apparently the issue is that the numbers for the referees include the interactions for the entire career : i.e. if in 2002 this ref gave a red card to a player it'll appear in the dyad between him and the player. BUT the player data is only for 2012-2013. So we'll filter the data so that every ref has at least 22 dyads (if there are less it corresponds to an old interaction)

In [10]:
dfd = pd.read_csv('crowdstorm_disaggregated.csv')

allRefs = dfd.refNum.value_counts()
goodRefs = allRefs[allRefs > 21]
#Copying from 
#http://stackoverflow.com/questions/12065885/how-to-filter-the-dataframe-rows-of-pandas-by-within-in
#
#This line defines a new dataframe based on our >21 games filter
disag_good = dfd[dfd['refNum'].isin(goodRefs.index.values)]
disag_good.sample(10)

Unnamed: 0,playerShort,player,club,leagueCountry,birthday,height,weight,position,games,victories,...,refCountry,Alpha_3,meanIAT,nIAT,seIAT,meanExp,nExp,seExp,skintone,allreds
34783,angel,Ángel,Real Betis,Spain,10.03.1981,180.0,71.0,Right Fullback,5,3,...,3,ESP,0.369894,1785.0,0.000229,0.588297,1897.0,0.001002,0.25,0
198108,matthew-lowton,Matthew Lowton,Aston Villa,England,09.06.1989,180.0,78.0,,7,3,...,44,ENGL,0.32669,44791.0,1e-05,0.356446,46916.0,3.7e-05,,0
303912,fabricio-coloccini,Fabricio Coloccini,Newcastle United,England,22.01.1982,183.0,76.0,Center Back,4,2,...,44,ENGL,0.32669,44791.0,1e-05,0.356446,46916.0,3.7e-05,0.25,0
297515,noel-hunt,Noel Hunt,Reading FC,England,26.12.1982,173.0,72.0,,3,0,...,44,ENGL,0.32669,44791.0,1e-05,0.356446,46916.0,3.7e-05,,0
68963,pepe_2,Pepe,Real Madrid,Spain,26.02.1983,188.0,81.0,Center Back,9,6,...,3,ESP,0.369894,1785.0,0.000229,0.588297,1897.0,0.001002,0.25,0
22411,philipp-lahm,Philipp Lahm,Bayern München,Germany,11.11.1983,170.0,66.0,Left Fullback,1,0,...,40,SWE,0.340205,5223.0,8.1e-05,0.626401,5621.0,0.000373,0.25,0
184199,adil-rami,Adil Rami,Valencia CF,Spain,27.12.1985,190.0,88.0,Center Back,6,4,...,3,ESP,0.369894,1785.0,0.000229,0.588297,1897.0,0.001002,0.125,0
109038,zdenk-pospch,Zdeněk Pospěch,1. FSV Mainz 05,Germany,14.12.1978,174.0,72.0,Right Fullback,9,2,...,8,DEU,0.336628,7749.0,5.5e-05,0.335967,7974.0,0.000225,0.125,0
12113,florent-malouda,Florent Malouda,Chelsea FC,England,13.06.1980,181.0,73.0,Left Winger,2,2,...,52,RUS,0.398174,526.0,0.000809,1.212727,550.0,0.004521,0.625,0
257062,james-collins,James Collins,West Ham United,England,23.08.1983,188.0,83.0,Center Back,1,0,...,44,ENGL,0.32669,44791.0,1e-05,0.356446,46916.0,3.7e-05,0.0,0


---
# 1. Random Forest Classifier

We want to train a `RandomForestClassifier` to predict a player's skin color based on the player's description.

In [11]:
# Import stuff
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
import pandas as pd
import numpy as np


### 1.1 Loading the data

The first step is of course to load the data provided to us in `CrowdstormingDataJuly1st.csv`. Its fields are described above for reference.

In [12]:
filename='CrowdstormingDataJuly1st.csv'
df = pd.read_csv(filename)
df.sample(10)

Unnamed: 0,playerShort,player,club,leagueCountry,birthday,height,weight,position,games,victories,...,rater2,refNum,refCountry,Alpha_3,meanIAT,nIAT,seIAT,meanExp,nExp,seExp
77715,mohammed-rabiu,Mohammed Rabiu,Évian Thonon Gaillard,France,31.12.1989,192.0,80.0,Defensive Midfielder,2,1,...,,1696,7,FRA,0.334684,2882.0,0.000151,0.336101,3011.0,0.000586
39977,julian-schieber,Julian Schieber,Borussia Dortmund,Germany,13.02.1989,188.0,83.0,Center Forward,1,1,...,0.0,747,8,DEU,0.336628,7749.0,5.5e-05,0.335967,7974.0,0.000225
7998,adam-smith_3,Adam Smith,Millwall FC,England,29.04.1991,173.0,73.0,Right Fullback,2,0,...,0.0,194,44,ENGL,0.32669,44791.0,1e-05,0.356446,46916.0,3.7e-05
87408,andy-marshall,Andy Marshall,Aston Villa,England,14.04.1975,188.0,86.0,Goalkeeper,6,2,...,,1909,44,ENGL,0.32669,44791.0,1e-05,0.356446,46916.0,3.7e-05
93994,marc-antoine-fortune,Marc-Antoine Fortuné,West Bromwich Albion,England,02.07.1981,182.0,81.0,Center Forward,2,0,...,0.75,2027,44,ENGL,0.32669,44791.0,1e-05,0.356446,46916.0,3.7e-05
27236,tobias-werner,Tobias Werner,FC Augsburg,Germany,19.07.1985,176.0,66.0,Left Midfielder,13,3,...,0.0,494,8,DEU,0.336628,7749.0,5.5e-05,0.335967,7974.0,0.000225
83811,lolo_4,Lolo,Real Valladolid,Spain,07.04.1993,182.0,78.0,,2,1,...,,1817,72,PRT,0.396803,1079.0,0.000392,0.790366,1121.0,0.001798
1880,hiroshi-kiyotake,Hiroshi Kiyotake,1. FC Nürnberg,Germany,12.11.1989,172.0,63.0,Attacking Midfielder,1,1,...,0.5,73,23,AUS,0.334126,16261.0,2.6e-05,0.301962,16969.0,0.0001
123819,denis-cheryshev,Denis Cheryshev,Real Madrid,Spain,26.12.1990,179.0,74.0,,2,0,...,0.0,2639,3,ESP,0.369894,1785.0,0.000229,0.588297,1897.0,0.001002
104098,adam-morgan,Adam Morgan,Liverpool FC,England,21.04.1994,179.0,,,1,1,...,0.25,2239,44,ENGL,0.32669,44791.0,1e-05,0.356446,46916.0,3.7e-05


### 1.2 Cleaning the data

A few things have to be done to clean the data before we feed it to the classifier.

Firstly, let's merge the two skin color ratings into a mean one, since this score will act as our label for the classifier, and remove the two rating columns.

In [13]:
df['skintone']=(df['rater1']+df['rater2'])/2
df = df.drop('rater1', 1)
df = df.drop('rater2', 1)

Secondly, let's remove players who don't have this rating since we won't be able to train the classifier with these examples. They are currently store as `NaN` scpre in the `skintone` column.

In [14]:
clean_df = df.copy()
len(clean_df)
#clean_df = clean_df['skintone'].dropna(axis=0)

146028

In [15]:
clean_df = clean_df.dropna(axis=0, subset = ['skintone'])
len(clean_df) - len(clean_df.dropna(axis=0))

9164

Let see what are the left missing values in the other features.

In [17]:
for column in clean_df:
    #a = pd.isnull(clean_df['player']).nonzero()
    print(column + ':', len(clean_df[column].iloc[clean_df[column].isnull().nonzero()]))

playerShort: 0
player: 0
club: 0
leagueCountry: 0
birthday: 0
height: 46
weight: 753
position: 8461
games: 0
victories: 0
ties: 0
defeats: 0
goals: 0
yellowCards: 0
yellowReds: 0
redCards: 0
photoID: 0
refNum: 0
refCountry: 0
Alpha_3: 1
meanIAT: 153
nIAT: 153
seIAT: 153
meanExp: 153
nExp: 153
seExp: 153
skintone: 0


Firstly the features meanIAT, nIAT, seIAT, meanExp, nExp and seExp correspond to the same entries of the dataset. Secondly, we have to look their distribution by country because thats how they have been collected.

In [19]:
for country in clean_df['leagueCountry'].unique():
    print(country + '\n', clean_df['meanIAT'][clean_df['leagueCountry'] == country].describe())
for country in clean_df['leagueCountry'].unique():
    print(country + '\n', clean_df['nIAT'][clean_df['leagueCountry'] == country].describe())
    

#for country in clean_df['leagueCountry'].unique():
#    print country + '\n', clean_df['seExp'][clean_df['leagueCountry'] == country].describe()

#clean_df['leagueCountry'].iloc[clean_df['meanIAT'].isnull().nonzero()]
#clean_df['seExp'][clean_df['leagueCountry']== 'England'].plot(kind='bar')

Spain
 count    30968.000000
mean         0.360991
std          0.029947
min         -0.047254
25%               NaN
50%               NaN
75%               NaN
max          0.573793
Name: meanIAT, dtype: float64
France
 count    18916.000000
mean         0.343024
std          0.037566
min         -0.047254
25%               NaN
50%               NaN
75%               NaN
max          0.573793
Name: meanIAT, dtype: float64
England
 count    35145.000000
mean         0.343323
std          0.032956
min         -0.047254
25%               NaN
50%               NaN
75%               NaN
max          0.573793
Name: meanIAT, dtype: float64
Germany
 count    39439.000000
mean         0.344741
std          0.026943
min         -0.047254
25%               NaN
50%               NaN
75%               NaN
max          0.573793
Name: meanIAT, dtype: float64
Spain
 count    3.096800e+04
mean     1.452977e+04
std      1.394920e+05
min      2.000000e+00
25%               NaN
50%               NaN
75% 



We conclude that only meanIAT has a mean that make sense with its standard deviation. In any case, with respect to the size of our dataset, set we can drop these entries from our dataframe.

In [20]:
clean_df = clean_df.dropna(axis=0,subset=['meanIAT','nIAT','seIAT','meanExp','nExp','seExp']).reset_index()

For the 'position' feauture we conclude that it was not a important feature for our model. Finally, for the height and weigth missing values we made the assumption that soccer player have almost all a 'athletic' condition that depend on their weight, height and skintone.

In [21]:
idx_weight = clean_df.iloc[clean_df['weight'].isnull().nonzero()].index
idx_height = clean_df.iloc[clean_df['height'].isnull().nonzero()].index
for idx in idx_weight:
    mean_weight = clean_df['weight'][(clean_df['height'] == clean_df.iloc[100].height) & 
                                     ([year[2] for year in clean_df['birthday'].str.split('.')] == clean_df.iloc[idx].birthday.split('.')[2]) & 
                                     (clean_df['skintone'] == clean_df.iloc[idx].skintone)].mean() 
                        
    clean_df.set_value(idx,'weight',mean_weight)

#for idx in idx_height:
#    mean_height = clean_df['height'][(clean_df['weight'] == clean_df.iloc[100].height) &
#                                     ([year[2] for year in clean_df['birthday'].str.split('.')] == clean_df.iloc[idx].birthday.split('.')[2]) &
#                                     (clean_df['skintone'] == clean_df.iloc[idx].skintone)].mean()
#                        
#    clean_df.set_value(idx,'height',mean_height)

#clean_df.iloc[3788]
#idx_weight    #mean_weight
#clean_df['player'].iloc[clean_df['weight'].isnull().nonzero()]
#clean_df.iloc[idx_weight]
#clean_df.iloc[97]

In [None]:
clean_df.iloc[idx_height]

We can now extract the labels that will be used to train the classifier.

In [None]:
labels = np.array(clean_df['skintone'])
le_labels = LabelEncoder()
labels = le_labels.fit_transform(labels)
labels

We'll start by using all the features provided by the dataset to train the classifier, later figuring out which ones are actually interesting.

In [None]:
features = clean_df.drop('skintone', 1)
f = features.copy()

le = LabelEncoder()
for col in features.columns.values:
    f[col] = le.fit_transform(features[col])

### 1.3 Training the classifier

We've now got our training data so let's try to train the `RandomForestClassifier`.

In [None]:
from sklearn.model_selection import train_test_split

# Split into a test and a train set
X_train, X_test, y_train, y_test = train_test_split(f, labels, test_size=0.4, random_state=0)

# Create the RandomForestClassifier
forest = RandomForestClassifier(n_estimators = 100)
# Train the model
clf = forest.fit(X_train, y_train)

Let's now see how accurate our model is :

In [None]:
clf.score(X_test, y_test)

This is a simple scoring method, so let's perform cross-validation on it to make sure we actually get the desired score.

In [None]:
from sklearn.model_selection import cross_val_score

#scores = cross_val_score(clf, X_test, y_test, cv=5)

print("Accuracy: %0.6f (+/- %0.6f)" % (scores.mean(), scores.std() * 2))

### 1.4 Learning Curves

We want to see how our model's accuracy evolves with the number of samples. For this we can plot a learning curve, which will randomly split the data into test and training sets using a `ShuffleSplit`. It will then train the model and compute its score over the test set. We can then plot the accuracy over the number of training examples.

In [None]:
# Plot code adapted from :
# http://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html
#
from sklearn.model_selection import learning_curve
from sklearn.model_selection import ShuffleSplit

# Legends
title  = "Learning Curves (Random Forest Regressor)"
xlabel = "Training examples"
ylabel = "Score"

# Plots params
y_lim = (0.95, 1.01)

# CV params
train_sizes = np.linspace(.1, 1.0, 5)
cv = ShuffleSplit(n_splits=5, test_size=0.2, random_state=0)

plt.figure()
plt.title(title)
plt.ylim(*y_lim)
plt.xlabel(xlabel)
plt.ylabel(ylabel)

train_sizes, train_scores, test_scores = learning_curve(forest, f, labels, cv=cv, n_jobs=4, train_sizes=train_sizes)

train_scores_mean = np.mean(train_scores, axis=1)
test_scores_mean  = np.mean(test_scores , axis=1)

train_scores_std  = np.std(train_scores, axis=1)
test_scores_std   = np.std(test_scores , axis=1)

plt.grid()
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                 train_scores_mean + train_scores_std, alpha=0.1,
                 color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                 test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
         label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
         label="Cross-validation score")
plt.legend(loc="best")


plt.show()

### 1.5 Feature importances

We would like to figure out which of the features are the most relevant to our model (i.e. the features that represent the most variance / information). These are the features that help the classifier make its decision.

We have access to this information in `RandomForestClassifier.feature_importances_`, so we can plot their importance.

In [None]:
# Plot code adapted from :
# http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html#sphx-glr-auto-examples-ensemble-plot-forest-importances-py
#
importances = forest.feature_importances_
std = np.std([tree.feature_importances_ for tree in forest.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1]
feature_names = f.columns.values.tolist()

# Print the feature ranking
print("Feature ranking:")

for j in range(X.shape[1]):
    print("%d. feature #%d %s (%f)" % (j + 1, indices[j], feature_names[indices[j]], importances[indices[j]]))

# Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(X.shape[1]), importances[indices],
       color="r", yerr=std[indices], align="center")
plt.xticks(range(X.shape[1]), indices)
plt.xlim([-1, X.shape[1]])
plt.show()