# Libraries

In [1]:
# Note book settings
%load_ext autoreload
%autoreload 2
%matplotlib inline

from IPython.display import Image, display
import warnings
warnings.simplefilter("ignore")

In [2]:
# data manipulation libraries
import numpy as np
import pandas as pd

# machine learing
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier, export_graphviz
import pydot # decision tree visualization

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_auc_score
from sklearn.model_selection import train_test_split, cross_val_score

# utilities
from time import time
from functions import *

# Contents

1. [Datasets](#datasets)
   * [Background](#background)  
   * [Load Datasets](#loaddatasets)  
   * [Rough Dataset Understanding](#roughdatasetunderstanding)  
   * 

2. [Feature Engineering](#featureengineering)  
3. [Final Prep for Input Dataset](#finalprepforinputdataset)
    * [Labeling](#labeling)
    * [Train Test Split](#traintestsplit)  
    *  
   
4. [Model Training](#modeltraining)  
5. [Evaluation](#evaluation)  
   * [Simple Evaluation](#simpleevaluation)  
   * [Confused Evaluation](#confusedevaluation) 
   * [Feature Importance](#featureimportance)  
   *  

6. [Further References](#furtherreferences)  

# Dataset <a id="datasets"></a>

### Background <a id="background"></a>

Sourced from : https://www.kaggle.com/hugomathien/soccer/home

**The ultimate Soccer database for data analysis and machine learning**  
What you get:

- +25,000 matches  
- +10,000 players  
- 11 European Countries with their lead championship  
- Seasons 2008 to 2016  
- Players and Teams' attributes* sourced from EA Sports' FIFA video game series, including the weekly updates  
- Team line up with squad formation (X, Y coordinates)  
- Betting odds from up to 10 providers  
- Detailed match events (goal types, possession, corner, cross, fouls, cards etc...) for +10,000 matches  


*16th Oct 2016: New table containing teams' attributes from FIFA !  

**Original Data Source**:

You can easily find data about soccer matches but they are usually scattered across different websites. A thorough data collection and processing has been done to make your life easier. I must insist that you do not make any commercial use of the data. The data was sourced from:

http://football-data.mx-api.enetscores.com/ : scores, lineup, team formation and events

http://www.football-data.co.uk/ : betting odds. Click here to understand the column naming system for betting odds:

http://sofifa.com/ : players and teams attributes from EA Sports FIFA games. FIFA series and all FIFA assets property of EA Sports.

When you have a look at the database, you will notice foreign keys for players and matches are the same as the original data sources. I have called those foreign keys "api_id".

**From Author:**  
*"You will notice that some players are missing from the lineup (NULL values). This is because I have not been able to source their attributes from FIFA. This will be fixed overtime as the crawling algorithm is being improved. The dataset will also be expanded to include international games, national cups, Champion's League and Europa League."*

**Final Notes**:  

The bookies use 3 classes (Home Win, Draw, Away Win). They get it right about 53% of the time. This is also what I've achieved so far using my own SVM. Though it may sound high for such a random sport game, you've got to know that the home team wins about 46% of the time. So the base case (constantly predicting Home Win) has indeed 46% precision.

When running a multi-class classifier like SVM you could also output a probability estimate and compare it to the betting odds. Have a look at your variance vs odds and see for what games you had very different predictions.

### Load Datasets <a id="loaddatasets"></a>

In [3]:
country = pd.read_csv('../data/main_Country.csv')
league = pd.read_csv('../data/main_League.csv')
match = pd.read_csv('../data/main_Match.csv', parse_dates=["date"])
playerAttr = pd.read_csv('../data/main_Player_Attributes.csv', parse_dates=["date"])
player_p1 = pd.read_csv('../data/main_Player_part1.csv', parse_dates=["birthday"])
player_p2 = pd.read_csv('../data/main_Player_part2.csv', parse_dates=["birthday"])
team = pd.read_csv('../data/main_Team.csv')
teamAttr = pd.read_csv('../data/main_Team_Attributes.csv', parse_dates=["date"])

In [4]:
# merge player1 and player2 table
players = player_p1.append(player_p2)

### Rough Dataset Understanding <a id="roughdatasetunderstanding"></a>

#### Country

In [5]:
# shape of country df
print('dataframe shape: {}'.format(country.shape))
country.head()

dataframe shape: (11, 2)


Unnamed: 0,id,name
0,1,Belgium
1,1729,England
2,4769,France
3,7809,Germany
4,10257,Italy


In [6]:
# unique names of the countries
country.name.unique()

array(['Belgium', 'England', 'France', 'Germany', 'Italy', 'Netherlands',
       'Poland', 'Portugal', 'Scotland', 'Spain', 'Switzerland'], dtype=object)

#### League

In [7]:
# shape of league df
print(f'dataframe shape: {league.shape}')
league.head()

dataframe shape: (11, 3)


Unnamed: 0,id,country_id,name
0,1,1,Belgium Jupiler League
1,1729,1729,England Premier League
2,4769,4769,France Ligue 1
3,7809,7809,Germany 1. Bundesliga
4,10257,10257,Italy Serie A


In [8]:
# unique league names
league.name.unique()

array(['Belgium Jupiler League', 'England Premier League',
       'France Ligue 1', 'Germany 1. Bundesliga', 'Italy Serie A',
       'Netherlands Eredivisie', 'Poland Ekstraklasa',
       'Portugal Liga ZON Sagres', 'Scotland Premier League',
       'Spain LIGA BBVA', 'Switzerland Super League'], dtype=object)

#### <div style="color:blue">Match</div>

In [9]:
# print shape of Match df (25979 rows) and first 3 rows
print(f'dataframe shape: {match.shape}')
match.head(3)

dataframe shape: (25979, 115)


Unnamed: 0,id,country_id,league_id,season,stage,date,match_api_id,home_team_api_id,away_team_api_id,home_team_goal,...,SJA,VCH,VCD,VCA,GBH,GBD,GBA,BSH,BSD,BSA
0,1,1,1,2008/2009,1,2008-08-17,492473,9987,9993,1,...,4.0,1.65,3.4,4.5,1.78,3.25,4.0,1.73,3.4,4.2
1,2,1,1,2008/2009,1,2008-08-16,492474,10000,9994,0,...,3.8,2.0,3.25,3.25,1.85,3.25,3.75,1.91,3.25,3.6
2,3,1,1,2008/2009,1,2008-08-16,492475,9984,8635,0,...,2.5,2.35,3.25,2.65,2.5,3.2,2.5,2.3,3.2,2.75


In [10]:
# match column names (1st 11) : id columns and match resuts

#match.columns[:11]
match.iloc[:,:11].head()

Unnamed: 0,id,country_id,league_id,season,stage,date,match_api_id,home_team_api_id,away_team_api_id,home_team_goal,away_team_goal
0,1,1,1,2008/2009,1,2008-08-17,492473,9987,9993,1,1
1,2,1,1,2008/2009,1,2008-08-16,492474,10000,9994,0,0
2,3,1,1,2008/2009,1,2008-08-16,492475,9984,8635,0,3
3,4,1,1,2008/2009,1,2008-08-17,492476,9991,9998,5,0
4,5,1,1,2008/2009,1,2008-08-16,492477,7947,9985,1,3


In [11]:
# match colun 55 to 66 : home player ids

#match.columns[55 :66]
match[match.iloc[:,55 :66].notnull().any(axis=1)].iloc[:5,55 :66] # get rows with non-nulls in columns 55 :66

Unnamed: 0,home_player_1,home_player_2,home_player_3,home_player_4,home_player_5,home_player_6,home_player_7,home_player_8,home_player_9,home_player_10,home_player_11
144,39890.0,,38788.0,38312.0,26235.0,,,,26916.0,,94289.0
145,38327.0,67950.0,67958.0,67959.0,37112.0,36393.0,148286.0,67898.0,164352.0,38801.0,26502.0
146,95597.0,,,38435.0,94462.0,46004.0,164732.0,,38246.0,38423.0,38419.0
147,,39580.0,30692.0,37861.0,47411.0,119117.0,35412.0,39631.0,39591.0,25957.0,38369.0
148,30934.0,38292.0,11569.0,38273.0,14642.0,38945.0,38290.0,95609.0,38257.0,,121639.0


In [12]:
#  match colun 66 to 77 : home player ids

#match.columns[66 :77]
match[match.iloc[:,66 :77].notnull().any(axis=1)].iloc[:5,66 :77] # get rows with non-nulls in columns 66 :77

Unnamed: 0,away_player_1,away_player_2,away_player_3,away_player_4,away_player_5,away_player_6,away_player_7,away_player_8,away_player_9,away_player_10,away_player_11
144,34480.0,38388.0,26458.0,13423.0,38389.0,38798.0,30949.0,38253.0,106013.0,38383.0,46552.0
145,37937.0,38293.0,148313.0,104411.0,148314.0,37202.0,43158.0,9307.0,42153.0,32690.0,38782.0
146,38252.0,39156.0,39151.0,166554.0,15652.0,39145.0,46890.0,38947.0,46881.0,39158.0,119118.0
147,36835.0,37047.0,37021.0,38186.0,27110.0,32863.0,37957.0,37909.0,104386.0,38251.0,37065.0
148,104378.0,27838.0,36841.0,38337.0,,33662.0,37044.0,32760.0,38229.0,12574.0,46335.0


#### Players Table

In [13]:
# check and print players table shape
print(players.shape)
players.head()

(11060, 7)


Unnamed: 0,id,player_api_id,player_name,player_fifa_api_id,birthday,height,weight
0,1,505942,Aaron Appindangoye,218353,1992-02-29,182.88,187
1,2,155782,Aaron Cresswell,189615,1989-12-15,170.18,146
2,3,162549,Aaron Doran,186170,1991-05-13,170.18,163
3,4,30572,Aaron Galindo,140161,1982-05-08,182.88,198
4,5,23780,Aaron Hughes,17725,1979-11-08,182.88,154


#### <div style="color:blue">Player Attributes</div>

In [14]:
# print Player df shape, column names and top 3 rows ( what we really need is id and 'overall_rating')
print(f'dataframe shape: {playerAttr.shape}')
print(playerAttr.columns)
playerAttr.head(3)

dataframe shape: (183978, 42)
Index(['id', 'player_fifa_api_id', 'player_api_id', 'date', 'overall_rating',
       'potential', 'preferred_foot', 'attacking_work_rate',
       'defensive_work_rate', 'crossing', 'finishing', 'heading_accuracy',
       'short_passing', 'volleys', 'dribbling', 'curve', 'free_kick_accuracy',
       'long_passing', 'ball_control', 'acceleration', 'sprint_speed',
       'agility', 'reactions', 'balance', 'shot_power', 'jumping', 'stamina',
       'strength', 'long_shots', 'aggression', 'interceptions', 'positioning',
       'vision', 'penalties', 'marking', 'standing_tackle', 'sliding_tackle',
       'gk_diving', 'gk_handling', 'gk_kicking', 'gk_positioning',
       'gk_reflexes'],
      dtype='object')


Unnamed: 0,id,player_fifa_api_id,player_api_id,date,overall_rating,potential,preferred_foot,attacking_work_rate,defensive_work_rate,crossing,...,vision,penalties,marking,standing_tackle,sliding_tackle,gk_diving,gk_handling,gk_kicking,gk_positioning,gk_reflexes
0,1,218353,505942,2016-02-18,67.0,71.0,right,medium,medium,49.0,...,54.0,48.0,65.0,69.0,69.0,6.0,11.0,10.0,8.0,8.0
1,2,218353,505942,2015-11-19,67.0,71.0,right,medium,medium,49.0,...,54.0,48.0,65.0,69.0,69.0,6.0,11.0,10.0,8.0,8.0
2,3,218353,505942,2015-09-21,62.0,66.0,right,medium,medium,49.0,...,54.0,48.0,65.0,66.0,69.0,6.0,11.0,10.0,8.0,8.0


#### Team

In [15]:
# check team df shape and peek df (just team names and id)
print(team.shape)
team.head()

(299, 5)


Unnamed: 0,id,team_api_id,team_fifa_api_id,team_long_name,team_short_name
0,1,9987,673.0,KRC Genk,GEN
1,2,9993,675.0,Beerschot AC,BAC
2,3,10000,15005.0,SV Zulte-Waregem,ZUL
3,4,9994,2007.0,Sporting Lokeren,LOK
4,5,9984,1750.0,KSV Cercle Brugge,CEB


#### Team Attributes

In [16]:
# peep at team table and print shape
print(teamAttr.shape)
teamAttr.head(3)

(1458, 25)


Unnamed: 0,id,team_fifa_api_id,team_api_id,date,buildUpPlaySpeed,buildUpPlaySpeedClass,buildUpPlayDribbling,buildUpPlayDribblingClass,buildUpPlayPassing,buildUpPlayPassingClass,...,chanceCreationShooting,chanceCreationShootingClass,chanceCreationPositioningClass,defencePressure,defencePressureClass,defenceAggression,defenceAggressionClass,defenceTeamWidth,defenceTeamWidthClass,defenceDefenderLineClass
0,1,434,9930,2010-02-22,60,Balanced,,Little,50,Mixed,...,55,Normal,Organised,50,Medium,55,Press,45,Normal,Cover
1,2,434,9930,2014-09-19,52,Balanced,48.0,Normal,56,Mixed,...,64,Normal,Organised,47,Medium,44,Press,54,Normal,Cover
2,3,434,9930,2015-09-10,47,Balanced,41.0,Normal,54,Mixed,...,64,Normal,Organised,47,Medium,44,Press,54,Normal,Cover


In [17]:
# print both parts of team df

display(teamAttr.iloc[:3,:13])
display(teamAttr.iloc[:3,13:25])

Unnamed: 0,id,team_fifa_api_id,team_api_id,date,buildUpPlaySpeed,buildUpPlaySpeedClass,buildUpPlayDribbling,buildUpPlayDribblingClass,buildUpPlayPassing,buildUpPlayPassingClass,buildUpPlayPositioningClass,chanceCreationPassing,chanceCreationPassingClass
0,1,434,9930,2010-02-22,60,Balanced,,Little,50,Mixed,Organised,60,Normal
1,2,434,9930,2014-09-19,52,Balanced,48.0,Normal,56,Mixed,Organised,54,Normal
2,3,434,9930,2015-09-10,47,Balanced,41.0,Normal,54,Mixed,Organised,54,Normal


Unnamed: 0,chanceCreationCrossing,chanceCreationCrossingClass,chanceCreationShooting,chanceCreationShootingClass,chanceCreationPositioningClass,defencePressure,defencePressureClass,defenceAggression,defenceAggressionClass,defenceTeamWidth,defenceTeamWidthClass,defenceDefenderLineClass
0,65,Normal,55,Normal,Organised,50,Medium,55,Press,45,Normal,Cover
1,63,Normal,64,Normal,Organised,47,Medium,44,Press,54,Normal,Cover
2,63,Normal,64,Normal,Organised,47,Medium,44,Press,54,Normal,Cover


# Feature Engineering <a id="featureengineering"></a>

Feature Engineering refers to the exercise where by we define more "columns" of numerical variables or "flags" to let the model "understand/learn" more about our data. This is often the key determining factor to accurate models.

1. As we have statistics of each player over many time periods, we need to get the latest stats for each player, prior to their match date. I have done this with the help of functions developed by [Airback](https://www.kaggle.com/airback), from this [kaggle kernel](https://www.kaggle.com/airback/match-outcome-prediction-in-football?scriptVersionId=796746) and have placed them in the file called functions.py in this folder so we imported them in via from functions import *    
<br>
2. Next we add in some match results from the last 10 matches that a particular team-vs-team had before their latest matchup.    
<br>
3. We also flag the leagues that the teams are in and add in some bookie odds. Note here that, flags must have a column on their own. (we do this via pd.get_dummies() in pandas)    
<br>
4. Lastly, a match outcome normally goes by "Win", "Draw" or "Loose", which makes this a multi-class problem. However, for beginners and ease of explanation, we need to treat this as a binary classficiation problem and label the outcome as either "Win", or "No-Win".
<br>

In [18]:
# Specify only those column names that we want to use
keepcols = ["country_id", "league_id", "season", "stage", "date", "match_api_id", "home_team_api_id", 
        "away_team_api_id", "home_team_goal", "away_team_goal", "home_player_1", "home_player_2",
        "home_player_3", "home_player_4", "home_player_5", "home_player_6", "home_player_7", 
        "home_player_8", "home_player_9", "home_player_10", "home_player_11", "away_player_1",
        "away_player_2", "away_player_3", "away_player_4", "away_player_5", "away_player_6",
        "away_player_7", "away_player_8", "away_player_9", "away_player_10", "away_player_11"]

# drop those rows with NAs or Nulls and subset with keepcols
match.dropna(subset = keepcols, how='any', inplace = True)

In [33]:
# Load/Read in Player/Match Ratings
fifa_data = pd.read_pickle('../data/fifa_data.pkl')
print(fifa_data.columns)
fifa_data.head()

Index(['home_player_1_overall_rating', 'home_player_2_overall_rating',
       'home_player_3_overall_rating', 'home_player_4_overall_rating',
       'home_player_5_overall_rating', 'home_player_6_overall_rating',
       'home_player_7_overall_rating', 'home_player_8_overall_rating',
       'home_player_9_overall_rating', 'home_player_10_overall_rating',
       'home_player_11_overall_rating', 'away_player_1_overall_rating',
       'away_player_2_overall_rating', 'away_player_3_overall_rating',
       'away_player_4_overall_rating', 'away_player_5_overall_rating',
       'away_player_6_overall_rating', 'away_player_7_overall_rating',
       'away_player_8_overall_rating', 'away_player_9_overall_rating',
       'away_player_10_overall_rating', 'away_player_11_overall_rating',
       'match_api_id'],
      dtype='object')


Unnamed: 0,home_player_1_overall_rating,home_player_2_overall_rating,home_player_3_overall_rating,home_player_4_overall_rating,home_player_5_overall_rating,home_player_6_overall_rating,home_player_7_overall_rating,home_player_8_overall_rating,home_player_9_overall_rating,home_player_10_overall_rating,...,away_player_3_overall_rating,away_player_4_overall_rating,away_player_5_overall_rating,away_player_6_overall_rating,away_player_7_overall_rating,away_player_8_overall_rating,away_player_9_overall_rating,away_player_10_overall_rating,away_player_11_overall_rating,match_api_id
145,58.0,57.0,67.0,53.0,60.0,63.0,60.0,66.0,50.0,65.0,...,59.0,55.0,54.0,72.0,67.0,65.0,70.0,68.0,63.0,493017.0
153,64.0,64.0,63.0,62.0,62.0,72.0,68.0,67.0,69.0,68.0,...,66.0,67.0,66.0,70.0,69.0,68.0,67.0,73.0,68.0,493025.0
155,67.0,72.0,69.0,69.0,72.0,75.0,74.0,70.0,74.0,64.0,...,61.0,60.0,49.0,64.0,67.0,66.0,55.0,58.0,64.0,493027.0
162,58.0,57.0,67.0,65.0,66.0,60.0,53.0,60.0,50.0,64.0,...,72.0,66.0,67.0,75.0,70.0,74.0,74.0,70.0,69.0,493034.0
168,61.0,66.0,61.0,61.0,60.0,64.0,67.0,66.0,64.0,58.0,...,57.0,57.0,51.0,58.0,66.0,57.0,60.0,63.0,65.0,493040.0


In [25]:
# Load/Read in Final Dataset
feables = pd.read_pickle('../data/feables.pkl')
print(feables.columns)
feables.head()

Index(['match_api_id', 'home_team_goals_difference',
       'away_team_goals_difference', 'games_won_home_team',
       'games_won_away_team', 'games_against_won', 'games_against_lost',
       'League_1.0', 'League_1729.0', 'League_4769.0', 'League_7809.0',
       'League_10257.0', 'League_13274.0', 'League_15722.0', 'League_17642.0',
       'League_19694.0', 'League_21518.0', 'League_24558.0',
       'home_player_1_overall_rating', 'home_player_2_overall_rating',
       'home_player_3_overall_rating', 'home_player_4_overall_rating',
       'home_player_5_overall_rating', 'home_player_6_overall_rating',
       'home_player_7_overall_rating', 'home_player_8_overall_rating',
       'home_player_9_overall_rating', 'home_player_10_overall_rating',
       'home_player_11_overall_rating', 'away_player_1_overall_rating',
       'away_player_2_overall_rating', 'away_player_3_overall_rating',
       'away_player_4_overall_rating', 'away_player_5_overall_rating',
       'away_player_6_overall_ra

Unnamed: 0,match_api_id,home_team_goals_difference,away_team_goals_difference,games_won_home_team,games_won_away_team,games_against_won,games_against_lost,League_1.0,League_1729.0,League_4769.0,...,away_player_9_overall_rating,away_player_10_overall_rating,away_player_11_overall_rating,B365_Win,B365_Draw,B365_Defeat,BW_Win,BW_Draw,BW_Defeat,label
0,493017,0.0,0.0,0.0,0.0,0.0,0.0,1,0,0,...,70.0,68.0,63.0,0.313804,0.276886,0.40931,0.307825,0.27941,0.412765,Win
1,493025,0.0,0.0,0.0,0.0,0.0,0.0,1,0,0,...,67.0,73.0,68.0,0.327179,0.286281,0.38654,0.290493,0.300176,0.409331,Defeat
2,493027,0.0,0.0,0.0,0.0,0.0,0.0,1,0,0,...,55.0,58.0,64.0,0.672897,0.209346,0.117757,0.672269,0.226891,0.10084,Win
3,493034,1.0,2.0,1.0,1.0,0.0,0.0,1,0,0,...,74.0,70.0,69.0,0.207407,0.259259,0.533333,0.192717,0.274476,0.532807,Win
4,493040,-2.0,0.0,0.0,0.0,0.0,0.0,1,0,0,...,60.0,63.0,65.0,0.535211,0.267606,0.197183,0.565759,0.25499,0.17925,Draw


# Final Prep for Input Dataset <a id="finalprepforinputdataset"></a>

1. Here we have our match outcome under the column 'label'. Sklearn has evolved quite a bit and some models have accepted text for classes. But others require us to enumerate the classes as integers or "flags" for them to work with numpy and the ml computation. Here, we just change them to 0s and 1s for a mental exercise. (More of multiclass/multilabel classifying can be found [here](http://scikit-learn.org/stable/tutorial/basic/tutorial.html#multiclass-vs-multilabel-fitting) and [here2](https://stats.stackexchange.com/questions/11859/what-is-the-difference-between-multiclass-and-multilabel-problem))  
<br>
2. We have cannot just learn from the entire dataset as we will be "overfitting" the data. Hence, we use sklearn's helper function "train_test_split()" to work this out.

Excerpt from Jeremy Howard's FastAI [machine learning course](http://forums.fast.ai/t/another-treat-early-access-to-intro-to-machine-learning-videos/6826?source_topic_id=9285&source_topic_id=9594):  


"
Possibly **the most important idea** in machine learning is that of having separate training & validation data sets. As motivation, suppose you don't divide up your data, but instead use all of it.  And suppose you have lots of parameters:

<img src="images/overfitting2.png" alt="" style="width: 70%"/>
<center>
[Underfitting and Overfitting](https://datascience.stackexchange.com/questions/361/when-is-a-model-underfitted)
</center>

The error for the pictured data points is lowest for the model on the far right (the blue curve passes through the red points almost perfectly), yet it's not the best choice.  Why is that?  If you were to gather some new data points, they most likely would not be on that curve in the graph on the right, but would be closer to the curve in the middle graph.

This illustrates how using all our data can lead to **overfitting**. A validation set helps diagnose this problem.
"

## Labeling <a id="labeling"></a>

In [28]:
# rename input df and drop match id keys
inputs = feables.drop('match_api_id', axis = 1)

# convert training label to integer based
#inputs['label'] = inputs.label.apply(lambda x: 0 if x=='Defeat' else (1 if x=='Draw' else 2))

# convert training label to win/loose
inputs['label'] = inputs.label.apply(lambda x: 1 if x =='Win' else 0)

# split/seperate out the training label and features
labels = inputs['label']
features = inputs.drop(['label'], axis = 1)

In [29]:
# Calculate training label ratio

win_rows = inputs['label'].sum()
total_rows = inputs['label'].size
perc_wins = win_rows / total_rows * 100

print('there are {0}({1:.2f}%) wins out of a total of {2} rows/matches'.format(win_rows, perc_wins, total_rows))

there are 9034(45.92%) wins out of a total of 19673 rows/matches


## Train Test Split <a id="traintestsplit"></a>

In [30]:
#Splitting the data into Train and Test data sets
X_train, X_test, y_train, y_test = train_test_split(features, labels, 
                                                    test_size = 0.2, 
                                                    random_state = 88, 
                                                    stratify = labels)

# Model Training <a id="modeltraining"></a>

1. Sklearn comes with many [models](http://scikit-learn.org/stable/user_guide.html). For simplicity, today, we just try out the Decision Tree and Random Forest classifiers. (Note: There can also be models that work on Regression problems)    
<br>
2. Training and Predicting the model is mostly simple. Sklearn's api standardises most to a "model.fit()", "model.predict" form. (more [sklearn tutorials](http://scikit-learn.org/stable/tutorial/index.html))

<img src="images/decisiontree.png" alt="" style="width: 50%"/>
<center>
[Decision Tree](https://towardsdatascience.com/the-decision-tree-of-life-12f1eef603ba)
</center>

### Decision Tree (Background)

- From "[Decision Trees Explained Easily](https://medium.com/@chiragsehra42/decision-trees-explained-easily-28f23241248)" by Chirag Sehra:  
*"Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression."*  


- Terminology from "[Decision tree Intuition](https://medium.com/greyatom/decision-tree-intuition-a38669005cb7)" by Biraj Parikh:
<img src="images/decisiontree_terminology.PNG" alt="" style="width: 70%"/>
<center>
[Decision Tree Terminology](https://medium.com/greyatom/decision-tree-intuition-a38669005cb7)
</center>

    * Root Node: It represents entire population or sample and this further gets divided into two or more homogeneous sets.
    * Splitting: It is a process of dividing a node into two or more sub-nodes.
    * Decision Node: When a sub-node splits into further sub-nodes, then it is called decision node.
    * Leaf / Terminal Node: Nodes do not split is called Leaf or Terminal node.
    * Pruning: When we remove sub-nodes of a decision node, this process is called pruning.
    * Branch / Sub-Tree: A sub section of entire tree is called branch or sub-tree.



**How do we decide which variable to split on first?**  

from "[Decision tree Intuition](https://medium.com/greyatom/decision-tree-intuition-a38669005cb7)" by Biraj Parikh:    

**ENTROPY**  
*A decision tree is built top-down from a root node and involves partitioning the data into subsets that contain instances with similar values (homogeneous). ID3 algorithm uses entropy to calculate the homogeneity of a sample. If the sample is completely homogeneous the entropy is zero and if the sample is equally divided it has entropy of one.*

<img src="images/entropy_curve.PNG" alt="" style="width: 45%"/>
<center>
[Entropy](https://medium.com/greyatom/decision-tree-intuition-a38669005cb7)
</center>

**INFORMATION GAIN**  
*Entropy gives measure of impurity in a node. In a decision tree building process, two important decisions are to be made — what is the best split(s) and which is the best variable to split a node.*  

<br>
**Few Other Commonly used Algorithms are:**
    - ID3
    - C4.5
    - CART
    - CHAID (Chi-squared Automatic Interaction Detector)

<br>
**GINI (other metric for splitting)**:  
*from [Gini Impurity vs Entropy (stackexchange)](https://datascience.stackexchange.com/questions/10228/gini-impurity-vs-entropy)*:  
<img src="images/gini_formula.PNG" alt="" style="width: 30%"/>
<center>
[Gini vs Entropy](https://datascience.stackexchange.com/questions/10228/gini-impurity-vs-entropy)
</center>

<br>

**So various implementation breakdown:**
<img src="images/dt_summary.PNG" alt="" style="width: 45%"/>
<center>
[Algorithm Implementations](https://medium.com/greyatom/decision-tree-intuition-a38669005cb7)
</center>

<br>
** Reference to python's sklearn Decision Tree API:** [BUILDING DECISION TREE ALGORITHM IN PYTHON WITH SCIKIT LEARN](http://dataaspirant.com/2017/02/01/decision-tree-algorithm-python-with-scikit-learn/)

### Random Forest (Background)

An **[Ensemble Learning](https://en.wikipedia.org/wiki/Ensemble_learning)** method, according to wikipedia, uses multiple learning algorithms to obtain better predictive performance that could be obtained from any of the constituent learning algorithms alone.  

A **Random Forest** is basically an ensemble of decision trees.

According to William Koehrsen in "[Random Forest Simple Explanation](https://medium.com/@williamkoehrsen/random-forest-simple-explanation-377895a60d2d)":  

*"In technical terms, the predictions have variance because they will be widely spread around the right answer. Now, what if we take predictions from hundreds or thousands of individuals, some of which are high and some of which are low, and decided to average them together? Well, congratulations, we have created a random forest!"*  

<br>
**The [official page of the algorithm](https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#remarks) and this [stackexchange](https://stats.stackexchange.com/questions/36165/does-the-optimal-number-of-trees-in-a-random-forest-depend-on-the-number-of-pred) states that**:  

*"Random forest uses bagging (picking a sample of observations rather than all of them) and random subspace method (picking a sample of features rather than all of them, in other words - attribute bagging) to grow a tree."*  

<br>

<img src="images/randomforest.PNG" alt="" style="width: 55%"/>
<center>
[random forest](https://towardsdatascience.com/a-tour-of-the-top-10-algorithms-for-machine-learning-newbies-dde4edffae11)
</center>

<br>
**Other References on implementation**:  
[The Random Forest Algorithm](https://towardsdatascience.com/the-random-forest-algorithm-d457d499ffcd)  
[Intuitive Interpretation of Random Forest](https://medium.com/usf-msds/intuitive-interpretation-of-random-forest-2238687cae45)  



### Build Models

In [31]:
# Initiate Model Objects

# Decision Tree Model Initiate
DT_clf = DecisionTreeClassifier(min_samples_split=20, random_state=88, criterion='gini', class_weight = 'balanced')

# Random Forest Model Initiate
RF_clf = RandomForestClassifier(n_estimators = 200, random_state = 88, criterion='gini', class_weight = 'balanced', oob_score=True)

In [32]:
# Train Model on Training Set

model_dt = DT_clf.fit(X_train, y_train)
model_rf = RF_clf.fit(X_train, y_train)

In [33]:
# Predict on Test Set

predictDT_on_test = model_dt.predict(X_test) # model.predict_proba(X_test) # for predicting in probabilities
predictDT_on_test

predictRF_on_test = model_rf.predict(X_test) # model.predict_proba(X_test) # for predicting in probabilities
predictRF_on_test

array([0, 0, 0, ..., 1, 1, 0], dtype=int64)

# Evaluation <a id="evaluation"></a>

1. So we can take plain accuracy for our model to gauge its success. However, if a dataset is dominated by 1 class, guessin just that class would probably get us far on accuracy as well. Hence, we need to look at the metrics describing [False/True Positives and Negatives](https://towardsdatascience.com/beyond-accuracy-precision-and-recall-3da06bea9f6c).    
<br>
2. After knowing about False/True Positives and Negatives, we should touch on the normal way they present them. ie. A [confusion matrix](https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/).  
<br>
3. And finally, we should understand a bit on how to make sense of these numbers with some [ROC or AUC curves](https://www.dataschool.io/roc-curves-and-auc-explained/) ([another ref](https://stats.stackexchange.com/questions/132777/what-does-auc-stand-for-and-what-is-it))?

### Simple Evaluation <a id="simpleevaluation"></a>

In [35]:
# Get accuracy score for Decision Tree
accuracy_DT_on_testset = accuracy_score(y_test, predictDT_on_test)
accuracy_DT_on_testset

0.57153748411689964

In [36]:
# Get accuracy score for Random Forest
accuracy_RF_on_testset = accuracy_score(y_test, predictRF_on_test)
accuracy_RF_on_testset

0.63659466327827197

In [37]:
# cross validation for Decision Tree Accuracy
cv_accuracy = cross_val_score(model_dt, features, labels, cv=10)
cv_accuracy

# the parameter scoring can be specified with 'roc_auc', 'precision', 'recall', 'f1' etc

array([ 0.56199187,  0.58739837,  0.54268293,  0.54420732,  0.55821047,
        0.53990849,  0.58922217,  0.59938993,  0.526182  ,  0.57884028])

In [38]:
# cross validation for Random Forest Accuracy
cv_accuracy = cross_val_score(model_rf, features, labels, cv=10)
cv_accuracy

# the parameter scoring can be specified with 'roc_auc', 'precision', 'recall', 'f1' etc

array([ 0.6504065 ,  0.65752033,  0.60873984,  0.61686992,  0.62785968,
        0.6273513 ,  0.66903915,  0.69801729,  0.64107778,  0.67192269])

### Confused Evaluation :P <a id="confusedevaluation"></a>

In [39]:
# Confusion Matrix for Decision Tree
confusion_matrix(predictDT_on_test, y_test)

array([[1247,  805],
       [ 881, 1002]], dtype=int64)

In [40]:
# Classification Report for Decision Tree
print(classification_report(y_test, predictDT_on_test, labels=[0,1], target_names=['no win', 'Win']))

             precision    recall  f1-score   support

     no win       0.61      0.59      0.60      2128
        Win       0.53      0.55      0.54      1807

avg / total       0.57      0.57      0.57      3935



In [41]:
# Confusion Matrix for Random Forest
confusion_matrix(predictRF_on_test, y_test)

array([[1598,  900],
       [ 530,  907]], dtype=int64)

In [42]:
# Classification Report for Random Forest
print(classification_report(y_test, predictRF_on_test, labels=[0,1], target_names=['no win', 'Win']))

             precision    recall  f1-score   support

     no win       0.64      0.75      0.69      2128
        Win       0.63      0.50      0.56      1807

avg / total       0.64      0.64      0.63      3935



### Feature Importance <a id="featureimportance"></a>

In [43]:
# feature importance for Decision Tree
DT_feature_impt = pd.DataFrame({'features':features.columns, 'importance':model_dt.feature_importances_}).sort_values('importance', ascending=False)
DT_feature_impt.head(20)

Unnamed: 0,features,importance
44,BW_Defeat,0.171363
39,B365_Win,0.094419
42,BW_Win,0.035259
43,BW_Draw,0.034477
26,home_player_10_overall_rating,0.033535
1,away_team_goals_difference,0.033481
41,B365_Defeat,0.030593
27,home_player_11_overall_rating,0.029866
28,away_player_1_overall_rating,0.028361
35,away_player_8_overall_rating,0.02704


In [44]:
# feature importance for Random Forest
RF_feature_impt = pd.DataFrame({'features':features.columns, 'importance':model_rf.feature_importances_}).sort_values('importance', ascending=False)
RF_feature_impt.head(20)

Unnamed: 0,features,importance
44,BW_Defeat,0.057375
42,BW_Win,0.055406
39,B365_Win,0.054353
41,B365_Defeat,0.051659
43,BW_Draw,0.040754
40,B365_Draw,0.038714
1,away_team_goals_difference,0.031431
0,home_team_goals_difference,0.031398
24,home_player_8_overall_rating,0.026641
25,home_player_9_overall_rating,0.026473


# Further References <a id="furtherreferences"></a>

*placeholder*