# Data Science Sydney
## Week 6 Lesson 2 - Lab: Ensembling

### Introduction to Ensembling

*Adapted from Chapter 8 of [An Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/)*

Let's pretend that instead of building a single model to solve a classification problem, you created **five independent models**, and each model was correct 70% of the time. If you combined these models into an "ensemble" and used their majority vote as a prediction, how often would the ensemble be correct?

Let's simulate it to find out!

The code in the next chunk is about simulating how a bunch of independant models do better if averaged. To simulate models the code builds 5 sets of model 'correct predictions'. i.e model predict outcome as 0/1 and model was right about that prediction. 

In [1]:
import numpy as np

# set a seed for reproducibility
np.random.seed(1234)

# generate 1000 random numbers (between 0 and 1) for each model, representing 1000 observations
# This step is just a random number generator, 
# generating a number between 0 and 1. We are using these random numbers as a proxy for a predicted probability. 
rand1 = np.random.rand(1000)
rand2 = np.random.rand(1000)
rand3 = np.random.rand(1000)
rand4 = np.random.rand(1000)
rand5 = np.random.rand(1000)

In [2]:
rand1


array([ 0.19151945,  0.62210877,  0.43772774,  0.78535858,  0.77997581,
        0.27259261,  0.27646426,  0.80187218,  0.95813935,  0.87593263,
        0.35781727,  0.50099513,  0.68346294,  0.71270203,  0.37025075,
        0.56119619,  0.50308317,  0.01376845,  0.77282662,  0.88264119,
        0.36488598,  0.61539618,  0.07538124,  0.36882401,  0.9331401 ,
        0.65137814,  0.39720258,  0.78873014,  0.31683612,  0.56809865,
        0.86912739,  0.43617342,  0.80214764,  0.14376682,  0.70426097,
        0.70458131,  0.21879211,  0.92486763,  0.44214076,  0.90931596,
        0.05980922,  0.18428708,  0.04735528,  0.67488094,  0.59462478,
        0.53331016,  0.04332406,  0.56143308,  0.32966845,  0.50296683,
        0.11189432,  0.60719371,  0.56594464,  0.00676406,  0.61744171,
        0.91212289,  0.79052413,  0.99208147,  0.95880176,  0.79196414,
        0.28525096,  0.62491671,  0.4780938 ,  0.19567518,  0.38231745,
        0.05387369,  0.45164841,  0.98200474,  0.1239427 ,  0.11

In [3]:
# each model independently predicts 1 (the "correct response") if random number was at least 0.3
# Here the probability threshold is set 0.3 and 
correctPreds1 = np.where(rand1 > 0.3, 1, 0)
correctPreds2 = np.where(rand2 > 0.3, 1, 0)
correctPreds3 = np.where(rand3 > 0.3, 1, 0)
correctPreds4 = np.where(rand4 > 0.3, 1, 0)
correctPreds5 = np.where(rand5 > 0.3, 1, 0)

# print the first 20 predictions from each model
print(correctPreds1[:20])
print(correctPreds2[:20])
print(correctPreds3[:20])
print(correctPreds4[:20])
print(correctPreds5[:20])

[0 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 0 1 1]
[1 1 1 1 1 1 1 0 1 0 0 0 1 1 1 0 1 0 0 0]
[1 1 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 0 1 1]
[1 1 0 0 0 0 1 1 0 1 1 1 1 1 1 0 1 1 1 0]
[0 0 1 0 0 0 1 0 1 0 0 0 1 1 1 1 1 1 1 1]


In [4]:
# add the predictions together
sum_of_preds = correctPreds1 + correctPreds2 + correctPreds3 + correctPreds4 + correctPreds5
sum_of_preds

array([3, 4, 4, 3, 2, 2, 4, 2, 3, 3, 3, 3, 5, 5, 5, 3, 5, 2, 4, 3, 4, 5, 3,
       4, 4, 4, 5, 4, 3, 4, 4, 3, 4, 4, 3, 5, 2, 4, 2, 3, 3, 1, 3, 5, 4, 4,
       3, 4, 5, 3, 3, 4, 4, 2, 5, 4, 2, 5, 4, 2, 3, 3, 3, 4, 4, 3, 4, 3, 4,
       3, 3, 4, 4, 3, 4, 4, 3, 3, 3, 5, 4, 4, 4, 4, 4, 2, 5, 4, 4, 4, 2, 3,
       5, 5, 5, 4, 5, 4, 3, 3, 4, 5, 5, 4, 3, 3, 4, 4, 4, 3, 3, 2, 4, 3, 4,
       2, 4, 3, 4, 4, 3, 3, 4, 4, 4, 3, 4, 4, 1, 2, 3, 4, 3, 4, 1, 3, 3, 3,
       3, 3, 4, 4, 3, 5, 4, 5, 5, 2, 4, 4, 2, 5, 3, 2, 4, 4, 5, 5, 2, 4, 2,
       4, 4, 4, 5, 4, 5, 2, 3, 4, 5, 4, 4, 3, 2, 3, 4, 4, 4, 5, 3, 4, 5, 3,
       5, 2, 4, 4, 4, 4, 3, 2, 4, 3, 4, 3, 4, 2, 4, 5, 4, 4, 4, 3, 4, 4, 3,
       5, 2, 4, 4, 4, 4, 5, 4, 5, 4, 2, 4, 1, 4, 3, 2, 3, 5, 5, 4, 4, 4, 4,
       2, 4, 3, 3, 3, 3, 4, 5, 4, 4, 4, 2, 2, 5, 3, 3, 3, 4, 3, 4, 4, 5, 3,
       3, 4, 4, 2, 4, 4, 3, 5, 4, 2, 4, 3, 2, 2, 5, 4, 3, 5, 5, 5, 4, 5, 4,
       3, 4, 4, 5, 2, 4, 3, 3, 3, 4, 5, 4, 3, 3, 1, 3, 4, 4, 5, 5, 4, 4, 5,
       1, 2,

In [5]:
# ensemble predicts 1 (the "correct response") if at least 3 models predict 1
# This is an example of a majority vote
ensemble_preds = np.where(sum_of_preds >=3 , 1, 0)
ensemble_preds

array([1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1,
       0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0,
       1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       0, 0,

In [6]:
# print the ensemble's first 20 predictions
print(ensemble_preds[:20])

[1 1 1 1 0 0 1 0 1 1 1 1 1 1 1 1 1 0 1 1]


How accurate was the ensemble?

In [7]:
# Simgle 'model' average correct
correctPreds1.mean() # average of 'correct' responses for 1 model

0.71299999999999997

In [8]:
# Ensemble average correct
ensemble_preds.mean() # average of 'correct' responses for ensemble

0.84099999999999997

Amazing, right?

**Ensemble learning (or "ensembling")** is simply the process of combining several models to solve a prediction problem, with the goal of producing a combined model that is more accurate than any individual model. For **classification** problems, the combination is often done by majority vote. For **regression** problems, the combination is often done by taking an average of the predictions.

For ensembling to work well, the individual models must meet two conditions:

- Models should be **accurate** (they must outperform random guessing)
- Models should be **independent** (their predictions are not correlated with one another)

The idea, then, is that if you have a collection of individually imperfect (and independent) models, the "one-off" mistakes made by each model are probably not going to be made by the rest of the models, and thus the mistakes will be discarded when averaging the models.

It turns out that as you add more models to the voting process, the probability of error decreases. This is known as [Condorcet's Jury Theorem](http://en.wikipedia.org/wiki/Condorcet%27s_jury_theorem), which was developed by a French political scientist in the 18th century.

Anyway, we'll see examples of ensembling below.

## Bootstrapping

**Some preliminary terminology:** In statistics, "bootstrapping" refers to the process of using "bootstrap samples" to quantify the uncertainty of a model. Bootstrap samples are simply random samples with replacement:

In [9]:
# set a seed for reproducibility
np.random.seed(1)

# create an array of 0 to 9, then sample 10 times with replacement
np.random.choice(a=10, size=10, replace=True)

array([5, 8, 9, 5, 0, 0, 1, 7, 6, 9])

## Bagging

On their own, decision trees are not competitive with the best supervised learning methods in terms of **predictive accuracy**. However, they can be used as the basis for more sophisticated methods that have much higher accuracy!

One of the main issues with decision trees is **high variance**, meaning that different splits in the training data can lead to very different trees. **"Bootstrap aggregation" (aka "bagging")** is a general purpose procedure for reducing the variance of a machine learning method, but is particularly useful for decision trees.

What is the bagging process (in general)?

- Take repeated bootstrap samples (random samples with replacement) from the training data set
- Train our method on each bootstrapped training set and make predictions
- Average the predictions

This increases predictive accuracy by **reducing the variance**, similar to how cross-validation reduces the variance associated with the test set approach (for estimating out-of-sample error) by splitting many times an averaging the results.

## Applying bagging to decision trees

So how exactly can bagging be used with decision trees? Here's how it applies to **regression trees**:

- Grow B regression trees using B bootstrapped training sets
- Grow each tree deep so that each one has low bias
- Every tree makes a numeric prediction, and the predictions are averaged (to reduce the variance)

It is applied in a similar fashion to **classification trees**, except that during the prediction stage, the overall prediction is based upon a majority vote of the trees.

**What value should be used for B?** Simply use a large enough value that the error seems to have stabilized. (Choosing a value of B that is "too large" will generally not lead to overfitting.)

## Manually implementing bagged decision trees (with B=3)

In this example we are going to predict price of vehicles

In [10]:
import pandas as pd

# read in vehicle data
vehicles = pd.read_csv('used_vehicles.csv')
vehicles

Unnamed: 0,price,year,miles,doors,type
0,22000,2012,13000,2,car
1,14000,2010,30000,2,car
2,13000,2010,73500,4,car
3,9500,2009,78000,4,car
4,9000,2007,47000,4,car
5,4000,2006,124000,2,car
6,3000,2004,177000,4,car
7,2000,2004,209000,4,truck
8,3000,2003,138000,2,car
9,1900,2003,160000,4,car


Question: Can we model the data in Sklearn as is?

In [11]:
from sklearn.dummy import DummyClassifier

In [None]:
?DummyClassifier

In [12]:

# convert car to 0 and truck to 1
vehicles['type'] = vehicles.type.map({'car':0, 'truck':1})

# print out data
vehicles

Unnamed: 0,price,year,miles,doors,type
0,22000,2012,13000,2,0
1,14000,2010,30000,2,0
2,13000,2010,73500,4,0
3,9500,2009,78000,4,0
4,9000,2007,47000,4,0
5,4000,2006,124000,2,0
6,3000,2004,177000,4,0
7,2000,2004,209000,4,1
8,3000,2003,138000,2,0
9,1900,2003,160000,4,0


Q: Calculate the number of rows in the vehicles dataset

In [13]:
#len(vehicles.type)
n_rows = vehicles.shape[0]

Create three bootstrap sample indexes

In [14]:
import numpy as np

# set a seed for reproducibility
np.random.seed(123)

# create three bootstrap samples (will be used to select rows from the DataFrame)
sample1 = np.random.choice(a=n_rows, size=n_rows, replace=True)
sample2 = np.random.choice(a=n_rows, size=n_rows, replace=True)
sample3 = np.random.choice(a=n_rows, size=n_rows, replace=True)

# print samples
# these are indexes to pick records that will make up each sample
print(sample1)
print(sample2)
print(sample3)

[13  2 12  2  6  1  3 10 11  9  6  1  0  1]
[ 9  0  0  9  3 13  4  0  0  4  1  7  3  2]
[ 4  7  2  4  8 13  0  7  9  3 12 12  4  6]


Q: How can we sample the records from the DataFrame?

In [15]:
vehicles.iloc[sample1, 1:]

Unnamed: 0,year,miles,doors,type
13,1997,138000,4,0
2,2010,73500,4,0
12,1999,163000,2,1
2,2010,73500,4,0
6,2004,177000,4,0
1,2010,30000,2,0
3,2009,78000,4,0
10,2003,190000,2,1
11,2001,62000,4,0
9,2003,160000,4,0


Here we will build 3 regression trees with Price as the target.

Q: how can we retrieve our target variable?

In [16]:
vehicles.iloc[sample1, 0]
vehicles.iloc[sample1, :]['price']

13     1300
2     13000
12     1800
2     13000
6      3000
1     14000
3      9500
10     2500
11     5000
9      1900
6      3000
1     14000
0     22000
1     14000
Name: price, dtype: int64

In [17]:
from sklearn.tree import DecisionTreeRegressor

# grow one regression tree with each bootstrapped training set
treereg1 = DecisionTreeRegressor(random_state=123)
treereg1.fit(vehicles.iloc[sample1, 1:], vehicles.iloc[sample1, 0])

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_split=1e-07,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=123,
           splitter='best')

In [18]:
treereg2 = DecisionTreeRegressor(random_state=123)
treereg2.fit(vehicles.iloc[sample2, 1:], vehicles.iloc[sample2, 0])

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_split=1e-07,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=123,
           splitter='best')

In [19]:
treereg3 = DecisionTreeRegressor(random_state=123)
treereg3.fit(vehicles.iloc[sample3, 1:], vehicles.iloc[sample3, 0])

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_split=1e-07,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=123,
           splitter='best')

To test the models we will use a pre-prepared out-of-sample data set, which will need to apply the same preparations to.

In [20]:
# read in out-of-sample data
oos = pd.read_csv('used_vehicles_oos.csv')
oos

Unnamed: 0,price,year,miles,doors,type
0,3000,2003,130000,4,truck
1,6000,2005,82500,4,car
2,12000,2010,60000,2,car


In [21]:
# convert car to 0 and truck to 1
oos['type'] = oos.type.map({'car':0, 'truck':1})

# print data
oos

Unnamed: 0,price,year,miles,doors,type
0,3000,2003,130000,4,1
1,6000,2005,82500,4,0
2,12000,2010,60000,2,0


Q: How do we get predictions from our models?

In [22]:
feature_cols = vehicles.columns[1:]
feature_cols

Index(['year', 'miles', 'doors', 'type'], dtype='object')

In [23]:
# select feature columns (every column except for the 0th column)
feature_cols = vehicles.columns[1:]

# make predictions on out-of-sample data
preds1 = treereg1.predict(oos[feature_cols])
preds2 = treereg2.predict(oos[feature_cols])
preds3 = treereg3.predict(oos[feature_cols])

# print predictions for Price from each model
print(preds1)
print(preds2)
print(preds3)

[  1300.   5000.  14000.]
[  1300.   1300.  13000.]
[  3000.   3000.  13000.]


Average the predictions for each record and compare to actual values

In [24]:
# average predictions and compare to actual values
print((preds1 + preds2 + preds3)/3)

[  1866.66666667   3100.          13333.33333333]


In [25]:
# The actual values
print(oos.price.values)

[ 3000  6000 12000]


In [26]:
# copy the oos dataframe
oos_pred = oos.copy()

# create a new predicted column
oos_pred['predPrice'] = (preds1 + preds2 + preds3)/3
oos_pred

Unnamed: 0,price,year,miles,doors,type,predPrice
0,3000,2003,130000,4,1,1866.666667
1,6000,2005,82500,4,0,3100.0
2,12000,2010,60000,2,0,13333.333333


## Estimating out-of-sample error

Bagged models have a very nice property: **out-of-sample error can be estimated without using the test set approach or cross-validation!**

Here's how the out-of-sample estimation process works with bagged trees:

- On average, each bagged tree uses about two-thirds of the observations. **For each tree, the remaining observations are called "out-of-bag" observations.**
- For the first observation in the training data, predict its response using **only** the trees in which that observation was out-of-bag. Average those predictions (for regression) or take a majority vote (for classification).
- Repeat this process for every observation in the training data.
- Compare all predictions to the actual responses in order to compute a mean squared error or classification error. This is known as the **out-of-bag error**.

**When B is sufficiently large, the out-of-bag error is an accurate estimate of out-of-sample error.**

In [27]:
# set is a data structure used to identify unique elements
print(set(range(14)))

# only show the unique elements in sample1
print(set(sample1))

{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13}
{0, 1, 2, 3, 6, 9, 10, 11, 12, 13}


In [28]:
# use the "set difference" to identify the out-of-bag observations for each tree
print(sorted(set(range(14)) - set(sample1)))
print(sorted(set(range(14)) - set(sample2)))
print(sorted(set(range(14)) - set(sample3)))

[4, 5, 7, 8]
[5, 6, 8, 10, 11, 12]
[1, 5, 10, 11]


Thus, we would predict the response for **observation 4** by using tree 1 (because it is only out-of-bag for tree 1). We would predict the response for **observation 5** by averaging the predictions from trees 1, 2, and 3 (since it is out-of-bag for all three trees). We would repeat this process for all observations, and then calculate the MSE using those predictions.

## Estimating variable importance

Although bagging **increases predictive accuracy**, it **decreases model interpretability** because it's no longer possible to visualize the tree to understand the importance of each variable.

However, we can still obtain an overall summary of "variable importance" from bagged models:

- To compute variable importance for bagged regression trees, we can calculate the **total amount that the mean squared error is decreased due to splits over a given predictor, averaged over all trees**.
- A similar process is used for bagged classification trees, except we use the Gini index instead of the mean squared error.

(We'll see an example of this below.)

## Random Forests

Random Forests is a **slight variation of bagged trees** that has even better performance! Here's how it works:

- Exactly like bagging, we create an ensemble of decision trees using bootstrapped samples of the training set.
- However, when building each tree, **each time a split is considered**, a random sample of m predictors is chosen as split candidates from the full set of p predictors. **The split is only allowed to use one of those m predictors.**

Notes:

- A new random sample of predictors is chosen for **every single tree at every single split**.
- For **classification**, m is typically chosen to be the square root of p. For **regression**, m is typically chosen to be somewhere between p/3 and p.

What's the point?

- Suppose there is one very strong predictor in the data set. When using bagged trees, most of the trees will use that predictor as the top split, resulting in an ensemble of similar trees that are "highly correlated".
- Averaging highly correlated quantities does not significantly reduce variance (which is the entire goal of bagging).
- **By randomly leaving out candidate predictors from each split, Random Forests "decorrelates" the trees**, such that the averaging process can reduce the variance of the resulting model.

In [29]:
import pandas as pd

# read in the Titanic data
titanic = pd.read_csv('titanic.csv')

https://github.com/JosPolfliet/pandas-profiling

In [45]:
!conda install -c jos_pol pandas-profiling -y

Fetching package metadata .........
Solving package specifications: ..........

# All requested packages already installed.
# packages in environment at /Users/andrew/miniconda2/envs/keras:
#
pandas-profiling          1.4.0                    py35_0  


In [30]:
import pandas_profiling
pandas_profiling.ProfileReport(titanic)

0,1
Number of variables,12
Number of observations,891
Total Missing (%),8.1%
Total size in memory,83.6 KiB
Average record size in memory,96.1 B

0,1
Numeric,7
Categorical,4
Date,0
Text (Unique),1
Rejected,0

0,1
Distinct count,89
Unique (%),12.5%
Missing (%),19.9%
Missing (n),177
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,29.699
Minimum,0.42
Maximum,80
Zeros (%),0.0%

0,1
Minimum,0.42
5-th percentile,4.0
Q1,20.125
Median,28.0
Q3,38.0
95-th percentile,56.0
Maximum,80.0
Range,79.58
Interquartile range,17.875

0,1
Standard deviation,14.526
Coef of variation,0.48912
Kurtosis,0.17827
Mean,29.699
MAD,11.323
Skewness,0.38911
Sum,21205
Variance,211.02
Memory size,7.0 KiB

Value,Count,Frequency (%),Unnamed: 3
24.0,30,3.4%,
22.0,27,3.0%,
18.0,26,2.9%,
19.0,25,2.8%,
30.0,25,2.8%,
28.0,25,2.8%,
21.0,24,2.7%,
25.0,23,2.6%,
36.0,22,2.5%,
29.0,20,2.2%,

Value,Count,Frequency (%),Unnamed: 3
0.42,1,0.1%,
0.67,1,0.1%,
0.75,2,0.2%,
0.83,2,0.2%,
0.92,1,0.1%,

Value,Count,Frequency (%),Unnamed: 3
70.0,2,0.2%,
70.5,1,0.1%,
71.0,2,0.2%,
74.0,1,0.1%,
80.0,1,0.1%,

0,1
Distinct count,148
Unique (%),72.5%
Missing (%),77.1%
Missing (n),687

0,1
B96 B98,4
C23 C25 C27,4
G6,4
Other values (144),192
(Missing),687

Value,Count,Frequency (%),Unnamed: 3
B96 B98,4,0.4%,
C23 C25 C27,4,0.4%,
G6,4,0.4%,
C22 C26,3,0.3%,
E101,3,0.3%,
F2,3,0.3%,
F33,3,0.3%,
D,3,0.3%,
D36,2,0.2%,
F G73,2,0.2%,

0,1
Distinct count,4
Unique (%),0.4%
Missing (%),0.2%
Missing (n),2

0,1
S,644
C,168
Q,77
(Missing),2

Value,Count,Frequency (%),Unnamed: 3
S,644,72.3%,
C,168,18.9%,
Q,77,8.6%,
(Missing),2,0.2%,

0,1
Distinct count,248
Unique (%),27.8%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,32.204
Minimum,0
Maximum,512.33
Zeros (%),1.7%

0,1
Minimum,0.0
5-th percentile,7.225
Q1,7.9104
Median,14.454
Q3,31.0
95-th percentile,112.08
Maximum,512.33
Range,512.33
Interquartile range,23.09

0,1
Standard deviation,49.693
Coef of variation,1.5431
Kurtosis,33.398
Mean,32.204
MAD,28.164
Skewness,4.7873
Sum,28694
Variance,2469.4
Memory size,7.0 KiB

Value,Count,Frequency (%),Unnamed: 3
8.05,43,4.8%,
13.0,42,4.7%,
7.8958,38,4.3%,
7.75,34,3.8%,
26.0,31,3.5%,
10.5,24,2.7%,
7.925,18,2.0%,
7.775,16,1.8%,
26.55,15,1.7%,
0.0,15,1.7%,

Value,Count,Frequency (%),Unnamed: 3
0.0,15,1.7%,
4.0125,1,0.1%,
5.0,1,0.1%,
6.2375,1,0.1%,
6.4375,1,0.1%,

Value,Count,Frequency (%),Unnamed: 3
227.525,4,0.4%,
247.5208,2,0.2%,
262.375,2,0.2%,
263.0,4,0.4%,
512.3292,3,0.3%,

First 3 values
"Taussig, Mr. Emil"
"Hagland, Mr. Ingvald Olai Olsen"
"Montvila, Rev. Juozas"

Last 3 values
"Funk, Miss. Annie Clemmer"
"Blackwell, Mr. Stephen Weart"
"Meyer, Mr. August"

Value,Count,Frequency (%),Unnamed: 3
"Abbing, Mr. Anthony",1,0.1%,
"Abbott, Mr. Rossmore Edward",1,0.1%,
"Abbott, Mrs. Stanton (Rosa Hunt)",1,0.1%,
"Abelson, Mr. Samuel",1,0.1%,
"Abelson, Mrs. Samuel (Hannah Wizosky)",1,0.1%,

Value,Count,Frequency (%),Unnamed: 3
"de Mulder, Mr. Theodore",1,0.1%,
"de Pelsmaeker, Mr. Alfons",1,0.1%,
"del Carlo, Mr. Sebastiano",1,0.1%,
"van Billiard, Mr. Austin Blyler",1,0.1%,
"van Melkebeke, Mr. Philemon",1,0.1%,

0,1
Distinct count,7
Unique (%),0.8%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.38159
Minimum,0
Maximum,6
Zeros (%),76.1%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,0
Q3,0
95-th percentile,2
Maximum,6
Range,6
Interquartile range,0

0,1
Standard deviation,0.80606
Coef of variation,2.1123
Kurtosis,9.7781
Mean,0.38159
MAD,0.58074
Skewness,2.7491
Sum,340
Variance,0.64973
Memory size,7.0 KiB

Value,Count,Frequency (%),Unnamed: 3
0,678,76.1%,
1,118,13.2%,
2,80,9.0%,
5,5,0.6%,
3,5,0.6%,
4,4,0.4%,
6,1,0.1%,

Value,Count,Frequency (%),Unnamed: 3
0,678,76.1%,
1,118,13.2%,
2,80,9.0%,
3,5,0.6%,
4,4,0.4%,

Value,Count,Frequency (%),Unnamed: 3
2,80,9.0%,
3,5,0.6%,
4,4,0.4%,
5,5,0.6%,
6,1,0.1%,

0,1
Distinct count,891
Unique (%),100.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,446
Minimum,1
Maximum,891
Zeros (%),0.0%

0,1
Minimum,1.0
5-th percentile,45.5
Q1,223.5
Median,446.0
Q3,668.5
95-th percentile,846.5
Maximum,891.0
Range,890.0
Interquartile range,445.0

0,1
Standard deviation,257.35
Coef of variation,0.57703
Kurtosis,-1.2
Mean,446
MAD,222.75
Skewness,0
Sum,397386
Variance,66231
Memory size,7.0 KiB

Value,Count,Frequency (%),Unnamed: 3
891,1,0.1%,
293,1,0.1%,
304,1,0.1%,
303,1,0.1%,
302,1,0.1%,
301,1,0.1%,
300,1,0.1%,
299,1,0.1%,
298,1,0.1%,
297,1,0.1%,

Value,Count,Frequency (%),Unnamed: 3
1,1,0.1%,
2,1,0.1%,
3,1,0.1%,
4,1,0.1%,
5,1,0.1%,

Value,Count,Frequency (%),Unnamed: 3
887,1,0.1%,
888,1,0.1%,
889,1,0.1%,
890,1,0.1%,
891,1,0.1%,

0,1
Distinct count,3
Unique (%),0.3%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,2.3086
Minimum,1
Maximum,3
Zeros (%),0.0%

0,1
Minimum,1
5-th percentile,1
Q1,2
Median,3
Q3,3
95-th percentile,3
Maximum,3
Range,2
Interquartile range,1

0,1
Standard deviation,0.83607
Coef of variation,0.36215
Kurtosis,-1.28
Mean,2.3086
MAD,0.76197
Skewness,-0.63055
Sum,2057
Variance,0.69902
Memory size,7.0 KiB

Value,Count,Frequency (%),Unnamed: 3
3,491,55.1%,
1,216,24.2%,
2,184,20.7%,

Value,Count,Frequency (%),Unnamed: 3
1,216,24.2%,
2,184,20.7%,
3,491,55.1%,

Value,Count,Frequency (%),Unnamed: 3
1,216,24.2%,
2,184,20.7%,
3,491,55.1%,

0,1
Distinct count,2
Unique (%),0.2%
Missing (%),0.0%
Missing (n),0

0,1
male,577
female,314

Value,Count,Frequency (%),Unnamed: 3
male,577,64.8%,
female,314,35.2%,

0,1
Distinct count,7
Unique (%),0.8%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.52301
Minimum,0
Maximum,8
Zeros (%),68.2%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,0
Q3,1
95-th percentile,3
Maximum,8
Range,8
Interquartile range,1

0,1
Standard deviation,1.1027
Coef of variation,2.1085
Kurtosis,17.88
Mean,0.52301
MAD,0.71378
Skewness,3.6954
Sum,466
Variance,1.216
Memory size,7.0 KiB

Value,Count,Frequency (%),Unnamed: 3
0,608,68.2%,
1,209,23.5%,
2,28,3.1%,
4,18,2.0%,
3,16,1.8%,
8,7,0.8%,
5,5,0.6%,

Value,Count,Frequency (%),Unnamed: 3
0,608,68.2%,
1,209,23.5%,
2,28,3.1%,
3,16,1.8%,
4,18,2.0%,

Value,Count,Frequency (%),Unnamed: 3
2,28,3.1%,
3,16,1.8%,
4,18,2.0%,
5,5,0.6%,
8,7,0.8%,

0,1
Distinct count,2
Unique (%),0.2%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.38384
Minimum,0
Maximum,1
Zeros (%),61.6%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,0
Q3,1
95-th percentile,1
Maximum,1
Range,1
Interquartile range,1

0,1
Standard deviation,0.48659
Coef of variation,1.2677
Kurtosis,-1.775
Mean,0.38384
MAD,0.47301
Skewness,0.47852
Sum,342
Variance,0.23677
Memory size,7.0 KiB

Value,Count,Frequency (%),Unnamed: 3
0,549,61.6%,
1,342,38.4%,

Value,Count,Frequency (%),Unnamed: 3
0,549,61.6%,
1,342,38.4%,

Value,Count,Frequency (%),Unnamed: 3
0,549,61.6%,
1,342,38.4%,

0,1
Distinct count,681
Unique (%),76.4%
Missing (%),0.0%
Missing (n),0

0,1
1601,7
347082,7
CA. 2343,7
Other values (678),870

Value,Count,Frequency (%),Unnamed: 3
1601,7,0.8%,
347082,7,0.8%,
CA. 2343,7,0.8%,
3101295,6,0.7%,
CA 2144,6,0.7%,
347088,6,0.7%,
382652,5,0.6%,
S.O.C. 14879,5,0.6%,
17421,4,0.4%,
113760,4,0.4%,

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [33]:
titanic['Cabin'].unique().shape

(148,)

#### Some Cleaning and Feature Engineering Steps:

In [34]:
# encode sex feature
titanic['Sex'] = titanic.Sex.map({'female':0, 'male':1})

In [36]:
# fill in missing values for age
titanic.Age.fillna(titanic.Age.mean(), inplace=True)

In [38]:
# create three dummy variables, drop the first dummy variable, and store this as a DataFrame
embarked_dummies = pd.get_dummies(titanic.Embarked, prefix='Embarked').iloc[:, 1:]

# concatenate the two dummy variable columns onto the original DataFrame
# note: axis=0 means rows, axis=1 means columns
titanic = pd.concat([titanic, embarked_dummies], axis=1)

In [39]:
# create a list of feature columns
feature_cols = ['Pclass', 'Sex', 'Age', 'Embarked_Q', 'Embarked_S']

In [40]:
# print the updated DataFrame
titanic.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Embarked_Q,Embarked_S
0,1,0,3,"Braund, Mr. Owen Harris",1,22.0,1,0,A/5 21171,7.25,,S,0,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,38.0,1,0,PC 17599,71.2833,C85,C,0,0
2,3,1,3,"Heikkinen, Miss. Laina",0,26.0,0,0,STON/O2. 3101282,7.925,,S,0,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,35.0,1,0,113803,53.1,C123,S,0,1
4,5,0,3,"Allen, Mr. William Henry",1,35.0,0,0,373450,8.05,,S,0,1
5,6,0,3,"Moran, Mr. James",1,29.699118,0,0,330877,8.4583,,Q,1,0
6,7,0,1,"McCarthy, Mr. Timothy J",1,54.0,0,0,17463,51.8625,E46,S,0,1
7,8,0,3,"Palsson, Master. Gosta Leonard",1,2.0,3,1,349909,21.075,,S,0,1
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",0,27.0,0,2,347742,11.1333,,S,0,1
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",0,14.0,1,0,237736,30.0708,,C,0,0


Question: Can you think of other feature engineering steps? List some thoughts

-

In [41]:
# import class, instantiate estimator, fit with all data
from sklearn.ensemble import RandomForestClassifier
rfclf = RandomForestClassifier(n_estimators=100, max_features='auto', oob_score=True, random_state=1)
rfclf.fit(titanic[feature_cols], titanic.Survived)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=100, n_jobs=1, oob_score=True, random_state=1,
            verbose=0, warm_start=False)

These are the most important tuning parameters for Random Forests:

- **n_estimators:** more estimators (trees) increases performance but decreases speed
- **max_features:** The number of features to consider when looking for the best split

Cross-validate to try different combinations of paramters and optimise for ideal values

In [None]:
?RandomForestClassifier

In [42]:
feat_imp = pd.DataFrame({'Features':feature_cols, 'Importance':rfclf.feature_importances_})
feat_imp

Unnamed: 0,Features,Importance
0,Pclass,0.160553
1,Sex,0.3667
2,Age,0.434528
3,Embarked_Q,0.012129
4,Embarked_S,0.026089


In [43]:
# compute the feature importances
pd.DataFrame({'feature':feature_cols, 'importance':rfclf.feature_importances_})

Unnamed: 0,feature,importance
0,Pclass,0.160553
1,Sex,0.3667
2,Age,0.434528
3,Embarked_Q,0.012129
4,Embarked_S,0.026089


In [44]:
# compute the out-of-bag classification accuracy
rfclf.oob_score_

0.80022446689113358

### Excercise: Try other paramter settings for RandomForest

Have a look at the documentation here http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier

Then create new classifier objects and fits using different parameter settings. Try and maximise the out-of-bag score

## More Random Forest

Read the following recent post on the Yhat blog which talks about Random forest. Try running the code here as well
http://blog.yhat.com/posts/python-random-forest.html

## Boosting

To go over Extreme Gradient Boosting (XGBOOST) have a look at the following kaggle script

[Understanding XGBOOST on the Oto Dataset](https://www.kaggle.com/tqchen/otto-group-product-classification-challenge/understanding-xgboost-model-on-otto-data/notebook)

Or try installing yourself [from github ](https://github.com/dmlc/xgboost/blob/master/doc/build.md)




## Wrapping up ensembling

Ensembling is incredibly popular, when the **primary goal is predictive accuracy**. For example, the team that eventually won the $1 million [Netflix Prize](http://en.wikipedia.org/wiki/Netflix_Prize) used an [ensemble of 107 models](http://www2.research.att.com/~volinsky/papers/chance.pdf) early on in the competition.

There was a recent paper in the Journal of Machine Learning Research titled "[Do We Need Hundreds of Classifiers to Solve Real World Classification Problems?](http://jmlr.csail.mit.edu/papers/volume15/delgado14a/delgado14a.pdf)" (**Spoiler alert:** Random Forests did very well.) In the [comments about the paper](https://news.ycombinator.com/item?id=8719723) on Hacker News, Ben Hamner (Kaggle's chief scientist) said the following:

> This is consistent with our experience running hundreds of Kaggle competitions: for most classification problems, some variation on ensembled decision trees (random forests, gradient boosted machines, etc.) performs the best. This is typically in conjunction with clever data processing, feature selection, and internal validation.

> One key exception is where the data is richly and hierarchically structured. Text, speech, and visual data falls under this category. In many cases here, variations of neural networks (deep neural nets/CNN's/RNN's/etc.) provide very dramatic improvements.

But as you can imagine, ensembling may not often be practical in a real-time environment.

**You can also build your own ensembles:** just build a variety of models and average them together! Here are some strategies for building independent models:

- using different models
- choosing different combinations of features
- changing the tuning parameters

## Resources

- scikit-learn documentation: [Ensemble Methods](http://scikit-learn.org/stable/modules/ensemble.html)
- Quora: [How do random forests work in layman's terms?](http://www.quora.com/How-do-random-forests-work-in-laymans-terms/answer/Edwin-Chen-1)