# Ensemble Learning Methods Introduction using the Titanic dataset

This notebook covers the different methods of ensemble learning with cleaned data.

## Ensemble Methods:

### **B**ootstrap **Agg**regat**ing** or [Bagging](https://en.wikipedia.org/wiki/Bootstrap_aggregating)
* [Scikit- Learn Reference](http://scikit-learn.org/stable/modules/ensemble.html#bagging)
* Bootstrap sampling: Sampling with replacement
* Combine by averaging the output (regression)
* Combine by voting (classification)
* Can be applied to many classifiers which includes ANN, CART, etc.

### [Pasting](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html)
* Sampling without replacement

### [Boosting](https://en.wikipedia.org/wiki/Boosting_(machine_learning)
* Train weak classifiers 
* Add them to a final strong classifier by weighting. Weighting by accuracy (typically)
* Once added, the data are reweighted
  * Misclassified samples gain weight 
  * Correctly classified samples lose weight (Exception: Boost by majority and BrownBoost - decrease the weight of repeatedly misclassified examples). 
  * Algo are forced to learn more from misclassified samples
  
    
### [Stacking](http://blog.kaggle.com/2016/12/27/a-kagglers-guide-to-model-stacking-in-practice/)
* Also known as Stacked generalization
* [From Kaggle:](http://blog.kaggle.com/2016/12/27/a-kagglers-guide-to-model-stacking-in-practice/) Combine information from multiple predictive models to generate a new model. Often times the stacked model (also called 2nd-level model) will outperform each of the individual models due its smoothing nature and ability to highlight each base model where it performs best and discredit each base model where it performs poorly. For this reason, stacking is most effective when the base models are significantly different. 
* Training a learning algorithm to combine the predictions of several other learning algorithms. 
  * Step 1: Train learning algo
  * Step 2: Combiner algo is trained using algo predictions from step 1.  
  

### Other Ensemble Methods:

[Wikipedia](https://en.wikipedia.org/wiki/Ensemble_learning)
* Bayes optimal classifier
  * An ensemble of all the hypotheses in the hypothesis space. 
  * Each hypothesis is given a vote proportional to the likelihood that the training dataset would be sampled from a system if that hypothesis were true. 
  * To facilitate training data of finite size, the vote of each hypothesis is also multiplied by the prior probability of that hypothesis. 
* Bayesian parameter averaging
  * an ensemble technique that seeks to approximate the Bayes Optimal Classifier by sampling hypotheses from the hypothesis space, and combining them using Bayes' law.
  * Unlike the Bayes optimal classifier, Bayesian model averaging (BMA) can be practically implemented. 
  * Hypotheses are typically sampled using a Monte Carlo sampling technique such as MCMC. 
* Bayesian model combination
  * Instead of sampling each model in the ensemble individually, it samples from the space of possible ensembles (with model weightings drawn randomly from a Dirichlet distribution having uniform parameters). 
  * This modification overcomes the tendency of BMA to converge toward giving all of the weight to a single model. 
  * Although BMC is somewhat more computationally expensive than BMA, it tends to yield dramatically better results. The results from BMC have been shown to be better on average (with statistical significance) than BMA, and bagging.
* Bucket of models
  * An ensemble technique in which a model selection algorithm is used to choose the best model for each problem. 
  * When tested with only one problem, a bucket of models can produce no better results than the best model in the set, but when evaluated across many problems, it will typically produce much better results, on average, than any model in the set.


R released
* BMS (an acronym for Bayesian Model Selection) package
* BAS (an acronym for Bayesian Adaptive Sampling) package
* BMA package

**Note: Ensemble methods**

* Work best with indepedent predictors

* Best to utilise different algorithms


# Lets start with the examples

# Bagging Machine Learning Algorithm

### **B**ootstrap **Agg**regat**ing** or [Bagging](https://en.wikipedia.org/wiki/Bootstrap_aggregating)
* [Scikit- Learn Reference](http://scikit-learn.org/stable/modules/ensemble.html#bagging)
* Bootstrap sampling: Sampling with replacement
* Combine by averaging the output (regression)
* Combine by voting (classification)
* Can be applied to many classifiers which includes ANN, CART, etc.

# Data processing

In [92]:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd

In [93]:
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')
combine = [train_df, test_df]
train_df.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


In [94]:
print("Before", train_df.shape, test_df.shape, combine[0].shape, combine[1].shape)

train_df = train_df.drop(['Ticket', 'Cabin'], axis=1)
test_df = test_df.drop(['Ticket', 'Cabin'], axis=1)
combine = [train_df, test_df]

"After", train_df.shape, test_df.shape, combine[0].shape, combine[1].shape

Before (891, 12) (418, 11) (891, 12) (418, 11)


('After', (891, 10), (418, 9), (891, 10), (418, 9))

In [95]:
for dataset in combine:
    dataset['Title'] = dataset.Name.str.extract('([A-Za-z]+)\.', expand=False)

pd.crosstab(train_df['Title'], train_df['Sex'])

Sex,female,male
Title,Unnamed: 1_level_1,Unnamed: 2_level_1
Capt,0,1
Col,0,2
Countess,1,0
Don,0,1
Dr,1,6
Jonkheer,0,1
Lady,1,0
Major,0,2
Master,0,40
Miss,182,0


In [96]:
dataset['Title'].head()

0     Mr
1    Mrs
2     Mr
3     Mr
4    Mrs
Name: Title, dtype: object

In [97]:
for dataset in combine:
    dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col',
 	'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')

    dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')
    
train_df[['Title', 'Survived']].groupby(['Title'], as_index=False).mean()

Unnamed: 0,Title,Survived
0,Master,0.575
1,Miss,0.702703
2,Mr,0.156673
3,Mrs,0.793651
4,Rare,0.347826


In [98]:
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}
for dataset in combine:
    dataset['Title'] = dataset['Title'].map(title_mapping)
    dataset['Title'] = dataset['Title'].fillna(0)

train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked,Title
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,7.25,S,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,71.2833,C,3
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,7.925,S,2
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,53.1,S,3
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,8.05,S,1


In [99]:
test_df.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked,Title
0,892,3,"Kelly, Mr. James",male,34.5,0,0,7.8292,Q,1
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,7.0,S,3
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,9.6875,Q,1
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,8.6625,S,1
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,12.2875,S,3


In [100]:
train_df = train_df.drop(['Name', 'PassengerId'], axis=1)
test_df = test_df.drop(['Name'], axis=1)
combine = [train_df, test_df]
train_df.shape, test_df.shape

((891, 9), (418, 9))

In [101]:
type(combine)

list

In [102]:
combine

[     Survived  Pclass     Sex   Age  SibSp  Parch      Fare Embarked  Title
 0           0       3    male  22.0      1      0    7.2500        S      1
 1           1       1  female  38.0      1      0   71.2833        C      3
 2           1       3  female  26.0      0      0    7.9250        S      2
 3           1       1  female  35.0      1      0   53.1000        S      3
 4           0       3    male  35.0      0      0    8.0500        S      1
 5           0       3    male   NaN      0      0    8.4583        Q      1
 6           0       1    male  54.0      0      0   51.8625        S      1
 7           0       3    male   2.0      3      1   21.0750        S      4
 8           1       3  female  27.0      0      2   11.1333        S      3
 9           1       2  female  14.0      1      0   30.0708        C      3
 10          1       3  female   4.0      1      1   16.7000        S      2
 11          1       1  female  58.0      0      0   26.5500        S      2

In [103]:
for dataset in combine:
    dataset['Sex'] = dataset['Sex'].map( {'female': 1, 'male': 0} ).astype(int)

train_df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Title
0,0,3,0,22.0,1,0,7.25,S,1
1,1,1,1,38.0,1,0,71.2833,C,3
2,1,3,1,26.0,0,0,7.925,S,2
3,1,1,1,35.0,1,0,53.1,S,3
4,0,3,0,35.0,0,0,8.05,S,1


In [104]:
guess_ages = np.zeros((2,3))
guess_ages

array([[0., 0., 0.],
       [0., 0., 0.]])

In [105]:
dataset['Sex'].head()

0    0
1    1
2    0
3    0
4    1
Name: Sex, dtype: int32

In [106]:
for i in range(0,2):
    print (i)    

0
1


In [107]:
dataset[(dataset['Sex'] == 0) & (dataset['Pclass'] == 1)].isnull().sum()

PassengerId    0
Pclass         0
Sex            0
Age            7
SibSp          0
Parch          0
Fare           0
Embarked       0
Title          0
dtype: int64

In [108]:
dataset.loc[ (dataset.Age.isnull()) & (dataset.Sex == 0) & (dataset.Pclass == 1),'Age']

41    NaN
146   NaN
148   NaN
191   NaN
205   NaN
266   NaN
290   NaN
Name: Age, dtype: float64

In [109]:
for dataset in combine:
    for i in range(0, 2):
        for j in range(0, 3):
            guess_df = dataset[(dataset['Sex'] == i) & (dataset['Pclass'] == j+1)]['Age'].dropna()

            # age_mean = guess_df.mean()
            # age_std = guess_df.std()
            # age_guess = rnd.uniform(age_mean - age_std, age_mean + age_std)

            age_guess = guess_df.median()

            # Convert random age float to nearest .5 age
            guess_ages[i,j] = int( age_guess/0.5 + 0.5 ) * 0.5
            
    for i in range(0, 2):
        for j in range(0, 3):
            dataset.loc[ (dataset.Age.isnull()) & (dataset.Sex == i) & (dataset.Pclass == j+1),'Age'] = guess_ages[i,j]

    dataset['Age'] = dataset['Age'].astype(int)

train_df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Title
0,0,3,0,22,1,0,7.25,S,1
1,1,1,1,38,1,0,71.2833,C,3
2,1,3,1,26,0,0,7.925,S,2
3,1,1,1,35,1,0,53.1,S,3
4,0,3,0,35,0,0,8.05,S,1


In [110]:
dataset.head(8)

Unnamed: 0,PassengerId,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Title
0,892,3,0,34,0,0,7.8292,Q,1
1,893,3,1,47,1,0,7.0,S,3
2,894,2,0,62,0,0,9.6875,Q,1
3,895,3,0,27,0,0,8.6625,S,1
4,896,3,1,22,1,1,12.2875,S,3
5,897,3,0,14,0,0,9.225,S,1
6,898,3,1,30,0,0,7.6292,Q,2
7,899,2,0,26,1,1,29.0,S,1


In [111]:
pd.cut(train_df['Age'], 5)

0       (16.0, 32.0]
1       (32.0, 48.0]
2       (16.0, 32.0]
3       (32.0, 48.0]
4       (32.0, 48.0]
5       (16.0, 32.0]
6       (48.0, 64.0]
7      (-0.08, 16.0]
8       (16.0, 32.0]
9      (-0.08, 16.0]
10     (-0.08, 16.0]
11      (48.0, 64.0]
12      (16.0, 32.0]
13      (32.0, 48.0]
14     (-0.08, 16.0]
15      (48.0, 64.0]
16     (-0.08, 16.0]
17      (16.0, 32.0]
18      (16.0, 32.0]
19      (16.0, 32.0]
20      (32.0, 48.0]
21      (32.0, 48.0]
22     (-0.08, 16.0]
23      (16.0, 32.0]
24     (-0.08, 16.0]
25      (32.0, 48.0]
26      (16.0, 32.0]
27      (16.0, 32.0]
28      (16.0, 32.0]
29      (16.0, 32.0]
           ...      
861     (16.0, 32.0]
862     (32.0, 48.0]
863     (16.0, 32.0]
864     (16.0, 32.0]
865     (32.0, 48.0]
866     (16.0, 32.0]
867     (16.0, 32.0]
868     (16.0, 32.0]
869    (-0.08, 16.0]
870     (16.0, 32.0]
871     (32.0, 48.0]
872     (32.0, 48.0]
873     (32.0, 48.0]
874     (16.0, 32.0]
875    (-0.08, 16.0]
876     (16.0, 32.0]
877     (16.0

In [112]:
train_df['AgeBand'] = pd.cut(train_df['Age'], 5)
train_df[['AgeBand', 'Survived']].groupby(['AgeBand'], as_index=False).mean().sort_values(by='AgeBand', ascending=True)

Unnamed: 0,AgeBand,Survived
0,"(-0.08, 16.0]",0.55
1,"(16.0, 32.0]",0.337374
2,"(32.0, 48.0]",0.412037
3,"(48.0, 64.0]",0.434783
4,"(64.0, 80.0]",0.090909


In [120]:
dataset.loc[ dataset['Age'] <= 16,].head()

Unnamed: 0,PassengerId,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Title
0,892,3,0,2,0,0,7.8292,Q,1
1,893,3,1,2,1,0,7.0,S,3
2,894,2,0,3,0,0,9.6875,Q,1
3,895,3,0,1,0,0,8.6625,S,1
4,896,3,1,1,1,1,12.2875,S,3


In [119]:
for dataset in combine:    
    dataset.loc[ dataset['Age'] <= 16, 'Age'] = 0
    dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1
    dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age'] = 2
    dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 3
    dataset.loc[ dataset['Age'] > 64, 'Age']
train_df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Title,AgeBand
0,0,3,0,1,1,0,7.25,S,1,"(16.0, 32.0]"
1,1,1,1,2,1,0,71.2833,C,3,"(32.0, 48.0]"
2,1,3,1,1,0,0,7.925,S,2,"(16.0, 32.0]"
3,1,1,1,2,1,0,53.1,S,3,"(32.0, 48.0]"
4,0,3,0,2,0,0,8.05,S,1,"(32.0, 48.0]"


In [123]:
dataset.head()

Unnamed: 0,PassengerId,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Title,FamilySize
0,892,3,0,2,0,0,7.8292,Q,1,1
1,893,3,1,2,1,0,7.0,S,3,2
2,894,2,0,3,0,0,9.6875,Q,1,1
3,895,3,0,1,0,0,8.6625,S,1,1
4,896,3,1,1,1,1,12.2875,S,3,3


In [124]:
train_df = train_df.drop(['AgeBand'], axis=1)
combine = [train_df, test_df]
train_df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Title,FamilySize
0,0,3,0,1,1,0,7.25,S,1,2
1,1,1,1,2,1,0,71.2833,C,3,2
2,1,3,1,1,0,0,7.925,S,2,1
3,1,1,1,2,1,0,53.1,S,3,2
4,0,3,0,2,0,0,8.05,S,1,1


In [127]:
for dataset in combine:
    dataset['FamilySize'] = dataset['SibSp'] + dataset['Parch'] + 1

train_df[['FamilySize', 'Survived']].groupby(['FamilySize'], as_index=False).mean().sort_values(by='Survived', ascending=False)

Unnamed: 0,FamilySize,Survived
3,4,0.724138
2,3,0.578431
1,2,0.552795
6,7,0.333333
0,1,0.303538
4,5,0.2
5,6,0.136364
7,8,0.0
8,11,0.0


In [129]:
dataset.head()

Unnamed: 0,PassengerId,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Title,FamilySize
0,892,3,0,2,0,0,7.8292,Q,1,1
1,893,3,1,2,1,0,7.0,S,3,2
2,894,2,0,3,0,0,9.6875,Q,1,1
3,895,3,0,1,0,0,8.6625,S,1,1
4,896,3,1,1,1,1,12.2875,S,3,3


In [130]:
dataset.loc[dataset['FamilySize'] == 1,]

Unnamed: 0,PassengerId,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Title,FamilySize
0,892,3,0,2,0,0,7.8292,Q,1,1
2,894,2,0,3,0,0,9.6875,Q,1,1
3,895,3,0,1,0,0,8.6625,S,1,1
5,897,3,0,0,0,0,9.2250,S,1,1
6,898,3,1,1,0,0,7.6292,Q,2,1
8,900,3,1,1,0,0,7.2292,C,3,1
10,902,3,0,1,0,0,7.8958,S,1,1
11,903,1,0,2,0,0,26.0000,S,1,1
16,908,2,0,2,0,0,12.3500,Q,1,1
17,909,3,0,1,0,0,7.2250,C,1,1


In [131]:
for dataset in combine:
    dataset['IsAlone'] = 0
    dataset.loc[dataset['FamilySize'] == 1, 'IsAlone'] = 1

train_df[['IsAlone', 'Survived']].groupby(['IsAlone'], as_index=False).mean()

Unnamed: 0,IsAlone,Survived
0,0,0.50565
1,1,0.303538


In [132]:
train_df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Title,FamilySize,IsAlone
0,0,3,0,1,1,0,7.25,S,1,2,0
1,1,1,1,2,1,0,71.2833,C,3,2,0
2,1,3,1,1,0,0,7.925,S,2,1,1
3,1,1,1,2,1,0,53.1,S,3,2,0
4,0,3,0,2,0,0,8.05,S,1,1,1


In [133]:
train_df = train_df.drop(['Parch', 'SibSp', 'FamilySize'], axis=1)
test_df = test_df.drop(['Parch', 'SibSp', 'FamilySize'], axis=1)
combine = [train_df, test_df]

train_df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Fare,Embarked,Title,IsAlone
0,0,3,0,1,7.25,S,1,0
1,1,1,1,2,71.2833,C,3,0
2,1,3,1,1,7.925,S,2,1
3,1,1,1,2,53.1,S,3,0
4,0,3,0,2,8.05,S,1,1


In [137]:
for dataset in combine:
    dataset['Age*Class'] = dataset.Age * dataset.Pclass

train_df.loc[:, ['Age*Class', 'Age', 'Pclass']].head(10)

Unnamed: 0,Age*Class,Age,Pclass
0,3,1,3
1,2,2,1
2,3,1,3
3,2,2,1
4,6,2,3
5,3,1,3
6,3,3,1
7,0,0,3
8,3,1,3
9,0,0,2


In [139]:
dataset.head()

Unnamed: 0,PassengerId,Pclass,Sex,Age,Fare,Embarked,Title,IsAlone,Age*Class
0,892,3,0,2,7.8292,Q,1,1,6
1,893,3,1,2,7.0,S,3,0,6
2,894,2,0,3,9.6875,Q,1,1,6
3,895,3,0,1,8.6625,S,1,1,3
4,896,3,1,1,12.2875,S,3,0,3


In [141]:
train_df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Fare,Embarked,Title,IsAlone,Age*Class
0,0,3,0,1,7.25,S,1,0,3
1,1,1,1,2,71.2833,C,3,0,2
2,1,3,1,1,7.925,S,2,1,3
3,1,1,1,2,53.1,S,3,0,2
4,0,3,0,2,8.05,S,1,1,6


In [142]:
test_df.head()

Unnamed: 0,PassengerId,Pclass,Sex,Age,Fare,Embarked,Title,IsAlone,Age*Class
0,892,3,0,2,7.8292,Q,1,1,6
1,893,3,1,2,7.0,S,3,0,6
2,894,2,0,3,9.6875,Q,1,1,6
3,895,3,0,1,8.6625,S,1,1,3
4,896,3,1,1,12.2875,S,3,0,3


In [143]:
freq_port = train_df.Embarked.dropna().mode()[0]
freq_port

'S'

In [144]:
for dataset in combine:
    dataset['Embarked'] = dataset['Embarked'].fillna(freq_port)
    
train_df[['Embarked', 'Survived']].groupby(['Embarked'], as_index=False).mean().sort_values(by='Survived', ascending=False)

Unnamed: 0,Embarked,Survived
0,C,0.553571
1,Q,0.38961
2,S,0.339009


In [145]:
for dataset in combine:
    dataset['Embarked'] = dataset['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int)

train_df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Fare,Embarked,Title,IsAlone,Age*Class
0,0,3,0,1,7.25,0,1,0,3
1,1,1,1,2,71.2833,1,3,0,2
2,1,3,1,1,7.925,0,2,1,3
3,1,1,1,2,53.1,0,3,0,2
4,0,3,0,2,8.05,0,1,1,6


In [146]:
test_df['Fare'].fillna(test_df['Fare'].dropna().median(), inplace=True)
test_df.head()

Unnamed: 0,PassengerId,Pclass,Sex,Age,Fare,Embarked,Title,IsAlone,Age*Class
0,892,3,0,2,7.8292,2,1,1,6
1,893,3,1,2,7.0,0,3,0,6
2,894,2,0,3,9.6875,2,1,1,6
3,895,3,0,1,8.6625,0,1,1,3
4,896,3,1,1,12.2875,0,3,0,3


In [147]:
train_df['FareBand'] = pd.qcut(train_df['Fare'], 4)
train_df[['FareBand', 'Survived']].groupby(['FareBand'], as_index=False).mean().sort_values(by='FareBand', ascending=True)

Unnamed: 0,FareBand,Survived
0,"(-0.001, 7.91]",0.197309
1,"(7.91, 14.454]",0.303571
2,"(14.454, 31.0]",0.454955
3,"(31.0, 512.329]",0.581081


In [148]:
for dataset in combine:
    dataset.loc[ dataset['Fare'] <= 7.91, 'Fare'] = 0
    dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <= 14.454), 'Fare'] = 1
    dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] <= 31), 'Fare']   = 2
    dataset.loc[ dataset['Fare'] > 31, 'Fare'] = 3
    dataset['Fare'] = dataset['Fare'].astype(int)

train_df = train_df.drop(['FareBand'], axis=1)
combine = [train_df, test_df]
    
train_df.head(10)

Unnamed: 0,Survived,Pclass,Sex,Age,Fare,Embarked,Title,IsAlone,Age*Class
0,0,3,0,1,0,0,1,0,3
1,1,1,1,2,3,1,3,0,2
2,1,3,1,1,1,0,2,1,3
3,1,1,1,2,3,0,3,0,2
4,0,3,0,2,1,0,1,1,6
5,0,3,0,1,1,2,1,1,3
6,0,1,0,3,3,0,1,1,3
7,0,3,0,0,2,0,4,0,0
8,1,3,1,1,1,0,3,0,3
9,1,2,1,0,2,1,3,0,0


### Fit model

In [149]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier

In [150]:
from sklearn.model_selection import train_test_split

In [151]:
X_train, X_test, y_train, y_test = train_test_split(train_df.drop("Survived", axis=1), train_df["Survived"], test_size=0.3)

In [152]:
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [153]:
def print_score(clf, X_train, y_train, X_test, y_test, train=True):
    '''
    print the accuracy score, classification report and confusion matrix of classifier
    '''
    if train:
        '''
        training performance
        '''
        print("Train Result:\n")
        print("accuracy score: {0:.4f}\n".format(accuracy_score(y_train, clf.predict(X_train))))
        print("Classification Report: \n {}\n".format(classification_report(y_train, clf.predict(X_train))))
        print("Confusion Matrix: \n {}\n".format(confusion_matrix(y_train, clf.predict(X_train))))

        res = cross_val_score(clf, X_train, y_train, cv=10, scoring='accuracy')
        print("Average Accuracy: \t {0:.4f}".format(np.mean(res)))
        print("Accuracy SD: \t\t {0:.4f}".format(np.std(res)))
        
    elif train==False:
        '''
        test performance
        '''
        print("Test Result:\n")        
        print("accuracy score: {0:.4f}\n".format(accuracy_score(y_test, clf.predict(X_test))))
        print("Classification Report: \n {}\n".format(classification_report(y_test, clf.predict(X_test))))
        print("Confusion Matrix: \n {}\n".format(confusion_matrix(y_test, clf.predict(X_test))))    
        

## Decision Tree

In [154]:
clf = DecisionTreeClassifier(random_state=42)

clf.fit(X_train, y_train)

print_score(clf, X_train, y_train, X_test, y_test, train=True)

print_score(clf, X_train, y_train, X_test, y_test, train=False) # Test



Train Result:

accuracy score: 0.8796

Classification Report: 
              precision    recall  f1-score   support

          0       0.87      0.95      0.91       383
          1       0.90      0.77      0.83       240

avg / total       0.88      0.88      0.88       623


Confusion Matrix: 
 [[363  20]
 [ 55 185]]

Average Accuracy: 	 0.8043
Accuracy SD: 		 0.0385
Test Result:

accuracy score: 0.7985

Classification Report: 
              precision    recall  f1-score   support

          0       0.80      0.90      0.85       166
          1       0.80      0.63      0.70       102

avg / total       0.80      0.80      0.79       268


Confusion Matrix: 
 [[150  16]
 [ 38  64]]



So our decision tree has an accuracy of 0.76

![](http://)## Bagging (oob_score=False)

In [155]:
bag_clf = BaggingClassifier(base_estimator=clf, n_estimators=1000,
                            bootstrap=True, n_jobs=-1,
                            random_state=42)

bag_clf.fit(X_train, y_train)

print_score(bag_clf, X_train, y_train, X_test, y_test, train=True)

print_score(bag_clf, X_train, y_train, X_test, y_test, train=False)

Train Result:

accuracy score: 0.8796

Classification Report: 
              precision    recall  f1-score   support

          0       0.89      0.92      0.90       383
          1       0.87      0.81      0.84       240

avg / total       0.88      0.88      0.88       623


Confusion Matrix: 
 [[354  29]
 [ 46 194]]

Average Accuracy: 	 0.8106
Accuracy SD: 		 0.0386
Test Result:

accuracy score: 0.7985

Classification Report: 
              precision    recall  f1-score   support

          0       0.82      0.87      0.84       166
          1       0.76      0.69      0.72       102

avg / total       0.80      0.80      0.80       268


Confusion Matrix: 
 [[144  22]
 [ 32  70]]



The bagging model has an average accuracy of 0.767

## Bagging (oob_score=True)

Use out-of-bag samples to estimate the generalization accuracy

In [156]:
bag_clf = BaggingClassifier(base_estimator=clf, n_estimators=1000,
                            bootstrap=True, oob_score=True,
                            n_jobs=-1, random_state=42)

In [157]:
bag_clf.fit(X_train, y_train)

BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=42,
            splitter='best'),
         bootstrap=True, bootstrap_features=False, max_features=1.0,
         max_samples=1.0, n_estimators=1000, n_jobs=-1, oob_score=True,
         random_state=42, verbose=0, warm_start=False)

In [158]:
bag_clf.oob_score_

0.8138041733547352

In [159]:
print_score(bag_clf, X_train, y_train, X_test, y_test, train=True)

Train Result:

accuracy score: 0.8796

Classification Report: 
              precision    recall  f1-score   support

          0       0.89      0.92      0.90       383
          1       0.87      0.81      0.84       240

avg / total       0.88      0.88      0.88       623


Confusion Matrix: 
 [[354  29]
 [ 46 194]]

Average Accuracy: 	 0.8106
Accuracy SD: 		 0.0386


In [160]:
print_score(bag_clf, X_train, y_train, X_test, y_test, train=False)

Test Result:

accuracy score: 0.7985

Classification Report: 
              precision    recall  f1-score   support

          0       0.82      0.87      0.84       166
          1       0.76      0.69      0.72       102

avg / total       0.80      0.80      0.80       268


Confusion Matrix: 
 [[144  22]
 [ 32  70]]



Setting oob True also generated same score.

# Random Forest

[paper](http://ect.bell-labs.com/who/tkh/publications/papers/odt.pdf)

* Ensemble of Decision Trees

* Training via the bagging method (Repeated sampling with replacement)
  * Bagging: Sample from samples
  * RF: Sample from predictors. $m=sqrt(p)$ for classification and $m=p/3$ for regression problems.

* Utilise uncorrelated trees

Random Forest
* Sample both observations and features of training data

Bagging
* Samples only observations at random
* Decision Tree select best feature when splitting a node

In [161]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [162]:
def print_score(clf, X_train, y_train, X_test, y_test, train=True):
    '''
    print the accuracy score, classification report and confusion matrix of classifier
    '''
    if train:
        '''
        training performance
        '''
        print("Train Result:\n")
        print("accuracy score: {0:.4f}\n".format(accuracy_score(y_train, clf.predict(X_train))))
        print("Classification Report: \n {}\n".format(classification_report(y_train, clf.predict(X_train))))
        print("Confusion Matrix: \n {}\n".format(confusion_matrix(y_train, clf.predict(X_train))))

        res = cross_val_score(clf, X_train, y_train, cv=10, scoring='accuracy')
        print("Average Accuracy: \t {0:.4f}".format(np.mean(res)))
        print("Accuracy SD: \t\t {0:.4f}".format(np.std(res)))
        
    elif train==False:
        '''
        test performance
        '''
        print("Test Result:\n")        
        print("accuracy score: {0:.4f}\n".format(accuracy_score(y_test, clf.predict(X_test))))
        print("Classification Report: \n {}\n".format(classification_report(y_test, clf.predict(X_test))))
        print("Confusion Matrix: \n {}\n".format(confusion_matrix(y_test, clf.predict(X_test))))    
        

In [163]:
rf_clf = RandomForestClassifier(random_state=42)

In [164]:
rf_clf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=42, verbose=0, warm_start=False)

In [165]:
print_score(rf_clf, X_train, y_train, X_test, y_test, train=True)

Train Result:

accuracy score: 0.8780

Classification Report: 
              precision    recall  f1-score   support

          0       0.88      0.92      0.90       383
          1       0.87      0.80      0.84       240

avg / total       0.88      0.88      0.88       623


Confusion Matrix: 
 [[354  29]
 [ 47 193]]

Average Accuracy: 	 0.8074
Accuracy SD: 		 0.0421


In [166]:
print_score(rf_clf, X_train, y_train, X_test, y_test, train=False)

Test Result:

accuracy score: 0.7799

Classification Report: 
              precision    recall  f1-score   support

          0       0.80      0.86      0.83       166
          1       0.74      0.65      0.69       102

avg / total       0.78      0.78      0.78       268


Confusion Matrix: 
 [[143  23]
 [ 36  66]]



The accuracy came as .768 but we can tune the parameters using grid search.

## Grid Search

In [167]:
from sklearn.pipeline import Pipeline

from sklearn.model_selection import GridSearchCV

In [168]:
rf_clf = RandomForestClassifier(random_state=42)

In [169]:
params_grid = {"max_depth": [3, None],
               "min_samples_split": [2, 3, 10],
               "min_samples_leaf": [1, 3, 10],
               "bootstrap": [True, False],
               "criterion": ['gini', 'entropy']}

In [170]:
grid_search = GridSearchCV(rf_clf, params_grid,
                           n_jobs=-1, cv=5,
                           verbose=1, scoring='accuracy')

In [171]:
grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 72 candidates, totalling 360 fits


[Parallel(n_jobs=-1)]: Done  83 tasks      | elapsed:    4.7s
[Parallel(n_jobs=-1)]: Done 360 out of 360 | elapsed:    6.7s finished


GridSearchCV(cv=5, error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=42, verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'max_depth': [3, None], 'min_samples_split': [2, 3, 10], 'min_samples_leaf': [1, 3, 10], 'bootstrap': [True, False], 'criterion': ['gini', 'entropy']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='accuracy', verbose=1)

In [172]:
grid_search.best_score_

0.826645264847512

In [173]:
grid_search.best_estimator_.get_params()

{'bootstrap': False,
 'class_weight': None,
 'criterion': 'entropy',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 3,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 10,
 'n_jobs': 1,
 'oob_score': False,
 'random_state': 42,
 'verbose': 0,
 'warm_start': False}

In [174]:
print_score(grid_search, X_train, y_train, X_test, y_test, train=True)

Train Result:

accuracy score: 0.8620

Classification Report: 
              precision    recall  f1-score   support

          0       0.87      0.91      0.89       383
          1       0.85      0.78      0.81       240

avg / total       0.86      0.86      0.86       623


Confusion Matrix: 
 [[349  34]
 [ 52 188]]

Fitting 5 folds for each of 72 candidates, totalling 360 fits


[Parallel(n_jobs=-1)]: Done  86 tasks      | elapsed:    3.7s
[Parallel(n_jobs=-1)]: Done 360 out of 360 | elapsed:    5.7s finished


Fitting 5 folds for each of 72 candidates, totalling 360 fits


[Parallel(n_jobs=-1)]: Done  95 tasks      | elapsed:    3.4s
[Parallel(n_jobs=-1)]: Done 360 out of 360 | elapsed:    5.1s finished


Fitting 5 folds for each of 72 candidates, totalling 360 fits


[Parallel(n_jobs=-1)]: Done  95 tasks      | elapsed:    3.3s
[Parallel(n_jobs=-1)]: Done 360 out of 360 | elapsed:    5.7s finished


Fitting 5 folds for each of 72 candidates, totalling 360 fits


[Parallel(n_jobs=-1)]: Done  86 tasks      | elapsed:    3.8s
[Parallel(n_jobs=-1)]: Done 360 out of 360 | elapsed:    5.9s finished


Fitting 5 folds for each of 72 candidates, totalling 360 fits


[Parallel(n_jobs=-1)]: Done 105 tasks      | elapsed:    3.4s
[Parallel(n_jobs=-1)]: Done 360 out of 360 | elapsed:    4.8s finished


Fitting 5 folds for each of 72 candidates, totalling 360 fits


[Parallel(n_jobs=-1)]: Done  86 tasks      | elapsed:    3.1s
[Parallel(n_jobs=-1)]: Done 360 out of 360 | elapsed:    5.3s finished


Fitting 5 folds for each of 72 candidates, totalling 360 fits


[Parallel(n_jobs=-1)]: Done 105 tasks      | elapsed:    3.4s
[Parallel(n_jobs=-1)]: Done 360 out of 360 | elapsed:    5.1s finished


Fitting 5 folds for each of 72 candidates, totalling 360 fits


[Parallel(n_jobs=-1)]: Done  95 tasks      | elapsed:    3.6s
[Parallel(n_jobs=-1)]: Done 360 out of 360 | elapsed:    5.3s finished


Fitting 5 folds for each of 72 candidates, totalling 360 fits


[Parallel(n_jobs=-1)]: Done 105 tasks      | elapsed:    3.3s
[Parallel(n_jobs=-1)]: Done 360 out of 360 | elapsed:    5.2s finished


Fitting 5 folds for each of 72 candidates, totalling 360 fits


[Parallel(n_jobs=-1)]: Done  95 tasks      | elapsed:    3.4s


Average Accuracy: 	 0.8122
Accuracy SD: 		 0.0314


[Parallel(n_jobs=-1)]: Done 360 out of 360 | elapsed:    5.3s finished


In [175]:
print_score(grid_search, X_train, y_train, X_test, y_test, train=False)

Test Result:

accuracy score: 0.8134

Classification Report: 
              precision    recall  f1-score   support

          0       0.83      0.88      0.85       166
          1       0.78      0.71      0.74       102

avg / total       0.81      0.81      0.81       268


Confusion Matrix: 
 [[146  20]
 [ 30  72]]



# Extra-Trees (Extremely Randomized Trees) Ensemble

[scikit-learn](http://scikit-learn.org/stable/modules/ensemble.html#bagging)

* Random Forest is build upon Decision Tree
* Decision Tree node splitting is based on gini or entropy or some other algorithms
* Extra-Trees make use of random thresholds for each feature unlike Decision Tree


In [176]:
from sklearn.ensemble import ExtraTreesClassifier

In [177]:
xt_clf = ExtraTreesClassifier(random_state=42)

xt_clf.fit(X_train, y_train)

print_score(xt_clf, X_train, y_train, X_test, y_test, train=True)

print_score(xt_clf, X_train, y_train, X_test, y_test, train=False)

Train Result:

accuracy score: 0.8796

Classification Report: 
              precision    recall  f1-score   support

          0       0.87      0.95      0.91       383
          1       0.90      0.77      0.83       240

avg / total       0.88      0.88      0.88       623


Confusion Matrix: 
 [[363  20]
 [ 55 185]]

Average Accuracy: 	 0.8218
Accuracy SD: 		 0.0299
Test Result:

accuracy score: 0.7948

Classification Report: 
              precision    recall  f1-score   support

          0       0.79      0.90      0.85       166
          1       0.80      0.62      0.70       102

avg / total       0.80      0.79      0.79       268


Confusion Matrix: 
 [[150  16]
 [ 39  63]]



In [178]:
Y_pred = xt_clf.predict(test_df.drop('PassengerId',axis=1))

Y_pred

submission = pd.DataFrame({
        "PassengerId": test_df["PassengerId"],
        "Survived": Y_pred
    })
submission.to_csv('submissions_xt.csv', index=False)

# That is all for bagging. Now moving onto boosting

# Boosting (Hypothesis Boosting)

* Combine several weak learners into a strong learner. 

* Train predictors sequentially

# AdaBoost / Adaptive Boosting

[Robert Schapire](http://rob.schapire.net/papers/explaining-adaboost.pdf)

[Wikipedia](https://en.wikipedia.org/wiki/AdaBoost)

[Chris McCormick](http://mccormickml.com/2013/12/13/adaboost-tutorial/)

[Scikit Learn AdaBoost](http://scikit-learn.org/stable/modules/ensemble.html#adaboost)

1995

As above for Boosting:
* Similar to human learning, the algo learns from past mistakes by focusing more on difficult problems it did not get right in prior learning. 
* In machine learning speak, it pays more attention to training instances that previously underfitted.

Source: Scikit-Learn:

* Fit a sequence of weak learners (i.e., models that are only slightly better than random guessing, such as small decision trees) on repeatedly modified versions of the data. 
* The predictions from all of them are then combined through a weighted majority vote (or sum) to produce the final prediction.
* The data modifications at each so-called boosting iteration consist of applying weights $w_1, w_2, …, w_N$ to each of the training samples. 
* Initially, those weights are all set to $w_i = 1/N$, so that the first step simply trains a weak learner on the original data. 
* For each successive iteration, the sample weights are individually modified and the learning algorithm is reapplied to the reweighted data. 
* At a given step, those training examples that were incorrectly predicted by the boosted model induced at the previous step have their weights increased, whereas the weights are decreased for those that were predicted correctly. 
* As iterations proceed, examples that are difficult to predict receive ever-increasing influence. Each subsequent weak learner is thereby forced to concentrate on the examples that are missed by the previous ones in the sequence.



In [179]:
from sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier()

ada_clf.fit(X_train, y_train)

print_score(ada_clf, X_train, y_train, X_test, y_test, train=True)

print_score(ada_clf, X_train, y_train, X_test, y_test, train=False)

Train Result:

accuracy score: 0.8395

Classification Report: 
              precision    recall  f1-score   support

          0       0.85      0.90      0.87       383
          1       0.82      0.75      0.78       240

avg / total       0.84      0.84      0.84       623


Confusion Matrix: 
 [[343  40]
 [ 60 180]]

Average Accuracy: 	 0.7992
Accuracy SD: 		 0.0287
Test Result:

accuracy score: 0.7948

Classification Report: 
              precision    recall  f1-score   support

          0       0.82      0.86      0.84       166
          1       0.75      0.69      0.72       102

avg / total       0.79      0.79      0.79       268


Confusion Matrix: 
 [[143  23]
 [ 32  70]]



In [None]:
Y_pred = ada_clf.predict(test_df.drop('PassengerId',axis=1))

Y_pred

submission = pd.DataFrame({
        "PassengerId": test_df["PassengerId"],
        "Survived": Y_pred
    })
submission.to_csv('submissions_ada.csv', index=False)

## AdaBoost with Random Forest

In [180]:
from sklearn.ensemble import RandomForestClassifier

In [181]:
ada_clf = AdaBoostClassifier(RandomForestClassifier())

ada_clf.fit(X_train, y_train)

print_score(ada_clf, X_train, y_train, X_test, y_test, train=True)

print_score(ada_clf, X_train, y_train, X_test, y_test, train=False)

ada_clf = AdaBoostClassifier(base_estimator=RandomForestClassifier())

ada_clf.fit(X_train, y_train)

print_score(ada_clf, X_train, y_train, X_test, y_test, train=True)

print_score(ada_clf, X_train, y_train, X_test, y_test, train=False)

Train Result:

accuracy score: 0.8780

Classification Report: 
              precision    recall  f1-score   support

          0       0.89      0.91      0.90       383
          1       0.85      0.83      0.84       240

avg / total       0.88      0.88      0.88       623


Confusion Matrix: 
 [[348  35]
 [ 41 199]]

Average Accuracy: 	 0.8122
Accuracy SD: 		 0.0411
Test Result:

accuracy score: 0.7985

Classification Report: 
              precision    recall  f1-score   support

          0       0.82      0.86      0.84       166
          1       0.76      0.70      0.72       102

avg / total       0.80      0.80      0.80       268


Confusion Matrix: 
 [[143  23]
 [ 31  71]]

Train Result:

accuracy score: 0.8796

Classification Report: 
              precision    recall  f1-score   support

          0       0.89      0.92      0.90       383
          1       0.86      0.82      0.84       240

avg / total       0.88      0.88      0.88       623


Confusion Matrix: 
 [[3

In [None]:
Y_pred = ada_clf.predict(test_df.drop('PassengerId',axis=1))

Y_pred

submission = pd.DataFrame({
        "PassengerId": test_df["PassengerId"],
        "Survived": Y_pred
    })
submission.to_csv('submissions_ada_random.csv', index=False)

Works for both regression and classification

[Wikipedia](https://en.wikipedia.org/wiki/Gradient_boosting)

* Sequentially adding predictors
* Each one correcting its predecessor
* Fit new predictor to the residual errors

Compare this to AdaBoost: 
* Alter instance weights at every iteration


**Step 1. **

  $$Y = F(x) + \epsilon$$

**Step 2. **

  $$\epsilon = G(x) + \epsilon_2$$

  Substituting (2) into (1), we get:
  
  $$Y = F(x) + G(x) + \epsilon_2$$
    
**Step 3. **

  $$\epsilon_2 = H(x)  + \epsilon_3$$

Now:
  
  $$Y = F(x) + G(x) + H(x)  + \epsilon_3$$
  
Finally, by adding weighting  
  
  $$Y = \alpha F(x) + \beta G(x) + \gamma H(x)  + \epsilon_4$$

Gradient boosting involves three elements:

* **Loss function to be optimized**: Loss function depends on the type of problem being solved. In the case of regression problems, mean squared error is used, and in classification problems, logarithmic loss will be used. In boosting, at each stage, unexplained loss from prior iterations will be optimized rather than starting from scratch.

* **Weak learner to make predictions**: Decision trees are used as a weak learner in gradient boosting.

* **Additive model to add weak learners to minimize the loss function**: Trees are added one at a time and existing trees in the model are not changed. The gradient descent procedure is used to minimize the loss when adding trees.

In [182]:
from sklearn.ensemble import GradientBoostingClassifier

In [183]:
gbc_clf = GradientBoostingClassifier()
gbc_clf.fit(X_train, y_train)

GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_decrease=0.0, min_impurity_split=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=100,
              presort='auto', random_state=None, subsample=1.0, verbose=0,
              warm_start=False)

In [184]:
print_score(gbc_clf, X_train, y_train, X_test, y_test, train=True)

Train Result:

accuracy score: 0.8555

Classification Report: 
              precision    recall  f1-score   support

          0       0.87      0.90      0.88       383
          1       0.83      0.79      0.81       240

avg / total       0.85      0.86      0.85       623


Confusion Matrix: 
 [[344  39]
 [ 51 189]]

Average Accuracy: 	 0.8026
Accuracy SD: 		 0.0278


In [185]:
print_score(gbc_clf, X_train, y_train, X_test, y_test, train=False) # Test

Test Result:

accuracy score: 0.7687

Classification Report: 
              precision    recall  f1-score   support

          0       0.80      0.83      0.82       166
          1       0.71      0.67      0.69       102

avg / total       0.77      0.77      0.77       268


Confusion Matrix: 
 [[138  28]
 [ 34  68]]



In [186]:
Y_pred = gbc_clf.predict(test_df.drop('PassengerId',axis=1))

Y_pred

submission = pd.DataFrame({
        "PassengerId": test_df["PassengerId"],
        "Survived": Y_pred
    })
submission.to_csv('submissions_gbc.csv', index=False)

# XGBoost (Extreme Gradient Boosting)

[Documentation](http://xgboost.readthedocs.io/en/latest/)

[tqchen github](https://github.com/tqchen/xgboost/tree/master/demo/guide-python)

[dmlc github](https://github.com/dmlc/xgboost)

* “Gradient Boosting” is proposed in the paper Greedy Function Approximation: A Gradient Boosting Machine, by Friedman. 
* XGBoost is based on this original model. 

* Supervised Learning

## Objective Function : Training Loss + Regularization

$$Obj(Θ)=L(θ)+Ω(Θ)$$

* $L$ is the training loss function, and 
* $Ω$ is the regularization term. 

### Training Loss

The training loss measures how predictive our model is on training data.

Example 1, Mean Squared Error for Linear Regression:

$$L(θ)= \sum_i(y_i-\hat{y}_i)^2$$

Example 2, Logistic Loss for Logistic Regression:

$$ L(θ) = \sum_i \large[ y_i ln(1 + e^{-\hat{y}_i}) + (1-y_i) ln(1 + e^{\hat{y}_i}) \large] $$

### Regularization Term

The regularization term controls the complexity of the model, which helps us to avoid overfitting. 

In [187]:
import xgboost as xgb

In [188]:
xgb_clf = xgb.XGBClassifier(max_depth=5, n_estimators=10000, learning_rate=0.3,
                            n_jobs=-1)

In [189]:
xgb_clf.fit(X_train, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.3, max_delta_step=0,
       max_depth=5, min_child_weight=1, missing=None, n_estimators=10000,
       n_jobs=-1, nthread=None, objective='binary:logistic',
       random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
       seed=None, silent=True, subsample=1)

In [190]:
print_score(xgb_clf, X_train, y_train, X_test, y_test, train=True)

Train Result:



  if diff:


accuracy score: 0.8780



  if diff:


Classification Report: 
              precision    recall  f1-score   support

          0       0.88      0.93      0.90       383
          1       0.88      0.79      0.83       240

avg / total       0.88      0.88      0.88       623




  if diff:


Confusion Matrix: 
 [[358  25]
 [ 51 189]]



  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:


Average Accuracy: 	 0.8186
Accuracy SD: 		 0.0280


  if diff:


In [191]:
print_score(xgb_clf, X_train, y_train, X_test, y_test, train=False)

  if diff:


Test Result:

accuracy score: 0.8134



  if diff:


Classification Report: 
              precision    recall  f1-score   support

          0       0.82      0.90      0.86       166
          1       0.80      0.68      0.73       102

avg / total       0.81      0.81      0.81       268


Confusion Matrix: 
 [[149  17]
 [ 33  69]]



  if diff:


# The best model is XGB in these runs

In [None]:
Y_pred = xgb_clf.predict(test_df.drop('PassengerId',axis=1))

Y_pred

submission = pd.DataFrame({
        "PassengerId": test_df["PassengerId"],
        "Survived": Y_pred
    })
submission.to_csv('submissions_xgb.csv', index=False)

# Ensemble of ensembles - model stacking

* **Ensemble with different types of classifiers**: 
  * Different types of classifiers (E.g., logistic regression, decision trees, random forest, etc.) are fitted on the same training data
  * Results are combined based on either 
    * majority voting (classification) or 
    * average (regression)
  

* **Ensemble with a single type of classifier**: 
  * Bootstrap samples are drawn from training data 
  * With each bootstrap sample, model (E.g., Individual model may be decision trees, random forest, etc.) will be fitted 
  * All the results are combined to create an ensemble. 
  * Suitabe for highly flexible models that is prone to overfitting / high variance. 

***

## Combining Method

* **Majority voting or average**: 
  * Classification: Largest number of votes (mode) 
  * Regression problems: Average (mean).
  
  
* **Method of application of meta-classifiers on outcomes**: 
  * Binary outcomes: 0 / 1 from individual classifiers
  * Meta-classifier is applied on top of the individual classifiers. 
  
  
* **Method of application of meta-classifiers on probabilities**: 
  * Probabilities are obtained from individual classifiers. 
  * Applying meta-classifier
  

## Model 1 : Decision Trees

In [192]:
from sklearn.tree import DecisionTreeClassifier

tree_clf = DecisionTreeClassifier()
tree_clf.fit(X_train, y_train)

print_score(tree_clf, X_train, y_train, X_test, y_test, train=True)
print_score(tree_clf, X_train, y_train, X_test, y_test, train=False)

Train Result:

accuracy score: 0.8796

Classification Report: 
              precision    recall  f1-score   support

          0       0.87      0.95      0.91       383
          1       0.90      0.77      0.83       240

avg / total       0.88      0.88      0.88       623


Confusion Matrix: 
 [[363  20]
 [ 55 185]]

Average Accuracy: 	 0.8075
Accuracy SD: 		 0.0385
Test Result:

accuracy score: 0.7985

Classification Report: 
              precision    recall  f1-score   support

          0       0.80      0.90      0.85       166
          1       0.80      0.63      0.70       102

avg / total       0.80      0.80      0.79       268


Confusion Matrix: 
 [[150  16]
 [ 38  64]]



## Model 2: Random Forest

In [193]:
from sklearn.ensemble import RandomForestClassifier

rf_clf = RandomForestClassifier()
rf_clf.fit(X_train, y_train.ravel())

print_score(rf_clf, X_train, y_train, X_test, y_test, train=True)
print_score(rf_clf, X_train, y_train, X_test, y_test, train=False)

Train Result:

accuracy score: 0.8796

Classification Report: 
              precision    recall  f1-score   support

          0       0.89      0.92      0.90       383
          1       0.87      0.81      0.84       240

avg / total       0.88      0.88      0.88       623


Confusion Matrix: 
 [[353  30]
 [ 45 195]]

Average Accuracy: 	 0.8073
Accuracy SD: 		 0.0401
Test Result:

accuracy score: 0.7873

Classification Report: 
              precision    recall  f1-score   support

          0       0.82      0.84      0.83       166
          1       0.73      0.71      0.72       102

avg / total       0.79      0.79      0.79       268


Confusion Matrix: 
 [[139  27]
 [ 30  72]]



In [194]:
en_en = pd.DataFrame()

In [195]:
tree_clf.predict_proba(X_train)

array([[1.        , 0.        ],
       [0.92727273, 0.07272727],
       [0.        , 1.        ],
       ...,
       [0.87804878, 0.12195122],
       [0.        , 1.        ],
       [0.85714286, 0.14285714]])

In [196]:
en_en['tree_clf'] = pd.DataFrame(tree_clf.predict_proba(X_train))[1]
en_en['rf_clf'] =  pd.DataFrame(rf_clf.predict_proba(X_train))[1]
col_name = en_en.columns
en_en = pd.concat([en_en, pd.DataFrame(y_train).reset_index(drop=True)], axis=1)

In [197]:
en_en.head()

Unnamed: 0,tree_clf,rf_clf,Survived
0,0.0,0.0,0
1,0.072727,0.063031,0
2,1.0,1.0,1
3,0.0,0.125,0
4,0.130435,0.136922,0


In [198]:
tmp = list(col_name)
tmp.append('ind')
en_en.columns = tmp

# Meta Classifier

In [199]:
from sklearn.linear_model import LogisticRegression

m_clf = LogisticRegression(fit_intercept=False)

m_clf.fit(en_en[['tree_clf', 'rf_clf']], en_en['ind'])

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=False,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [200]:
en_test = pd.DataFrame()

In [201]:
en_test['tree_clf'] = pd.DataFrame(tree_clf.predict_proba(X_test))[1]
en_test['rf_clf'] =  pd.DataFrame(rf_clf.predict_proba(X_test))[1]
col_name = en_en.columns
en_test['combined'] = m_clf.predict(en_test[['tree_clf', 'rf_clf']])

In [202]:
col_name = en_test.columns
tmp = list(col_name)
tmp.append('ind')

In [203]:
tmp

['tree_clf', 'rf_clf', 'combined', 'ind']

In [204]:
en_test = pd.concat([en_test, pd.DataFrame(y_test).reset_index(drop=True)], axis=1)

In [205]:
en_test.columns = tmp

In [206]:
print(pd.crosstab(en_test['ind'], en_test['combined']))

combined   0    1
ind              
0         62  104
1         12   90


In [207]:
print(round(accuracy_score(en_test['ind'], en_test['combined']), 4))

0.5672


In [208]:
print(classification_report(en_test['ind'], en_test['combined']))

             precision    recall  f1-score   support

          0       0.84      0.37      0.52       166
          1       0.46      0.88      0.61       102

avg / total       0.70      0.57      0.55       268



# Using Single Classifier

In [None]:
#For self:
#df.Attrition.value_counts() / df.Attrition.count()

In [209]:
from sklearn.ensemble import RandomForestClassifier

from sklearn.ensemble import BaggingClassifier

from sklearn.ensemble import AdaBoostClassifier

In [210]:
pd.Series(list(y_train)).value_counts() / pd.Series(list(y_train)).count()

0    0.614767
1    0.385233
dtype: float64

In [211]:
class_weight = {0:0.61, 1:0.38}

In [212]:
forest = RandomForestClassifier(class_weight=class_weight)

In [213]:
ada = AdaBoostClassifier(base_estimator=forest, n_estimators=100,
                         learning_rate=0.5, random_state=42)

In [214]:
ada.fit(X_train, y_train.ravel())

AdaBoostClassifier(algorithm='SAMME.R',
          base_estimator=RandomForestClassifier(bootstrap=True, class_weight={0: 0.61, 1: 0.38},
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False),
          learning_rate=0.5, n_estimators=100, random_state=42)

In [215]:
print_score(ada, X_train, y_train, X_test, y_test, train=True)
print_score(ada, X_train, y_train, X_test, y_test, train=False)

Train Result:

accuracy score: 0.8796

Classification Report: 
              precision    recall  f1-score   support

          0       0.87      0.95      0.91       383
          1       0.90      0.77      0.83       240

avg / total       0.88      0.88      0.88       623


Confusion Matrix: 
 [[363  20]
 [ 55 185]]

Average Accuracy: 	 0.8154
Accuracy SD: 		 0.0301
Test Result:

accuracy score: 0.7948

Classification Report: 
              precision    recall  f1-score   support

          0       0.80      0.90      0.84       166
          1       0.79      0.63      0.70       102

avg / total       0.79      0.79      0.79       268


Confusion Matrix: 
 [[149  17]
 [ 38  64]]



In [216]:
bag_clf = BaggingClassifier(base_estimator=ada, n_estimators=50,
                            max_samples=1.0, max_features=1.0, bootstrap=True,
                            bootstrap_features=False, n_jobs=-1,
                            random_state=42)

In [217]:
bag_clf.fit(X_train, y_train.ravel())

BaggingClassifier(base_estimator=AdaBoostClassifier(algorithm='SAMME.R',
          base_estimator=RandomForestClassifier(bootstrap=True, class_weight={0: 0.61, 1: 0.38},
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_impurity_decrease=0.0,
            min_impur...       verbose=0, warm_start=False),
          learning_rate=0.5, n_estimators=100, random_state=42),
         bootstrap=True, bootstrap_features=False, max_features=1.0,
         max_samples=1.0, n_estimators=50, n_jobs=-1, oob_score=False,
         random_state=42, verbose=0, warm_start=False)

In [218]:
print_score(bag_clf, X_train, y_train, X_test, y_test, train=True)
print_score(bag_clf, X_train, y_train, X_test, y_test, train=False)

Train Result:

accuracy score: 0.8780

Classification Report: 
              precision    recall  f1-score   support

          0       0.88      0.93      0.90       383
          1       0.88      0.80      0.83       240

avg / total       0.88      0.88      0.88       623


Confusion Matrix: 
 [[356  27]
 [ 49 191]]

Average Accuracy: 	 0.8073
Accuracy SD: 		 0.0317
Test Result:

accuracy score: 0.8172

Classification Report: 
              precision    recall  f1-score   support

          0       0.83      0.89      0.86       166
          1       0.79      0.71      0.75       102

avg / total       0.82      0.82      0.81       268


Confusion Matrix: 
 [[147  19]
 [ 30  72]]



In [219]:
Y_pred = bag_clf.predict(test_df.drop('PassengerId',axis=1))

Y_pred

submission = pd.DataFrame({
        "PassengerId": test_df["PassengerId"],
        "Survived": Y_pred
    })
submission.to_csv('submissions_bag_last.csv', index=False)