# Introduction to Ensemble Methods

## What are Ensemble Methods
Ensemble methods is a machine learning technique that combines several base models in order to produce one optimal predictive model. This helps us to decrease variance (bagging), bias (boosting), or improve predictions (stacking).

Ensemble methods can be divided into two groups:
1. Sequential where the base learners are generated sequentially (e.g. AdaBoost).
2. Parallel where the base learners are generated in parallel (e.g. Random Forest). 

Emsemble methods can also be categorized as:
1. Homogenous which use a single base learning algorithm to produce homogeneous base learners, i.e. learners of the same type
2. Hetrogenous which use heterogeneous learners, i.e. learners of different types

*Note: In order for ensemble methods to be more accurate than any of its individual members, the base learners have to be as accurate as possible and as diverse as possible*

## Terminologies

**Bagging:**  Bagging stands for **B**ootstrap **agg**regation. One way to reduce the variance of an estimate is to average together multiple estimates. Variance is the amount that the estimate of the target function will change if different training data was used. Bagging uses bootstrap sampling to obtain the data subsets for training the base learners. For aggregating the outputs of base learners, bagging uses voting for classification and averaging for regression.
*Note: The bootstrap sampling method is a statistical technique for estimating quantities about a population by averaging estimates from multiple small data samples. Importantly, samples are constructed by drawing observations from a large data sample one at a time and returning them to the data sample after they have been chosen. This allows a given observation to be included in a given small sample more than once. This approach to sampling is called sampling with replacement.

Imagine we have a dataset with 6 observations:

In [1]:
dataset = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]

Now we are going to choose a random observation from the above dataset.

In [2]:
sample = [0.2]

This observation is returned to the dataset and we repeat this step 3 more times.

In [3]:
sample = [0.2, 0.1, 0.5, 0.1]

We now have our data sample. The example purposefully demonstrates that the same value can appear zero, one or more times in the sample. Here the observation 0.1 appears twice.

An estimate can then be calculated on the drawn sample. Those observations not chosen for the sample may be used as out of sample observations.

In [4]:
oosb = [0.3, 0.4, 0.6]

In the case of evaluating a machine learning model, the model is fit on the drawn sample and evaluated on the out-of-bag sample. We do not have to implement the bootstrap method manually. The scikit-learn library provides an implementation that will create a single bootstrap sample of a dataset.

The resample() scikit-learn function can be used.

![Bagging Picture](./bagging.png)

In [5]:
# Standard Libraries required for reading data & perform exploratory analytics
import numpy as np 
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt

In [6]:
train = pd.read_csv('train.csv')

In [7]:
train.shape

(891, 12)

In [8]:
train.sample(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
838,839,1,3,"Chip, Mr. Chang",male,32.0,0,0,1601,56.4958,,S
521,522,0,3,"Vovk, Mr. Janko",male,22.0,0,0,349252,7.8958,,S
410,411,0,3,"Sdycoff, Mr. Todor",male,,0,0,349222,7.8958,,S
334,335,1,1,"Frauenthal, Mrs. Henry William (Clara Heinshei...",female,,1,0,PC 17611,133.65,,S
836,837,0,3,"Pasic, Mr. Jakob",male,21.0,0,0,315097,8.6625,,S


In [9]:
#  Check  for missing values in the target field which is Survived
print(train.isnull().any())

PassengerId    False
Survived       False
Pclass         False
Name           False
Sex            False
Age             True
SibSp          False
Parch          False
Ticket         False
Fare           False
Cabin           True
Embarked        True
dtype: bool


In [10]:
# Create Titles
train['Title'] = train.Name.str.extract(' ([A-Za-z]+)\.', expand=False)
pd.crosstab(train['Title'], train['Sex'])

Sex,female,male
Title,Unnamed: 1_level_1,Unnamed: 2_level_1
Capt,0,1
Col,0,2
Countess,1,0
Don,0,1
Dr,1,6
Jonkheer,0,1
Lady,1,0
Major,0,2
Master,0,40
Miss,182,0


In [11]:
# Combine multiple titles to simplify
train['Title'] = train['Title'].replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
train['Title'] = train['Title'].replace('Mlle', 'Miss')
train['Title'] = train['Title'].replace('Ms', 'Miss')
train['Title'] = train['Title'].replace('Mme', 'Mrs')
pd.crosstab(train['Title'], train['Sex'])

Sex,female,male
Title,Unnamed: 1_level_1,Unnamed: 2_level_1
Master,0,40
Miss,185,0
Mr,0,517
Mrs,126,0
Rare,3,20


In [12]:
# Convert Titles to Numbers
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}
train['Title'] = train['Title'].map(title_mapping)
train['Title'] = train['Title'].fillna(0)
pd.crosstab(train['Title'], train['Sex'])

Sex,female,male
Title,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0,517
2,185,0
3,126,0
4,0,40
5,3,20


In [13]:
# Dropping Name and Passenger Ids as it will not add any value
train.drop(['Name', 'PassengerId'], axis=1, inplace=True)
train.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title
0,0,3,male,22.0,1,0,A/5 21171,7.25,,S,1
1,1,1,female,38.0,1,0,PC 17599,71.2833,C85,C,3
2,1,3,female,26.0,0,0,STON/O2. 3101282,7.925,,S,2
3,1,1,female,35.0,1,0,113803,53.1,C123,S,3
4,0,3,male,35.0,0,0,373450,8.05,,S,1


In [14]:
# Encoding values for sex category
train['Sex'] = train['Sex'].map( {'female': 1, 'male': 0} ).astype(int)

In [15]:
train.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title
0,0,3,0,22.0,1,0,A/5 21171,7.25,,S,1
1,1,1,1,38.0,1,0,PC 17599,71.2833,C85,C,3
2,1,3,1,26.0,0,0,STON/O2. 3101282,7.925,,S,2
3,1,1,1,35.0,1,0,113803,53.1,C123,S,3
4,0,3,0,35.0,0,0,373450,8.05,,S,1


In [16]:
# Looking for NaN
print(train.isnull().any())

Survived    False
Pclass      False
Sex         False
Age          True
SibSp       False
Parch       False
Ticket      False
Fare        False
Cabin        True
Embarked     True
Title       False
dtype: bool


In [17]:
train.Age.value_counts()

24.00    30
22.00    27
18.00    26
28.00    25
19.00    25
         ..
55.50     1
74.00     1
0.92      1
70.50     1
12.00     1
Name: Age, Length: 88, dtype: int64

In [18]:
train[train.Age.isnull()]

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title
5,0,3,0,,0,0,330877,8.4583,,Q,1
17,1,2,0,,0,0,244373,13.0000,,S,1
19,1,3,1,,0,0,2649,7.2250,,C,3
26,0,3,0,,0,0,2631,7.2250,,C,1
28,1,3,1,,0,0,330959,7.8792,,Q,2
...,...,...,...,...,...,...,...,...,...,...,...
859,0,3,0,,0,0,2629,7.2292,,C,1
863,0,3,1,,8,2,CA. 2343,69.5500,,S,2
868,0,3,0,,0,0,345777,9.5000,,S,1
878,0,3,0,,0,0,349217,7.8958,,S,1


We see age with decimal values, for simpicity lets transform age to proper absoluete numbers and also populate the NaN records. We will populate the NaN records using the Median age of the person. As the median age of the person can vary based on his gender and also has dependency on which class he is travelling in. 

In [19]:
train.Pclass.value_counts()

3    491
1    216
2    184
Name: Pclass, dtype: int64

In [20]:
male_age = [0, 0, 0]
female_age = [0, 0, 0]

for j in range(1, 4):
    no_nan_df_male_age = train[(train['Sex'] == 0) & (train['Pclass'] == j)]['Age'].dropna()
    no_nan_df_female_age = train[(train['Sex'] == 1) & (train['Pclass'] == j)]['Age'].dropna()
    male_age[j-1] = no_nan_df_male_age.median()
    female_age[j-1] = no_nan_df_female_age.median()

for j in range(1, 4):
    train.loc[(train.Age.isnull()) & (train.Sex == 0) & (train.Pclass == j),'Age'] = male_age[j-1]
    train.loc[(train.Age.isnull()) & (train.Sex == 1) & (train.Pclass == j),'Age'] = female_age[j-1]

print(train.isnull().any())

Survived    False
Pclass      False
Sex         False
Age         False
SibSp       False
Parch       False
Ticket      False
Fare        False
Cabin        True
Embarked     True
Title       False
dtype: bool


In [21]:
# Convert Age from floating point to numbers
train['Age'] = train['Age'].astype(int)

In [22]:
train.sample(5)

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title
103,0,3,0,33,0,0,7540,8.6542,,S,1
231,0,3,0,29,0,0,347067,7.775,,S,1
739,0,3,0,25,0,0,349218,7.8958,,S,1
198,1,3,1,21,0,0,370370,7.75,,Q,2
183,1,2,0,1,2,1,230136,39.0,F4,S,4


In [23]:
train['AgeGroup'] = pd.cut(train['Age'], 5)
train[['AgeGroup', 'Survived']].groupby(['AgeGroup'], as_index=False).mean().sort_values(by='AgeGroup', ascending=True)

Unnamed: 0,AgeGroup,Survived
0,"(-0.08, 16.0]",0.55
1,"(16.0, 32.0]",0.337374
2,"(32.0, 48.0]",0.412037
3,"(48.0, 64.0]",0.434783
4,"(64.0, 80.0]",0.090909


We can clearly observe that Agegroup 64 to 80 the survival rate is very low

In [24]:
train.loc[ train['Age'] <= 16, 'Age'] = 0
train.loc[(train['Age'] > 16) & (train['Age'] <= 32), 'Age'] = 1
train.loc[(train['Age'] > 32) & (train['Age'] <= 48), 'Age'] = 2
train.loc[(train['Age'] > 48) & (train['Age'] <= 64), 'Age'] = 3
train.loc[train['Age'] > 64, 'Age'] = 5
train.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title,AgeGroup
0,0,3,0,1,1,0,A/5 21171,7.25,,S,1,"(16.0, 32.0]"
1,1,1,1,2,1,0,PC 17599,71.2833,C85,C,3,"(32.0, 48.0]"
2,1,3,1,1,0,0,STON/O2. 3101282,7.925,,S,2,"(16.0, 32.0]"
3,1,1,1,2,1,0,113803,53.1,C123,S,3,"(32.0, 48.0]"
4,0,3,0,2,0,0,373450,8.05,,S,1,"(32.0, 48.0]"


In [25]:
train.drop('AgeGroup', axis=1, inplace=True)

In [26]:
train[['SibSp', 'Parch', 'Survived']].groupby(['SibSp','Parch'], as_index=False).mean().sort_values(by='Survived', ascending=False)


Unnamed: 0,SibSp,Parch,Survived
17,3,0,1.0
3,0,3,1.0
16,2,3,1.0
14,2,1,0.857143
2,0,2,0.724138
1,0,1,0.657895
8,1,2,0.631579
7,1,1,0.596491
6,1,0,0.520325
15,2,2,0.5


Let's consider a person is alone if SibSp + Parch = 0 otherwise he is with family

In [27]:
train['IsAlone'] = 0
train['FamilySize'] = train['SibSp'] + train['Parch']
train.loc[train['FamilySize']==0, 'IsAlone'] = 1
train[['IsAlone', 'Survived']].groupby(['IsAlone'], as_index=False).mean().sort_values(by='Survived', ascending=False)

Unnamed: 0,IsAlone,Survived
0,0,0.50565
1,1,0.303538


In [28]:
train = train.drop(['Parch', 'SibSp', 'FamilySize'], axis=1)

In [29]:
# Let's fill in other NAN records
print(train.isnull().any())

Survived    False
Pclass      False
Sex         False
Age         False
Ticket      False
Fare        False
Cabin        True
Embarked     True
Title       False
IsAlone     False
dtype: bool


In [30]:
# Cabin, Embarked neeed to be handled
freq_port = train.Embarked.dropna().mode()[0]
freq_port

'S'

In [31]:
train['Embarked'] = train['Embarked'].fillna(freq_port)
train.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Ticket,Fare,Cabin,Embarked,Title,IsAlone
0,0,3,0,1,A/5 21171,7.25,,S,1,0
1,1,1,1,2,PC 17599,71.2833,C85,C,3,0
2,1,3,1,1,STON/O2. 3101282,7.925,,S,2,1
3,1,1,1,2,113803,53.1,C123,S,3,0
4,0,3,0,2,373450,8.05,,S,1,1


In [32]:
train.Cabin.value_counts()

B96 B98        4
G6             4
C23 C25 C27    4
C22 C26        3
D              3
              ..
C50            1
E12            1
D15            1
E77            1
D9             1
Name: Cabin, Length: 147, dtype: int64

In [33]:
# Dropping Cabin column altogether
train.drop('Cabin', axis=1, inplace=True)

In [34]:
train.Fare.value_counts()

8.0500     43
13.0000    42
7.8958     38
7.7500     34
26.0000    31
           ..
50.4958     1
13.8583     1
8.4583      1
7.7250      1
7.5208      1
Name: Fare, Length: 248, dtype: int64

In [35]:
#Let's group the fares like Age
train.loc[ train['Fare'] <= 7.91, 'Fare'] = 0
train.loc[(train['Fare'] > 7.91) & (train['Fare'] <= 14.454), 'Fare'] = 1
train.loc[(train['Fare'] > 14.454) & (train['Fare'] <= 31), 'Fare']   = 2
train.loc[ train['Fare'] > 31, 'Fare'] = 3
train['Fare'] = train['Fare'].astype(int)

In [36]:
print(train.isnull().any())

Survived    False
Pclass      False
Sex         False
Age         False
Ticket      False
Fare        False
Embarked    False
Title       False
IsAlone     False
dtype: bool


In [37]:
train.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Ticket,Fare,Embarked,Title,IsAlone
0,0,3,0,1,A/5 21171,0,S,1,0
1,1,1,1,2,PC 17599,3,C,3,0
2,1,3,1,1,STON/O2. 3101282,1,S,2,1
3,1,1,1,2,113803,3,S,3,0
4,0,3,0,2,373450,1,S,1,1


In [38]:
# Drop Tikcet column
train.drop('Ticket', axis=1, inplace=True)

In [39]:
# Convert Embarked to Ordinal
train['Embarked'] = train['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int)

In [40]:
train.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Fare,Embarked,Title,IsAlone
0,0,3,0,1,0,0,1,0
1,1,1,1,2,3,1,3,0
2,1,3,1,1,1,0,2,1
3,1,1,1,2,3,0,3,0
4,0,3,0,2,1,0,1,1


Let's work on the model now

In [41]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import train_test_split

In [42]:
X_train, X_test, y_train, y_test = train_test_split(train.drop("Survived", axis=1), train["Survived"], test_size=0.3, random_state=100)

In [43]:
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# print all the scorers of classifier: accuracy score, classification report and confusion matrix
def scores(clf, X_train, y_train, X_test, y_test, train=True):
    if train:
        print("\nTrain Result:\n")
        print("accuracy score: {0:.4f}\n".format(accuracy_score(y_train, clf.predict(X_train))))
        print("Classification Report: \n {}\n".format(classification_report(y_train, clf.predict(X_train))))
        print("Confusion Matrix: \n {}\n".format(confusion_matrix(y_train, clf.predict(X_train))))

        res = cross_val_score(clf, X_train, y_train, cv=10, scoring='accuracy')
        print("Average Accuracy: \t {0:.4f}".format(np.mean(res)))
        print("Accuracy SD: \t\t {0:.4f}".format(np.std(res)))
    else:
        print("\nTest Result:\n")        
        print("accuracy score: {0:.4f}\n".format(accuracy_score(y_test, clf.predict(X_test))))
        print("Classification Report: \n {}\n".format(classification_report(y_test, clf.predict(X_test))))
        print("Confusion Matrix: \n {}\n".format(confusion_matrix(y_test, clf.predict(X_test))))    

In [44]:
# Decision Treee
clf = DecisionTreeClassifier(random_state=100)

clf.fit(X_train, y_train)

scores(clf, X_train, y_train, X_test, y_test, train=True)

scores(clf, X_train, y_train, X_test, y_test, train=False)


Train Result:

accuracy score: 0.8716

Classification Report: 
               precision    recall  f1-score   support

           0       0.86      0.95      0.90       390
           1       0.90      0.74      0.81       233

    accuracy                           0.87       623
   macro avg       0.88      0.85      0.86       623
weighted avg       0.87      0.87      0.87       623


Confusion Matrix: 
 [[370  20]
 [ 60 173]]

Average Accuracy: 	 0.8026
Accuracy SD: 		 0.0360

Test Result:

accuracy score: 0.8060

Classification Report: 
               precision    recall  f1-score   support

           0       0.79      0.91      0.85       159
           1       0.84      0.65      0.73       109

    accuracy                           0.81       268
   macro avg       0.81      0.78      0.79       268
weighted avg       0.81      0.81      0.80       268


Confusion Matrix: 
 [[145  14]
 [ 38  71]]



So our decision tree has an accuracy of 0.8060

Let's try bagging now

In [45]:
bag_clf = BaggingClassifier(base_estimator=clf, n_estimators=1000,
                            bootstrap=True, oob_score=True, n_jobs=-1,
                            random_state=100)

bag_clf.fit(X_train, y_train)

scores(bag_clf, X_train, y_train, X_test, y_test, train=True)

scores(bag_clf, X_train, y_train, X_test, y_test, train=False)


Train Result:

accuracy score: 0.8716

Classification Report: 
               precision    recall  f1-score   support

           0       0.87      0.94      0.90       390
           1       0.88      0.76      0.82       233

    accuracy                           0.87       623
   macro avg       0.87      0.85      0.86       623
weighted avg       0.87      0.87      0.87       623


Confusion Matrix: 
 [[365  25]
 [ 55 178]]

Average Accuracy: 	 0.8010
Accuracy SD: 		 0.0313

Test Result:

accuracy score: 0.8172

Classification Report: 
               precision    recall  f1-score   support

           0       0.81      0.91      0.85       159
           1       0.83      0.69      0.75       109

    accuracy                           0.82       268
   macro avg       0.82      0.80      0.80       268
weighted avg       0.82      0.82      0.81       268


Confusion Matrix: 
 [[144  15]
 [ 34  75]]



We can see the impact of bagging and the accuracy score now is 0.8172 (vs 0.8060 for decision tree)


**Bagging - RandomForest:** Random Forest Models can be thought of as BAGGing, with a slight tweak. When deciding where to split and how to make decisions, BAGGed Decision Trees have the full disposal of features to choose from. Therefore, although the bootstrapped samples may be slightly different, the data is largely going to break off at the same features throughout each model. In contrary, Random Forest models decide where to split based on a random selection of features. Rather than splitting at similar features at each node throughout, Random Forest models implement a level of differentiation because each tree will split based on different features. This level of differentiation provides a greater ensemble to aggregate over, ergo producing a more accurate predictor. Refer to the image for a better understanding.

In [46]:
from sklearn.ensemble import RandomForestClassifier

In [47]:
rf_clf = RandomForestClassifier(random_state=100)
rf_clf.fit(X_train, y_train)
scores(rf_clf, X_train, y_train, X_test, y_test, train=True)
scores(rf_clf, X_train, y_train, X_test, y_test, train=False)


Train Result:

accuracy score: 0.8716

Classification Report: 
               precision    recall  f1-score   support

           0       0.87      0.94      0.90       390
           1       0.88      0.76      0.82       233

    accuracy                           0.87       623
   macro avg       0.87      0.85      0.86       623
weighted avg       0.87      0.87      0.87       623


Confusion Matrix: 
 [[365  25]
 [ 55 178]]

Average Accuracy: 	 0.8010
Accuracy SD: 		 0.0243

Test Result:

accuracy score: 0.8172

Classification Report: 
               precision    recall  f1-score   support

           0       0.81      0.90      0.85       159
           1       0.83      0.70      0.76       109

    accuracy                           0.82       268
   macro avg       0.82      0.80      0.80       268
weighted avg       0.82      0.82      0.81       268


Confusion Matrix: 
 [[143  16]
 [ 33  76]]



The accuuracy score is 0.8172, same as with bagging. Let's try using GridSearch

In [48]:
from sklearn.pipeline import Pipeline

from sklearn.model_selection import GridSearchCV

In [49]:
rf_clf = RandomForestClassifier(random_state=100)
grid_params = {"max_depth": [3, None],
               "min_samples_split": [2, 3, 10],
               "min_samples_leaf": [1, 3, 10],
               "bootstrap": [True, False],
               "criterion": ['gini', 'entropy']}
grid_search = GridSearchCV(rf_clf, grid_params,
                           n_jobs=-1, cv=5,
                           verbose=1, scoring='accuracy')

In [50]:
grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 72 candidates, totalling 360 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:    1.6s
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed:    6.1s
[Parallel(n_jobs=-1)]: Done 360 out of 360 | elapsed:   11.1s finished


GridSearchCV(cv=5, estimator=RandomForestClassifier(random_state=100),
             n_jobs=-1,
             param_grid={'bootstrap': [True, False],
                         'criterion': ['gini', 'entropy'],
                         'max_depth': [3, None], 'min_samples_leaf': [1, 3, 10],
                         'min_samples_split': [2, 3, 10]},
             scoring='accuracy', verbose=1)

In [51]:
grid_search.best_score_

0.8105935483870967

In [52]:
grid_search.best_estimator_.get_params()

{'bootstrap': False,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'entropy',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 10,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 100,
 'verbose': 0,
 'warm_start': False}

In [53]:
scores(grid_search, X_train, y_train, X_test, y_test, train=True)
scores(grid_search, X_train, y_train, X_test, y_test, train=False)


Train Result:

accuracy score: 0.8347

Classification Report: 
               precision    recall  f1-score   support

           0       0.82      0.94      0.88       390
           1       0.87      0.66      0.75       233

    accuracy                           0.83       623
   macro avg       0.84      0.80      0.81       623
weighted avg       0.84      0.83      0.83       623


Confusion Matrix: 
 [[366  24]
 [ 79 154]]

Fitting 5 folds for each of 72 candidates, totalling 360 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:    1.3s
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed:    6.3s
[Parallel(n_jobs=-1)]: Done 360 out of 360 | elapsed:   11.5s finished


Fitting 5 folds for each of 72 candidates, totalling 360 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:    1.3s
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed:    6.8s
[Parallel(n_jobs=-1)]: Done 360 out of 360 | elapsed:   13.3s finished


Fitting 5 folds for each of 72 candidates, totalling 360 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:    1.5s
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed:    7.1s
[Parallel(n_jobs=-1)]: Done 360 out of 360 | elapsed:   12.8s finished


Fitting 5 folds for each of 72 candidates, totalling 360 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:    1.3s
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed:    7.3s
[Parallel(n_jobs=-1)]: Done 360 out of 360 | elapsed:   13.7s finished


Fitting 5 folds for each of 72 candidates, totalling 360 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:    1.5s
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed:    7.7s
[Parallel(n_jobs=-1)]: Done 360 out of 360 | elapsed:   13.7s finished


Fitting 5 folds for each of 72 candidates, totalling 360 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:    1.4s
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed:    7.0s
[Parallel(n_jobs=-1)]: Done 360 out of 360 | elapsed:   12.5s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.


Fitting 5 folds for each of 72 candidates, totalling 360 fits


[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:    1.3s
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed:    7.6s
[Parallel(n_jobs=-1)]: Done 360 out of 360 | elapsed:   14.5s finished


Fitting 5 folds for each of 72 candidates, totalling 360 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:    1.6s
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed:    8.9s
[Parallel(n_jobs=-1)]: Done 360 out of 360 | elapsed:   15.5s finished


Fitting 5 folds for each of 72 candidates, totalling 360 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:    1.6s
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed:    7.9s
[Parallel(n_jobs=-1)]: Done 360 out of 360 | elapsed:   14.0s finished


Fitting 5 folds for each of 72 candidates, totalling 360 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:    1.3s
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed:    7.2s
[Parallel(n_jobs=-1)]: Done 360 out of 360 | elapsed:   13.1s finished


Average Accuracy: 	 0.8026
Accuracy SD: 		 0.0345

Test Result:

accuracy score: 0.8172

Classification Report: 
               precision    recall  f1-score   support

           0       0.80      0.93      0.86       159
           1       0.87      0.65      0.74       109

    accuracy                           0.82       268
   macro avg       0.83      0.79      0.80       268
weighted avg       0.82      0.82      0.81       268


Confusion Matrix: 
 [[148  11]
 [ 38  71]]



We can clearly see GridSearch also provided the same accuracy score, so the default params of Random Forest works better in this case.

We tried bagging techniques, lets try the Boosting techniques now.

**Boosting:** Boosting refers to a family of algorithms that are able to convert weak learners to strong learners. The main principle of boosting is to fit a sequence of weak learners − models that are only slightly better than random guessing, such as small decision trees − to weighted versions of the data. More weight is given to examples that were misclassified by earlier rounds. The predictions are then combined through a weighted majority vote (classification) or a weighted sum (regression) to produce the final prediction. The principal difference between boosting and the committee methods, such as bagging, is that base learners are trained in sequence on a weighted version of the data.

Following are the common types of boosting algorithms:
1. **AdaBoost:** Widely used form of boosting algorithm called AdaBoost, which stands for adaptive boosting.
2. **GradientBoost:** Utilizes the gradient descent to pinpoint the challenges in the learners’ predictions used previously
3. **XGBoost:** Implements decision trees with boosted gradient, enhanced performance, and speed

*Note: The goal of any supervised machine learning algorithm is to achieve low bias and low variance. In turn the algorithm should achieve good prediction performance.*
![BIAS vs Variance](biasvsvariance.png)


In [54]:
from sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier()

ada_clf.fit(X_train, y_train)

scores(ada_clf, X_train, y_train, X_test, y_test, train=True)

scores(ada_clf, X_train, y_train, X_test, y_test, train=False)


Train Result:

accuracy score: 0.8250

Classification Report: 
               precision    recall  f1-score   support

           0       0.84      0.88      0.86       390
           1       0.79      0.73      0.76       233

    accuracy                           0.83       623
   macro avg       0.82      0.80      0.81       623
weighted avg       0.82      0.83      0.82       623


Confusion Matrix: 
 [[345  45]
 [ 64 169]]

Average Accuracy: 	 0.8044
Accuracy SD: 		 0.0506

Test Result:

accuracy score: 0.8097

Classification Report: 
               precision    recall  f1-score   support

           0       0.82      0.87      0.84       159
           1       0.80      0.72      0.75       109

    accuracy                           0.81       268
   macro avg       0.81      0.79      0.80       268
weighted avg       0.81      0.81      0.81       268


Confusion Matrix: 
 [[139  20]
 [ 31  78]]



We can observe the accuracy score as 0.8097 which is more or less same as that of DecisionTree Classifier(0.8060). Lets try adaboost over Randomforest.

In [55]:
ada_clf = AdaBoostClassifier(RandomForestClassifier())

ada_clf.fit(X_train, y_train)

scores(ada_clf, X_train, y_train, X_test, y_test, train=True)

scores(ada_clf, X_train, y_train, X_test, y_test, train=False)


Train Result:

accuracy score: 0.8716

Classification Report: 
               precision    recall  f1-score   support

           0       0.88      0.92      0.90       390
           1       0.85      0.79      0.82       233

    accuracy                           0.87       623
   macro avg       0.87      0.86      0.86       623
weighted avg       0.87      0.87      0.87       623


Confusion Matrix: 
 [[358  32]
 [ 48 185]]

Average Accuracy: 	 0.7994
Accuracy SD: 		 0.0217

Test Result:

accuracy score: 0.8134

Classification Report: 
               precision    recall  f1-score   support

           0       0.81      0.89      0.85       159
           1       0.82      0.70      0.75       109

    accuracy                           0.81       268
   macro avg       0.81      0.80      0.80       268
weighted avg       0.81      0.81      0.81       268


Confusion Matrix: 
 [[142  17]
 [ 33  76]]



We can observe that adaboost with Randomforest(0.8134)is also more or less same as Randomforest (0.8172)

In [56]:
ada_clf = AdaBoostClassifier(base_estimator=RandomForestClassifier())

ada_clf.fit(X_train, y_train)

scores(ada_clf, X_train, y_train, X_test, y_test, train=True)

scores(ada_clf, X_train, y_train, X_test, y_test, train=False)


Train Result:

accuracy score: 0.8716

Classification Report: 
               precision    recall  f1-score   support

           0       0.87      0.94      0.90       390
           1       0.88      0.76      0.82       233

    accuracy                           0.87       623
   macro avg       0.87      0.85      0.86       623
weighted avg       0.87      0.87      0.87       623


Confusion Matrix: 
 [[365  25]
 [ 55 178]]

Average Accuracy: 	 0.8042
Accuracy SD: 		 0.0281

Test Result:

accuracy score: 0.8172

Classification Report: 
               precision    recall  f1-score   support

           0       0.81      0.90      0.85       159
           1       0.83      0.70      0.76       109

    accuracy                           0.82       268
   macro avg       0.82      0.80      0.80       268
weighted avg       0.82      0.82      0.81       268


Confusion Matrix: 
 [[143  16]
 [ 33  76]]



Let's try GradientBoost

In [57]:
from sklearn.ensemble import GradientBoostingClassifier

gbc_clf = GradientBoostingClassifier()
gbc_clf.fit(X_train, y_train)

scores(gbc_clf, X_train, y_train, X_test, y_test, train=True)
scores(gbc_clf, X_train, y_train, X_test, y_test, train=False)


Train Result:

accuracy score: 0.8523

Classification Report: 
               precision    recall  f1-score   support

           0       0.83      0.95      0.89       390
           1       0.90      0.68      0.78       233

    accuracy                           0.85       623
   macro avg       0.87      0.82      0.83       623
weighted avg       0.86      0.85      0.85       623


Confusion Matrix: 
 [[372  18]
 [ 74 159]]

Average Accuracy: 	 0.8090
Accuracy SD: 		 0.0427

Test Result:

accuracy score: 0.8172

Classification Report: 
               precision    recall  f1-score   support

           0       0.80      0.93      0.86       159
           1       0.87      0.65      0.74       109

    accuracy                           0.82       268
   macro avg       0.83      0.79      0.80       268
weighted avg       0.82      0.82      0.81       268


Confusion Matrix: 
 [[148  11]
 [ 38  71]]



Gradient Boost also yielded same as Randomforest. So for we saw 0.8172 as the highest accuracy score.

Let's try XGBoost

In [58]:
import xgboost as xgb

xgb_clf = xgb.XGBClassifier(max_depth=5, n_estimators=10000, learning_rate=0.3,
                            n_jobs=-1, use_label_encoder=False)
                            
xgb_clf.fit(X_train, y_train)
scores(xgb_clf, X_train, y_train, X_test, y_test, train=True)
scores(xgb_clf, X_train, y_train, X_test, y_test, train=False)


Train Result:

accuracy score: 0.8716

Classification Report: 
               precision    recall  f1-score   support

           0       0.87      0.94      0.90       390
           1       0.88      0.76      0.82       233

    accuracy                           0.87       623
   macro avg       0.87      0.85      0.86       623
weighted avg       0.87      0.87      0.87       623


Confusion Matrix: 
 [[365  25]
 [ 55 178]]

Average Accuracy: 	 0.7994
Accuracy SD: 		 0.0337

Test Result:

accuracy score: 0.8209

Classification Report: 
               precision    recall  f1-score   support

           0       0.81      0.91      0.86       159
           1       0.84      0.70      0.76       109

    accuracy                           0.82       268
   macro avg       0.82      0.80      0.81       268
weighted avg       0.82      0.82      0.82       268


Confusion Matrix: 
 [[144  15]
 [ 33  76]]



XGBoost gave us a marginal improvement taking the score to 0.8209, which is the highest now.

Now lets try stacking techniques to combine multiple models and predict.

**Stacking:** Stacked Generalization or “Stacking” for short is an ensemble machine learning algorithm. It involves combining the predictions from multiple machine learning models on the same dataset, like bagging and boosting. Stacking answers the following question: "Given multiple machine learning models that are skillful on a problem, but in different ways, how do you choose which model to use (trust)?"

The approach to this question is to use another machine learning model that learns when to use or trust each model in the ensemble.
* Unlike bagging, in stacking, the models are typically different (e.g. not all decision trees) and fit on the same dataset (e.g. instead of samples of the training dataset).
* Unlike boosting, in stacking, a single model is used to learn how to best combine the predictions from the contributing models (e.g. instead of a sequence of models that correct the predictions of prior models).

The architecture of a stacking model involves two or more base models, often referred to as level-0 models, and a meta-model that combines the predictions of the base models, referred to as a level-1 model.

* **Level-0 Models (Base-Models):** Models fit on the training data and whose predictions are compiled.
* **Level-1 Model (Meta-Model):** Model that learns how to best combine the predictions of the base models.

In [59]:
# Let's Recap the Decession Tree and GradientBoost Classifier score
clf = DecisionTreeClassifier()
gbc_clf = GradientBoostClassifier()
gbc_clf.fit(X_train, y_train)
clf.fit(X_train, y_train)
print("Decision Tree Scores")
scores(clf, X_train, y_train, X_test, y_test, train=False)
print("GradientBoost Scores")
scores(gbc_clf, X_train, y_train, X_test, y_test, train=False)
# Let's try Logistic Regression
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X_train, y_train)
print("Logistic Regression Scores")
scores(gbc_clf, X_train, y_train, X_test, y_test, train=False)

Decision Tree Scores

Test Result:

accuracy score: 0.8060

Classification Report: 
               precision    recall  f1-score   support

           0       0.79      0.91      0.85       159
           1       0.84      0.65      0.73       109

    accuracy                           0.81       268
   macro avg       0.81      0.78      0.79       268
weighted avg       0.81      0.81      0.80       268


Confusion Matrix: 
 [[145  14]
 [ 38  71]]

GradientBoost Scores

Test Result:

accuracy score: 0.8172

Classification Report: 
               precision    recall  f1-score   support

           0       0.80      0.93      0.86       159
           1       0.87      0.65      0.74       109

    accuracy                           0.82       268
   macro avg       0.83      0.79      0.80       268
weighted avg       0.82      0.82      0.81       268


Confusion Matrix: 
 [[148  11]
 [ 38  71]]

Logistic Regression Scores

Test Result:

accuracy score: 0.8172

Classification Repor

We can clearly observe the following accuracy scores:
* DecissionTree = 0.8060
* GradientBoost = 0.8172
* Logistic = 0.8172

Now lets stack these models and use Voting Technique to predict.

In [60]:
# importing voting classifer
from sklearn.ensemble import VotingClassifier

model1 = DecisionTreeClassifier()
model2 = GradientBoostingClassifier()
model3 = LogisticRegression()

# Making the final model using voting classifier
final_model = VotingClassifier(estimators=[('clf', model1), ('gbc_clf', model2), ('lr', model3)], voting='hard')

# training all the model on the train dataset
final_model.fit(X_train, y_train)

# Print Scores
scores(final_model, X_train, y_train, X_test, y_test, train=True)
scores(final_model, X_train, y_train, X_test, y_test, train=False)


Train Result:

accuracy score: 0.8555

Classification Report: 
               precision    recall  f1-score   support

           0       0.84      0.95      0.89       390
           1       0.89      0.70      0.78       233

    accuracy                           0.86       623
   macro avg       0.86      0.83      0.84       623
weighted avg       0.86      0.86      0.85       623


Confusion Matrix: 
 [[369  21]
 [ 69 164]]

Average Accuracy: 	 0.8090
Accuracy SD: 		 0.0434

Test Result:

accuracy score: 0.8209

Classification Report: 
               precision    recall  f1-score   support

           0       0.80      0.92      0.86       159
           1       0.86      0.67      0.75       109

    accuracy                           0.82       268
   macro avg       0.83      0.80      0.81       268
weighted avg       0.83      0.82      0.82       268


Confusion Matrix: 
 [[147  12]
 [ 36  73]]



You can observe how marginally the accuracy improved. Now we get the accuracy as 0.8209 which is pretty much same as XGBoost but without using it.

Let's use the meta classifier as Logistic Regression and instead of using Voting, we will use the StackingClassifier.

In [61]:
# importing stacking lib
from sklearn.ensemble import StackingClassifier

# putting all base model objects in one list
all_models = [('clf', model1), ('gbc_clf', model2), ('lr', model3)]

# Define Meta Learner Model
meta_model = model3

# Stacking Classifier
final_model = StackingClassifier(estimators=all_models, final_estimator=meta_model, cv=5)

# training all the model on the train dataset
final_model.fit(X_train, y_train)

# Print Scores
scores(final_model, X_train, y_train, X_test, y_test, train=True)
scores(final_model, X_train, y_train, X_test, y_test, train=False)


Train Result:

accuracy score: 0.8507

Classification Report: 
               precision    recall  f1-score   support

           0       0.85      0.93      0.89       390
           1       0.86      0.72      0.78       233

    accuracy                           0.85       623
   macro avg       0.85      0.82      0.83       623
weighted avg       0.85      0.85      0.85       623


Confusion Matrix: 
 [[363  27]
 [ 66 167]]

Average Accuracy: 	 0.8218
Accuracy SD: 		 0.0389

Test Result:

accuracy score: 0.8134

Classification Report: 
               precision    recall  f1-score   support

           0       0.80      0.91      0.85       159
           1       0.84      0.67      0.74       109

    accuracy                           0.81       268
   macro avg       0.82      0.79      0.80       268
weighted avg       0.82      0.81      0.81       268


Confusion Matrix: 
 [[145  14]
 [ 36  73]]



The Stacking using a meta model in this case gives more of less same accuracy score as the previous models.

## Conclusion

XGBoost and Stacking gave the best accuracy(0.8209) for this data set. We also learnt various ensemble techniques.

## Reference Links
1. [https://machinelearningmastery.com/a-gentle-introduction-to-the-bootstrap-method/](https://machinelearningmastery.com/a-gentle-introduction-to-the-bootstrap-method/)
2. [https://blog.statsbot.co/ensemble-learning-d1dcd548e936](https://blog.statsbot.co/ensemble-learning-d1dcd548e936)
3. [https://towardsdatascience.com/ensemble-methods-in-machine-learning-what-are-they-and-why-use-them-68ec3f9fef5f](https://towardsdatascience.com/ensemble-methods-in-machine-learning-what-are-they-and-why-use-them-68ec3f9fef5f)
4. [https://en.wikipedia.org/wiki/AdaBoost](https://en.wikipedia.org/wiki/AdaBoost)
5. [https://machinelearningmastery.com/stacking-ensemble-machine-learning-with-python/](https://machinelearningmastery.com/stacking-ensemble-machine-learning-with-python/)
6. [https://corporatefinanceinstitute.com/resources/knowledge/other/boosting/](https://corporatefinanceinstitute.com/resources/knowledge/other/boosting/)
7. [https://www.kaggle.com/arthurtok/employee-attrition-via-ensemble-tree-based-methods/notebook](https://www.kaggle.com/arthurtok/employee-attrition-via-ensemble-tree-based-methods/notebook)
8. [https://www.geeksforgeeks.org/ensemble-methods-in-python/](https://www.geeksforgeeks.org/ensemble-methods-in-python/)