# Ensemble of ensembles - model stacking

* **Ensemble with different types of classifiers**: 
  * Different types of classifiers (E.g., logistic regression, decision trees, random forest, etc.) are fitted on the same training data
  * Results are combined based on either 
    * majority voting (classification) or 
    * average (regression)
  

* **Ensemble with a single type of classifier**: 
  * Bootstrap samples are drawn from training data 
  * With each bootstrap sample, model (E.g., Individual model may be decision trees, random forest, etc.) will be fitted 
  * All the results are combined to create an ensemble. 
  * Suitabe for highly flexible models that is prone to overfitting / high variance. 

***

## Combining Method

* **Majority voting or average**: 
  * Classification: Largest number of votes (mode) 
  * Regression problems: Average (mean).
  
  
* **Method of application of meta-classifiers on outcomes**: 
  * Binary outcomes: 0 / 1 from individual classifiers
  * Meta-classifier is applied on top of the individual classifiers. 
  
  
* **Method of application of meta-classifiers on probabilities**: 
  * Probabilities are obtained from individual classifiers. 
  * Applying meta-classifier
  

In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [105]:
df = pd.read_csv("data\WA_Fn-UseC_-HR-Employee-Attrition.csv")
df.pop('EmployeeNumber')
df.pop('Over18')
df.pop('StandardHours')
df.pop('EmployeeCount')
y = df['Attrition']
X = df
X.pop('Attrition')
from sklearn import preprocessing
le = preprocessing.LabelBinarizer()
y = le.fit_transform(y)
ind_BusinessTravel = pd.get_dummies(df['BusinessTravel'], prefix='BusinessTravel')
ind_Department = pd.get_dummies(df['Department'], prefix='Department')
ind_EducationField = pd.get_dummies(df['EducationField'], prefix='EducationField')
ind_Gender = pd.get_dummies(df['Gender'], prefix='Gender')
ind_JobRole = pd.get_dummies(df['JobRole'], prefix='JobRole')
ind_MaritalStatus = pd.get_dummies(df['MaritalStatus'], prefix='MaritalStatus')
ind_OverTime = pd.get_dummies(df['OverTime'], prefix='OverTime')
df1 = pd.concat([ind_BusinessTravel, ind_Department, ind_EducationField, ind_Gender, 
                 ind_JobRole, ind_MaritalStatus, ind_OverTime])
df1 = pd.concat([ind_BusinessTravel, ind_Department, ind_EducationField, ind_Gender, 
                 ind_JobRole, ind_MaritalStatus, ind_OverTime, df.select_dtypes(['int64'])], axis=1)
df1.dropna(inplace=True)
df1.shape

(1470, 51)

In [106]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df1, y)

In [107]:
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [108]:
def print_score(clf, X_train, y_train, X_test, y_test, train=True):
    '''
    print the accuracy score, classification report and confusion matrix of classifier
    '''
    if train:
        '''
        training performance
        '''
        print("Train Result:\n")
        print("accuracy score: {0:.4f}\n".format(accuracy_score(y_train, clf.predict(X_train))))
        print("Classification Report: \n {}\n".format(classification_report(y_train, clf.predict(X_train))))
        print("Confusion Matrix: \n {}\n".format(confusion_matrix(y_train, clf.predict(X_train))))

        res = cross_val_score(clf, X_train, y_train.ravel(), cv=10, scoring='accuracy')
        print("Average Accuracy: \t {0:.4f}".format(np.mean(res)))
        print("Accuracy SD: \t\t {0:.4f}".format(np.std(res)))
        
    elif train==False:
        '''
        test performance
        '''
        print("Test Result:\n")        
        print("accuracy score: {0:.4f}\n".format(accuracy_score(y_test, clf.predict(X_test))))
        print("Classification Report: \n {}\n".format(classification_report(y_test, clf.predict(X_test))))
        print("Confusion Matrix: \n {}\n".format(confusion_matrix(y_test, clf.predict(X_test))))    
        

## Model 1: Decision Tree

In [109]:
from sklearn.tree import DecisionTreeClassifier

In [110]:
tree_clf = DecisionTreeClassifier()
tree_clf.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [111]:
print_score(tree_clf, X_train, y_train, X_test, y_test, train=True)

Train Result:

accuracy score: 1.0000

Classification Report: 
              precision    recall  f1-score   support

          0       1.00      1.00      1.00       922
          1       1.00      1.00      1.00       180

avg / total       1.00      1.00      1.00      1102


Confusion Matrix: 
 [[922   0]
 [  0 180]]

Average Accuracy: 	 0.7586
Accuracy SD: 		 0.0472


In [112]:
print_score(tree_clf, X_train, y_train, X_test, y_test, train=False)

Test Result:

accuracy score: 0.7989

Classification Report: 
              precision    recall  f1-score   support

          0       0.89      0.87      0.88       311
          1       0.37      0.40      0.38        57

avg / total       0.81      0.80      0.80       368


Confusion Matrix: 
 [[271  40]
 [ 34  23]]



## Model 2: Random Forest

In [113]:
from sklearn.ensemble import RandomForestClassifier

In [114]:
rf_clf = RandomForestClassifier()
rf_clf.fit(X_train, y_train.ravel())

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [115]:
print_score(rf_clf, X_train, y_train, X_test, y_test, train=True)

Train Result:

accuracy score: 0.9837

Classification Report: 
              precision    recall  f1-score   support

          0       0.98      1.00      0.99       922
          1       0.99      0.91      0.95       180

avg / total       0.98      0.98      0.98      1102


Confusion Matrix: 
 [[921   1]
 [ 17 163]]

Average Accuracy: 	 0.8476
Accuracy SD: 		 0.0196


In [116]:
print_score(rf_clf, X_train, y_train, X_test, y_test, train=False)

Test Result:

accuracy score: 0.8614

Classification Report: 
              precision    recall  f1-score   support

          0       0.87      0.99      0.92       311
          1       0.75      0.16      0.26        57

avg / total       0.85      0.86      0.82       368


Confusion Matrix: 
 [[308   3]
 [ 48   9]]



## Model 3

In [117]:
from sklearn.ensemble import AdaBoostClassifier

In [145]:
forest = RandomForestClassifier()
ab_clf = AdaBoostClassifier(base_estimator=forest, n_estimators=100,
                         learning_rate=0.5, random_state=42)

In [146]:
ab_clf.fit(X_train, y_train.ravel())

AdaBoostClassifier(algorithm='SAMME.R',
          base_estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
          learning_rate=0.5, n_estimators=100, random_state=42)

In [147]:
print_score(ab_clf, X_train, y_train, X_test, y_test, train=True)

Train Result:

accuracy score: 1.0000

Classification Report: 
              precision    recall  f1-score   support

          0       1.00      1.00      1.00       922
          1       1.00      1.00      1.00       180

avg / total       1.00      1.00      1.00      1102


Confusion Matrix: 
 [[922   0]
 [  0 180]]

Average Accuracy: 	 0.8585
Accuracy SD: 		 0.0146


In [148]:
print_score(ab_clf, X_train, y_train, X_test, y_test, train=False)

Test Result:

accuracy score: 0.8560

Classification Report: 
              precision    recall  f1-score   support

          0       0.86      0.99      0.92       311
          1       0.70      0.12      0.21        57

avg / total       0.84      0.86      0.81       368


Confusion Matrix: 
 [[308   3]
 [ 50   7]]



### Build a new matrix with each model's predicted probability as a column and the y values as the last column

 Returns
 
 p : array of shape = [n_samples, n_classes], or a list of n_outputs
    such arrays if n_outputs > 1.
    The class probabilities of the input samples. The order of the
    classes corresponds to that in the attribute `classes_`.
    
 We want n_classes column 1  E.g. [1]

In [149]:
y_train.ravel?

In [122]:
en_en = pd.DataFrame()

In [123]:
rf_clf.predict_proba(X_train)

array([[ 1. ,  0. ],
       [ 1. ,  0. ],
       [ 1. ,  0. ],
       ..., 
       [ 0.4,  0.6],
       [ 1. ,  0. ],
       [ 0.9,  0.1]])

In [124]:
tree_clf.predict_proba(X_train)

array([[ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       ..., 
       [ 0.,  1.],
       [ 1.,  0.],
       [ 1.,  0.]])

In [125]:
ab_clf.predict_proba(X_train)

array([[ 0.51206476,  0.48793524],
       [ 0.51921006,  0.48078994],
       [ 0.51268083,  0.48731917],
       ..., 
       [ 0.4991434 ,  0.5008566 ],
       [ 0.51663155,  0.48336845],
       [ 0.50084399,  0.49915601]])

In [None]:
tree_clf.predict_proba?

In [126]:
en_en['tree_clf'] = pd.DataFrame(tree_clf.predict_proba(X_train))[1]
en_en['rf_clf'] =  pd.DataFrame(rf_clf.predict_proba(X_train))[1]
en_en['ab_clf'] =  pd.DataFrame(ab_clf.predict_proba(X_train))[1]
col_name = en_en.columns
en_en = pd.concat([en_en, pd.DataFrame(y_train).reset_index(drop=True)], axis=1)

In [127]:
pd.DataFrame(y_train).head()

Unnamed: 0,0
0,0
1,0
2,0
3,0
4,0


In [128]:
en_en.head()

Unnamed: 0,tree_clf,rf_clf,ab_clf,0
0,0.0,0.0,0.487935,0
1,0.0,0.0,0.48079,0
2,0.0,0.0,0.487319,0
3,0.0,0.0,0.488915,0
4,0.0,0.2,0.492957,0


In [129]:
tmp = list(col_name)
tmp.append('ind')
en_en.columns = tmp

In [130]:
en_en.head()

Unnamed: 0,tree_clf,rf_clf,ab_clf,ind
0,0.0,0.0,0.487935,0
1,0.0,0.0,0.48079,0
2,0.0,0.0,0.487319,0
3,0.0,0.0,0.488915,0
4,0.0,0.2,0.492957,0


# Meta Classifier

 Using X_test

In [131]:
from sklearn.linear_model import LogisticRegression

In [132]:
m_clf = LogisticRegression(fit_intercept=False)

In [133]:
m_clf.fit(en_en[['tree_clf', 'rf_clf', 'ab_clf']], en_en['ind'])

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=False,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [134]:
en_test = pd.DataFrame()

In [135]:
en_test['tree_clf'] = pd.DataFrame(tree_clf.predict_proba(X_test))[1]
en_test['rf_clf'] =  pd.DataFrame(rf_clf.predict_proba(X_test))[1]
en_test['ab_clf'] =  pd.DataFrame(ab_clf.predict_proba(X_test))[1]
col_name = en_en.columns
en_test['combined'] = m_clf.predict(en_test[['tree_clf', 'rf_clf', 'ab_clf']])

In [136]:
col_name = en_test.columns
tmp = list(col_name)
tmp.append('ind')

In [137]:
tmp

['tree_clf', 'rf_clf', 'ab_clf', 'combined', 'ind']

In [138]:
en_test = pd.concat([en_test, pd.DataFrame(y_test).reset_index(drop=True)], axis=1)

In [139]:
en_test.columns = tmp

In [140]:
en_test

Unnamed: 0,tree_clf,rf_clf,ab_clf,combined,ind
0,1.0,0.1,0.489807,1,0
1,0.0,0.1,0.496837,0,0
2,0.0,0.1,0.487061,0,0
3,0.0,0.1,0.477193,0,0
4,0.0,0.2,0.497060,0,1
5,0.0,0.0,0.474707,0,0
6,0.0,0.0,0.479865,0,0
7,0.0,0.1,0.481657,0,0
8,0.0,0.2,0.483909,0,0
9,0.0,0.1,0.476543,0,0


In [141]:
pd.crosstab?

In [142]:
print(pd.crosstab(en_test['ind'], en_test['combined']))

combined    0   1
ind              
0         271  40
1          34  23


In [143]:
print(round(accuracy_score(en_test['ind'], en_test['combined']), 4))

0.7989


In [144]:
print(classification_report(en_test['ind'], en_test['combined']))

             precision    recall  f1-score   support

          0       0.89      0.87      0.88       311
          1       0.37      0.40      0.38        57

avg / total       0.81      0.80      0.80       368



***