# Ensembles of Ensembles - Model Stacking

* **Ensemble with different types of classifiers**: 
  * Different types of classifiers (E.g., logistic regression, decision trees, random forest, etc.) are fitted on the same training data
  * Results are combined based on either 
    * majority voting (classification) or 
    * average (regression)
  

* **Ensemble with a single type of classifier**: 
  * Bootstrap samples are drawn from training data 
  * With each bootstrap sample, model (E.g., Individual model may be decision trees, random forest, etc.) will be fitted 
  * All the results are combined to create an ensemble. 
  * Suitabe for highly flexible models that is prone to overfitting / high variance. 

***

## Combining Method

* **Majority voting or average**: 
  * Classification: Largest number of votes (mode) 
  * Regression problems: Average (mean).
  
  
* **Method of application of meta-classifiers on outcomes**: 
  * Binary outcomes: 0 / 1 from individual classifiers
  * Meta-classifier is applied on top of the individual classifiers. 
  
  
* **Method of application of meta-classifiers on probabilities**: 
  * Probabilities are obtained from individual classifiers. 
  * Applying meta-classifier
  

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [4]:
df = pd.read_csv("IBM_HR_project/data/WA_Fn-UseC_-HR-Employee-Attrition.csv")
df.pop('EmployeeNumber')
df.pop('Over18')
df.pop('StandardHours')
df.pop('EmployeeCount')
y = df['Attrition']
X = df
X.pop('Attrition')
from sklearn import preprocessing
le = preprocessing.LabelBinarizer()
y = le.fit_transform(y)
ind_BusinessTravel = pd.get_dummies(df['BusinessTravel'], prefix='BusinessTravel')
ind_Department = pd.get_dummies(df['Department'], prefix='Department')
ind_EducationField = pd.get_dummies(df['EducationField'], prefix='EducationField')
ind_Gender = pd.get_dummies(df['Gender'], prefix='Gender')
ind_JobRole = pd.get_dummies(df['JobRole'], prefix='JobRole')
ind_MaritalStatus = pd.get_dummies(df['MaritalStatus'], prefix='MaritalStatus')
ind_OverTime = pd.get_dummies(df['OverTime'], prefix='OverTime')
df1 = pd.concat([ind_BusinessTravel, ind_Department, ind_EducationField, ind_Gender, 
                 ind_JobRole, ind_MaritalStatus, ind_OverTime])
df1 = pd.concat([ind_BusinessTravel, ind_Department, ind_EducationField, ind_Gender, 
                 ind_JobRole, ind_MaritalStatus, ind_OverTime, df.select_dtypes(['int64'])], axis=1)
df1.dropna(inplace=True)
df1.shape

(1470, 51)

In [5]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df1, y)

In [7]:
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [8]:
def print_score(clf, X_train, y_train, X_test, y_test, train=True):
    '''
    print the accuracy score, classification report and confusion matrix of classifier
    '''
    if train:
        '''
        training performance
        '''
        print("Train Result:\n")
        print("accuracy score: {0:.4f}\n".format(accuracy_score(y_train, clf.predict(X_train))))
        print("Classification Report: \n {}\n".format(classification_report(y_train, clf.predict(X_train))))
        print("Confusion Matrix: \n {}\n".format(confusion_matrix(y_train, clf.predict(X_train))))

        res = cross_val_score(clf, X_train, y_train, cv=10, scoring='accuracy')
        print("Average Accuracy: \t {0:.4f}".format(np.mean(res)))
        print("Accuracy SD: \t\t {0:.4f}".format(np.std(res)))
        
    elif train==False:
        '''
        test performance
        '''
        print("Test Result:\n")        
        print("accuracy score: {0:.4f}\n".format(accuracy_score(y_test, clf.predict(X_test))))
        print("Classification Report: \n {}\n".format(classification_report(y_test, clf.predict(X_test))))
        print("Confusion Matrix: \n {}\n".format(confusion_matrix(y_test, clf.predict(X_test))))    
        

# Model 1: Decision Tree

In [10]:
from sklearn.tree import DecisionTreeClassifier

In [11]:
tree_clf = DecisionTreeClassifier()
tree_clf.fit(X_train, y_train)

In [12]:
print_score(tree_clf, X_train, y_train, X_test, y_test, train=True)

Train Result:

accuracy score: 1.0000

Classification Report: 
               precision    recall  f1-score   support

           0       1.00      1.00      1.00       931
           1       1.00      1.00      1.00       171

    accuracy                           1.00      1102
   macro avg       1.00      1.00      1.00      1102
weighted avg       1.00      1.00      1.00      1102


Confusion Matrix: 
 [[931   0]
 [  0 171]]

Average Accuracy: 	 0.7713
Accuracy SD: 		 0.0325


In [13]:
print_score(tree_clf, X_train, y_train, X_test, y_test, train=False)

Test Result:

accuracy score: 0.7880

Classification Report: 
               precision    recall  f1-score   support

           0       0.86      0.89      0.87       302
           1       0.39      0.32      0.35        66

    accuracy                           0.79       368
   macro avg       0.62      0.60      0.61       368
weighted avg       0.77      0.79      0.78       368


Confusion Matrix: 
 [[269  33]
 [ 45  21]]



# Model 2: Random Forest

In [14]:
from sklearn.ensemble import RandomForestClassifier

In [16]:
rf_clf = RandomForestClassifier()
y_train = y_train.ravel()
rf_clf.fit(X_train, y_train)

In [18]:
print_score(rf_clf, X_train, y_train, X_test, y_test, train=True)

Train Result:

accuracy score: 1.0000

Classification Report: 
               precision    recall  f1-score   support

           0       1.00      1.00      1.00       931
           1       1.00      1.00      1.00       171

    accuracy                           1.00      1102
   macro avg       1.00      1.00      1.00      1102
weighted avg       1.00      1.00      1.00      1102


Confusion Matrix: 
 [[931   0]
 [  0 171]]

Average Accuracy: 	 0.8612
Accuracy SD: 		 0.0156


In [19]:
print_score(rf_clf, X_train, y_train, X_test, y_test, train=False)

Test Result:

accuracy score: 0.8370

Classification Report: 
               precision    recall  f1-score   support

           0       0.84      0.99      0.91       302
           1       0.80      0.12      0.21        66

    accuracy                           0.84       368
   macro avg       0.82      0.56      0.56       368
weighted avg       0.83      0.84      0.78       368


Confusion Matrix: 
 [[300   2]
 [ 58   8]]



In [63]:
en_en = pd.DataFrame()

In [64]:
tree_clf.predict_proba(X_train)

array([[0., 1.],
       [1., 0.],
       [1., 0.],
       ...,
       [1., 0.],
       [1., 0.],
       [1., 0.]])

In [65]:
# # What is predict_proba?
# tree_clf.predict_proba?

In [66]:
en_en["tree_clf"] = pd.DataFrame(tree_clf.predict_proba(X_train))[1]
en_en["rf_clf"] = pd.DataFrame(rf_clf.predict_proba(X_train))[1]
col_name = en_en.columns
en_en = pd.concat([en_en, pd.DataFrame(y_train).reset_index(drop=True)], axis=1)

In [67]:
en_en.head()

Unnamed: 0,tree_clf,rf_clf,0
0,1.0,0.71,1
1,0.0,0.02,0
2,0.0,0.01,0
3,0.0,0.03,0
4,0.0,0.02,0


In [68]:
tmp = list(col_name)
tmp.append("ind")
en_en.columns = tmp

In [69]:
en_en.head()

Unnamed: 0,tree_clf,rf_clf,ind
0,1.0,0.71,1
1,0.0,0.02,0
2,0.0,0.01,0
3,0.0,0.03,0
4,0.0,0.02,0


# Meta Classifier

In [70]:
from sklearn.linear_model import LogisticRegression

In [71]:
m_clf = LogisticRegression(fit_intercept=False)

In [76]:
m_clf.fit(en_en[["tree_clf", "rf_clf"]], en_en["ind"])

In [85]:
en_test = pd.DataFrame()

In [86]:
en_test["tree_clf"] = pd.DataFrame(tree_clf.predict_proba(X_test))[1]
en_test["rf_clf"] = pd.DataFrame(rf_clf.predict_proba(X_test))[1]
col_name = en_test.columns
en_test["combined"] = m_clf.predict(en_test[["tree_clf", "rf_clf"]])

In [87]:
col_name = en_test.columns
tmp = list(col_name)
tmp.append('ind')

In [92]:
tmp

['tree_clf', 'rf_clf', 'combined', 'ind']

In [91]:
en_test.head()

Unnamed: 0,tree_clf,rf_clf,combined
0,0.0,0.04,0
1,0.0,0.01,0
2,0.0,0.11,0
3,1.0,0.26,1
4,0.0,0.31,0


In [93]:
en_test = pd.concat([en_test, pd.DataFrame(y_test).reset_index(drop=True)], axis=1)

In [94]:
en_test.columns = tmp

In [95]:
print(pd.crosstab(en_test['ind'], en_test['combined']))

combined    0   1
ind              
0         269  33
1          45  21


In [96]:
print(round(accuracy_score(en_test["ind"], en_test["combined"]), 4))

0.788


In [97]:
print(classification_report(en_test['ind'], en_test['combined']))

              precision    recall  f1-score   support

           0       0.86      0.89      0.87       302
           1       0.39      0.32      0.35        66

    accuracy                           0.79       368
   macro avg       0.62      0.60      0.61       368
weighted avg       0.77      0.79      0.78       368



# Single Classifier

Ensemble with Single type of Classifier

In [100]:
df = pd.read_csv("IBM_HR_project/data/WA_Fn-UseC_-HR-Employee-Attrition.csv")

In [101]:
df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


In [102]:
df.Attrition.value_counts() / df.Attrition.count()

Attrition
No     0.838776
Yes    0.161224
Name: count, dtype: float64

In [103]:
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier, AdaBoostClassifier

In [104]:
class_weight = {0:0.838776, 1:0.161224}

In [106]:
pd.Series(list(y_train)).value_counts() / pd.Series(list(y_train)).count()

0    0.844828
1    0.155172
Name: count, dtype: float64

In [107]:
rf_clf = RandomForestClassifier(class_weight=class_weight)

In [108]:
ada_clf = AdaBoostClassifier(estimator=rf_clf, n_estimators=100,
                             learning_rate = 0.5, random_state=42)

In [109]:
ada_clf.fit(X_train, y_train.ravel())



In [111]:
print_score(ada_clf, X_train, y_train, X_test, y_test, train=True)

Train Result:

accuracy score: 1.0000

Classification Report: 
               precision    recall  f1-score   support

           0       1.00      1.00      1.00       931
           1       1.00      1.00      1.00       171

    accuracy                           1.00      1102
   macro avg       1.00      1.00      1.00      1102
weighted avg       1.00      1.00      1.00      1102


Confusion Matrix: 
 [[931   0]
 [  0 171]]





Average Accuracy: 	 0.8639
Accuracy SD: 		 0.0221


In [112]:
print_score(ada_clf, X_train, y_train, X_test, y_test, train=False)

Test Result:

accuracy score: 0.8397

Classification Report: 
               precision    recall  f1-score   support

           0       0.84      0.99      0.91       302
           1       0.82      0.14      0.23        66

    accuracy                           0.84       368
   macro avg       0.83      0.56      0.57       368
weighted avg       0.84      0.84      0.79       368


Confusion Matrix: 
 [[300   2]
 [ 57   9]]



In [113]:
bag_clf = BaggingClassifier(estimator=ada_clf, n_estimators=50,
                            max_features=1.0, max_samples=1.0,
                            bootstrap=True, bootstrap_features=False,
                            n_jobs=-1, random_state=42)

In [114]:
bag_clf.fit(X_train, y_train.ravel())

In [115]:
print_score(bag_clf, X_train, y_train, X_test, y_test, train=True)

Train Result:

accuracy score: 0.9673

Classification Report: 
               precision    recall  f1-score   support

           0       0.96      1.00      0.98       931
           1       1.00      0.79      0.88       171

    accuracy                           0.97      1102
   macro avg       0.98      0.89      0.93      1102
weighted avg       0.97      0.97      0.97      1102


Confusion Matrix: 
 [[931   0]
 [ 36 135]]

Average Accuracy: 	 0.8612
Accuracy SD: 		 0.0176


In [116]:
print_score(bag_clf, X_train, y_train, X_test, y_test, train=False)

Test Result:

accuracy score: 0.8397

Classification Report: 
               precision    recall  f1-score   support

           0       0.84      0.99      0.91       302
           1       0.82      0.14      0.23        66

    accuracy                           0.84       368
   macro avg       0.83      0.56      0.57       368
weighted avg       0.84      0.84      0.79       368


Confusion Matrix: 
 [[300   2]
 [ 57   9]]



***
***