## 5.2.7 Ensemble Learning Implementation

Ensemble modeling is a process in which multiple diverse models are developed to predict an outcome using different modeling algorithms or training data sets. The ensemble model then aggregates each base model's prediction and results in the final prediction for the unknown data.

A brief overview of different Ensemble modeling is given in Section 5.2.6. In the following, we will use the different ensemble learning methods to solve the customer churn classification problem using Python for the implementation. 

For all the available Ensemble methods that can be implemented in Python and their APIs, students can refer to https://scikit-learn.org/stable/modules/ensemble.html?highlight=xgboost#


The following Python codes show the implementation of Bagging-based, Boosting-based, and Voting-based Ensemble models. The comments embedded in the codes give explanations to guide the rationale of the programming logic.

In [5]:
#import necessary libraries
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np

#Loading Dataset and specify inputs/output
data = pd.read_csv('data/ChurnFinal.csv')
X = pd.get_dummies(data[['Gender', 'Age', 'PostalCode', 'Cash', 'CreditCard', 
        'Cheque', 'SinceLastTrx', 'SqrtTotal', 'SqrtMax', 'SqrtMin']])
Y = data['Churn']

#Splitting dataset into training and testing dataset
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test = train_test_split(X, Y, stratify=Y, 
        test_size=0.2, random_state=1)

# feature scaling
from sklearn.preprocessing import StandardScaler  
scaler = StandardScaler()  
scaler.fit(X_train)                     
X_train = scaler.transform(X_train)  
X_test = scaler.transform(X_test)       #

# Bagged-based Ensenble using Decision Trees for classification
from sklearn import model_selection
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

cart = DecisionTreeClassifier()
num_trees = 100

# create the sub models using multiple Decision Trees, e.g. 100 trees
es_Bag = BaggingClassifier(estimator=cart, n_estimators=num_trees, random_state=7)

# Boosting-based Ensemble using AdaBoost Classification wiht DEcision Trees
from sklearn.ensemble import AdaBoostClassifier
num_trees = 100
# create the sub models using multiple Decision Trees, e.g. 100 trees
es_Boost = AdaBoostClassifier(n_estimators=num_trees, random_state=7)

# Voting-based Ensemble for Classification using variouus methods
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier 
from sklearn.ensemble import VotingClassifier

# create the sub models using LogisticRegression, DecisionTree, and SupportVector
estimators = []
model1 = LogisticRegression(random_state=7)  # Logistic Regression Classifier
estimators.append(('logistic', model1))
model2 = DecisionTreeClassifier(random_state=7) # Decision Tree Classifier
estimators.append(('cart', model2))
model3 = SVC(random_state=7) #Support Vector Classifier
estimators.append(('svc', model3))

# create the ensemble model
es_Vote = VotingClassifier(estimators)

# Model performance
from sklearn import metrics 
# use for-loop to print the performance of each ensemble model
for model in (es_Bag, es_Boost, es_Vote):
        model.fit(X_train, Y_train)
        y_predict = model.predict(X_test)       
        print(model.__class__.__name__, ' Ensemble accuracy: ', 
                metrics.accuracy_score(Y_test, y_predict))

BaggingClassifier  Ensemble accuracy:  0.74
AdaBoostClassifier  Ensemble accuracy:  0.735
VotingClassifier  Ensemble accuracy:  0.73


These results indicate that with the parameter settings defined in the codes, Bagging-based Ensemble performs slightly better with an accuracy of 0.74 compared to Boosting-based and Voting-based learning methods using the customer churn data set.