# Big Data Processes Exercises - Week 04, <b>*part 2*</b>
# <font color= Pink> Ensemble methods </font>

#### What we will cover today

<ol>
    <li>Importing packages and libraries</li>
    <li>Loading the dataset</li>
    <li>Selecting target features</li>
    <li>Splitting and Scaling</li>
    <li>Bagging</li>
    <ol>
        <li>Random Forest</li>
    </ol>
    <li>Boosting</li>
    <ol>
        <li>AdaBoost</li>
        <li>XGBoost</li>
    </ol>
    <li>Emsemble voting</li>
</ol>

### What is ensemble methods?

The ensemble model technique uses multiple machine learning models for a better result. This can either be by combining them, or stacking them on top of each other.

In our case, we want to predict attrition using the IBM dataset using ensemble methods. Thus, we first train several models using our training data. Next, when we want to make a prediction, we run the models using our test data. As the models are different, they might also make different predictions. One model may predict that an employee will leave the company (= attrition), while another model say's the employee won't leave (= no attrition). How do we decide which prediction to stick with? We simple pick the majority vote. In other words, if we have three models and two of our models conclude that the employee will leave, we pick this option.


### TA Tip:

Think of ensemble methods as "wisdom of the crowd". It refers to the case where the opinion calculated from the (the sum) of a group of people is often more accurate, useful, or correct than the opinion of any individual in the group.

***
***
***

## 1. Importing various libraries

In [1]:
# Libraries to work with the data object
import pandas as pd 
import numpy as np

# libraries to visualize
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.tree import export_graphviz
from six import StringIO
from IPython.display import Image

import graphviz
import pydotplus

# sklearn packages for Decision Tree, KNN, RandomForest and Logistic Regression
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.ensemble import BaggingClassifier, AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression  
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import VotingClassifier
from sklearn.model_selection import cross_val_score

***
***
## 2. Loading the dataset

In this notebook, we will be using our 'good old' IBM-employee-attrition dataset, so let's just load this into our notebook:

In [2]:
df = pd.read_csv("IBM-Employee-Attrition.csv", delimiter=',')

***
***
## 3. Selecting target features

We will select the same features as we've used before for our models:

In [3]:
#Create the feature and target variables
#From list of feature(s) 'X', the model will guess/predict the 'y' feature (our target)
X = df[['EnvironmentSatisfaction', 'JobSatisfaction', 'JobInvolvement', 'YearsAtCompany', 'StockOptionLevel', 'YearsWithCurrManager', 'Age', 'MonthlyIncome', 'YearsInCurrentRole', 'JobLevel', 'TotalWorkingYears']].values

y = df['Attrition'].values

***
***
## 4. Splitting and Scaling



We split the data as usual, such that we have some data to test the accuracy of our models:

In [4]:
# split data into test and train - 80/20
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

We also scale our data, as we will be using KNN for classification further down and this classification model requires that we scale the data:

In [5]:
# create a standard scaler object and fit it to the training data
scaler = StandardScaler()
scaler.fit(X_train)

# transform the training and test data using the scaler
X_train_std = scaler.transform(X_train)
X_test_std = scaler.transform(X_test)

***
***

## 5. Bagging

![title](Bagging_Boosting.png)

Bagging and boosting are ensemble learning techniques that we can use to improve the performance of our classification models. In this section, we look at bagging. We will get to boosting in the following section. However, both techniques entail training so-called base models (also called 'weak learners') of the same type using our training data. In the image above, the base models are pictured as small robots: <img src="https://em-content.zobj.net/source/google/387/robot_1f916.png" width="20"/> <img src="https://em-content.zobj.net/source/google/387/robot_1f916.png" width="20"/> <img src="https://em-content.zobj.net/source/google/387/robot_1f916.png" width="20"/>


In bagging, we train several base models of the same type on the same training data. More specifically, we take *samples* from our training data and train each individual model on a subsample. In the illustration above, you can see how subsamples (the purple sheets) are taken from the original training data (the green sheet). The robots represent classification base models. Each robot is trained on its own subset. Next, we provide the robots with our test data and ask them to make a prediction. The bagging ensemble model then choose the prediction made by the majority of the robots/models. For instance, robot A and B might predict that an employee will stay, while robot C predicts that he will leave. In this case, the bagging ensemble model will predict that the empoyee stays.

Last week, you learned about two types of classification models: decision trees and KNN.

If we want to use several decision trees in our bagging ensemble model, we can use what is called **random forests**, see section 5.1.

(Optional: read more about this classifier here, https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)

We can also use another type of model, for instance KNN, see section 5.2.

### 5.1 Random Forrest

Random forrest is a subset of Bagging. Again, it is applying multiple methods on subset of data.

The fundamental difference between bagging and random forests is that in random forests only a subset of features are selected at random out of the total. Then, the best split feature from the subset is used to split each node in a tree, unlike in bagging where all features are considered for splitting a node. 
-- Stackoverflow


When making a random forest, we start by creating a random forest object, which we choose to call 'RF_classifier'. 

The **n_estimators** parameter defines the number of trees in the forest. In other words, here we state the number of decision trees (= robots) we want to train. If we do not specify anything, the default value will be 100. We limit this number in order to prevent overfitting. 

We also specify the maximum depth of the tree (tip: we explain this and other relevant parameters for decision trees in the notebook on hyperparameter tuning).

In [6]:
# making a random forest object and deciding on 5 decesion trees of depth 5
RF_classifier = RandomForestClassifier(n_estimators = 5, max_depth = 5)

Next, we train your model object on our training data, that we created in section 4.

In [7]:
RF_classifier.fit(X_train,y_train)

Then, we test our new random forest classification model on our test data, i.e., we ask the model to predict attrition:

In [8]:
y_pred = RF_classifier.predict(X_test)

Finally, we measure the accuracy of the model:

In [9]:
print("Accuracy: ", metrics.accuracy_score(y_test, y_pred))

Accuracy:  0.8605442176870748


As you can see, our random forest model was correct approximatly 86% of the time (N.B.: because of randomness built into the model, the accuracy may vary a little bit from time to time when running the model).

### 5.2 Bagging classifier

Okay, now lets try to make a bagging ensemble model using KNN models instead of decision trees! 

Again, we start by making an object of the bagging classifier, telling it use KNN. We also provide the **n_estimators** parameter. This signifies the number of estimators/robots we want the bagging classifier to use (in our case, the number of KNN models). Increasing the number of estimators generally improves performance but also increases computational cost, so it's a trade-off.

However, you can provide additional parameters, such as max_samples and max_features.

- **max_samples** is the number of samples to draw from X in order to train each decision tree (base estimator) 

- **max_features** is the number of features to draw from X in order to train each decision tree (base estimator) 


(Optional: Learn more about the bagging classifier in the documentation, https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html)

In [10]:
BG_KNN_classifier = BaggingClassifier(KNeighborsClassifier(), n_estimators = 100)

Next, we train your model object on our training data, that we created in section 4.

In [11]:
BG_KNN_classifier.fit(X_train_std,y_train)

Finally, we measure the accuracy of the model. Again, the acuracy will vary a bit from time to time due to randomness in the model.

In [12]:
accuracy_BG = round(BG_KNN_classifier.score(X_test_std,y_test),4)
print("The model's accuracy is: \t", accuracy_BG)

The model's accuracy is: 	 0.8673


***
***
## 6. Boosting

![title](Bagging_Boosting.png)

Boosting is another ensemble learning technique, which we can use to improve the performance of our classification models. Unlike bagging, boosting involves *sequentially* training a series of base models/weak learners/robots, also of the same type (for instance decision trees). Here, each subsequent base model corrects the errors made by the previous ones. Thus, models at each iteration are based on the performance of the previous models. It does so by assigning weights to the training data points in the data - that's why the sheets in the image have different colors. When the boosting model makes a prediction, this preditction is a weighted combination of the predictions made by all the base models, where the weights are determined during the boosting process.

(Optional: read more here, https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html)

Now, we will go through two methods for boosting: AdaBoost and XGBoost.

### 6.1 AdaBoost

AdaBoost stands for Adaptive Boosting, is best for binary (two-class)  classification. 

The parameters are the same just like in Bagging:

- **max_samples** is the number of samples to draw from X in order to train each decision tree (base estimator) 

- **max_features** is the number of features to draw from X in order to train each decision tree (base estimator) 

- **n_estimators** the number of decision trees(base estimators) in the ensemble 

Documentation:
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html

In [13]:
# First we create an object from the model's class.
model_BO = AdaBoostClassifier(DecisionTreeClassifier(max_depth = 2), n_estimators = 2, learning_rate=1)

In [14]:
# We fit/train our  model on our  traning data, which are the feature vectors (X_train) and the target vector (y_train)
model_BO.fit(X_train, y_train)

In [15]:
#  We make prediction on the new/unseen test dataset
y_pred4 = model_BO.predict(X_test)

In [16]:
print("Accuracy: ", metrics.accuracy_score(y_test, y_pred4))

Accuracy:  0.8571428571428571


In [17]:
# All your burning questions about the confusion matrix will have to wait until week 5:)) 
# We use it here to illustrate the number of correct and incorrect predictions
mtr = confusion_matrix(y_test, y_pred4)

print("Correct predictions:", (mtr[0,0] + mtr[1,1]))
print("Incorrect predictions:", (mtr[0,1] + mtr[1,0]))
print("Total predictions:", (mtr.sum()))

Correct predictions: 252
Incorrect predictions: 42
Total predictions: 294


### 6.2 XGBoost

XGBoost is short for **Extreme Gradient Boosting** is an effective machine learning model, even on datasets where the class distribution is skewed.

In [19]:
#%pip install xgboost

import xgboost as xgb

from xgboost import XGBClassifier

In [20]:
bst = XGBClassifier(n_estimators=2, max_depth=2, learning_rate=1, objective='binary:logistic')
# fit model
bst.fit(X_train, y_train)
# make predictions
preds = bst.predict(X_test)

In [21]:
print("Accuracy: ", metrics.accuracy_score(y_test, preds))
# All your burning questions about the confusion matrix will have to wait until week 5:)) 
# We use it here to illustrate the number of correct and incorrect predictions
mtr = confusion_matrix(y_test, preds)

print("Correct predictions:", (mtr[0,0] + mtr[1,1]))
print("Incorrect predictions:", (mtr[0,1] + mtr[1,0]))
print("Total predictions:", (mtr.sum()))

Accuracy:  0.8537414965986394
Correct predictions: 251
Incorrect predictions: 43
Total predictions: 294


In [22]:
dtrain_clf = xgb.DMatrix(X_train, y_train, enable_categorical=True)
dtest_clf = xgb.DMatrix(X_test, y_test, enable_categorical=True)

***
***

## 7. Voting

As you now know, bagging and boosting involves training several base models and using them all to make a prediction. However, you can also use multiple base models *of different types* in the same ensemble.

In this section we will look at how you create an ensemble consisting of three different types of classification models:
- Decision tree
- KNN
- Logistic regression

By using the VotingClassifier, it will aggregate each models' predictions 

(Optional: read more here, https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html)



### 7.1 Making an ensemble model using the VotingClassifier

Our first step is to make an object of each the three base models, we want to use:

In [23]:
#OBS! Even though  it's called Logistic Regression,  it is atucally a classification technique
model_LR = LogisticRegression(solver='liblinear')
model_KNN = KNeighborsClassifier()
model_DT = DecisionTreeClassifier()

Then, we 'load' the models into a voting classifier:

In [24]:
# You can either do hard voting, or soft voting. 
# In hard voting, we combine the outputs by returning the mode - the most frequently occurring label among the base classifiers’ outputs.
# In soft voting, the base classifiers output probabilities or numerical scores.
VC = VotingClassifier(estimators= [("model_LR",model_LR),("model_KNN", model_KNN),("model_DT", model_DT)], voting = 'hard')

Then we train the voting classifier model on our training data:

In [25]:
VC.fit(X_train_std,  y_train)

Finally, we test our model on our test data and assess its accuracy:

In [26]:
y_pred_VC = VC.predict(X_test_std)

In [27]:
accuracy_VC = accuracy_score(y_test, y_pred_VC)
print("The model's accuracy is: \t", accuracy_VC)

The model's accuracy is: 	 0.8843537414965986


### 7.2 Understanding the results of the VotingClassifier

One problem with voting is that it is not clear which classifier to trust. Therefore we will fit the majority rule classifier:


Using the **.transform**() method, we can get the prediction for each classifier. If we put the final prediction of the voting process and the actual target labels, we get a really nice overview of our classifiers.

In [28]:
model_LR = LogisticRegression(random_state=1)
model_RFC = RandomForestClassifier(n_estimators=50, random_state=1)
model_KNNC = KNeighborsClassifier()

eclf = VotingClassifier(estimators=[('lr', model_LR), ('rf', model_RFC), ('knn', model_KNNC)],voting='hard')

for clf, label in zip([model_LR, model_RFC, model_KNNC, eclf], ['Logistic Regression', 'Random Forest', 'KNN', 'Ensemble']):
     scores = cross_val_score(clf, X_test_std, y_test, scoring='accuracy', cv=5)
     print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))

Accuracy: 0.87 (+/- 0.02) [Logistic Regression]
Accuracy: 0.86 (+/- 0.02) [Random Forest]
Accuracy: 0.84 (+/- 0.01) [KNN]
Accuracy: 0.87 (+/- 0.01) [Ensemble]


We can see that Logistic Regression shows higher accuracy than the rest

***
***
***
# Take home messages

After finishing this notebook, you should know:
- How to use mutliple models in order to perform bagging
- How to use boosting
- Using multiple models in order to perform voting 