# DS4023 Machine Learning : Ensemble Learning Exercise

This exercise, you'll explore different ensemble methods and how does ensemble improves the performance of a machine learning model. There are three parts in this exercise:
1. Simple ensemble strategy: majority voting
2. Bagging Method
3. Boosting Method: Adaboost

The dataset we use for this exercise is a cancer dataset with 699 instances and a total number of 9 features labeled in either benign or malignant classes (0 for benign, 1 for malignant). The dataset only contains numeric values and has been normalized.

Many methods will use random generator, e.g., train-test split, decision tree model, bagging boostramp sample generation, therefore, we can set the seed to a fixed number in order to achieve same results.

## Load Dataset

In [1]:
import numpy as np
import pandas as pd
from sklearn import model_selection
data = pd.read_csv("cancer_normalized.csv")
data.describe()

Unnamed: 0,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitoses,Class
count,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0
mean,0.379749,0.237164,0.245271,0.200763,0.246225,0.346352,0.270863,0.207439,0.06549,0.344778
std,0.31286,0.339051,0.330213,0.317264,0.246033,0.364071,0.270929,0.339293,0.190564,0.475636
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.111111,0.0,0.0,0.0,0.111111,0.1,0.111111,0.0,0.0,0.0
50%,0.333333,0.0,0.0,0.0,0.111111,0.1,0.222222,0.0,0.0,0.0
75%,0.555556,0.444444,0.444444,0.333333,0.333333,0.5,0.444444,0.333333,0.0,1.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


## 1.  Simple Ensemble Strategies

In this section, we will look at a simple ensemble technique for classification: majority voting. In this method, multiple models are used to make predictions for each data instance. The predictions by each model are considered as a **vote**. The prediction which we get from the majority of the models are used as the final prediction.

Scikit-Learn provides us with some handy functions that we can use to accomplish this.
- The ``VotingClassifier`` takes in a list of different estimators as arguments and a voting method. The ``hard`` voting method uses the predicted labels and a majority rules system, while the ``soft`` voting method predicts a label based on the sum of the predicted probabilities.

Here, we use three models, *Decision Tree*, *SVM* and *LogisticRegression*, for voting and adopt 10-fold cross validation. Report the mean accuracy of **individual classifiers and the ensemble by applying the majority voting strategy (**hard voting**).** Compare the performance. 
- Note: For DecisionTreeClassifier() implementation, the features are always randomly permuted at each split. Therefore, the best found split may vary, even with the same training data.

In [13]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression 
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC

from sklearn.ensemble import VotingClassifier
import random


seed       = 7
random.seed(seed)

# your implementation here
# Load data
y          = np.array(data.iloc[:, -1])
X          = np.array(data.iloc[:, :-1])

# Initialize the k-fold instance
kf         = model_selection.KFold(n_splits = 10)
kf.get_n_splits(data)

# divide the data set into 10 folds and do the fitting
score      = {"Decision tree": [], "SVM": [], "Logistic": [], "Ensemble learning": []}
for train_index, test_index in kf.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    # Individual models
    # Decision Tree
    tree   = DecisionTreeClassifier(random_state = seed)
    tree.fit(X_train, y_train)
    y_pred = tree.predict(X_test)
    score["Decision tree"].append(np.round(accuracy_score(y_test, y_pred) * 100, 3))
    
    # SVM
    svm    = SVC(random_state = seed)
    svm.fit(X_train, y_train)
    y_pred = svm.predict(X_test)
    score["SVM"].append(np.round(accuracy_score(y_test, y_pred) * 100, 3))
    
    # Logistic
    lg     = LogisticRegression(random_state = seed)
    lg.fit(X_train, y_train)
    y_pred = lg.predict(X_test)
    score["Logistic"].append(np.round(accuracy_score(y_test, y_pred) * 100, 3))
    
    # Ensemble learning
    vote   = VotingClassifier(estimators=[
                                            ('Decision Tree', tree), 
                                            ('svm', svm), 
                                            ('Logistic', lg)
                                         ], voting='hard')
    vote.fit(X_train, y_train)
    y_pred = vote.predict(X_test)
    score["Ensemble learning"].append(np.round(accuracy_score(y_test, y_pred) * 100, 3))

print("10 Folds Cross Validation:")
display(pd.DataFrame(score).mean())

10 Folds Cross Validation:


Decision tree        92.2796
SVM                  96.2836
Logistic             96.2856
Ensemble learning    96.1407
dtype: float64

### Comparision:  
We can see that the Logistic Regression has the highest accuracy. The ensemble learning has a significant improvement of the accuracy.

## Bagging Method

In this section, we will explore the bagging method by using decision tree as the base learning algorithm. Scikit-Learn provides us a module of ``BaggingClassifier``, we can provide the base learning model and the number of estimation models. Try to set the number of estimators to 100 and report the mean accuracy of the ensemble using 10-fold cross validation. Compare the performance with a single decision tree model.

In [18]:
from sklearn.ensemble import BaggingClassifier

num_trees = 100

# your implemenation here
score      = []
for train_index, test_index in kf.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    clf    = BaggingClassifier(n_estimators=num_trees, max_samples=0.1, 
                             random_state=seed
                          ).fit(X_train, y_train)

    y_pred = clf.predict(X_test)
    score.append(np.round(accuracy_score(y_test, y_pred) * 100, 3))

print("10 Folds Cross Validation:")
print("The accuracy is: %.2f%%"%np.round(np.mean(score), 3))

10 Folds Cross Validation:
The accuracy is: 96.14%


### Comparision:  
We can see that the accuracy is shown as follows:  

| Name | Accuracy |
|----|---------|
|BaggingClassifier | 96.14% |
|Decision Tree | 92.28% |  

The BaggingClassifier has a significant improvement of accuracy.

First, we initialized a 10-fold cross-validation fold. After that, we instantiated a Decision Tree Classifier with 100 trees and wrapped it in a Bagging-based Ensemble. The accuracy improved to 95.85%.

Sklearn also provides access to the ``RandomForestClassifier``, which is a modification of the decision tree classification. Use random forest model and report the mean accuracy by using 10-folds cross-validation. Number of trees set to 100.

**Compare the performance of RandomForestClassifier with bagged decision tree and give the analysis.**

In [16]:
from sklearn.ensemble import RandomForestClassifier
# your implementation here...
score      = []
for train_index, test_index in kf.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    forest = RandomForestClassifier(n_estimators = 100, random_state = seed)
    forest.fit(X_train, y_train)
    y_pred = forest.predict(X_test)
    score.append(np.round(accuracy_score(y_test, y_pred) * 100, 3))

print("10 Folds Cross Validation:")
print("The accuracy is: %.2f%%"%np.round(np.mean(score), 3))

10 Folds Cross Validation:
The accuracy is: 96.71%


### Comparison and analysis:

We can see that the accuracy is shown as follows:  

| Name | Accuracy |
|----|---------|
|BaggingClassifier | 96.14% |
|Random Forest | 96.71% |  

The Random Forest has a slightly improvement of accuracy, maybe because its voting procedures are different.

## Adaboost Method

In this section, you use AdaBoost classification by boosting the ``decision stump``(**one-level decision tree**).Try to set the number of rounds to 100 and report the performance of the ensemble. Compare the performance with a single decision tree model.

In [19]:
from sklearn.ensemble import AdaBoostClassifier
# your implementation here...
score      = []
for train_index, test_index in kf.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    ada = AdaBoostClassifier(n_estimators = 100, learning_rate = 1.0, 
                         algorithm = 'SAMME.R', random_state = seed)
    ada.fit(X_train, y_train)
    y_pred = ada.predict(X_test)
    score.append(np.round(accuracy_score(y_test, y_pred) * 100, 3))

print("10 Folds Cross Validation:")
print("The accuracy is: %.2f%%"%np.round(np.mean(score), 3))

10 Folds Cross Validation:
The accuracy is: 95.71%


### Comparision:  
We can see that the accuracy is shown as follows:  

| Name | Accuracy |
|----|---------|
|AdaBoostClassifier | 95.71% |
|Decision Tree | 92.28% |  

The AdaBoostClassifier has a significant improvement of accuracy.