# DS4023 Machine Learning : Ensemble Learning Exercise

This exercise, you'll explore different ensemble methods and how does ensemble improves the performance of a machine learning model. There are three parts in this exercise:
1. Simple ensemble strategy: majority voting
2. Bagging Method
3. Boosting Method: Adaboost

The dataset we use for this exercise is a cancer dataset with 699 instances and a total number of 9 features labeled in either benign or malignant classes (0 for benign, 1 for malignant). The dataset only contains numeric values and has been normalized.

Many methods will use random generator, e.g., train-test split, decision tree model, bagging boostramp sample generation, therefore, we can set the seed to a fixed number in order to achieve same results.

## Load Dataset

In [1]:
import numpy as np
import pandas as pd
from sklearn import model_selection
data = pd.read_csv("cancer_normalized.csv")
data.describe()

Unnamed: 0,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitoses,Class
count,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0
mean,0.379749,0.237164,0.245271,0.200763,0.246225,0.346352,0.270863,0.207439,0.06549,0.344778
std,0.31286,0.339051,0.330213,0.317264,0.246033,0.364071,0.270929,0.339293,0.190564,0.475636
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.111111,0.0,0.0,0.0,0.111111,0.1,0.111111,0.0,0.0,0.0
50%,0.333333,0.0,0.0,0.0,0.111111,0.1,0.222222,0.0,0.0,0.0
75%,0.555556,0.444444,0.444444,0.333333,0.333333,0.5,0.444444,0.333333,0.0,1.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


## 1.  Simple Ensemble Strategies

In this section, we will look at a simple ensemble technique for classification: majority voting. In this method, multiple models are used to make predictions for each data instance. The predictions by each model are considered as a **vote**. The prediction which we get from the majority of the models are used as the final prediction.

Scikit-Learn provides us with some handy functions that we can use to accomplish this.
- The ``VotingClassifier`` takes in a list of different estimators as arguments and a voting method. The ``hard`` voting method uses the predicted labels and a majority rules system, while the ``soft`` voting method predicts a label based on the sum of the predicted probabilities.

Here, we use three models, *Decision Tree*, *SVM* and *LogisticRegression*, for voting and adopt 10-fold cross validation. Report the mean accuracy of **individual classifiers and the ensemble by applying the majority voting strategy (**hard voting**).** Compare the performance. 
- Note: For DecisionTreeClassifier() implementation, the features are always randomly permuted at each split. Therefore, the best found split may vary, even with the same training data.

In [2]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression 
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier
from sklearn.model_selection import cross_val_score
seed = 7

# your implementation here
# declare the result 

# for individual classifiers
decision_tree_cls = DecisionTreeClassifier(random_state=seed)
logistic_reg_cls = LogisticRegression(random_state=seed)
svc_cls = SVC(random_state=seed)
# for voting class
voting_cls = VotingClassifier(estimators=[("log_clf", LogisticRegression(random_state=seed)),
                                         ("svm_clf", SVC(random_state= seed)),
                                         ("Dec_clf", DecisionTreeClassifier(random_state=seed))])
# binary classification problem
# define the input x and label y
x = data.iloc[:,:-1].values
y = data.iloc[:, -1].values
# mean score of voting 
mean_scores_voting = cross_val_score(voting_cls, x,y,cv = 10).mean()
# mean score of decision tree
mean_score_des = cross_val_score(decision_tree_cls, x, y, cv =10).mean()
# mean score of SVC
mean_score_svc = cross_val_score(svc_cls, x, y, cv = 10).mean()
# mean score of Logistic regression
mean_score_lr = cross_val_score(logistic_reg_cls, x, y, cv = 10).mean()

In [3]:
mean_score_des, mean_score_lr, mean_score_svc, mean_scores_voting

(0.9485093167701864,
 0.9628571428571429,
 0.9685507246376812,
 0.9642650103519671)

## Bagging Method

In this section, we will explore the bagging method by using decision tree as the base learning algorithm. Scikit-Learn provides us a module of ``BaggingClassifier``, we can provide the base learning model and the number of estimation models. Try to set the number of estimators to 100 and report the mean accuracy of the ensemble using 10-fold cross validation. Compare the performance with a single decision tree model.

In [4]:
from sklearn.ensemble import BaggingClassifier

num_trees = 100

# your implemenation here
# initialize bagging method with single decision tree
clf = BaggingClassifier(base_estimator=DecisionTreeClassifier(random_state=seed),
                       n_estimators=num_trees, random_state= seed)
bagging_score = cross_val_score(clf, x, y, cv = 10).mean()

In [5]:
bagging_score, mean_score_des

(0.9557142857142857, 0.9485093167701864)

First, we initialized a 10-fold cross-validation fold. After that, we instantiated a Decision Tree Classifier with 100 trees and wrapped it in a Bagging-based Ensemble. The accuracy improved to 95.85%.

Sklearn also provides access to the ``RandomForestClassifier``, which is a modification of the decision tree classification. Use random forest model and report the mean accuracy by using 10-folds cross-validation. Number of trees set to 100.

**Compare the performance of RandomForestClassifier with bagged decision tree and give the analysis.**

In [6]:
from sklearn.ensemble import RandomForestClassifier
# your implementation here...
# implement the Random Forest
# also set default to true
rdf_cls = RandomForestClassifier(random_state=seed)
mean_score_rdf = cross_val_score(rdf_cls, x, y, cv = 10).mean()

In [7]:
mean_score_rdf, bagging_score

(0.9671428571428571, 0.9557142857142857)

**Comparison and analysis**:

## Analysis
Bagging here is known as **bootstrap aggregating**. This method use to train a number of base learners from a different bootstrap sample by calling decision tree basic method here. The final combine method is the voting, which is hard voting here. Then, for Random Forest, still use boostrap here for sampling, which make sure each tree has same training data and have some difference, which means the learning ability will not be reduced too much even for small dataset. Also, random forest randomly select features for all candidate features, and finally get the best feature. So, random forest have:
1. better training effiency 
2. more robustness since random forest focus on different candidate features while Bagging only have sampling robustness.

## Adaboost Method

In this section, you use AdaBoost classification by boosting the ``decision stump``(**one-level decision tree**).Try to set the number of rounds to 100 and report the performance of the ensemble. Compare the performance with a single decision tree model.

In [8]:
from sklearn.ensemble import AdaBoostClassifier
# your implementation here...
adaboost_cls = AdaBoostClassifier(n_estimators=100, random_state=seed,
                                 base_estimator = DecisionTreeClassifier(max_depth=1))
adaboost_score = cross_val_score(adaboost_cls, x,y, cv = 10).mean()

In [9]:
mean_score_des, adaboost_score

(0.9485093167701864, 0.9585507246376814)