# DS4023 Machine Learning : Ensemble Learning Exercise

This exercise, you'll explore different ensemble methods and how does ensemble improves the performance of a machine learning model. There are three parts in this exercise:
1. Simple ensemble strategy: majority voting
2. Bagging Method
3. Boosting Method: Adaboost

The dataset we use for this exercise is a cancer dataset with 699 instances and a total number of 9 features labeled in either benign or malignant classes (0 for benign, 1 for malignant). The dataset only contains numeric values and has been normalized.

Many methods will use random generator, e.g., train-test split, decision tree model, bagging boostramp sample generation, therefore, we can set the seed to a fixed number in order to achieve same results.

## Load Dataset

In [2]:
import numpy as np
import pandas as pd
from sklearn import model_selection
data = pd.read_csv("cancer_normalized.csv")
data.describe()

Unnamed: 0,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitoses,Class
count,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0
mean,0.379749,0.237164,0.245271,0.200763,0.246225,0.346352,0.270863,0.207439,0.06549,0.344778
std,0.31286,0.339051,0.330213,0.317264,0.246033,0.364071,0.270929,0.339293,0.190564,0.475636
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.111111,0.0,0.0,0.0,0.111111,0.1,0.111111,0.0,0.0,0.0
50%,0.333333,0.0,0.0,0.0,0.111111,0.1,0.222222,0.0,0.0,0.0
75%,0.555556,0.444444,0.444444,0.333333,0.333333,0.5,0.444444,0.333333,0.0,1.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [3]:
data.head()

Unnamed: 0,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitoses,Class
0,0.444444,0.0,0.0,0.0,0.111111,0.1,0.222222,0.0,0.0,0.0
1,0.444444,0.333333,0.333333,0.444444,0.666667,1.0,0.222222,0.111111,0.0,0.0
2,0.222222,0.0,0.0,0.0,0.111111,0.2,0.222222,0.0,0.0,0.0
3,0.555556,0.777778,0.777778,0.0,0.222222,0.4,0.222222,0.666667,0.0,0.0
4,0.333333,0.0,0.0,0.222222,0.111111,0.1,0.222222,0.0,0.0,0.0


## 1.  Simple Ensemble Strategies

In this section, we will look at a simple ensemble technique for classification: majority voting. In this method, multiple models are used to make predictions for each data instance. The predictions by each model are considered as a **vote**. The prediction which we get from the majority of the models are used as the final prediction.

Scikit-Learn provides us with some handy functions that we can use to accomplish this.
- The ``VotingClassifier`` takes in a list of different estimators as arguments and a voting method. The ``hard`` voting method uses the predicted labels and a majority rules system, while the ``soft`` voting method predicts a label based on the sum of the predicted probabilities.

Here, we use three models, *Decision Tree*, *SVM* and *LogisticRegression*, for voting and adopt 10-fold cross validation. Report the mean accuracy of **individual classifiers and the ensemble by applying the majority voting strategy (**hard voting**).** Compare the performance. 
- Note: For DecisionTreeClassifier() implementation, the features are always randomly permuted at each split. Therefore, the best found split may vary, even with the same training data.

In [36]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression 
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier
# cross validation library
from sklearn.model_selection import cross_val_score
seed = 7

# your implementation here
# model for three
decision_tree_cls = DecisionTreeClassifier(random_state=seed)
logistic_reg_cls = LogisticRegression(random_state=seed)
svc_cls = SVC(random_state=seed)

# voting classifier
voting_clf = VotingClassifier(estimators=[
    ('log_clf', LogisticRegression(random_state=seed)),
    ('svm_clf', SVC(random_state=seed)),
    ('dt_clf', DecisionTreeClassifier(random_state=seed))])

In [43]:
x = data.iloc[:,:-1]
y = data.iloc[:,-1]
display(x.shape)
display(y.shape)

(699, 9)

(699,)

In [38]:
# mean accuarcy for ten fold(represented as cv)
ten_tree_acc = cross_val_score(decision_tree_cls,x,y,cv=10)
ten_logist_acc = cross_val_score(logistic_reg_cls,x,y,cv=10)
ten_svc_acc = cross_val_score(svc_cls,x,y,cv=10)
ten_vote_acc = cross_val_score(voting_clf,x,y,cv=10)

In [39]:
print("Mean tree accuarcy: ", ten_tree_acc.mean())
print("Mean logist accuarcy: ",ten_logist_acc.mean())
print("Mean svc accuarcy: ",ten_svc_acc.mean())
print("Mean vote accuarcy: ",ten_vote_acc.mean())

Mean tree accuarcy:  0.9485093167701864
Mean logist accuarcy:  0.9628571428571429
Mean svc accuarcy:  0.9685507246376812
Mean vote accuarcy:  0.9642650103519671


## Bagging Method

In this section, we will explore the bagging method by using decision tree as the base learning algorithm. Scikit-Learn provides us a module of ``BaggingClassifier``, we can provide the base learning model and the number of estimation models. Try to set the number of estimators to 100 and report the mean accuracy of the ensemble using 10-fold cross validation. Compare the performance with a single decision tree model.

In [50]:
from sklearn.ensemble import BaggingClassifier

num_trees = 100

# using baggin classifier
bag_clf = BaggingClassifier(decision_tree_cls,
                            n_estimators = num_trees,
                            bootstrap = True,
                            n_jobs = -1,
                            oob_score = True,
                            random_state = seed)
ten_bagging_tree_acc = cross_val_score(bag_clf,x,y,cv=10)

In [55]:
print("Mean tree accuarcy: ", ten_tree_acc.mean())
print("Mean bagging tree accuarcy: ", ten_bagging_tree_acc.mean())

Mean tree accuarcy:  0.9485093167701864
Mean bagging tree accuarcy:  0.9557142857142857


First, we initialized a 10-fold cross-validation fold. After that, we instantiated a Decision Tree Classifier with 100 trees and wrapped it in a Bagging-based Ensemble. The accuracy improved to 95.85%.

Sklearn also provides access to the ``RandomForestClassifier``, which is a modification of the decision tree classification. Use random forest model and report the mean accuracy by using 10-folds cross-validation. Number of trees set to 100.

**Compare the performance of RandomForestClassifier with bagged decision tree and give the analysis.**

In [52]:
from sklearn.ensemble import RandomForestClassifier
random_forest_cls = RandomForestClassifier(random_state = seed)
ten_random_forest_acc = cross_val_score(random_forest_cls,x,y,cv=10)

In [53]:
print("Mean bagging_random_forest accuarcy: ", ten_random_forest_acc.mean())

Mean bagging_random_forest accuarcy:  0.9671428571428571


**Comparison and analysis**:

The preformance of the **random forest** is better than the preformance of the **bagged decision tree**. Bagged decision tree generate a lot of tree according to the bootstrap sample. Based on it, when constructing each decision tree, the random forest will randomly select some feature subsets from all the features. Each time the tree is split, the optimal one will be selected from these features. Many this is the reason why it have better preformance than the bagged decision tree. These to method both have robustness, but the **bagged decision tree** more tend to the sample and the **random forest** more tend to sample as well as features.

## Adaboost Method

In this section, you use AdaBoost classification by boosting the ``decision stump``(**one-level decision tree**).Try to set the number of rounds to 100 and report the performance of the ensemble. Compare the performance with a single decision tree model.

In [54]:
from sklearn.ensemble import AdaBoostClassifier
ada_boost_tree_cls = AdaBoostClassifier(base_estimator = DecisionTreeClassifier(max_depth=1, 
                                                                                andom_state = seed),
                                       n_estimators = 100,
                                       random_state = seed)
ten_ada_boost_tree_acc = cross_val_score(ada_boost_tree_cls,x,y,cv=10)

In [56]:
print("Mean tree accuarcy: ", ten_tree_acc.mean())
print("Mean Adaboost tree accuarcy: ", ten_ada_boost_tree_acc.mean())

Mean tree accuarcy:  0.9485093167701864
Mean Adaboost tree accuarcy:  0.9585507246376814
