<a href="https://colab.research.google.com/github/tommylouistaylor/CEGE0004_MachineLearning/blob/master/6%20-%20Week/model_ensembles.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Model Ensembles

### Metadata: Breast Cancer Wisconsin Dataset

This [dataset](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29) was computed from a digitized images of a breast mass and describes characteristics of the cell nuclei.

### Metadata
* 569 rows and 30 cols;
* classic and easy binary classification dataset
* Of 569 examples, 212 (xs) classified as malignant (y1) and 357 as benign (y2)

### Attributes
1. An id;
1. Output Target (ys): the diagnosis (M = malignant, B = benign);
1. Input Feature Examples (xs): 32 real-valued features
    * radius (mean distance from center to perimeter);
    * texture (standard deviation of gray-scale values);
    * perimeter;
    * area;
    * smoothness (local variation in radius lengths);
    * compactness (perimeter^2 / area - 1.0);
    * concavity (severity of concave portions of the contour);
    * concave points (number of concave portions of the contour);
    * symmetry;
    * fractal dimension ("coastline approximation" - 1).

## Prepare Dataset

### Load Dataset using scikit-learn function

In [None]:
# read dataset
from sklearn.datasets import load_breast_cancer
dataset = load_breast_cancer()  # load dataset
xs = dataset.data               # pass examples to var
ys = dataset.target             # pass classes to var
print('The dimensionality of the dataset is', xs.shape[1])

In [None]:
xs  # view examples array

In [None]:
ys  # view classes array

### Preprocessing: Split into Training and Test sets

In [9]:
# split into Training and Test
from sklearn.model_selection import train_test_split
xs_train, xs_test, ys_train, ys_test = train_test_split(xs, ys, test_size=0.20, random_state=42)  # 80% for the training set and 20% for the test set.

In [10]:
# perform min-max scaling (since all vars are continuous)
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()               # call scaler
scaler.fit(xs_train)                  # fit training examples
xs_train = scaler.transform(xs_train)
xs_test = scaler.transform(xs_test)

## Stack Base Classifiers into a 'Stacked Classifier' using a Voting Mechanism

### Define 4 Different Classifiers

In [11]:
# define classifiers

from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier

svm_clf = SVC(C=0.1)              # defines svc classifier
nb_clf = MultinomialNB()          # defines Naive Bayes classifier
dt_clf = DecisionTreeClassifier() # defines decision tree classifier
mlp_clf = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5, 2), random_state=1)  # defines neural network classifer

### Define Voting Classifier using scikit-learn

In [20]:
# define voting classifier
from sklearn.ensemble import VotingClassifier
voting_clf = VotingClassifier( 
    estimators=[('svm', svm_clf), ('nb', nb_clf), ('df', dt_clf), ('mlp', mlp_clf)],  # assign an id for each classifier
    voting='hard')  # assign voting parameters: HARD=predicts class lavel based on majority rule voting, SOFT=predicts class label based on argmax of the sums of the predicted probabilities

In [None]:
# fit voting classifier
voting_clf.fit(xs_train, ys_train)

In [22]:
# fit each individual classifier independenly
for clf in (svm_clf, nb_clf, dt_clf, mlp_clf):
    clf.fit(xs_train, ys_train)

### Measure Accuracy of ALL Classifier

In [None]:
# measure train and test accuracy of each classifier, INCLUDING the ensembled classifier.
from sklearn.metrics import accuracy_score
for clf in (svm_clf, nb_clf, dt_clf, mlp_clf, voting_clf):
    print(clf.__class__.__name__)
    ys_pred = clf.predict(xs_train)
    print('\ttrain:', accuracy_score(ys_train, ys_pred))
    ys_pred = clf.predict(xs_test)
    print('\ttest:', accuracy_score(ys_test, ys_pred))

## Define Bagging Classifier

Bagging classifier ~ an ensemble meta-estimator. It fits base classifiers on random subsets of the original dataset and then aggregates their individual predictions (by voting or averaging) to form a final prediction.

### Define Bagging Classifier

In [26]:
# perform bagging on a kNN classifier
from sklearn.ensemble import BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier
bag_clf = BaggingClassifier(KNeighborsClassifier(),             # pass knn classifier into bagging classifier
                            max_samples=0.5, max_features=1.0)  # max_samples and max_features control fraction of examples and features in each replica of the dataset

In [None]:
# fit bagging classifier to sets
bag_clf.fit(xs_train, ys_train)

In [None]:
# measure train and test accuracy of bagging classifier
print(bag_clf.__class__.__name__, '(kNN)')
ys_pred = bag_clf.predict(xs_train)
print('\ttrain:', accuracy_score(ys_train, ys_pred))
ys_pred = bag_clf.predict(xs_test)
print('\ttest:', accuracy_score(ys_test, ys_pred))

## Fit Multiple Decision Tree Classifiers using 'Random Forest' Ensamble Model

Random forest fits multiple decision tree classifiers on various sub-samples of
the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is controlled with the `max_samples` parameter if `bootstrap=True`, otherwise the whole dataset is used to build each tree.

In [32]:
# define random forest classifier
from sklearn.ensemble import RandomForestClassifier
rf_clf = RandomForestClassifier(n_estimators=10)

In [None]:
# fit random forest classifier
rf_clf.fit(xs_train, ys_train)

In [None]:
# evaluate accuracy on train and test sets
print(rf_clf.__class__.__name__)
ys_pred = rf_clf.predict(xs_train)
print('\ttrain:', accuracy_score(ys_train, ys_pred))
ys_pred = rf_clf.predict(xs_test)
print('\ttest:', accuracy_score(ys_test, ys_pred))

# Boosting (AdaBoost)

An AdaBoost classifier is a meta-estimator that begins by fitting a classifier on the original dataset and then
fits additional copies of the classifier on the same dataset but where the weights of incorrectly classified instances
are adjusted such that subsequent classifiers focus more on difficult cases.

For this classifier we will boost 200 times decision trees of depth 1, aka decision stumps.

In [35]:
from sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=1), n_estimators=200)

The parameter `n_estimators` controls the maximum number of estimators at which boosting is terminated.
Of course, in case of a perfect fit, the learning procedure is stopped earlier.

We now fit this classifier and evaluate its accuracy on the train and test sets.

In [36]:
ada_clf.fit(xs_train, ys_train)

print(ada_clf.__class__.__name__, '(DecisionStumps)')
ys_pred = ada_clf.predict(xs_train)
print('\ttrain:', accuracy_score(ys_train, ys_pred))
ys_pred = ada_clf.predict(xs_test)
print('\ttest:', accuracy_score(ys_test, ys_pred))

AdaBoostClassifier (DecisionStumps)
	train: 1.0
	test: 0.9736842105263158


Note that in all the examples above we have not performed any validation.
How would you perform the validation of these classifiers?