## Task

You will now be using the ensemble methods you have just learned about to find the best performance on the [Kaggle Breast Cancer Dataset](https://www.kaggle.com/merishnasuwal/breast-cancer-prediction-dataset), which has already been provided for you in the file `Breast_cancer_data.csv`. The dataset was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg. It contains 5 features and one predictor, titled `diagnosis`, which is either 1 (has breast cancer) or 0 (does not have breast cancer).

### 1. Tune Ensemble Models

Read the dataset using Pandas and separate the predictor from the other features. Split the dataset using an 80/20 split and keep the test set separate for evaluation. Using what you have learned thus far, tune the best of each of the following models you can and report their F1 score on the test set: Decision Tree, Bagged Decision Tree, Random Forest, and AdaBoost. For each model type, keep track of the best trained model (you'll need it in the following task).

In [72]:
from random import seed
from random import random
import numpy as np
import pandas as pd
from sklearn.utils import resample
import matplotlib.pyplot as plt
from sklearn.metrics import f1_score
#import necessary utilities
print(".")

.


In [73]:
from sklearn import preprocessing
df = pd.read_csv(filepath_or_buffer='Breast_cancer_data.csv', header=None)
df = df.drop([0], axis=0) #drop first row (header names) from dataframe

le = preprocessing.LabelEncoder()
le.fit(df[5]) #label encode column index 5 (target attribute)
df[5] = le.transform(df[5]) #normalize target based on label encoding (into integer form)
labels = df.iloc[:,5].values #labels: list of all values in column
#df = df.drop([5], axis=1) #drop fifth column from dataframe
samples = df.iloc[:,:5].values #2d array of all the samples (not including target variable)

print('breast cancer data')
df

breast cancer data


Unnamed: 0,0,1,2,3,4,5
1,17.99,10.38,122.8,1001.0,0.1184,0
2,20.57,17.77,132.9,1326.0,0.08474,0
3,19.69,21.25,130.0,1203.0,0.1096,0
4,11.42,20.38,77.58,386.1,0.1425,0
5,20.29,14.34,135.1,1297.0,0.1003,0
...,...,...,...,...,...,...
565,21.56,22.39,142.0,1479.0,0.111,0
566,20.13,28.25,131.2,1261.0,0.0978,0
567,16.6,28.08,108.3,858.1,0.08455,0
568,20.6,29.33,140.1,1265.0,0.1178,0


In [74]:
print(len(samples), "samples,", len(samples[0]), "attributes", type(samples))
print(len(samples))
trainsamples, testsamples = np.split(samples, [455]) #80/20 train/test split
trainlabels, testlabels = np.split(labels, [455])
print(len(trainsamples), len(trainlabels))
print(len(testsamples))

569 samples, 5 attributes <class 'numpy.ndarray'>
569
455 455
114


In [75]:
# Function that utilizes cross validation to test accuracy of model
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import cross_val_score
from numpy import mean
from numpy import std
def evaluate_model(model):
    model.fit(trainsamples, trainlabels)
    prediction = model.predict(testsamples)
    scores = f1_score(testlabels, prediction, average='weighted') #return model's f1_score on test set
    return scores

In [98]:
#classic decision tree classifier (no ensemble method)
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score

#default classifier, no tuning
dtc = DecisionTreeClassifier(random_state=0)
scores = evaluate_model(dtc)
print('f1 score: {:.4f}'.format(scores.mean()))

#tuning parameter min_samples_leaf
for i in range(1,10):
    dtc = DecisionTreeClassifier(random_state=0, min_samples_leaf=i)
    scores = evaluate_model(dtc)
    print(i, 'minimum samples per leaf, f1 score: {:.4f}'.format(scores.mean()))

#min_samples_leaf=5 achieves highest f1 score
#tuning hyperparameter max_depth
for i in range(1,10):
    dtc = DecisionTreeClassifier(random_state=0, min_samples_leaf=5, max_depth=i)
    scores = evaluate_model(dtc)
    print(i, 'max depth, f1 score: {:.4f}'.format(scores.mean()))

print("\nmin_samples_leaf=5 , max_depth=1 achieve highest f1_score=.9298")

f1 score: 0.8598
1 minimum samples per leaf, f1 score: 0.8598
2 minimum samples per leaf, f1 score: 0.8214
3 minimum samples per leaf, f1 score: 0.8606
4 minimum samples per leaf, f1 score: 0.8921
5 minimum samples per leaf, f1 score: 0.9080
6 minimum samples per leaf, f1 score: 0.8921
7 minimum samples per leaf, f1 score: 0.8685
8 minimum samples per leaf, f1 score: 0.8685
9 minimum samples per leaf, f1 score: 0.8685
1 max depth, f1 score: 0.9298
2 max depth, f1 score: 0.9298
3 max depth, f1 score: 0.8528
4 max depth, f1 score: 0.8842
5 max depth, f1 score: 0.9080
6 max depth, f1 score: 0.9080
7 max depth, f1 score: 0.9080
8 max depth, f1 score: 0.9080
9 max depth, f1 score: 0.9080

min_samples_leaf=5 , max_depth=1 achieve highest f1_score=.9298


In [99]:
#bagged decision tree
from sklearn.ensemble import BaggingClassifier
#draws from classic decision tree model, makes n_estimators

#no tuning
bc = BaggingClassifier(estimator=dtc, n_estimators=100, random_state=0)
scores = evaluate_model(bc)
print('f1 score: {:.4f}'.format(scores.mean()))

#tuning n_estimators
for i in range(1, 10):
    bc = BaggingClassifier(estimator=dtc, n_estimators=i*10, random_state=0)
    scores = evaluate_model(bc)
    print(i*10, 'estimators, f1 score: {:.4f}'.format(scores.mean()))

#best f1_score was w/ n_estimators=60
#tune max_features hyperparameter
for i in range(1,6):
    bc = BaggingClassifier(estimator=dtc, n_estimators=60, random_state=0, max_features=i)
    scores = evaluate_model(bc)
    print(i, 'max features, f1 score: {:.4f}'.format(scores.mean()))

print("\nn_estimators=60, max_features=1 achieve highest f1_score=.9397")

f1 score: 0.9167
10 estimators, f1 score: 0.9160
20 estimators, f1 score: 0.9080
30 estimators, f1 score: 0.9080
40 estimators, f1 score: 0.9080
50 estimators, f1 score: 0.9080
60 estimators, f1 score: 0.9247
70 estimators, f1 score: 0.9167
80 estimators, f1 score: 0.9167
90 estimators, f1 score: 0.9167
1 max features, f1 score: 0.9397
2 max features, f1 score: 0.9328
3 max features, f1 score: 0.9247
4 max features, f1 score: 0.9167
5 max features, f1 score: 0.9247

n_estimators=60, max_features=1 achieve highest f1_score=.9397


In [100]:
#random forest
from sklearn.ensemble import RandomForestClassifier

#no tuning
rf = RandomForestClassifier(random_state=0)
scores = evaluate_model(rf)
print('f1 score: {:.4f}'.format(scores.mean()))

#tuning max_depth
for i in range(1,10):
    rf = RandomForestClassifier(random_state=0, max_depth=i)
    scores = evaluate_model(rf)
    print(i, 'max depth, f1 score: {:.4f}'.format(scores.mean()))

#highest f1_score with max_depth=1
#tuning max features
for i in range(1,6):
    rf = RandomForestClassifier(random_state=0, max_depth=1, max_features=i)
    scores = evaluate_model(rf)
    print(i, 'max features, f1 score: {:.4f}'.format(scores.mean()))

print("\nmax_depth=1, max_features=2 achieve highest f1_score=.9390")

f1 score: 0.9167
1 max depth, f1 score: 0.9390
2 max depth, f1 score: 0.9315
3 max depth, f1 score: 0.9160
4 max depth, f1 score: 0.9087
5 max depth, f1 score: 0.9167
6 max depth, f1 score: 0.9087
7 max depth, f1 score: 0.9167
8 max depth, f1 score: 0.9167
9 max depth, f1 score: 0.9080
1 max features, f1 score: 0.9307
2 max features, f1 score: 0.9390
3 max features, f1 score: 0.9307
4 max features, f1 score: 0.9144
5 max features, f1 score: 0.9225

max_depth=1, max_features=2 achieve highest f1_score=.9390


In [107]:
#adaptive boost classifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

#no tuning
ada = AdaBoostClassifier(random_state=0)
scores = evaluate_model(ada)
print('f1 score: {:.4f}'.format(scores.mean()))

#tuning learning rate
for i in range(1,10):
    ada = AdaBoostClassifier(random_state=0, learning_rate=i/10)
    scores = evaluate_model(ada)
    print(i/10, 'learning rate, f1 score: {:.4f}'.format(scores.mean()))

#best result was learning_rate=0.1, now tuning estimators
for i in range(1,10):
    ada = AdaBoostClassifier(random_state=0, learning_rate=0.1, n_estimators=10*i)
    scores = evaluate_model(ada)
    print(i*10, 'estimators, f1 score: {:.4f}'.format(scores.mean()))

print("\nlearning_rate=0.1, n_estimators=60, achieves highest f1_score=.9492")

f1 score: 0.8992
0.1 learning rate, f1 score: 0.9410
0.2 learning rate, f1 score: 0.9160
0.3 learning rate, f1 score: 0.9080
0.4 learning rate, f1 score: 0.9072
0.5 learning rate, f1 score: 0.9234
0.6 learning rate, f1 score: 0.9153
0.7 learning rate, f1 score: 0.9153
0.8 learning rate, f1 score: 0.8992
0.9 learning rate, f1 score: 0.9153
10 estimators, f1 score: 0.9390
20 estimators, f1 score: 0.9315
30 estimators, f1 score: 0.9404
40 estimators, f1 score: 0.9404
50 estimators, f1 score: 0.9410
60 estimators, f1 score: 0.9492
70 estimators, f1 score: 0.9328
80 estimators, f1 score: 0.9410
90 estimators, f1 score: 0.9328

learning_rate=0.1, n_estimators=60, achieves highest f1_score=.9492


### 2. Ensemble of Ensembles

Another way to create an ensemble is by combining the predictions of different models, e.g., the 4 different `best` models you found earlier. As usual, one has to decide how to combine the votes of each model. Read about the VotingClassifier on the [SKLearn documentation website](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html). You may also want to look at some tutorials online for using this ensemble model. Then train a VotingClassifier model that will use the 4 best models you found in step 1. Finally, use it to predict the samples in the test set and report the F1 score.

In [122]:
#ensemble of ensembles, using voting classifier
from sklearn.ensemble import VotingClassifier

dtc = DecisionTreeClassifier(random_state=0, min_samples_leaf=5, max_depth=1) #optimal decision tree classifier
bc = BaggingClassifier(estimator=dtc, n_estimators=60, random_state=0, max_features=1) #optimal bagging classifier
rf = RandomForestClassifier(random_state=0, max_depth=1, max_features=2) #optimal random forest classifier
ada = AdaBoostClassifier(random_state=0, learning_rate=0.1, n_estimators=60) #optimal adaptive booster classifier

#create voting classifier with four estimators being the best models derived earlier
voting = VotingClassifier(estimators=[ ("dtc", dtc), ("bc", bc), ("rf", rf), ("ada", ada) ], voting="hard")

#fit to training sample, predict test samples
voting.fit(trainsamples, trainlabels)

prediction = voting.predict(testsamples)
print('predicting test labels', prediction)
print('actual test labels', testlabels)

score = f1_score(testlabels, prediction, average='weighted')
print('\nvoting classifier achieves f1_score =', score)

predicting test labels [1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 0 1 0 1 0
 0 1 1 0 1 1 0 0 0 1 1 0 1 1 1 1 0 0 1 1 1 1 0 1 0 0 1 1 1 0 1 1 1 1 1 1 1
 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0
 0 0 1]
actual test labels [1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 0 1 1
 0 1 1 1 1 1 0 0 1 0 1 0 1 1 1 1 1 0 1 1 0 1 0 1 0 0 1 1 1 0 1 1 1 1 1 1 1
 1 1 1 1 0 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0
 0 0 1]

voting classifier achieves f1_score = 0.9307207994443325
