## Classification of mushrooms, edible or poisonous. Download the mushroom_dataset.csvdataset file from the module content. Load the dataset in your model development framework, examine the features to see they are all nominal features. The first column is the class whichrepresents the mushroom is poisonous or not. Apply necessary pre-processing such as nominal to numerical conversions (e.g. pd.get_dummies). Make sure sanity check the pipeline and perhaps run your favorite baseline classifier first.


### Load and Preprocess Data

In [49]:
%%time

import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from tqdm import tqdm

##get current working directory
cwd = os.getcwd()

##get data path and open as a pandas dataframe
datapath = cwd + '\\data\\mushroom_dataset.csv'

##get dataframe
df = pd.read_csv(datapath)

##check contents, dtypes, and description:
print('\nHead:\n\n ', df.head())
print('\nDatatypes:\n\n ', df.dtypes)
print('\nData Description: \n\n', df.describe())

##drop rows that have a question mark
df = df[df['stalk-root'] != "?"]

##seperate the class labels from dataframe then drop
y = df['class'].values

##drop the two features mentioned in the previous cell and class
df.drop(['veil-type', 'class'], axis=1, inplace=True)

##store the current feature names for later use
features = list(df)
feature_count = len(features)

##encode the data
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()
X = np.array(df.values)
X = X.flatten()
label_encoder.fit(X)
X_encoded = label_encoder.transform(X)
y_encoded = label_encoder.transform(y)
X_encoded = X_encoded.reshape(len(features), -1).T

##normalize
from sklearn.preprocessing import Normalizer
transformer = Normalizer().fit(X_encoded)
X_norm = transformer.transform(X_encoded)


Head:

    class cap-shape cap-surface cap-color bruises odor gill-attachment  \
0     p         x           s         n       t    p               f   
1     e         x           s         y       t    a               f   
2     e         b           s         w       t    l               f   
3     p         x           y         w       t    p               f   
4     e         x           s         g       f    n               f   

  gill-spacing gill-size gill-color  ... stalk-surface-below-ring  \
0            c         n          k  ...                        s   
1            c         b          k  ...                        s   
2            c         b          n  ...                        s   
3            c         n          n  ...                        s   
4            w         b          k  ...                        s   

  stalk-color-above-ring stalk-color-below-ring veil-type veil-color  \
0                      w                      w         p          w  

Observation:

- The dataset contains twenty-two features.
- The data is nominal and will need to be encoded.
- The independent variable is 'class', and it contains two unique classes.
- Class is contained within the dataframe so will need to be added to own variable and dropped from the set.
- The dataset has 8124 samples.
- Veil-type has only one value, so this feature can be dropped.
- 2400 of '?' appear in 'stalk-root', which is alot but since there are about 8000 samples, these rows are dropped. I know from trial and error that deleting this row will hurt classifier performance, and that is the reason that the entire row was not dropped.
- I know from trial and error through this assignment that this exercise is resource intensive and takes a long time to train the models, therfore, to further reduce the dataset, by twenty-five percent, the data was further split into a train set and a test set. This was shown to have no little to no effect on the random trees classifier. Furthermore, the test set can be used as a development set for further testing. 

In [50]:
%%time

'''run through favorite classifier to get an idea of performance'''

##split the data 
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_norm, y_encoded, test_size=0.25, random_state=42)

# ##normalize the data
from sklearn.pipeline import Pipeline
from sklearn.model_selection import StratifiedKFold, train_test_split
from sklearn.metrics import accuracy_score, classification_report

##Kfold function
def kfold_eval(_clf: object, X: np.array, y: np.array) -> np.array:
    '''
    Desc: K-fold train and validation
    
    :params _clf: Classifier to use for ensemble training
    :params _X: Train dataset
    :params _y: Train target
    '''
    ## accuracy bookkeeping
    acc = []
    ## stratified split
    kf = StratifiedKFold(n_splits=10, shuffle=False, random_state=None)
    ##iterate through train and test sets
    for train_index, test_index in kf.split(X, y):
        _clf.fit(X[train_index], y[train_index])
        y_pred = _clf.predict(X[test_index])
        acc += [accuracy_score(y[test_index], y_pred)]
    return np.array(acc)

##Random Forest
from sklearn.ensemble import RandomForestClassifier
rf_clf = RandomForestClassifier()
acc = kfold_eval(rf_clf, X_train, y_train)
print('\nRandom forest mean-average accuracy: ', np.round(np.mean(acc), 3), 
      ' Std: ', np.round(np.std(acc), 3))


Random forest mean-average accuracy:  0.665  Std:  0.023
Wall time: 7.66 s


In [51]:
%%time

'''
The decision tree classifier did not perform well so should drop least correlated features from the 
dataset. Here, the dataset features are directly correlated with the target. The last five 
features are displayed, which indicates the lowest scoring correlation, as is done in 
module06_ensemble_notebook.html. Features are dropped and the random forest classifier is again
used to test dropped feature influence on the model.
'''
##Address poor score by removing lowest correlated data
##Direct correlation between each column of X and the target y
corrs = np.array([np.correlate(X_train[:,j], y_train)[0] for j in range(X_train.shape[1])])

##Reverse sort, numpy array negation reverses the order
ranks = np.argsort((-corrs))

##Display top-9 and bot-5
rankings = [(f'{corrs[j]:.1f}', df.columns[j]) for j in ranks]
display(rankings[:9])
display(rankings[-5:])

[('6572.4', 'cap-shape'),
 ('6560.4', 'cap-color'),
 ('6484.7', 'cap-surface'),
 ('6370.3', 'bruises'),
 ('6188.1', 'stalk-root'),
 ('6178.2', 'gill-color'),
 ('6146.5', 'gill-size'),
 ('6134.0', 'gill-spacing'),
 ('6125.9', 'gill-attachment')]

[('5462.3', 'stalk-color-below-ring'),
 ('5395.8', 'spore-print-color'),
 ('4979.4', 'veil-color'),
 ('4953.7', 'ring-type'),
 ('4898.6', 'ring-number')]

Wall time: 2.99 ms


In [52]:
%%time 

##drop the two features mentioned in the previous cell and class
df.drop(['ring-number'], axis=1, inplace=True)

##store the current feature names for later use
features = list(df)
feature_count = len(features)

##encode the data
label_encoder = preprocessing.LabelEncoder()
X = np.array(df.values)
X = X.flatten()
label_encoder.fit(X)
X_encoded = label_encoder.transform(X)
y_encoded = label_encoder.transform(y)
X_encoded = X_encoded.reshape(len(features), -1).T

##normalize
transformer = Normalizer().fit(X_encoded)
X_norm = transformer.transform(X_encoded)
      
##split the data 
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_norm, y_encoded, test_size=0.25, random_state=42)

print('\nNumber of samples: ', len(y_train))
print('Number of Features: ', feature_count)

##Random Forest
from sklearn.ensemble import RandomForestClassifier
rf_clf = RandomForestClassifier()
acc = kfold_eval(rf_clf, X_norm, y_encoded)
print('\nRandom forest mean-average accuracy: ', np.round(np.mean(acc), 3), 
      ' Std: ', np.round(np.std(acc), 3))


Number of samples:  4233
Number of Features:  20

Random forest mean-average accuracy:  0.657  Std:  0.061
Wall time: 9.97 s


Observations:

- Removing the least correlated row slightly improves both model accuracy and generalizeability.
- I tried this same approach with up to five of the least correlated features, but each time seemed to worsen performance.

## Report 10-fold CV performances of GaussianNB, linear SVC (useSVC(kernel='linear', probability=True)), MLPClassifier, and DecisionTreeClassifier with default parameters. Now report the RandomForestClassifier performance too.

In [53]:
%%time
## import scikit learn classifier base classes
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier

##imports for multilayer perceptron
import warnings
from sklearn.exceptions import ConvergenceWarning

##convert encoded numpy array as a separate dataframe for further exploration
explore_df = pd.DataFrame(X_train, columns=features)
explore_df['class'] = y_train
print('\nFeatures: ', list(explore_df))

##create dictionary to hold classifier average performance
clf_performances = {'Model': [], 'Mean_Acc': [], 'Std': []}

##Naive Bayes
gb_clf = GaussianNB()
acc = kfold_eval(gb_clf, X_train, y_train)
##append to performance dict
clf_performances['Model'].append('NB')
clf_performances['Mean_Acc'].append(np.round(np.mean(acc), 3))
clf_performances['Std'].append( np.round(np.std(acc), 3))
print('\nNaive Bayes mean-average accuracy: ', np.round(np.mean(acc), 3), 
      ' Std: ', np.round(np.std(acc), 3))

##Support Vector Classifier
svc_clf = SVC(kernel='linear', probability=True)
acc = kfold_eval(svc_clf, X_train, y_train)
##append to performance dict
clf_performances['Model'].append('SVC')
clf_performances['Mean_Acc'].append(np.round(np.mean(acc), 3))
clf_performances['Std'].append( np.round(np.std(acc), 3))
print('\nSupport Vector Classifier mean-average accuracy: ', np.round(np.mean(acc), 3), 
      ' Std: ', np.round(np.std(acc), 3))

##Multilayer Percptron 
hid_layers = 3
mlp_clf = MLPClassifier() 
with warnings.catch_warnings():
    warnings.filterwarnings("ignore", category=ConvergenceWarning)
    acc = kfold_eval(mlp_clf, X_train, y_train)
##append to performance dict
clf_performances['Model'].append('MLP')
clf_performances['Mean_Acc'].append(np.round(np.mean(acc), 3))
clf_performances['Std'].append( np.round(np.std(acc), 3))
print('\nMLP mean-average accuracy: ', np.round(np.mean(acc), 3), 
      ' Std: ', np.round(np.std(acc), 3))

##Decision Tree
dt_clf = DecisionTreeClassifier()
acc = kfold_eval(dt_clf, X_train, y_train)
##append to performance dict
clf_performances['Model'].append('Decision Tree')
clf_performances['Mean_Acc'].append(np.round(np.mean(acc), 3))
clf_performances['Std'].append(np.round(np.std(acc), 3))
print('\nDecision Tree mean-average accuracy: ', np.round(np.mean(acc), 3), 
      ' Std: ', np.round(np.std(acc), 3))

##Random Forest
rf_clf = RandomForestClassifier()
acc = kfold_eval(rf_clf, X_train, y_train)
##append to performance dict
clf_performances['Model'].append('Rand_Forest')
clf_performances['Mean_Acc'].append(np.round(np.mean(acc), 3))
clf_performances['Std'].append( np.round(np.std(acc), 3))
print('\nRandom forest mean-average accuracy: ', np.round(np.mean(acc), 3), 
      ' Std: ', np.round(np.std(acc), 3))


Features:  ['cap-shape', 'cap-surface', 'cap-color', 'bruises', 'odor', 'gill-attachment', 'gill-spacing', 'gill-size', 'gill-color', 'stalk-shape', 'stalk-root', 'stalk-surface-above-ring', 'stalk-surface-below-ring', 'stalk-color-above-ring', 'stalk-color-below-ring', 'veil-color', 'ring-type', 'spore-print-color', 'population', 'habitat', 'class']

Naive Bayes mean-average accuracy:  0.61  Std:  0.006

Support Vector Classifier mean-average accuracy:  0.614  Std:  0.001

MLP mean-average accuracy:  0.641  Std:  0.021

Decision Tree mean-average accuracy:  0.593  Std:  0.024

Random forest mean-average accuracy:  0.662  Std:  0.015
Wall time: 37.6 s


Observation:

The classifiers all scored somewhat low on performance. I think the reason for that has to do with the way that I chose to encode the data. I chose a method of directly encoding the data as opposed to one-hot-encoding. The direct method can be found in scikit learn's page:

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html .

## Generate an ensemble of 100 classifiers for each of the four classifiers in Q1. stored as a list. Set the neural network hidden sizes to (3, 3), max iterations to 30, and tolerance to 1e-1. Set the decision tree parameters to max depth of 5 and max features of 5. We will evaluate these four ensemble classifiers.For each of the ensemble, report the first classifier performance in the ensemble.

In [54]:
'''
Define Base Classes for each of the classifiers prior to placing in list. This was done to avoid
editing any of the classifiers after instantiation and thereby enabling classifiers to be added to a 
list of preconfigured classifiers.
'''

class GaussNaiveBayesClf(GaussianNB):
    '''
    Description: Extends Naive Bayes Base Class
    '''
    def __init__(self):
        '''
        Description: Initialize parameters
        
        :params keyword: Name of classifier
        '''
        super().__init__()
        self.keyword = 'GaussNaiveBayesClf'
        
class SupportVectorClf(SVC):
    '''
    Description: Extends Support Vector Classifier Base Class
    '''
    def __init__(self):
        '''
        Description: Initialize parameters
        
        :params keyword: Name of classifier
        :params kernal: Kernel method to use
        :params probability: Enable probability estimates
        '''
        super().__init__()
        self.keyword = 'SupportVectorClf'
        self.kernel = 'linear'
        self.probability = True
        
class MultiLayerPerceptronClf(MLPClassifier):
    '''
    Description: Extends Multilayer Perceptron Base Class
    '''
    def __init__(self):
        '''
        Description: Initialize parameters
        
        :params keyword: Name of classifier
        :params tol: Train loss threshold only enabled when using SGD
        :params max_iter: Max Iterations to Train
        :param solver: Weight Optimizer
        :param hid_layer: Number of layers between input and output layers
        :params hidden_layer_size: Number of Neurons in the ith hidden layer
        '''
        super().__init__()
        self.keyword = 'MultiLayerPerceptonClf'
        self.tol=1e-1
        self.max_iter = 30
        self.solver = 'sgd'
        self.hid_layers = 5
        self.hidden_layer_sizes = (self.hid_layers,)
        
class DecisionTreeClf(DecisionTreeClassifier):
    '''
    Description: Extends Extends Decision Tree Classifier Base Class
    '''
    def __init__(self):
        '''
        Description: Initialize parameters
        
        :params keyword: Name of classifier
        :params max_depth: Max depth that the decision tree build. If not pure, then will stop.
        :params max_features: Number of features to consider during best split calculation
        '''
        super().__init__()
        self.keyword = 'DecisionTreeClf'
        self.max_depth = 5
        self.max_features = 5
        
class RandForestClf(RandomForestClassifier):
    '''
    Description: Extends Extends Random Forest Classifier Base Class
    '''
    def __init__(self):
        '''
        Description: Initialize parameters
        
        :params keyword: Name of classifier
        '''
        super().__init__()
        self.keyword = 'RandForestClf'

In [55]:
%%time

from numpy.random import choice

##Instantiate classifiers
gb_clf = GaussNaiveBayesClf
svc_clf = SupportVectorClf
mlp_clf = MultiLayerPerceptronClf
dt_clf = DecisionTreeClf
rf_clf = RandForestClf

##add classifiers to list
classifiers = [gb_clf, svc_clf, mlp_clf, dt_clf, rf_clf]

##Note: weak_fit, weak_predict, featues_randomsubset, and eval_single is based on 
##module06_ensemble_notebook.html. It has been modified to fit this problem.
def weak_fit(_clf, _list_cols, _X, _y):
    '''Builds a single weak learner classifier and subset of features'''
    Xs = _X[:,_list_cols]
    return _clf().fit(Xs, _y)  # return a single NaiveBayes

def weak_predict(_clf, _list_cols, _X):
    '''Predicts using a single weak learner classifier'''
    Xs = _X[:,_list_cols]
    return _clf.predict(Xs), _clf.predict_proba(Xs)

def features_randomsubset(_M, _m, n_estimators=100):
    '''Returns a list of list of column choices - subset features'''
    return [choice(_M, _m, replace=True) for _ in range(n_estimators)]

def eval_singleweak(_clf, _X, _y, _niters, _nfeatures):
    '''Evaluates a single weak trained model'''
    acc = []
    for j in range(_niters):
        ##Keep the subset features (i.e. columns) the same for a 10-fold
        cols = features_randomsubset(_X.shape[1], _nfeatures, n_estimators=1)
        ##10-fold CV
        kf = StratifiedKFold(n_splits=10, shuffle=False, random_state=None)
        for train_index, test_index in kf.split(_X, _y):
            clf = weak_fit(_clf ,cols[0], _X[train_index], _y[train_index])
            y_pred, y_prob = weak_predict(clf, cols[0], _X[test_index])
            acc += [accuracy_score(_y[test_index], y_pred)]
    return np.array(acc)

##loop through classifiers and retrieve weak learner performance
n_iterations = 10
for classifier in tqdm(classifiers):
    ##weak learner performance
    acc = eval_singleweak(classifier, X_train, y_train, n_iterations, len(list(explore_df)))
    
    ##Summary
    print('Classifier Name: ', classifier().keyword)
    print(f'total #results= {len(acc)}')
    print(f'First Learner Acc = {acc[0]:.2f}')
    print(f'Weak learners average Acc = {np.mean(acc):.2f} {chr(177)}{np.std(acc):.3f}')
    print('\n--------------------------------------------------------------')

 20%|████████████████▊                                                                   | 1/5 [00:00<00:00,  7.50it/s]

Classifier Name:  GaussNaiveBayesClf
total #results= 100
First Learner Acc = 0.63
Weak learners average Acc = 0.60 ±0.019

--------------------------------------------------------------


 40%|█████████████████████████████████▌                                                  | 2/5 [02:02<03:35, 71.94s/it]

Classifier Name:  SupportVectorClf
total #results= 100
First Learner Acc = 0.62
Weak learners average Acc = 0.61 ±0.001

--------------------------------------------------------------


 60%|██████████████████████████████████████████████████▍                                 | 3/5 [02:05<01:21, 40.65s/it]

Classifier Name:  MultiLayerPerceptonClf
total #results= 100
First Learner Acc = 0.62
Weak learners average Acc = 0.59 ±0.057

--------------------------------------------------------------


 80%|███████████████████████████████████████████████████████████████████▏                | 4/5 [02:06<00:24, 24.85s/it]

Classifier Name:  DecisionTreeClf
total #results= 100
First Learner Acc = 0.64
Weak learners average Acc = 0.63 ±0.021

--------------------------------------------------------------


100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [03:35<00:00, 43.03s/it]

Classifier Name:  RandForestClf
total #results= 100
First Learner Acc = 0.66
Weak learners average Acc = 0.63 ±0.031

--------------------------------------------------------------
Wall time: 3min 35s





## Write a function ensemble_fit() to receive the ensemble (i.e. one of the lists in Q2.) and train on one of the subsets of the training data (e.g. random.sample can generate a subset). So each classifier will see only a different subset of the training dataset, also called as subsampling the input data for training. (Use all features in the subsample).

In [56]:
%%time

from sklearn.ensemble import BaggingClassifier

def get_feat_subset(_list_cols: list, _X: np.array, _y: np.array) -> tuple:
    '''
    Gets subset of features
    
    :params _list_cols: List of dataset column names
    :params _X: Train dataset
    :params _y: Train target
    '''
    Xs = _X[:,_list_cols]
    return (Xs, _y)

##Note: Based on module06_ensemble_notebook.html but modified to fit this problem
def ensemble_fit(_clf: object, _ensemble_cols: np.array, _X: np.array, _y: np.array, 
                 bag_ratio: float) -> object:
    '''
    Generate numererous trained classifiers as weak learners
    
    :params _clf: Classifier to use for ensemble training
    :params ensemble_cols: split data for ensemble to train on
    :params _X: Train dataset
    :params _y: Train target
    :params bag_ratio: Subsample dataset ratio for each learner
    '''
    # the list of ensemble columns have a column list for every member of the ensemble
    n_estimators = len(_ensemble_cols)
    # list of weak learners
    ensemble_clf = []
    for j in range(n_estimators):
        Xs, y = get_feat_subset(_ensemble_cols[j], _X, _y)
        b_clf = BaggingClassifier(_clf(), max_samples=bag_ratio)
        ensemble_clf += [b_clf.fit(Xs, y)]
    return ensemble_clf

Wall time: 0 ns


## Write a function ensemble_predict() to receive the trained ensemble (i.e. one of the lists in Q3.) and test on the input. Use a voting scheme such as a histogram on the returned predictions by c.predict() by each of the weak classifier. The final prediction should be the np.argmax() of those counts. (Note that c.predict_proba() should have better results.)

In [57]:
%%time

##Note: Based on module06_ensemble_notebook.html but modified to fit this problem.
from collections import defaultdict

##modified 
def ensemble_predict(_ensemble_clf, _ensemble_cols, _Xtest) -> np.array:
    
    '''
    Uses trained ensemble to predict the outcome by majority voting
    
    :params _ensemble_clf: Ensemble classifier instance
    :params _ensemble_cols: Ensamble columns
    :params _Xtest: Test Data
    '''
    n_estimators = len(_ensemble_clf)
    
    ##Error check
    assert n_estimators==len(_ensemble_cols)  
    
    #weak learner predictions
    ypred_e, yprob_e = [], []
    for j in range(n_estimators):
        res = weak_predict(_ensemble_clf[j], _ensemble_cols[j], _Xtest)
        
        ##score/probability of the prediction
        ypred_e += [res[0]]
        yprob_e += [res[1]] 
        
        ##DEBUG
        #print('prediction: ', res[0])
        
    ##majority voting for each data point in _Xtest
    ypred = []
    for i in range(_Xtest.shape[0]):
        ypred_scores = defaultdict(float)
        for j in range(n_estimators):
            for c, p in enumerate(yprob_e[j][i]):
                ypred_scores[c] += p
        ix = max(ypred_scores.items(), key=lambda a: a[1])
        ypred += [ix[0]]
    
    ##convert to numpy and replace zeros by 4 and ones thirteen
    ypred = np.array(ypred)
    ypred =np.where(ypred == 0, 4, 13)
    return ypred

Wall time: 0 ns


## Report 10-fold CV performances of the ensembles with a subsample ratio of 0.1.Compare to a regular decision tree (same subsample ratio). Now repeat these for subsample of 0.001.

In [10]:
%%time

##Note: Based on module06_ensemble_notebook.html but modified to fit this problem.
def eval_ensemble(_clf: object, _X: np.array, _y: np.array, _niter: int, 
                  _n_estimators: int, _nfeatures: int, bag_ratio: float) -> np.array:
    
    '''
    Desc: 10-fold CV using ensemble_fit, ensemble_predict
    
    :params _clf: Classifier to use for ensemble training
    :params _X: Train dataset
    :params _y: Train target
    :params _niter: Number of iterations to train
    :params _n_estimators: Number of weak learners
    :params _nfeatures: Number of features per learner
    :params bag_ratio: Subsample dataset ratio for each learner
    '''
    acc = []
    for i in range(_niter):
        ##Keep subset features, columns same for a 10-fold
        cols = features_randomsubset(_X.shape[1], _nfeatures, n_estimators=_n_estimators)
        ##10-fold CV
        kf = StratifiedKFold(n_splits=10, shuffle=False, random_state=None)
        for train_index, test_index in kf.split(_X, _y):
            e_clf = ensemble_fit(_clf, cols, _X[train_index], _y[train_index], bag_ratio)
            y_pred = ensemble_predict(e_clf, cols, _X[test_index])
            acc += [accuracy_score(_y[test_index], y_pred)]
    return np.array(acc)

##Instantiate classifiers
e_gb_clf = GaussNaiveBayesClf
e_svc_clf = SupportVectorClf
e_mlp_clf = MultiLayerPerceptronClf
e_dt_clf = DecisionTreeClf

e_classifiers = [e_gb_clf, e_svc_clf, e_mlp_clf, e_dt_clf]

##hyperparameters
n_iterations = 10
n_estimators = 100
n_features = feature_count
bag_ratios = [0.01, 0.1]

e_clf_perf = {'Model': [], 'Mean_Acc': [], 'Std': [], 'Bag_Ratio':[]}
##Measure ensemble weak learners performance
for bag_ratio in tqdm(bag_ratios):
    for classifier in e_classifiers:
        acc = eval_ensemble(classifier, X_train, y_train, n_iterations, n_estimators, n_features, bag_ratio)
        print('Classifier Name: ', classifier().keyword)
        print('Subsample Ratio: ', bag_ratio)
        print(f'total #results= {len(acc)}')
        print(f'Ensemble learners average Acc= {np.mean(acc):.2f} {chr(177)}{np.std(acc):.3f}')
        print('\n--------------------------------------------------------------')

  0%|                                                                                            | 0/2 [00:00<?, ?it/s]

Classifier Name:  GaussNaiveBayesClf
Subsample Ratio:  0.01
total #results= 100
Ensemble learners average Acc= 0.61 ±0.001

--------------------------------------------------------------
Classifier Name:  SupportVectorClf
Subsample Ratio:  0.01
total #results= 100
Ensemble learners average Acc= 0.61 ±0.001

--------------------------------------------------------------
Classifier Name:  MultiLayerPerceptonClf
Subsample Ratio:  0.01
total #results= 100
Ensemble learners average Acc= 0.53 ±0.198

--------------------------------------------------------------


 50%|█████████████████████████████████████████▌                                         | 1/2 [12:19<12:19, 739.68s/it]

Classifier Name:  DecisionTreeClf
Subsample Ratio:  0.01
total #results= 100
Ensemble learners average Acc= 0.61 ±0.001

--------------------------------------------------------------
Classifier Name:  GaussNaiveBayesClf
Subsample Ratio:  0.1
total #results= 100
Ensemble learners average Acc= 0.59 ±0.008

--------------------------------------------------------------
Classifier Name:  SupportVectorClf
Subsample Ratio:  0.1
total #results= 100
Ensemble learners average Acc= 0.61 ±0.001

--------------------------------------------------------------
Classifier Name:  MultiLayerPerceptonClf
Subsample Ratio:  0.1
total #results= 100
Ensemble learners average Acc= 0.61 ±0.025

--------------------------------------------------------------


100%|██████████████████████████████████████████████████████████████████████████████████| 2/2 [56:54<00:00, 1707.07s/it]

Classifier Name:  DecisionTreeClf
Subsample Ratio:  0.1
total #results= 100
Ensemble learners average Acc= 0.61 ±0.003

--------------------------------------------------------------
Wall time: 56min 54s





In [11]:
'''The subsample ratio of anything less than 0.01 gives an error that "zero weights cannot be normalized." 
For that reason, I had to increase the subsample ratio to 0.01 instead of 0.001'''

'The subsample ratio of anything less than 0.01 give an error that "zero weights cannot be normalized." \nFor that reason, I had to increase the subsample ratio to 0.01 instead of 0.001'

## Report and plot 10-fold CV performances of the ensembles for the training subsample ratios of (0.0005, 0.001, 0.005, 0.01, 0.03, 0.05, 0.1) on the same graph.Add the regular classifiers to the plot with same subsample ratios. (Hint: pass the regular classifier to the same ensemble CV in a list of one element. Same script can be used for this entire step)Report your detailed observations

In [29]:
##Instantiate classifiers
e_gb_clf = GaussNaiveBayesClf
e_svc_clf = SupportVectorClf
e_mlp_clf = MultiLayerPerceptronClf
e_dt_clf = DecisionTreeClf

e_classifiers = [e_gb_clf, e_svc_clf, e_mlp_clf, e_dt_clf]

##hyperparameters
n_iterations = 10
n_estimators = 100
n_features = len(list(explore_df))

'''
Note: The bag ratios used were all greater than 0.01 because of the zero division error encountered
I think that this has to do with scikit learn's bagging method but I would need to spend more time
on that to be certain.
'''

bag_ratios = [0.01, 0.03, 0.05, 0.1]

e_clf_perf = {'Model': [], 'Mean_Acc': [], 'Std': [], 'Bag_Ratio':[]}
##Measure ensemble weak learners performance
for bag_ratio in tqdm(bag_ratios):
    for classifier in e_classifiers:
    
        acc = eval_ensemble(classifier, X_train, y_train, n_iterations, n_estimators, n_features, bag_ratio)

        ##record accuracy scores
        e_classifier = 'Ensamble_' + classifier().keyword
        e_clf_perf['Model'].append(e_classifier)
        e_clf_perf['Mean_Acc'].append(np.round(np.mean(acc), 3))
        e_clf_perf['Std'].append(np.round(np.std(acc), 3))
        e_clf_perf['Bag_Ratio'].append(bag_ratio)
        
        print('\n- Ensamble Classifier Performance on Subsampled Data -')
        print('Classifier Name: ', classifier().keyword)
        print('Subsample Ratio: ', bag_ratio)
        print(f'total #results= {len(acc)}')
        print(f'Ensemble learners average Acc= {np.mean(acc):.2f} {chr(177)}{np.std(acc):.3f}')
        print('\n--------------------------------------------------------------')

  0%|                                                                                            | 0/4 [00:00<?, ?it/s]


- Ensamble Classifier Performance on Subsampled Data -
Classifier Name:  GaussNaiveBayesClf
Subsample Ratio:  0.01
total #results= 100
Ensemble learners average Acc= 0.61 ±0.002

--------------------------------------------------------------

- Ensamble Classifier Performance on Subsampled Data -
Classifier Name:  SupportVectorClf
Subsample Ratio:  0.01
total #results= 100
Ensemble learners average Acc= 0.61 ±0.001

--------------------------------------------------------------

- Ensamble Classifier Performance on Subsampled Data -
Classifier Name:  MultiLayerPerceptonClf
Subsample Ratio:  0.01
total #results= 100
Ensemble learners average Acc= 0.50 ±0.212

--------------------------------------------------------------


 25%|████████████████████▊                                                              | 1/4 [12:43<38:10, 763.57s/it]


- Ensamble Classifier Performance on Subsampled Data -
Classifier Name:  DecisionTreeClf
Subsample Ratio:  0.01
total #results= 100
Ensemble learners average Acc= 0.61 ±0.001

--------------------------------------------------------------

- Ensamble Classifier Performance on Subsampled Data -
Classifier Name:  GaussNaiveBayesClf
Subsample Ratio:  0.03
total #results= 100
Ensemble learners average Acc= 0.60 ±0.008

--------------------------------------------------------------

- Ensamble Classifier Performance on Subsampled Data -
Classifier Name:  SupportVectorClf
Subsample Ratio:  0.03
total #results= 100
Ensemble learners average Acc= 0.61 ±0.001

--------------------------------------------------------------

- Ensamble Classifier Performance on Subsampled Data -
Classifier Name:  MultiLayerPerceptonClf
Subsample Ratio:  0.03
total #results= 100
Ensemble learners average Acc= 0.51 ±0.209

--------------------------------------------------------------


 50%|█████████████████████████████████████████▌                                         | 2/4 [29:38<30:22, 911.35s/it]


- Ensamble Classifier Performance on Subsampled Data -
Classifier Name:  DecisionTreeClf
Subsample Ratio:  0.03
total #results= 100
Ensemble learners average Acc= 0.61 ±0.002

--------------------------------------------------------------

- Ensamble Classifier Performance on Subsampled Data -
Classifier Name:  GaussNaiveBayesClf
Subsample Ratio:  0.05
total #results= 100
Ensemble learners average Acc= 0.60 ±0.007

--------------------------------------------------------------

- Ensamble Classifier Performance on Subsampled Data -
Classifier Name:  SupportVectorClf
Subsample Ratio:  0.05
total #results= 100
Ensemble learners average Acc= 0.61 ±0.001

--------------------------------------------------------------

- Ensamble Classifier Performance on Subsampled Data -
Classifier Name:  MultiLayerPerceptonClf
Subsample Ratio:  0.05
total #results= 100
Ensemble learners average Acc= 0.47 ±0.241

--------------------------------------------------------------


 75%|█████████████████████████████████████████████████████████████▌                    | 3/4 [52:22<18:38, 1118.14s/it]


- Ensamble Classifier Performance on Subsampled Data -
Classifier Name:  DecisionTreeClf
Subsample Ratio:  0.05
total #results= 100
Ensemble learners average Acc= 0.61 ±0.003

--------------------------------------------------------------

- Ensamble Classifier Performance on Subsampled Data -
Classifier Name:  GaussNaiveBayesClf
Subsample Ratio:  0.1
total #results= 100
Ensemble learners average Acc= 0.59 ±0.010

--------------------------------------------------------------

- Ensamble Classifier Performance on Subsampled Data -
Classifier Name:  SupportVectorClf
Subsample Ratio:  0.1
total #results= 100
Ensemble learners average Acc= 0.61 ±0.001

--------------------------------------------------------------

- Ensamble Classifier Performance on Subsampled Data -
Classifier Name:  MultiLayerPerceptonClf
Subsample Ratio:  0.1
total #results= 100
Ensemble learners average Acc= 0.61 ±0.061

--------------------------------------------------------------


100%|████████████████████████████████████████████████████████████████████████████████| 4/4 [1:37:11<00:00, 1457.87s/it]


- Ensamble Classifier Performance on Subsampled Data -
Classifier Name:  DecisionTreeClf
Subsample Ratio:  0.1
total #results= 100
Ensemble learners average Acc= 0.61 ±0.003

--------------------------------------------------------------





In [30]:
%%time

'''Subsampling with same ratios of regular classifiers'''

##Kfold function
def kfold_eval_bagging(_clf: object, X: np.array, y: np.array, bag_ratio: float) -> np.array:
    '''
    Desc: K-fold with subsampling
    
    :params _clf: Classifier to use for ensemble training
    :params _X: Train dataset
    :params _y: Train target
    :params _niter: Number of iterations to train
    :params bag_ratio: Subsample dataset ratio for each learner
    
    '''
    ## accuracy bookkeeping
    acc = []
    ## stratified split
    kf = StratifiedKFold(n_splits=10, shuffle=False, random_state=None)
    ##iterate through train and test sets
    for train_index, test_index in kf.split(X, y):
        b_clf = BaggingClassifier(_clf(), max_samples=bag_ratio)
        b_clf.fit(X[train_index], y[train_index])
        y_pred = b_clf.predict(X[test_index])
        acc += [accuracy_score(y[test_index], y_pred)]
    return np.array(acc)

##Instantiate classifiers
gb_clf = GaussNaiveBayesClf
svc_clf = SupportVectorClf
mlp_clf = MultiLayerPerceptronClf
dt_clf = DecisionTreeClf
classifiers = [gb_clf, svc_clf, mlp_clf, dt_clf]

##hyperparameters
n_features = len(list(explore_df))
bag_ratios = [0.01, 0.03, 0.05, 0.1]

clf_perf = {'Model': [], 'Mean_Acc': [], 'Std': [], 'Bag_Ratio':[]}
##Measure ensemble weak learners performance
for bag_ratio in tqdm(bag_ratios):
    for classifier in classifiers:
        acc = kfold_eval_bagging(classifier, X_train, y_train, bag_ratio)

        ##record accuracy scores
        clf_perf['Model'].append(classifier().keyword)
        clf_perf['Mean_Acc'].append(np.round(np.mean(acc), 3))
        clf_perf['Std'].append(np.round(np.std(acc), 3))
        clf_perf['Bag_Ratio'].append(bag_ratio)
        
        print('\n- Regular Classifier Performance on Subsampled Data -')
        print('Classifier Name: ', classifier().keyword)
        print('Subsample Ratio: ', bag_ratio)
        print(f'total #results= {len(acc)}')
        print(f'Ensemble learners average Acc= {np.mean(acc):.2f} {chr(177)}{np.std(acc):.3f}')
        print('\n--------------------------------------------------------------')

  0%|                                                                                            | 0/4 [00:00<?, ?it/s]


- Regular Classifier Performance on Subsampled Data -
Classifier Name:  GaussNaiveBayesClf
Subsample Ratio:  0.01
total #results= 10
Ensemble learners average Acc= 0.59 ±0.026

--------------------------------------------------------------

- Regular Classifier Performance on Subsampled Data -
Classifier Name:  SupportVectorClf
Subsample Ratio:  0.01
total #results= 10
Ensemble learners average Acc= 0.61 ±0.003

--------------------------------------------------------------


 25%|█████████████████████                                                               | 1/4 [00:00<00:02,  1.50it/s]


- Regular Classifier Performance on Subsampled Data -
Classifier Name:  MultiLayerPerceptonClf
Subsample Ratio:  0.01
total #results= 10
Ensemble learners average Acc= 0.52 ±0.112

--------------------------------------------------------------

- Regular Classifier Performance on Subsampled Data -
Classifier Name:  DecisionTreeClf
Subsample Ratio:  0.01
total #results= 10
Ensemble learners average Acc= 0.57 ±0.025

--------------------------------------------------------------

- Regular Classifier Performance on Subsampled Data -
Classifier Name:  GaussNaiveBayesClf
Subsample Ratio:  0.03
total #results= 10
Ensemble learners average Acc= 0.59 ±0.023

--------------------------------------------------------------

- Regular Classifier Performance on Subsampled Data -
Classifier Name:  SupportVectorClf
Subsample Ratio:  0.03
total #results= 10
Ensemble learners average Acc= 0.61 ±0.001

--------------------------------------------------------------


 50%|██████████████████████████████████████████                                          | 2/4 [00:01<00:01,  1.28it/s]


- Regular Classifier Performance on Subsampled Data -
Classifier Name:  MultiLayerPerceptonClf
Subsample Ratio:  0.03
total #results= 10
Ensemble learners average Acc= 0.50 ±0.097

--------------------------------------------------------------

- Regular Classifier Performance on Subsampled Data -
Classifier Name:  DecisionTreeClf
Subsample Ratio:  0.03
total #results= 10
Ensemble learners average Acc= 0.59 ±0.017

--------------------------------------------------------------

- Regular Classifier Performance on Subsampled Data -
Classifier Name:  GaussNaiveBayesClf
Subsample Ratio:  0.05
total #results= 10
Ensemble learners average Acc= 0.60 ±0.016

--------------------------------------------------------------

- Regular Classifier Performance on Subsampled Data -
Classifier Name:  SupportVectorClf
Subsample Ratio:  0.05
total #results= 10
Ensemble learners average Acc= 0.61 ±0.001

--------------------------------------------------------------


 75%|███████████████████████████████████████████████████████████████                     | 3/4 [00:02<00:00,  1.05it/s]


- Regular Classifier Performance on Subsampled Data -
Classifier Name:  MultiLayerPerceptonClf
Subsample Ratio:  0.05
total #results= 10
Ensemble learners average Acc= 0.58 ±0.065

--------------------------------------------------------------

- Regular Classifier Performance on Subsampled Data -
Classifier Name:  DecisionTreeClf
Subsample Ratio:  0.05
total #results= 10
Ensemble learners average Acc= 0.60 ±0.014

--------------------------------------------------------------

- Regular Classifier Performance on Subsampled Data -
Classifier Name:  GaussNaiveBayesClf
Subsample Ratio:  0.1
total #results= 10
Ensemble learners average Acc= 0.60 ±0.015

--------------------------------------------------------------

- Regular Classifier Performance on Subsampled Data -
Classifier Name:  SupportVectorClf
Subsample Ratio:  0.1
total #results= 10
Ensemble learners average Acc= 0.61 ±0.001

--------------------------------------------------------------


100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00,  1.27s/it]


- Regular Classifier Performance on Subsampled Data -
Classifier Name:  MultiLayerPerceptonClf
Subsample Ratio:  0.1
total #results= 10
Ensemble learners average Acc= 0.53 ±0.100

--------------------------------------------------------------

- Regular Classifier Performance on Subsampled Data -
Classifier Name:  DecisionTreeClf
Subsample Ratio:  0.1
total #results= 10
Ensemble learners average Acc= 0.62 ±0.023

--------------------------------------------------------------
Wall time: 5.1 s





In [48]:
reg_df = pd.DataFrame(clf_perf)
e_df = pd.DataFrame(e_clf_perf)
results_df = reg_df.append(e_df)

import plotly.express as px

fig = px.scatter(results_df, x="Model", y="Mean_Acc", 
                 size="Std", color='Bag_Ratio', 
                 width=900, height=800, size_max=100, 
                 color_continuous_scale=px.colors.sequential.Viridis)
fig.show()

Plot Explanation:

- The interactive plot shows how the mean-accuracy varies given different subsample values (bagging values) for different models. The user can scroll over the datapoints, and a hover window will show all available data about that point, such as model, mean-accuracy, std, and bag_ratio.


- The y-axis is the mean-accuracy, and the x-axis is the model.  The color embedding represents the subsample size, and the size represents the model generalizeability (std). The smaller the data pointb - the more generalized the model, and the larger the data point - the less generalized the model.

Observations:

- Naive Bayes benefitted from ensembling performed slightly better in both accuracy and standard deviation. The takeaway is that a naive-bayes model perform better than a regular naive-bayers classifier, and it is also more generalized. 


-  Support vector machine did not seem to benefit from ensembling, since the accuracy is nearly the same. The standard deviation is slightly less however.


- The ensemble MLP scored significantly higher for a bag_ratio of 0.1, but generally performed worse for all other bag_ratios. The standard deviation was significantly higher for the ensemble models. In other words, a bag_ratio of 0.1 seemed to perform better for the ensemble, but the regular MLP generally outperformed the ensemble MLP.


- The decision tree greatly improved on the standard deviation and generally performed better when ensembled. I think the decision tree definately shows a great increase in performance and seems to be a good candidate for ensembling methods.


- The bagging ratio of 0.1 seemed to be correlated to higher performing models, followed by 0.05.


- Training seemed to take longer for higher bagging ratios. 


- Bagging ratios < 0.01 were had a "divide by zero - normalization" error most likely do to having to small of a subsample split for scikit learn's bagging method.


- The way that I chose to encode the data may have contributed to the generally low scores, but I will likely go with one-hot-encoding next time.


- This was a fun assignment.