# 04_benchmarking

This notebook compares the baseline performance of 12 algorithms on the SEO-Effect dataset. The performance is evaluated by measuring the following evaluation metrics: Accuracy, ROC AUC (One-vs-One), Macro Precision, Macro Recall, Macro F1, as well as Fit and Train Time.

The algorithms are selected from the ["scikit-learn algorithm cheat-sheet"](https://scikit-learn.org/stable/_static/ml_map.png) to compare a broad spectrum of different classification methods.

Custom parameters are kept to a minimum. max_iteration is limited to 100, max_depth to 10, n_neighbors are 4, while the outlier_label is set to 5. The penalty option is set to l2 and the random state is set to 22. This is done to limit run time or because a parameter needs to be set.

The algorithms are cross validated by splitting the data into five stratified shuffle splits, with a test size of 66%. The mean of each metric as well the standard deviation is saved to a dataframe and stored in <code>output/benchmarking_results_1.csv</code>.

Six of the 12 algorithms perform very well, with an accuracy of at least 90% and an f1 of at least 75%. Those are (in order of the best performance to last): GradientBoosting, RandomForest, ExtraTrees, DecisionTree, LinearSVC and GaussianNB. However, the fit and training times of the GradientBoostingClassifier is 750 times longer than that of the Gaussian Naive Bayes Classifier. This classifier is known as a quick and reliable way to test classification performance and works well with the data that is being used. As such it will be used to compare data preprocessing methods in the next section.

#### 0. Imports libraries

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import cross_validate, StratifiedShuffleSplit
from sklearn.svm import SVC, LinearSVC
from sklearn.neighbors import KNeighborsClassifier, RadiusNeighborsClassifier
from sklearn.naive_bayes import GaussianNB, BernoulliNB
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier

import warnings
warnings.filterwarnings('ignore')
from IPython.display import clear_output

#### 1. Loads cleaned dataset as pandas dataframe

In [2]:
df = pd.read_csv('output/data_cleaned.csv')

# uncomment for SECOND BENCHMARKING
#df = pd.read_csv('output/data_cleaned_balanced.csv')

#### 2. Splits data into features and target

In [3]:
X = df.drop(columns=['seo class'])
y = df['seo class']

#### 3. Creates list of classifiers

In [4]:
# list of classifiers to compare
classifiers = {'AdaBoost': AdaBoostClassifier(),
               'BernoulliNB': BernoulliNB(),
               'DecisionTree': DecisionTreeClassifier(),
               'ExtraTrees': ExtraTreesClassifier(),
               'GaussianNB': GaussianNB(),
               'GradientBoosting': GradientBoostingClassifier(),
               'KNeighbors': KNeighborsClassifier(),
               'LinearSVC': LinearSVC(),
               'RadiusNeighbors': RadiusNeighborsClassifier(),
               'RandomForest': RandomForestClassifier(),
               'SGD': SGDClassifier(),
               'SVC': SVC()}

#### 4. Creates dictionary of evaluation metrics

In [5]:
metrics = {'accuracy': 'accuracy',
           'precision': 'precision_macro', 
           'recall': 'recall_macro',
           'f1': 'f1_macro'}

#### 5. Sets minimal parameters to make sure algorithms function

In [6]:
params = {'max_iter': 100,
          'max_depth' : 10,
          'penalty': 'l2',
          'n_neighbors': 4,
          'outlier_label': 5}

#### 6. Creates stratified split for cross validation

In [7]:
sss = StratifiedShuffleSplit(n_splits=5, test_size=.66, random_state=22)

#### 7. Iterates over classifiers to compare results

In [8]:
cv_results = {}

for name, clf in classifiers.items():
    # display current classifier
    # to show progress while code is running
    clear_output()
    print('Current classifier: %s' % (name))
    
    # get parameter options for current classifier
    clf_params = clf.get_params()
    
    # select matching parameters for current classifier from params
    c_params = {}
    for p in params.keys():
        if p in clf_params.keys():
            c_params[p] = params[p]
    
    # set parameters
    if c_params:
        clf.set_params(**c_params)
    
    # cross validate classifier
    cv = cross_validate(clf, X, y, scoring=metrics, cv=sss)
    # save results of cross validation
    cv_results[name] = cv

Current classifier: SVC


#### 8. Store results in dataframe

In [9]:
data = []
for name, results in cv_results.items():
    row = [name]
    for k, v in results.items():
        # add mean and standard deviation to data
        row.append(v.mean())
        row.append(v.std())
    data.append(row)

# column names for dataframe
columns = ['classifier']
for k in cv.keys():
    k = k.replace('test_', '')
    columns.append(k+'_mean')
    columns.append(k+'_std')

#### 9. Save results to csv

In [10]:
results = pd.DataFrame(data, columns=columns)
results.to_csv('output/benchmarking_results_2.csv')

#### 10. Results overview

In [11]:
# sorted by f1 mean
results.sort_values(by=['f1_mean'], ascending=False)

Unnamed: 0,classifier,fit_time_mean,fit_time_std,score_time_mean,score_time_std,accuracy_mean,accuracy_std,precision_mean,precision_std,recall_mean,recall_std,f1_mean,f1_std
2,DecisionTree,1.561036,0.022256,0.709853,0.010191,0.999874,1.1e-05,0.999877,1.1e-05,0.999871,1.2e-05,0.999874,1.1e-05
5,GradientBoosting,775.890066,43.679336,8.765686,0.353183,0.999843,1.8e-05,0.999845,1.8e-05,0.999841,1.7e-05,0.999843,1.8e-05
9,RandomForest,24.903184,0.667747,5.059198,0.130884,0.998124,0.00028,0.998088,0.000284,0.998069,0.000288,0.998073,0.000289
0,AdaBoost,19.863881,0.121957,5.967922,0.113124,0.992864,0.012768,0.99307,0.012368,0.992667,0.013149,0.992729,0.013038
3,ExtraTrees,11.82369,0.092173,4.582094,0.060801,0.991138,0.001148,0.991155,0.00114,0.990878,0.001181,0.990956,0.001182
4,GaussianNB,0.754486,0.01979,1.783671,0.074396,0.989987,0.000194,0.989883,0.000197,0.989842,0.000196,0.989816,0.000198
7,LinearSVC,24.671043,0.917064,0.689216,0.012664,0.984101,0.013518,0.98509,0.012024,0.984142,0.013164,0.984073,0.013453
1,BernoulliNB,0.790493,0.137142,1.049855,0.015084,0.959448,0.000288,0.961539,0.000269,0.95839,0.000297,0.957928,0.000301
10,SGD,20.641258,1.931221,0.834285,0.058309,0.933718,0.031441,0.936574,0.026737,0.932289,0.032241,0.931083,0.034933
6,KNeighbors,0.556671,0.043652,2908.589405,88.185894,0.873263,0.000242,0.871494,0.000268,0.86955,0.000249,0.867231,0.000259


In [12]:
# get standard dev columns to remove from df display
std_c = [c for c in results.columns if '_std' in c]

# set filters to narrow down results
# Filters: F1 > 75% and Accuracy > 95%
filter_ = (results['f1_mean'] > 0.75) & (results['accuracy_mean'] > 0.95)

# filter results by f1 > 75% and accuracy > 95%, sort by fit time
results[filter_].sort_values(by=['fit_time_mean']).drop(columns=std_c)

Unnamed: 0,classifier,fit_time_mean,score_time_mean,accuracy_mean,precision_mean,recall_mean,f1_mean
4,GaussianNB,0.754486,1.783671,0.989987,0.989883,0.989842,0.989816
1,BernoulliNB,0.790493,1.049855,0.959448,0.961539,0.95839,0.957928
2,DecisionTree,1.561036,0.709853,0.999874,0.999877,0.999871,0.999874
3,ExtraTrees,11.82369,4.582094,0.991138,0.991155,0.990878,0.990956
0,AdaBoost,19.863881,5.967922,0.992864,0.99307,0.992667,0.992729
7,LinearSVC,24.671043,0.689216,0.984101,0.98509,0.984142,0.984073
9,RandomForest,24.903184,5.059198,0.998124,0.998088,0.998069,0.998073
5,GradientBoosting,775.890066,8.765686,0.999843,0.999845,0.999841,0.999843
