# 01_benchmarking

This notebook compares the baseline performance of 12 algorithms on the SEO-Effect dataset. The performance is evaluated by measuring the following evaluation metrics: Accuracy, ROC AUC (One-vs-One), Macro Precision, Macro Recall, Macro F1, as well as Fit and Train Time.

The algorithms are selected from the ["scikit-learn algorithm cheat-sheet"](https://scikit-learn.org/stable/_static/ml_map.png) to compare a broad spectrum of different classification methods.

Custom parameters are kept to a minimum. max_iteration is limited to 100, max_depth to 10, n_neighbors are 4, while the outlier_label is set to 5. The penalty option is set to l2 and the random state is set to 22. This is done to limit run time or because a parameter needs to be set.

The algorithms are cross validated by splitting the data into five stratified shuffle splits, with a test size of 66%. The mean of each metric as well the standard deviation is saved to a dataframe and stored in <code>output/01_benchmarking.csv</code>.

Six of the 12 algorithms perform very well, with an accuracy of at least 90% and an f1 of at least 75%. Those are (in order of the best performance to last): GradientBoosting, RandomForest, ExtraTrees, DecisionTree, LinearSVC and GaussianNB. However, the fit and training times of the GradientBoostingClassifier is 750 times longer than that of the Gaussian Naive Bayes Classifier. This classifier is known as a quick and reliable way to test classification performance and works well with the data that is being used. As such it will be used to compare data preprocessing methods in the next section.

In [2]:
import pandas as pd
import numpy as np

from sklearn.model_selection import cross_validate, StratifiedShuffleSplit
from sklearn.svm import SVC, LinearSVC
from sklearn.neighbors import KNeighborsClassifier, RadiusNeighborsClassifier
from sklearn.naive_bayes import GaussianNB, BernoulliNB
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier

import warnings
warnings.filterwarnings('ignore')

from IPython.display import clear_output

# load data
df = pd.read_csv('output/data_cleaned.csv')

# split data into features and target
X = df.drop(columns=['seo class'])
y = df['seo class']

# list of classifiers to compare
classifiers = {'AdaBoost': AdaBoostClassifier(),
               'BernoulliNB': BernoulliNB(),
               'DecisionTree': DecisionTreeClassifier(),
               'ExtraTrees': ExtraTreesClassifier(),
               'GaussianNB': GaussianNB(),
               'GradientBoosting': GradientBoostingClassifier(),
               'KNeighbors': KNeighborsClassifier(),
               'LinearSVC': LinearSVC(),
               'RadiusNeighbors': RadiusNeighborsClassifier(),
               'RandomForest': RandomForestClassifier(),
               'SGD': SGDClassifier(),
               'SVC': SVC()}

# dictionary of evaluation metrics
metrics = {'accuracy': 'accuracy',
           'precision': 'precision_macro', 
           'recall': 'recall_macro',
           'f1': 'f1_macro'}

# set minimal parameters to make sure algorithms function
params = {'max_iter': 100,
          'max_depth' : 10,
          'penalty': 'l2',
          'n_neighbors': 4,
          'outlier_label': 5}

# create stratified split for cross validation
sss = StratifiedShuffleSplit(n_splits=5, test_size=.66, random_state=22)

# empty dictionary to store results
cv_results = {}

# iterate over classifiers to compare results
for name, clf in classifiers.items():
    clear_output()
    print('Current classifier: %s' % (name))
    
    # get parameter options for current classifier
    clf_params = clf.get_params()
    
    # select matching parameters for current classifier from params
    c_params = {}
    for p in params.keys():
        if p in clf_params.keys():
            c_params[p] = params[p]
    
    # set parameters
    if c_params:
        clf.set_params(**c_params)
    
    # cross validate classifier
    cv = cross_validate(clf, X, y, scoring=metrics, cv=sss)
    # save results of cross validation
    cv_results[name] = cv
    
    # print f1/accuracy mean and std to update on results
    print('F1: %.2f±%.2f' % (cv['test_f1'].mean()*100, cv['test_f1'].std()*100))
    print('Acc: %.2f±%.2f' % (cv['test_accuracy'].mean()*100, cv['test_accuracy'].std()*100))
    
# format data for dataframe
data = []
for name, results in cv_results.items():
    row = [name]
    for k, v in results.items():
        # add mean and standard deviation to data
        row.append(v.mean())
        row.append(v.std())
    data.append(row)
    
# column names for dataframe
columns = ['classifier']
for k in cv.keys():
    k = k.replace('test_', '')
    columns.append(k+'_mean')
    columns.append(k+'_std')

# create data frame to display cv results
results = pd.DataFrame(data, columns=columns)
# save data frame as csv file
results.to_csv('output/01_benchmarking_stateless.csv')

Current classifier: SVC
F1: 24.84±4.28
Acc: 54.97±2.50


In [3]:
# load results from csv
# (used to avoid having to run the whole code again)
import pandas as pd
r_df = pd.read_csv('output/01_benchmarking_stateless.csv', index_col=0)

# sort by best f1 score
r_df.sort_values(by=['f1_mean'], ascending=False)

Unnamed: 0,classifier,fit_time_mean,fit_time_std,score_time_mean,score_time_std,accuracy_mean,accuracy_std,precision_mean,precision_std,recall_mean,recall_std,f1_mean,f1_std
5,GradientBoosting,322.213033,15.916843,4.000637,0.12578,0.999935,2.1e-05,0.9924,0.001333,0.996405,0.003185,0.994376,0.00182
2,DecisionTree,0.835032,0.040339,0.369607,0.02607,0.998987,0.000186,0.919814,0.022215,0.893322,0.007192,0.905148,0.013507
3,ExtraTrees,6.34615,0.222098,2.673421,0.029114,0.989497,0.003568,0.961246,0.007382,0.794662,0.001794,0.824789,0.002521
9,RandomForest,11.501792,0.160823,2.665219,0.022156,0.997803,0.000407,0.991392,0.004365,0.792152,0.0025,0.823236,0.003368
4,GaussianNB,0.436892,0.022919,0.745969,0.024888,0.963281,0.001391,0.757487,0.000663,0.97041,0.001742,0.758505,0.001312
7,LinearSVC,14.364942,0.375638,0.329023,0.008011,0.990821,0.006233,0.763993,0.082316,0.780855,0.016119,0.745376,0.047975
1,BernoulliNB,0.493588,0.105435,0.59345,0.028804,0.916452,0.001131,0.733035,0.003055,0.939055,0.003086,0.734415,0.001984
0,AdaBoost,9.834682,0.188128,3.309323,0.051438,0.902365,0.143389,0.717554,0.039426,0.697937,0.081537,0.689125,0.095936
6,KNeighbors,0.256356,0.018577,924.195183,11.492634,0.726202,0.000711,0.589559,0.008776,0.535577,0.002654,0.545939,0.001965
10,SGD,5.30315,0.47989,0.326646,0.002098,0.925023,0.032688,0.531254,0.089441,0.522071,0.076451,0.522144,0.081544


In [4]:
# get standard dev columns to remove from df display
std_c = [c for c in r_df.columns if '_std' in c]

# set filters to narrow down results
# Filters: F1 > 75% and Accuracy > 95%
filter_ = (r_df['f1_mean'] > 0.75) & (r_df['accuracy_mean'] > 0.95)

# filter results by f1 > 75% and accuracy > 95%, sort by fit time
r_df[filter_].sort_values(by=['fit_time_mean']).drop(columns=std_c)

Unnamed: 0,classifier,fit_time_mean,score_time_mean,accuracy_mean,precision_mean,recall_mean,f1_mean
4,GaussianNB,0.436892,0.745969,0.963281,0.757487,0.97041,0.758505
2,DecisionTree,0.835032,0.369607,0.998987,0.919814,0.893322,0.905148
3,ExtraTrees,6.34615,2.673421,0.989497,0.961246,0.794662,0.824789
9,RandomForest,11.501792,2.665219,0.997803,0.991392,0.792152,0.823236
5,GradientBoosting,322.213033,4.000637,0.999935,0.9924,0.996405,0.994376
