# 02_data_optimization

This notebook uses the Gaussian Naive Bayes Classifier to optimize the data set. These steps are preprocessing methods, that make the data more uniform and help boost the performance of the algorithms. As before, the results are cross validated by separating the data into 5 stratified shuffle splits and measuring the mean and standard deviation of the results.

The most common techniques to handle missing data are compared. These are encoding the missing values with a placeholder (in this case -1), imputing the missing value by the mean of the feature or removing the sample in which a value for one feature is missing. Removing samples with missing data has the biggest positive impact on the data, increasing the accuracy from 96% to 99% and the f1 from 75% to 80%.

The dataset is highly imbalanced, with a class distribution of roughly 48, 48, 1 and 1% per class. As such, methods to over or under sample these classes are tested. A combination of both SMOTE and Tomek, in which the higher classes are under sampled and the lower classes are over sampled, boosts f1 from 75% to 97% while the accuracy score also increases from 96% to 99%.

Machine Learning algorithms generally perform better when the values are on an even scale. As such, different methods of scaling the data are compared. Since most of the features in the data set are either binary (0 or 1) or have very low variance (except for speed), the algorithm performs best when no scaler is applied.

When testing the scaling methods on the balanced dataset, the StandardScaler increases the performance the most. The balanced and scaled model has an accuracy of 99% and f1 of 99%. The "balanced", and the "balanced and scaled" datasets are saved to <code>output/data_balanced.csv</code> and <code>output/data_balanced_scaled.csv</code> respectively. The latter is used in the next steps.

#### 0. Imports libraries

In [2]:
import pandas as pd
import numpy as np

from sklearn.model_selection import cross_validate, StratifiedShuffleSplit
from sklearn.naive_bayes import GaussianNB

from sklearn.impute import SimpleImputer

from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from imblearn.combine import SMOTETomek

from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import MaxAbsScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import Normalizer

#### 1. Load and split data into features and target

In [3]:
# load data
df = pd.read_csv('output/data_cleaned.csv')

# split data into features and target
X = df.drop(columns=['seo class'])
y = df['seo class']

#### 2. Set evaluation metrics

In [4]:
# dictionary of evaluation metrics
metrics = {'accuracy': 'accuracy',
           'precision': 'precision_macro', 
           'recall': 'recall_macro',
           'f1': 'f1_macro'}

#### 3. Split data for cross validation

In [5]:
sss = StratifiedShuffleSplit(n_splits=5, test_size=.66, random_state=22)

#### 4. Set classifier

In [6]:
clf = GaussianNB()

#### 5.1. Error encoding

In [6]:
# empty dictionary to store results
cv_results = {}

# create datasets to compare

# errors removed
df_clean = df[~df.lt(0).any(1)]
X_clean = df_clean.drop(columns=['seo class'])
y_clean = df_clean['seo class']

# errors imputed
df_impute = df.copy(deep=True)
df_impute[df_impute < 0] = np.nan
X_impute = df_impute.drop(columns=['seo class'])
y_impute = df_impute['seo class']

imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
X_impute = imp_mean.fit_transform(X_impute, y_impute)

# store data in dict
data = {'errors encoded': [X, y],
        'errors removed': [X_clean, y_clean],
        'errors imputed': [X_impute, y_impute]}

for method, d in data.items():
    cv = cross_validate(clf, d[0], d[1], scoring=metrics, cv=sss)
    cv_results[method] = cv
    
# format data for dataframe
data = []
for name, results in cv_results.items():
    row = [name]
    for k, v in results.items():
        # add mean and standard deviation to data
        row.append(v.mean())
        row.append(v.std())
    data.append(row)
    
# column names for dataframe
columns = ['method']
for k in cv.keys():
    k = k.replace('test_', '')
    columns.append(k+'_mean')
    columns.append(k+'_std')
    
# create data frame to display cv results
results = pd.DataFrame(data, columns=columns)

# save data frame as csv file
results.to_csv('output/5_error_encoding.csv')
results.sort_values(by=['f1_mean'], ascending=False)

Unnamed: 0,method,fit_time_mean,fit_time_std,score_time_mean,score_time_std,accuracy_mean,accuracy_std,precision_mean,precision_std,recall_mean,recall_std,f1_mean,f1_std
1,errors removed,0.325752,0.035651,0.568571,0.007692,0.990434,0.000585,0.780505,0.00045,0.975232,0.006765,0.804394,0.000596
0,errors encoded,0.396002,0.019392,0.730445,0.04502,0.963281,0.001391,0.757487,0.000663,0.97041,0.001742,0.758505,0.001312
2,errors imputed,0.519994,0.03717,0.705688,0.043502,0.88655,0.011275,0.749924,0.000614,0.931416,0.005783,0.718006,0.004957


#### 5.2. Class Balancing

In [7]:
# empty dictionary to store results
cv_results = {}

# get scalers to compare
sampler = {'No Sampling': 0,
           'RandomOverSampler': RandomOverSampler(random_state=2),
           'RandomUnderSampler': RandomUnderSampler(random_state=2),
           'SMOTETomek': SMOTETomek(random_state=2)}

for method, s in sampler.items():
    # reset X and y
    X = df.drop(columns=['seo class'])
    y = df['seo class']
    
    # applies sampler to X and y
    if type(s) != int:
        X, y = s.fit_resample(X, y)
    
    cv = cross_validate(clf, X, y, scoring=metrics, cv=sss)
    cv_results[method] = cv
    
# format data for dataframe
data = []
for name, results in cv_results.items():
    row = [name]
    for k, v in results.items():
        # add mean and standard deviation to data
        row.append(v.mean())
        row.append(v.std())
    data.append(row)
    
# column names for dataframe
columns = ['method']
for k in cv.keys():
    k = k.replace('test_', '')
    columns.append(k+'_mean')
    columns.append(k+'_std')
    
# create data frame to display cv results
results = pd.DataFrame(data, columns=columns)

# save data frame as csv file
results.to_csv('output/5_class_balancing.csv')
results.sort_values(by=['f1_mean'], ascending=False)

Unnamed: 0,method,fit_time_mean,fit_time_std,score_time_mean,score_time_std,accuracy_mean,accuracy_std,precision_mean,precision_std,recall_mean,recall_std,f1_mean,f1_std
3,SMOTETomek,0.891233,0.008455,2.026337,0.010724,0.977984,0.001153,0.978086,0.001137,0.977641,0.001193,0.977668,0.001215
1,RandomOverSampler,0.970513,0.01649,2.276229,0.149319,0.970725,0.00039,0.971059,0.000392,0.970725,0.00039,0.970588,0.000405
2,RandomUnderSampler,0.003409,0.00024,0.006502,0.000255,0.963804,0.002169,0.963946,0.002221,0.963819,0.002159,0.963506,0.002213
0,No Sampling,0.436416,0.058127,0.745758,0.057335,0.963281,0.001391,0.757487,0.000663,0.97041,0.001742,0.758505,0.001312


#### 5.3. Data Scaling with inbalanced classes

In [8]:
# empty dictionary to store results
cv_results = {}

# get scalers to compare
scaler = {'No Scaling': 0,
          'MinMaxScaler': MinMaxScaler(),
          'MaxAbsScaler': MaxAbsScaler(),
          'StandardScaler': StandardScaler(),
          'Normalizer': Normalizer()}

for method, s in scaler.items():
    # reset X and y
    X = df.drop(columns=['seo class'])
    y = df['seo class']
    
    # applies scaler to X
    if type(s) != int:
        X = s.fit_transform(X)
    
    cv = cross_validate(clf, X, y, scoring=metrics, cv=sss)
    cv_results[method] = cv
    
# format data for dataframe
data = []
for name, results in cv_results.items():
    row = [name]
    for k, v in results.items():
        # add mean and standard deviation to data
        row.append(v.mean())
        row.append(v.std())
    data.append(row)
    
# column names for dataframe
columns = ['method']
for k in cv.keys():
    k = k.replace('test_', '')
    columns.append(k+'_mean')
    columns.append(k+'_std')
    
# create data frame to display cv results
results = pd.DataFrame(data, columns=columns)

# save data frame as csv file
results.to_csv('output/5_data_scaling.csv')
results.sort_values(by=['f1_mean'], ascending=False)

Unnamed: 0,method,fit_time_mean,fit_time_std,score_time_mean,score_time_std,accuracy_mean,accuracy_std,precision_mean,precision_std,recall_mean,recall_std,f1_mean,f1_std
0,No Scaling,0.404903,0.018503,0.673342,0.017692,0.963281,0.001391,0.757487,0.000663,0.97041,0.001742,0.758505,0.001312
2,MaxAbsScaler,0.526127,0.01109,0.696275,0.016563,0.93996,0.005883,0.754739,0.000698,0.9595,0.003501,0.7433,0.003031
1,MinMaxScaler,0.51391,0.01358,0.674625,0.02723,0.939642,0.005941,0.754716,0.000693,0.959495,0.003436,0.743228,0.003001
3,StandardScaler,0.506076,0.014552,0.658166,0.02382,0.936438,0.005922,0.754427,0.000622,0.957956,0.003258,0.741645,0.002881
4,Normalizer,0.487678,0.006445,0.630861,0.005555,0.943696,0.005194,0.560165,0.005137,0.777674,0.003217,0.596587,0.00704


#### 6. Combining best performing preprocessing methods

In [None]:
# load original data
df = pd.read_csv('output/data_cleaned.csv')
X = df.drop(columns=['seo class'])
y = df['seo class']

# removes errors
df = df[~df.lt(0).any(1)]
X = df.drop(columns=['seo class'])
y = df['seo class']

# applies SMOTETomek to balance classes
s = SMOTETomek(random_state=2)
X, y = s.fit_resample(X, y)

# save balanced cleaned data
# create copy of sampled features
df_balanced = X.copy(deep=True)
# add targets to copy
df_balanced['seo class'] = y
# save balanced data
df_balanced.to_csv('output/data_cleaned_balanced.csv', index=False)

# no data scaling because it was the best performing method
## possible: re-test after other two preprocessing steps
## incase it improves the data further

cv = cross_validate(clf, X, y, scoring=metrics, cv=sss)

# format data for dataframe
data = []
row = []
for k, v in cv.items():
    # add mean and standard deviation to data
    row.append(v.mean())
    row.append(v.std())
data.append(row)
    
# column names for dataframe
columns = []
for k in cv.keys():
    k = k.replace('test_', '')
    columns.append(k+'_mean')
    columns.append(k+'_std')
    
# create data frame to display cv results
results = pd.DataFrame(data, columns=columns)

# save data frame as csv file
results.to_csv('output/5_data_optimization.csv')
results.T