# 02_data_optimization

This notebook uses the Gaussian Naive Bayes Classifier to optimize the data set. These steps are preprocessing methods, that make the data more uniform and help boost the performance of the algorithms. As before, the results are cross validated by separating the data into 5 stratified shuffle splits and measuring the mean and standard deviation of the results.

The most common techniques to handle missing data are compared. These are encoding the missing values with a placeholder (in this case -1), imputing the missing value by the mean of the feature or removing the sample in which a value for one feature is missing. Removing samples with missing data has the biggest positive impact on the data, increasing the accuracy from 96% to 99% and the f1 from 75% to 80%.

The dataset is highly imbalanced, with a class distribution of roughly 48, 48, 1 and 1% per class. As such, methods to over or under sample these classes are tested. A combination of both SMOTE and Tomek, in which the higher classes are under sampled and the lower classes are over sampled, boosts f1 from 75% to 97% while the accuracy score also increases from 96% to 99%.

Machine Learning algorithms generally perform better when the values are on an even scale. As such, different methods of scaling the data are compared. Since most of the features in the data set are either binary (0 or 1) or have very low variance (except for speed), the algorithm performs best when no scaler is applied.

When testing the scaling methods on the balanced dataset, the StandardScaler increases the performance the most. The balanced and scaled model has an accuracy of 99% and f1 of 99%. The "balanced", and the "balanced and scaled" datasets are saved to <code>output/data_balanced.csv</code> and <code>output/data_balanced_scaled.csv</code> respectively. The latter is used in the next steps.

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import cross_validate, StratifiedShuffleSplit
from sklearn.naive_bayes import GaussianNB

from sklearn.impute import SimpleImputer

from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from imblearn.combine import SMOTETomek

from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import MaxAbsScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import Normalizer

# load data
df = pd.read_csv('output/data_cleaned.csv')

# split data into features and target
X = df.drop(columns=['seo class'])
y = df['seo class']

# dictionary of evaluation metrics
metrics = {'accuracy': 'accuracy',
           'precision': 'precision_macro', 
           'recall': 'recall_macro',
           'f1': 'f1_macro'}

# create stratified split for cross validation
sss = StratifiedShuffleSplit(n_splits=5, test_size=.66, random_state=22)

# define classifier
clf = GaussianNB()

## 02-1_error_encoding

In [2]:
# empty dictionary to store results
cv_results = {}

# get datasets to compare
# errors removed
df_clean = df[~df.lt(0).any(1)]
X_clean = df_clean.drop(columns=['seo class'])
y_clean = df_clean['seo class']

# errors imputed
df_impute = df.copy(deep=True)
df_impute[df_impute < 0] = np.nan
X_impute = df_impute.drop(columns=['seo class'])
y_impute = df_impute['seo class']

imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
X_impute = imp_mean.fit_transform(X_impute, y_impute)

# store data in dict
data = {'errors encoded': [X, y],
        'errors removed': [X_clean, y_clean],
        'errors imputed': [X_impute, y_impute]}

for method, d in data.items():
    cv = cross_validate(clf, d[0], d[1], scoring=metrics, cv=sss)
    cv_results[method] = cv
    
# format data for dataframe
data = []
for name, results in cv_results.items():
    row = [name]
    for k, v in results.items():
        # add mean and standard deviation to data
        row.append(v.mean())
        row.append(v.std())
    data.append(row)
    
# column names for dataframe
columns = ['method']
for k in cv.keys():
    k = k.replace('test_', '')
    columns.append(k+'_mean')
    columns.append(k+'_std')

# create data frame to display cv results
results = pd.DataFrame(data, columns=columns)
# save data frame as csv file
results.to_csv('output/02-1_error_encoding.csv')
results.sort_values(by=['f1_mean'], ascending=False)

Unnamed: 0,method,fit_time_mean,fit_time_std,score_time_mean,score_time_std,accuracy_mean,accuracy_std,precision_mean,precision_std,recall_mean,recall_std,f1_mean,f1_std
1,errors removed,0.352058,0.010248,0.626672,0.015262,0.990434,0.000585,0.780505,0.00045,0.975232,0.006765,0.804394,0.000596
0,errors encoded,0.449729,0.024788,0.77306,0.039809,0.963281,0.001391,0.757487,0.000663,0.97041,0.001742,0.758505,0.001312
2,errors imputed,0.525328,0.036665,0.687711,0.013056,0.88655,0.011275,0.749924,0.000614,0.931416,0.005783,0.718006,0.004957


## 02-2_class_balancing

In [3]:
# empty dictionary to store results
cv_results = {}

# get scalers to compare

sampler = {'No Sampling': 0,
           'RandomOverSampler': RandomOverSampler(random_state=2),
           'RandomUnderSampler': RandomUnderSampler(random_state=2),
           'SMOTETomek': SMOTETomek(random_state=2)}

for method, s in sampler.items():
    # reset X and y
    X = df.drop(columns=['seo class'])
    y = df['seo class']
    
    # applies sampler to X and y
    if type(s) != int:
        X, y = s.fit_resample(X, y)
    
    cv = cross_validate(clf, X, y, scoring=metrics, cv=sss)
    cv_results[method] = cv
    
# format data for dataframe
data = []
for name, results in cv_results.items():
    row = [name]
    for k, v in results.items():
        # add mean and standard deviation to data
        row.append(v.mean())
        row.append(v.std())
    data.append(row)
    
# column names for dataframe
columns = ['method']
for k in cv.keys():
    k = k.replace('test_', '')
    columns.append(k+'_mean')
    columns.append(k+'_std')

# create data frame to display cv results
results = pd.DataFrame(data, columns=columns)
# save data frame as csv file
results.to_csv('output/02-2_class_balancing.csv')
results.sort_values(by=['f1_mean'], ascending=False)

Unnamed: 0,method,fit_time_mean,fit_time_std,score_time_mean,score_time_std,accuracy_mean,accuracy_std,precision_mean,precision_std,recall_mean,recall_std,f1_mean,f1_std
3,SMOTETomek,0.902941,0.011864,2.05601,0.003684,0.977984,0.001153,0.978086,0.001137,0.977641,0.001193,0.977668,0.001215
1,RandomOverSampler,0.987261,0.047807,2.353501,0.147402,0.970725,0.00039,0.971059,0.000392,0.970725,0.00039,0.970588,0.000405
2,RandomUnderSampler,0.003405,0.000193,0.00663,0.000279,0.963804,0.002169,0.963946,0.002221,0.963819,0.002159,0.963506,0.002213
0,No Sampling,0.4017,0.019843,0.697778,0.012314,0.963281,0.001391,0.757487,0.000663,0.97041,0.001742,0.758505,0.001312


## 02-3_data_scaling

In [5]:
# empty dictionary to store results
cv_results = {}

# get scalers to compare
scaler = {'No Scaling': 0,
          'MinMaxScaler': MinMaxScaler(),
          'MaxAbsScaler': MaxAbsScaler(),
          'StandardScaler': StandardScaler(),
          'Normalizer': Normalizer()}

for method, s in scaler.items():
    # reset X and y
    X = df.drop(columns=['seo class'])
    y = df['seo class']
    
    # applies scaler to X
    if type(s) != int:
        X = s.fit_transform(X)
    
    cv = cross_validate(clf, X, y, scoring=metrics, cv=sss)
    cv_results[method] = cv
    
# format data for dataframe
data = []
for name, results in cv_results.items():
    row = [name]
    for k, v in results.items():
        # add mean and standard deviation to data
        row.append(v.mean())
        row.append(v.std())
    data.append(row)
    
# column names for dataframe
columns = ['method']
for k in cv.keys():
    k = k.replace('test_', '')
    columns.append(k+'_mean')
    columns.append(k+'_std')

# create data frame to display cv results
results = pd.DataFrame(data, columns=columns)
# save data frame as csv file
results.to_csv('output/02-3_data_scaling.csv')
results.sort_values(by=['f1_mean'], ascending=False)

Unnamed: 0,method,fit_time_mean,fit_time_std,score_time_mean,score_time_std,accuracy_mean,accuracy_std,precision_mean,precision_std,recall_mean,recall_std,f1_mean,f1_std
0,No Scaling,0.411623,0.015341,0.714665,0.011657,0.963281,0.001391,0.757487,0.000663,0.97041,0.001742,0.758505,0.001312
2,MaxAbsScaler,0.523627,0.008802,0.705701,0.023381,0.93996,0.005883,0.754739,0.000698,0.9595,0.003501,0.7433,0.003031
1,MinMaxScaler,0.501623,0.009172,0.672158,0.014683,0.939642,0.005941,0.754716,0.000693,0.959495,0.003436,0.743228,0.003001
3,StandardScaler,0.529279,0.005894,0.703881,0.010031,0.936438,0.005922,0.754427,0.000622,0.957956,0.003258,0.741645,0.002881
4,Normalizer,0.520847,0.004463,0.692708,0.012584,0.943696,0.005194,0.560165,0.005137,0.777674,0.003217,0.596587,0.00704


## 02-3.5_data_scaling_on_balanced_classes

In [6]:
# empty dictionary to store results
cv_results = {}

# get scalers to compare
scaler = {'No Scaling': 0,
          'MinMaxScaler': MinMaxScaler(),
          'MaxAbsScaler': MaxAbsScaler(),
          'StandardScaler': StandardScaler(),
          'Normalizer': Normalizer()}
# load data
df = pd.read_csv('output/data_cleaned.csv')

# 02-1: error encoding
df1 = df[~df.lt(0).any(1)]
X1 = df1.drop(columns=['seo class'])
y1 = df1['seo class']

# 02-2: class balancing
sampler = SMOTETomek(random_state=2)
X2, y2 = sampler.fit_resample(X1, y1)

# create copy of sampled features
df2 = X2.copy(deep=True)
# add targets to copy
df2['seo class'] = y2
# save balanced data
df2.to_csv('output/data_balanced.csv', index=False)

for method, s in scaler.items():
    # reset X and y
    X3 = X2
    y3 = y2
    
    # applies scaler to X
    if type(s) != int:
        X3 = s.fit_transform(X3)
    
    cv = cross_validate(clf, X3, y3, scoring=metrics, cv=sss)
    cv_results[method] = cv
    
# format data for dataframe
data = []
for name, results in cv_results.items():
    row = [name]
    for k, v in results.items():
        # add mean and standard deviation to data
        row.append(v.mean())
        row.append(v.std())
    data.append(row)
    
# column names for dataframe
columns = ['method']
for k in cv.keys():
    k = k.replace('test_', '')
    columns.append(k+'_mean')
    columns.append(k+'_std')

# create data frame to display cv results
results = pd.DataFrame(data, columns=columns)
# save data frame as csv file
results.to_csv('output/02-3.5_balanced_data_scaling.csv', index=False)
results.sort_values(by=['f1_mean'], ascending=False)

Unnamed: 0,method,fit_time_mean,fit_time_std,score_time_mean,score_time_std,accuracy_mean,accuracy_std,precision_mean,precision_std,recall_mean,recall_std,f1_mean,f1_std
3,StandardScaler,0.93833,0.020005,1.8307,0.031684,0.994816,8.7e-05,0.994853,8.9e-05,0.994685,8.8e-05,0.994738,8.8e-05
1,MinMaxScaler,0.938448,0.013297,1.801425,0.033166,0.994521,9.5e-05,0.994547,9.8e-05,0.994392,9.6e-05,0.99444,9.7e-05
2,MaxAbsScaler,0.923413,0.013727,1.763639,0.012033,0.994521,9.5e-05,0.994547,9.8e-05,0.994392,9.6e-05,0.99444,9.7e-05
0,No Scaling,0.818617,0.042544,1.870577,0.050447,0.989987,0.000194,0.989883,0.000197,0.989842,0.000196,0.989816,0.000198
4,Normalizer,0.936391,0.011571,1.81031,0.005827,0.981208,0.000264,0.981926,0.000255,0.980779,0.000263,0.980989,0.000264


## 02-4_scaling_balanced_data

In [8]:
# empty dictionary to store results
cv_results = {}

# load original data
df = pd.read_csv('output/data_cleaned.csv')
X = df.drop(columns=['seo class'])
y = df['seo class']

# 02-1: remove errors
df1 = df[~df.lt(0).any(1)]
X1 = df1.drop(columns=['seo class'])
y1 = df1['seo class']

# 02-2: remove errors
df2 = pd.read_csv('output/data_balanced.csv')
X2 = df2.drop(columns=['seo class'])
y2 = df2['seo class']

# 02-3: data scaling
# testing if scaling has a positive impact on balanced data
scaler = StandardScaler()
X3 = scaler.fit_transform(X2)
y3 = y2

# create copy of sampled and scaled features
df3 = pd.DataFrame(X3, columns=X.columns)
# add targets to copy
df3['seo class'] = y3
# save balanced and scaled data
df3.to_csv('output/data_balanced_scaled.csv', index=False)

# store data in dict
data = {'error encoded': [X, y],
        'errors removed': [X1, y1],
        'class balanced': [X2, y2],
        'data scaled': [X3, y3]}

for method, d in data.items():
    cv = cross_validate(clf, d[0], d[1], scoring=metrics, cv=sss)
    cv_results[method] = cv

# format data for dataframe
data = []
for name, results in cv_results.items():
    row = [name]
    for k, v in results.items():
        # add mean and standard deviation to data
        row.append(v.mean())
        row.append(v.std())
    data.append(row)
    
# column names for dataframe
columns = ['method']
for k in cv.keys():
    k = k.replace('test_', '')
    columns.append(k+'_mean')
    columns.append(k+'_std')

# create data frame to display cv results
results = pd.DataFrame(data, columns=columns)
# save data frame as csv file
results.to_csv('output/02_data_optimization.csv')
results

Unnamed: 0,method,fit_time_mean,fit_time_std,score_time_mean,score_time_std,accuracy_mean,accuracy_std,precision_mean,precision_std,recall_mean,recall_std,f1_mean,f1_std
0,error encoded,0.441485,0.057249,0.76477,0.082938,0.963281,0.001391,0.757487,0.000663,0.97041,0.001742,0.758505,0.001312
1,errors removed,0.361259,0.048442,0.589837,0.001799,0.990434,0.000585,0.780505,0.00045,0.975232,0.006765,0.804394,0.000596
2,class balanced,0.845261,0.020758,2.126672,0.189929,0.989987,0.000194,0.989883,0.000197,0.989842,0.000196,0.989816,0.000198
3,data scaled,0.958995,0.055208,1.812603,0.066597,0.994816,8.7e-05,0.994853,8.9e-05,0.994685,8.8e-05,0.994738,8.8e-05
