# Cross validation

The goal of this notebook is to compare the four obtained training sets to decide on which one to run a grid search to find a good model.

The models tested for each dataset are default neural networks with a numebr of hidden neuros equals to two third of the input plus the output. parameters are kept default and the training last 100 epochs.

In [1]:
import sys
sys.path.append("..")
from src.model import NeuralNetwork
import pandas as pd
from sklearn.model_selection import KFold, cross_val_score
from pprint import pprint
import numpy as np
import tensorflow


from numpy.random import seed
seed(1)
tensorflow.random.set_seed(1)

In [2]:
def get_cross_scores(path, neurons):
    data = pd.read_csv(path)
    x = data.drop("class", axis=1)
    y = data["class"]
    
    kf = KFold(n_splits=10)
    
    acc=[]
    loss=[]

    for train_index, test_index in kf.split(data):
        net = NeuralNetwork.create_model(neurons=neurons)
        net.fit(x.iloc[train_index], 
                y.iloc[train_index],
                batch_size=64, 
                epochs=100, 
                verbose=0)
        scores = net.evaluate(x.iloc[test_index], 
                              y.iloc[test_index], verbose=1)
        acc.append(scores[1])
        loss.append(scores[0])
    
    return {"Accuracy" : (np.mean(acc), np.std(acc), acc),
            "Loss" : (np.mean(loss), np.std(loss), loss)}

## First unscaled dataset
The first model is tested on the unscaled dataset, this has 132 features.

In [3]:
res_1 = get_cross_scores("../data/processed/initial/train_unscaled.csv", (132, 60, 30, 10))
pprint(res_1)

{'Accuracy': (0.15803959779441357,
              0.11748412387736448,
              [0.057777777314186096,
               0.02666666731238365,
               0.25555557012557983,
               0.04888888821005821,
               0.20222222805023193,
               0.3333333432674408,
               0.008888889104127884,
               0.12888889014720917,
               0.3400000035762787,
               0.1781737208366394]),
 'Loss': (nan,
          nan,
          [2.232120990753174,
           2.3848912715911865,
           nan,
           2.25197696685791,
           nan,
           2.9720394611358643,
           2.2974164485931396,
           nan,
           2.4739701747894287,
           3.387160539627075])}


## First scaled dataset
The second model is tested on the scaled dataset, 132 features and Standard Scaler

In [4]:
res_2 = get_cross_scores("../data/processed/initial/train_scaled.csv", (132, 60, 30, 10))
pprint(res_2)

{'Accuracy': (0.5419089287519455,
              0.09469623201509875,
              [0.504444420337677,
               0.48444443941116333,
               0.5333333611488342,
               0.5711110830307007,
               0.4888888895511627,
               0.5199999809265137,
               0.7755555510520935,
               0.5622222423553467,
               0.3888888955116272,
               0.5902004241943359]),
 'Loss': (2.4311001300811768,
          0.7799554310137715,
          [2.4903297424316406,
           2.6902523040771484,
           2.352134943008423,
           2.2133467197418213,
           3.743863582611084,
           2.0728633403778076,
           0.7911885976791382,
           2.58548641204834,
           3.4808926582336426,
           1.8906430006027222])}


## Extended and scaled dataset
This dataset has more features, 144 features and Standard Scaler

In [5]:
res_3 = get_cross_scores("../data/processed/extended/train_extended.csv", (144, 70, 30, 10))
pprint(res_3)

{'Accuracy': (0.5765746176242829,
              0.10803045893195957,
              [0.597777783870697,
               0.46444445848464966,
               0.6244444251060486,
               0.644444465637207,
               0.46666666865348816,
               0.5088889002799988,
               0.8199999928474426,
               0.6155555844306946,
               0.4377777874469757,
               0.5857461094856262]),
 'Loss': (2.3296024084091185,
          0.891361621883871,
          [1.707097053527832,
           3.000248432159424,
           1.69943368434906,
           1.9755271673202515,
           3.814427375793457,
           2.720547676086426,
           0.6426222324371338,
           1.9970605373382568,
           3.4562246799468994,
           2.2828352451324463])}


## PCA dataset
This is a reduced extended scaled dataset, with 120 features found by PCA

In [3]:
res_4 = get_cross_scores("../data/processed/extended/train_pca.csv", (120, 60, 25, 10))
pprint(res_4)

{'Accuracy': (0.5823504090309143,
              0.11846516258756422,
              [0.6355555653572083,
               0.42444443702697754,
               0.6733333468437195,
               0.6266666650772095,
               0.4355555474758148,
               0.6311110854148865,
               0.8288888931274414,
               0.5311111211776733,
               0.46000000834465027,
               0.576837420463562]),
 'Loss': (2.583436393737793,
          0.9924335616439395,
          [2.127903461456299,
           3.6515390872955322,
           1.7323474884033203,
           2.5358080863952637,
           4.280186176300049,
           1.7936289310455322,
           0.7402594089508057,
           3.10223388671875,
           3.358044147491455,
           2.512413263320923])}


## Final decision

As seen in this notebook, the PCA training set led to better performances on the cross validation, therefore it is selected to perform hyperparameter tuning.