# Cross validation

The goal of this notebook is to compare the four obtained training sets to decide on which one to run a grid search to find a good model.

The models tested for each dataset are default neural networks with a numebr of hidden neuros equals to two third of the input plus the output. parameters are kept default and the training last 100 epochs.

In [1]:
import sys
sys.path.append("..")
from src.model import NeuralNetwork
import pandas as pd
from sklearn.model_selection import KFold, cross_val_score
from pprint import pprint
import numpy as np
import tensorflow


from numpy.random import seed
seed(1)
tensorflow.random.set_seed(1)

In [2]:
def get_cross_scores(path, neurons):
    data = pd.read_csv(path)
    x = data.drop("class", axis=1)
    y = data["class"]
    
    kf = KFold(n_splits=5)
    
    acc=[]
    loss=[]

    for train_index, test_index in kf.split(data):
        net = NeuralNetwork.create_model(neurons=neurons)
        net.fit(x.iloc[train_index], 
                y.iloc[train_index],
                batch_size=64, 
                epochs=100, 
                verbose=0)
        scores = net.evaluate(x.iloc[test_index], 
                              y.iloc[test_index], verbose=1)
        acc.append(scores[1])
        loss.append(scores[0])
    
    return {"Accuracy" : (np.mean(acc), np.std(acc), acc),
            "Loss" : (np.mean(loss), np.std(loss), loss)}

## First unscaled dataset
The first model is tested on the unscaled dataset, this has 132 features.

In [3]:
res_1 = get_cross_scores("../data/processed/initial/train_unscaled.csv", (132, 60, 30, 10))
pprint(res_1)

{'Accuracy': (0.06690248474478722,
              0.04883372520743649,
              [0.06777777522802353,
               0.15555556118488312,
               0.008888889104127884,
               0.041111111640930176,
               0.061179086565971375]),
 'Loss': (nan, nan, [nan, nan, 2.3502118587493896, 2.2813379764556885, nan])}


## First scaled dataset
The second model is tested on the scaled dataset, 132 features and Standard Scaler

In [4]:
res_2 = get_cross_scores("../data/processed/initial/train_scaled.csv", (132, 60, 30, 10))
pprint(res_2)

{'Accuracy': (0.5201095044612885,
              0.06681752075511521,
              [0.46666666865348816,
               0.5788888931274414,
               0.4444444477558136,
               0.6177777647972107,
               0.4927697479724884]),
 'Loss': (2.582569193840027,
          0.4710694651929501,
          [2.8165314197540283,
           2.2383999824523926,
           3.2963662147521973,
           1.9310065507888794,
           2.6305418014526367])}


## Extended and scaled dataset
This dataset has more features, 144 features and Standard Scaler

In [5]:
res_3 = get_cross_scores("../data/processed/extended/train_extended.csv", (144, 70, 30, 10))
pprint(res_3)

{'Accuracy': (0.5585579037666321,
              0.08418217105044334,
              [0.5055555701255798,
               0.6177777647972107,
               0.4655555486679077,
               0.6933333277702332,
               0.510567307472229]),
 'Loss': (2.3159227848052977,
          0.5625732692380198,
          [2.4795339107513428,
           1.969861626625061,
           3.2567317485809326,
           1.5739480257034302,
           2.2995386123657227])}


## PCA dataset
This is a reduced extended scaled dataset, with 120 features found by PCA

In [6]:
res_4 = get_cross_scores("../data/processed/extended/train_pca.csv", (120, 60, 25, 10))
pprint(res_4)

{'Accuracy': (0.5607833385467529,
              0.09981858607316675,
              [0.4444444477558136,
               0.6511111259460449,
               0.47999998927116394,
               0.70333331823349,
               0.5250278115272522]),
 'Loss': (2.307897138595581,
          0.6985271471207426,
          [2.5293829441070557,
           1.7077693939208984,
           3.4707772731781006,
           1.4852628707885742,
           2.3462932109832764])}


## Final decision

As seen in this notebook, the PCA training set led to better performances on the cross validation, therefore it is selected to perform hyperparameter tuning.