# Stratified cross validation

The goal of this notebook is to compare the four obtained training sets to decide on which one to run a grid search to find a good model.

The models tested for each dataset are default neural networks with a numebr of hidden neuros equals to two third of the input plus the output. parameters are kept default and the training last 100 epochs.

Stratified cross validation is perfomed to accoutn for class imbalance in the training set, also, class weights are considered when training.

In [7]:
import sys
sys.path.append("..")
from src.model import NeuralNetwork
import pandas as pd
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.utils import class_weight
from pprint import pprint
import numpy as np
import tensorflow


from numpy.random import seed
seed(1)
tensorflow.random.set_seed(1)
import warnings  
warnings.filterwarnings("ignore",category=FutureWarning)

In [8]:

def get_cross_scores(path, neurons):
    data = pd.read_csv(path)
    x = data.drop("class", axis=1)
    y = data["class"]
    
    kf = StratifiedKFold(n_splits=5)
    
    class_weights = class_weight.compute_class_weight('balanced',
                                                      np.unique(y),
                                                      y)
    weights_dict = dict(zip(np.unique(y), class_weights))
    acc=[]
    loss=[]

    for train_index, test_index in kf.split(x, y):
        net = NeuralNetwork.create_model(neurons=neurons)
        net.fit(x.iloc[train_index], 
                y.iloc[train_index],
                batch_size=64, 
                epochs=100, 
                verbose=0, 
                class_weight=weights_dict)
        scores = net.evaluate(x.iloc[test_index], 
                              y.iloc[test_index], verbose=1)
        acc.append(scores[1])
        loss.append(scores[0])
    
    return {"Accuracy" : (np.mean(acc), np.std(acc), acc),
            "Loss" : (np.mean(loss), np.std(loss), loss)}

## First unscaled dataset
The first model is tested on the unscaled dataset, this has 132 features.

In [9]:
res_1 = get_cross_scores("../data/processed/initial/train_unscaled.csv", (132, 60, 30, 10))
pprint(res_1)

{'Accuracy': (0.11380249559879303,
              0.0038706225007881373,
              [0.1111111119389534,
               0.1111111119389534,
               0.12111110985279083,
               0.11444444209337234,
               0.11123470216989517]),
 'Loss': (nan,
          nan,
          [nan, 2.3026485443115234, 54869046067200.0, 2.300891160964966, nan])}


## First scaled dataset
The second model is tested on the scaled dataset, 132 features and Standard Scaler

In [44]:
res_2 = get_cross_scores("../data/processed/initial/train_scaled.csv", (132, 60, 30, 10))
pprint(res_2)

{'Accuracy': (0.5743455648422241,
              0.03242956403450645,
              [0.5666666626930237,
               0.5644444227218628,
               0.6377778053283691,
               0.5477777719497681,
               0.5550611615180969]),
 'Loss': (2.180751657485962,
          0.12919648526724084,
          [2.0357139110565186,
           2.1174960136413574,
           2.078993558883667,
           2.3363943099975586,
           2.335160493850708])}


## Extended and scaled dataset
This dataset has more features, 144 features and Standard Scaler

In [45]:
res_3 = get_cross_scores("../data/processed/extended/train_extended.csv", (144, 70, 30, 10))
pprint(res_3)

{'Accuracy': (0.6078949451446534,
              0.04855248704962302,
              [0.6266666650772095,
               0.6066666841506958,
               0.6777777671813965,
               0.601111114025116,
               0.5272524952888489]),
 'Loss': (2.0778797388076784,
          0.1999730871773237,
          [1.913411021232605,
           1.946790337562561,
           2.209632635116577,
           1.9098454713821411,
           2.409719228744507])}


## PCA dataset
This is a reduced extended scaled dataset, with 120 features found by PCA

In [46]:
res_4 = get_cross_scores("../data/processed/extended/train_pca.csv", (120, 60, 25, 10))
pprint(res_4)

{'Accuracy': (0.6143443346023559,
              0.04814182775076292,
              [0.5766666531562805,
               0.6511111259460449,
               0.6822222471237183,
               0.6122221946716309,
               0.5494994521141052]),
 'Loss': (1.9116067171096802,
          0.22695251120073814,
          [1.879289984703064,
           1.5592150688171387,
           2.2078046798706055,
           1.8132739067077637,
           2.098449945449829])}


## Final decision

As seen in this notebook, the PCA training set led to better performances on the cross validation, therefore it is selected to perform hyperparameter tuning.