<center><h1>[CBIC]</h1></center>
<center><h1>Global Least Squares Support Vector Machine (G-LSSVM)</h1></center>

## Summary:
1. [Methodology](#methodology)

2. [Simulations](#simulations)
    
    2.1 [Global LSSVM](#g-lssvm)

# 1. Methodology <a class="anchor" id="methodology"></a>

The approach was:

1. For 50 times:
    
    1.1 Divide the data set between train/test in stratified manner;
    
    1.2 Used 5-fold stratified cross-validation on the training set to choose best hyperparameters;
    
    1.3 Fit model in the whole train set with best hyperparameters;
    
    1.3 Make predictions in test set;
    
2. Distribution of the performance metric on train and test sets was evaluated.

# 2. Simulations <a class="anchor" id="simulations"></a>

## 2.1 Global LSSVM <a class="anchor" id="g-lssvm"></a>

As a baseline, we will use a global model, in this case, a **Global LSSVM**.

In [16]:
datasets_names = ['pk', 'vc2c', 'vc3c', 'wf2f', 'wf4f', 'wf24f']
rd_state_file  = 'simulation_results/G-LSSVM - n_init=50 - 2019-08-28 (random states).csv'

In [17]:
import numpy as np
import pandas as pd

# constant hyperparameters:
test_size = 0.5
scaleType = 'min-max'
n_init = 50 # number of independent runs

# hyperparameters grid search:
gammas = np.logspace(-6.0, 6.0, num=7).tolist()
sigmas = np.logspace(-0.5, 3.0, num=5).tolist()

print("gammas = {}".format(gammas))
print("sigmas = {}".format(sigmas))

hps_cases = [
    { "gamma": gamma,
      "sigma": sigma 
    }
    for gamma in gammas
    for sigma in sigmas
]
print("# of hps_cases = {}".format(len(hps_cases)))

# vector of random states for train/test split
random_states = np.unique(pd.read_csv(rd_state_file, usecols=['random_state']).values).tolist()

# random_states = np.random.randint(np.iinfo(np.int32).max, size=n_init).tolist()
cases = [
    {
         "dataset_name": dataset_name
        ,"random_state": random_state
    }
    # hyperparameters possible values
    for dataset_name in datasets_names
    for random_state in random_states
]

print(" ")
print("# of data set runs = {}".format(len(cases)))

gammas = [1e-06, 0.0001, 0.01, 1.0, 100.0, 10000.0, 1000000.0]
sigmas = [0.31622776601683794, 2.371373705661655, 17.78279410038923, 133.3521432163324, 1000.0]
# of hps_cases = 35
 
# of data set runs = 300


In [20]:
from utils import scale_feat, dummie2multilabel
from load_dataset import datasets
from evaluation import eval_GLSSVM

from lssvm import LSSVM
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from pathlib import Path

filename = f"simulation_results/cbic/temp_glssvm_cbic/G-LSSVM - {cases[0]}.csv"
header   = ["dataset_name", "random_state", "$\gamma$", "$\sigma$", "eigenvalues", "eigenvalues_dtype", "cm_tr", "cm_ts", ]

display(filename)

eval_GLSSVM(filename, header, cases[0], scaleType, test_size, hps_cases)

"simulation_results/cbic/temp_glssvm_cbic/G-LSSVM - {'dataset_name': 'pk', 'random_state': 73470257}.csv"

OSError: [Errno 22] Invalid argument: "simulation_results/cbic/temp_glssvm_cbic/G-LSSVM - {'dataset_name': 'pk', 'random_state': 73470257}.csv"

In [None]:
# from random import shuffle
# shuffle(cases) # better estimation of remaining time

from joblib import Parallel, delayed
data = Parallel(n_jobs=4, verbose=51)(
    delayed(eval_GLSSVM)(case) for case in reversed(cases)
)

Agregando resultados:

In [None]:
import glob

path      = r'./temp_glssvm_cbic'
all_files = glob.glob(path + "/*.csv")

li = []
for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    li.append(df)

df_results = pd.concat(li, axis=0, ignore_index=True)
df_results.sort_values(by ='dataset_name' )

In [None]:
from IPython.display import display

lista = list(df_results.columns)[2:-4]
for name in lista:
    display(df_results[name].value_counts())

Eigenvalues of each run:

In [None]:
for i in range(len(df_results)):
    print(df_results['dataset_name'][i])
    print(
        np.frombuffer(eval(df_results['eigenvalues'][i]), dtype=df_results['eigenvalues_dtype'][i]).shape,
        df_results['eigenvalues_dtype'][i]
    )
    print(" ")

How to get back `cm_tr` and `cm_ts`:

In [None]:
for i in range(len(df_results)):
    print(df_results['dataset_name'][i])
    temp_tr = np.frombuffer(eval( df_results['cm_tr'][i] ),
                         dtype='int64')
    temp_ts = np.frombuffer(eval( df_results['cm_ts'][i] ),
                         dtype='int64')
    print("cm_tr:")
    print(temp_tr.reshape( int(len(temp_tr)**(1/2)) ,-1))
    
    print("cm_ts:")
    print(temp_ts.reshape( int(len(temp_ts)**(1/2)) ,-1))
    
    print("\n")