<center><h1>Preliminary results and analysis (G-LSSVM and L-LSSVM)</h1></center>

### 1. Methodology <a class="anchor" id="methodology"></a>

The approach was:

1. For 50 times:
    
    1.1 Divide the data set between train/test in stratified manner;
    
    1.2 Used 5-fold stratified cross-validation on the training set to choose best hyperparameters;
    
    1.3 Fit model in the whole train set with best hyperparameters;
    
    1.4 Make predictions in test set;
    
    
2. Distribution of the performance metric on train and test sets was evaluated.

### 2. Simulations <a class="anchor" id="simulations"></a>

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
import numpy as np
import pandas as pd

from load_dataset import datasets

from devcode.analysis.clustering import cluster_val_metrics
from devcode.analysis.results import process_results
from devcode.models.lssvm import LSSVM
from devcode.models.local_learning import LocalModel
from devcode.utils.evaluation import eval_GLSSVM, eval_LLSSVM
from devcode.utils import scale_feat, dummie2multilabel, load_csv_as_pandas

from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold

from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.cluster import KMeans

from pathlib import Path
from copy import copy

Dataset:  Features.shape:   # of classes:
vc2c      (310, 6)          2
vc3c      (310, 6)          3
wf24f     (5456, 24)        4
wf4f      (5456, 4)         4
wf2f      (5456, 2)         4
pk        (195, 22)         2


In [3]:
datasets_names = ['pk', 'vc2c', 'vc3c', 'wf2f', 'wf4f', 'wf24f']

#### 2.1 Hyperparameters and initial state <a class="anchor" id="g-lssvm"></a>

Load predefined random states.

In [4]:
# Constant hyperparameters:
test_size = 0.5
scaleType = 'min-max'
n_init    = 50 # number of independent runs

# Hyperparameters grid search:
gammas = np.logspace(-6.0, 6.0, num=7).tolist()
sigmas = np.logspace(-0.5, 3.0, num=5).tolist()

print("gammas = {}".format(gammas))
print("sigmas = {}".format(sigmas))

hps_cases = [
    { "gamma": gamma,
      "sigma": sigma 
    }
    for gamma in gammas
    for sigma in sigmas
]

print("# of hps_cases = {}".format(len(hps_cases)))

# Load vector of random states for train/test split
rnd_states_file = "results/local-results/G-LSSVM - n_init=50 - 2019-08-28 (random states).csv"
random_states   = np.unique(pd.read_csv(rnd_states_file, usecols=['random_state']).values).tolist()

# random_states = np.random.randint(np.iinfo(np.int32).max, size=n_init).tolist()
cases = [
    {
         "dataset_name": dataset_name
        ,"random_state": random_state
    }
    # hyperparameters possible values
    for dataset_name in datasets_names
    for random_state in random_states
]

print(" ")
print("# of data set runs = {}".format(len(cases)))

gammas = [1e-06, 0.0001, 0.01, 1.0, 100.0, 10000.0, 1000000.0]
sigmas = [0.31622776601683794, 2.371373705661655, 17.78279410038923, 133.3521432163324, 1000.0]
# of hps_cases = 35
 
# of data set runs = 300


#### 2.2 Global LSSVM <a class="anchor" id="g-lssvm"></a>

As a baseline, we will use a global model, in this case, a **Global LSSVM**.

In [5]:
global_results_file = f"results/local-results/cbic/temp_glssvm_cbic"
header = ["dataset_name", "random_state", "$\gamma$", "$\sigma$", "eigenvalues", "eigenvalues_dtype", "cm_tr", "cm_ts", ]

display(global_results_file)

eval_GLSSVM(datasets, global_results_file, header, cases[0], scaleType, test_size, hps_cases)

'results/local-results/cbic/temp_glssvm_cbic'

PermissionError: [Errno 13] Permission denied: 'results/local-results/cbic/temp_glssvm_cbic'

##### 2.2.1 Process results<a class="anchor" id="g-lssvm"></a>

In [None]:
df_results = load_csv_as_pandas(global_results_file)
process_results(df_results)

#### 2.3 Local LSSVM <a class="anchor" id="l-lssvm"></a>

Repeating `LocalModel` class below:

In [6]:
temp  = [' ']*2*len(cluster_val_metrics)
count = 0

for metric in cluster_val_metrics:
    temp[count]   = "$\gamma_{opt}$ "+"[{}]".format(metric['name'])
    temp[count+1] = "$\sigma_{opt}$ "+"[{}]".format(metric['name'])
    count+=2
    
display(temp)

header = ["dataset_name", "random_state", "# empty regions", "# homogeneous regions"] + \
    ["$\gamma_{opt}$ [CV]", "$\sigma_{opt}$ [CV]"] + \
    temp +\
    ["$k_{opt}$ [CV]"] + \
    ['$k_{opt}$ '+'[{}]'.format(metric['name']) for metric in cluster_val_metrics] + \
    ['cv_score [{}]'.format(metric['name']) for metric in cluster_val_metrics] + \
    ["eigenvalues", "eigenvalues_dtype", "cm_tr", "cm_ts"]

local_results_file = "results/local-results/cbic/temp_llssvm_cbic/results"
temp = eval_LLSSVM(datasets, local_results_file, header, cases[0], scaleType, test_size, hps_cases)

['$\\gamma_{opt}$ [Adjusted Rand Index]',
 '$\\sigma_{opt}$ [Adjusted Rand Index]',
 '$\\gamma_{opt}$ [Adjusted Mutual Information]',
 '$\\sigma_{opt}$ [Adjusted Mutual Information]',
 '$\\gamma_{opt}$ [V-measure]',
 '$\\sigma_{opt}$ [V-measure]',
 '$\\gamma_{opt}$ [Fowlkes-Mallows]',
 '$\\sigma_{opt}$ [Fowlkes-Mallows]',
 '$\\gamma_{opt}$ [Silhouette]',
 '$\\sigma_{opt}$ [Silhouette]',
 '$\\gamma_{opt}$ [Calinski-Harabasz]',
 '$\\sigma_{opt}$ [Calinski-Harabasz]',
 '$\\gamma_{opt}$ [Davies-Bouldin]',
 '$\\sigma_{opt}$ [Davies-Bouldin]',
 '$\\gamma_{opt}$ [Dunn]',
 '$\\sigma_{opt}$ [Dunn]',
 '$\\gamma_{opt}$ [Final Prediction Error]',
 '$\\sigma_{opt}$ [Final Prediction Error]',
 '$\\gamma_{opt}$ [Akaike Information Criteria]',
 '$\\sigma_{opt}$ [Akaike Information Criteria]',
 '$\\gamma_{opt}$ [Bayesian Information Criteria]',
 '$\\sigma_{opt}$ [Bayesian Information Criteria]',
 '$\\gamma_{opt}$ [Minimum Description Length]',
 '$\\sigma_{opt}$ [Minimum Description Length]']

TypeError: fit() got an unexpected keyword argument 'Cluster_params'

##### 2.3.1 Process results<a class="anchor" id="g-lssvm"></a>

In [None]:
df_results = load_csv_as_pandas(local_results_file)
process_results(df_results)