# SVAE Training on Moderate-dimensional Register Data

In this report, we will be training our SVAE models on moderate-dimensional dataset which is one of the two versions of register dataset used in this thesis. These are:

- 1- High-dimensional Dataset: The dataset where we only removed features that are fully correlated and having zero variance
- 2- Moderate-dimensional Dataset: The dataset where we removed features with correlation threshold of 0.9 and variance threshold of 1%.

Each Dataset is trained with 5-Fold Cross-Validation method. The hyperparameters of SVAEs are trained in two steps.

The hyperparameters of the **first step** are:
- Number of Hidden Layers in Encoder/Decoder 
- Number of Neurons in Hidden Layers of Encoder/Decoder 
- Number of Hidden Layers in Classifier 
- Number of Neurons in Hidden Layers of Classifier 
- Latent Size 

The hyperparameters of the **second step** are:
- alpha 
- beta 
- weight decay 



After finding best hyperparameters in the end of hyperparameter tuning, we will test the performance on validation test data which was not included in the validation training data. 

### Importing required packages

In [1]:
import sys
import torch
import argparse
from torch.utils.data import DataLoader
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from custom_reg_dataset import RegisterDataset
from datetime import datetime
from timeit import default_timer as timer
import time
from statistics import mean as mean_calc

import os

import pytorch_warmup as warmup

from models.SVAE import SVAE
#from custom_dataset import TerraDataset
from utils.loss_fn import loss_fn_SVAE

from pytorchtools import EarlyStopping

from training_methods import cv_fold_maker, hyperparameter_tuner, model_test,latent_corr_calc

from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
import random
import numpy as np

## 1. Training on Moderate-dimensional Dataset

In this part of the report, we will concentrate on the training of the moderate-dimensional dataset. Before starting training, we will set the device as GPU and prepare convert training data into dataloaders which will be used in training process of SVAEs.

In [2]:
seed = 1
torch.manual_seed(seed)
if torch.cuda.is_available():
    torch.cuda.manual_seed(seed)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

Importing the validation training data and validation test data prepared for Moderate-dimensional dataset Training.

In [3]:
moddim_valtrain_5f = pd.read_csv("early_fusion_valtrain_5f_d365.txt", sep = "\t")
moddim_valtest = pd.read_csv("early_fusion_valtest_d365.txt", sep = "\t")

Preparing dataloaders for training folds and validation folds to be used in hyperparameter tuning.

In [4]:
num_folds = 5
batch_size = 64

moddim_Dataloader_list, moddim_dataset_val_list, moddims = cv_fold_maker(moddim_valtrain_5f, 
                                                                                   num_folds, batch_size, seed = seed)

Prepared data are moved to device(GPU) and added to lists which will be iterated throughout the 5-Fold Cross Validation.

The order of folds in 5-Fold Cross Validation will be as in the following:
- Training Folds: 2, 3, 4, 5 / Validation Fold: 1
- Training Folds: 1, 3, 4, 5 / Validation Fold: 2
- Training Folds: 1, 2, 4, 5 / Validation Fold: 3
- Training Folds: 1, 2, 3, 5 / Validation Fold: 4
- Training Folds: 1, 2, 3, 4 / Validation Fold: 5

In [5]:
moddim_X_val_list = []
moddim_y_val_list = []

for dataset in moddim_dataset_val_list:
    X_val = dataset.x
    X_val = X_val.to(device)
    
    y_val = dataset.y
    y_val = y_val.to(device)
    
    moddim_X_val_list.append(X_val)
    moddim_y_val_list.append(y_val)

In above code, we create lists for validation fold data and validaton fold labels.

### 1.1 Hyperparameter Tuning Step 1: NN Architecture

In this first step of hyperparameter tuning, we will tune the hyperparameters related to Neural Network Architectures.

The parameters which will be stable throughout this hyperparameter tuning step are:

In [6]:
batch_size_list = [64]
w_decay_list = [10**-4]
alpha_list = [1]
beta_list = [1]
lr_list= [0.001]
es_thr = 100 

The parameters which will be tuned in this hyperparameter tuning step are:

In [7]:
#encoder number of layers = 1, 3, 5 / encoder number of neurons each layer = 128, 256, 512
encoder_layer_list = [[moddims, 128], [moddims, 256], [moddims, 512], [moddims, 128, 128, 128], \
                      [moddims, 256, 256, 256], [moddims, 512, 512, 512], [moddims, 128, 128, 128, 128, 128] ,\
                     [moddims, 256, 256, 256, 256, 256], [moddims, 512, 512, 512, 512, 512]]
#classifier number of layers = 1, 3, 5 / classifier number of neurons each layer = 128, 256, 512
classifier_layer_list = [[128, 1], [256, 1], [512, 1], [128,128,128, 1], [256,256,256, 1], [512, 512, 512, 1],\
                        [128, 128, 128, 128, 128, 1], [256, 256, 256, 256, 256, 1], [512, 512, 512, 512, 512, 1]]
#number of dimensions in the latent space
latent_size_list = [2, 8, 32, 64, 128]

Now we will use *hyperparameter_tuning* function in order to train SVAE models.

In [8]:
moddim_report_step1,moddim_loss_step1_list, moddim_loss_step1_names = hyperparameter_tuner(device, moddims, 
                        num_folds, moddim_Dataloader_list, moddim_X_val_list, moddim_y_val_list, batch_size_list,
                        w_decay_list, alpha_list, beta_list, lr_list,
                        encoder_layer_list, classifier_layer_list, latent_size_list, seed = seed, es_thr = es_thr)

Training start date is: 2023-05-09 12:00:00.112283
./net_weights/SVAE_models_09-05-2023_12-00-00 directory is created
09-05-2023_12-00-00_loss_logs directory is created under H:\Projects\My Thesis\loss_values\
1 settings have been checked and saved to report file
6 settings have been checked and saved to report file
11 settings have been checked and saved to report file
16 settings have been checked and saved to report file
21 settings have been checked and saved to report file
26 settings have been checked and saved to report file
31 settings have been checked and saved to report file
36 settings have been checked and saved to report file
41 settings have been checked and saved to report file
46 settings have been checked and saved to report file
51 settings have been checked and saved to report file
56 settings have been checked and saved to report file
61 settings have been checked and saved to report file
66 settings have been checked and saved to report file
71 settings have been 

#### Hyperparameter Tuning Step 1 Results

#### Mean Auroc

In [9]:
moddim_report_step1.sort_values(by=["CV_avg_val_auroc"], ascending = False)

Unnamed: 0,Model_number,Seed,Batch_size,Encoder_num_neurons,Encoder_num_hidden_layers,Clf_num_neurons,Clf_num_hidden_layers,Latent_size,Alpha,Beta,...,CV3_val_acc,CV4_val_acc,CV5_val_acc,CV1_val_auroc,CV2_val_auroc,CV3_val_auroc,CV4_val_auroc,CV5_val_auroc,CV_avg_val_acc,CV_avg_val_auroc
385,386,1,64,512,5,512,3,2,1,1,...,0.559664,0.548523,0.597315,0.592145,0.584928,0.561317,0.582234,0.567565,0.586066,0.577638
371,372,1,64,512,5,512,1,8,1,1,...,0.542857,0.545148,0.540268,0.565559,0.592664,0.589434,0.584457,0.552096,0.556779,0.576842
375,376,1,64,512,5,128,3,2,1,1,...,0.597479,0.605907,0.545302,0.579730,0.588451,0.568125,0.582750,0.564865,0.586445,0.576784
380,381,1,64,512,5,256,3,2,1,1,...,0.588235,0.589030,0.611577,0.570604,0.576072,0.565526,0.588725,0.582581,0.596834,0.576702
370,371,1,64,512,5,512,1,2,1,1,...,0.605042,0.591561,0.622483,0.588084,0.573887,0.569028,0.583299,0.568109,0.606253,0.576481
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
103,104,1,64,512,1,512,1,64,1,1,...,0.515126,0.559494,0.541107,0.526765,0.538730,0.520160,0.542355,0.546980,0.538080,0.534998
61,62,1,64,256,1,128,3,8,1,1,...,0.509244,0.530802,0.531040,0.518152,0.566891,0.521373,0.531604,0.529682,0.520706,0.533540
8,9,1,64,128,1,256,1,64,1,1,...,0.542857,0.537553,0.537752,0.535126,0.539710,0.549390,0.525106,0.514159,0.534168,0.532698
146,147,1,64,128,3,512,1,8,1,1,...,0.542017,0.547679,0.534396,0.511313,0.542907,0.528851,0.536483,0.536702,0.538227,0.531251


Based on the results the best settings we got from the step 1 hyperparameter tuning are:

- Number of Hidden Layers in Encoder/Decoder = 5
- Number of Neurons in Hidden Layers of Encoder/Decoder = 512
- Number of Hidden Layers in Classifier = 3
- Number of Neurons in Hidden Layers of Classifier = 512
- Latent Size = 2


### 1.2 Hyperparameter Tuning Step 2: Loss Hyperparameters

In this second step of hyperparameter tuning, we will tune the loss hyperparameters such as weight decay used for regularization and loss calculation parameters alpha and beta.


The parameters which will be stable throughout this hyperparameter tuning step are:

In [12]:
batch_size_list = [64]
lr_list= [0.001]
es_thr = 100 

#Best values for number of nodes and hidden layers found for encoder/decoder in first tuning step was:
encoder_layer_list = [[moddims, 512, 512, 512, 512, 512]]
#Best values for number of nodes and hidden layers found for classifier in first tuning step was:
classifier_layer_list = [[512, 512, 512, 1]]

latent_size_list = [2]

The parameters which will be tuned in this hyperparameter tuning step are:

In [13]:
w_decay_list = [1, 10**-2, 10**-4, 10**-6]
alpha_list = [0.1, 1, 10, 100]
beta_list = [0.1, 1, 10, 100]

Now we will use *hyperparameter_tuning* function in order to train SVAE models based on the results we got from first step.


In [14]:
moddim_report_step2,moddim_loss_step2_list, moddim_loss_step2_names = hyperparameter_tuner(device, moddims, 
                        num_folds, moddim_Dataloader_list, moddim_X_val_list, moddim_y_val_list, batch_size_list,
                        w_decay_list, alpha_list, beta_list, lr_list,
                        encoder_layer_list, classifier_layer_list, latent_size_list, seed = seed, es_thr = es_thr)

Training start date is: 2023-05-14 10:17:16.833348
./net_weights/SVAE_models_14-05-2023_10-17-16 directory is created
14-05-2023_10-17-16_loss_logs directory is created under H:\Projects\My Thesis\loss_values\
1 settings have been checked and saved to report file
6 settings have been checked and saved to report file
11 settings have been checked and saved to report file
16 settings have been checked and saved to report file
21 settings have been checked and saved to report file
26 settings have been checked and saved to report file
31 settings have been checked and saved to report file
36 settings have been checked and saved to report file
41 settings have been checked and saved to report file
46 settings have been checked and saved to report file
51 settings have been checked and saved to report file
56 settings have been checked and saved to report file
61 settings have been checked and saved to report file
Training end date is: 2023-05-15 09:52:02.933204
Total Training Time is: 23:3

#### Hyperparameter Tuning Step 2 Results

####  Mean Auroc

In [15]:
moddim_report_step2.sort_values(by=["CV_avg_val_auroc"], ascending = False)

Unnamed: 0,Model_number,Seed,Batch_size,Encoder_num_neurons,Encoder_num_hidden_layers,Clf_num_neurons,Clf_num_hidden_layers,Latent_size,Alpha,Beta,...,CV3_val_acc,CV4_val_acc,CV5_val_acc,CV1_val_auroc,CV2_val_auroc,CV3_val_auroc,CV4_val_auroc,CV5_val_auroc,CV_avg_val_acc,CV_avg_val_auroc
9,10,1,64,512,5,512,3,2,10.0,1.0,...,0.562185,0.556118,0.547819,0.586805,0.589227,0.588417,0.571800,0.573130,0.550006,0.581876
10,11,1,64,512,5,512,3,2,10.0,10.0,...,0.567227,0.565401,0.536074,0.582728,0.590380,0.587681,0.582791,0.563522,0.555584,0.581420
38,39,1,64,512,5,512,3,2,1.0,10.0,...,0.599160,0.627848,0.621644,0.564725,0.579616,0.584365,0.587660,0.580596,0.609641,0.579392
20,21,1,64,512,5,512,3,2,1.0,0.1,...,0.531092,0.591561,0.578859,0.577849,0.592664,0.564212,0.587058,0.567257,0.565871,0.577808
50,51,1,64,512,5,512,3,2,0.1,10.0,...,0.606723,0.627004,0.626678,0.577283,0.598545,0.568620,0.578087,0.564303,0.612321,0.577368
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
56,57,1,64,512,5,512,3,2,10.0,0.1,...,0.591597,0.561181,0.593960,0.540317,0.569025,0.538684,0.548981,0.551751,0.582337,0.549752
44,45,1,64,512,5,512,3,2,100.0,0.1,...,0.569748,0.599156,0.620805,0.544558,0.560220,0.542075,0.543694,0.554033,0.591099,0.548916
60,61,1,64,512,5,512,3,2,100.0,0.1,...,0.577311,0.594937,0.608221,0.549796,0.554344,0.533520,0.544057,0.560653,0.590429,0.548474
30,31,1,64,512,5,512,3,2,100.0,10.0,...,0.594118,0.583122,0.603188,0.545299,0.564622,0.538142,0.560902,0.532140,0.581988,0.548221


Based on the results the best settings we got from the step 2 hyperparameter tuning are:

- alpha = 10
- beta = 1
- weight decay = 1

### 1.3 Moderate-dimensional Data Hyperparameter Tuning Result


In the end, the best SVAE hyperparameter values for the moderate-dimensional dataset are:

- Number of Hidden Layers in Encoder/Decoder = 5
- Number of Neurons in Hidden Layers of Encoder/Decoder = 512
- Number of Hidden Layers in Classifier = 3
- Number of Neurons in Hidden Layers of Classifier = 512
- Latent Size = 2
- alpha = 10
- beta = 1
- weight decay = 1

## 2. Testing on Validation Test Data

As we have found the best SVAE hyperparameter values for moderate-dimensional dataset, we can check its performance on validation test data which was not included in hyperparameter tuning.

### 2.1 Testing on Validation Test Data for Moderate-dimensional Data

- Number of Hidden Layers in Encoder/Decoder = 5
- Number of Neurons in Hidden Layers of Encoder/Decoder = 512
- Number of Hidden Layers in Classifier = 3
- Number of Neurons in Hidden Layers of Classifier = 512
- Latent Size = 2
- alpha = 10
- beta = 1
- weight decay = 1

In [19]:
train_df = moddim_valtrain_5f
test_df = moddim_valtest
column_names = test_df.drop(columns = ["persistence_d365","pid"]).columns

In [20]:
batch_size = 64
lr_init = 0.001
es_thr = 100 
es_active = True

In [21]:
#Hyperparameters tuned in step 1
encoder_layers = [moddims, 512, 512, 512, 512, 512]
classifier_layers = [512, 512, 512, 1]
latent_size = 2
#Hyperparameters tuned in step 2
w_decay = 1
alpha = 10
beta = 1

In [22]:
report_df_moddim_data = pd.DataFrame(columns = ["seed", "best_auroc_epoch","best_loss_epoch", "accuracy", "auroc"])

In [23]:
seeds_list = list(range(1,500,10))

In [24]:
now = datetime.now()

# dd/mm/YY H:M:S
dt_string = now.strftime("%d-%m-%Y_%H-%M-%S")
seeds_corr_list = []

for i, seed in enumerate(seeds_list):
    print("seed is", seed)
    best_auroc_epoch, best_loss_epoch, acc_score, auroc_score, X_test, y_test, eval_z, eval_pred_labels, eval_pred_prob = model_test(device, 
           train_df, test_df, batch_size, w_decay, alpha, 
           beta, lr_init, encoder_layers, classifier_layers,latent_size, 
            epochs = 500, seed = seed, es_active = es_active, es_thr = es_thr, es_patience = 10, 
           datestring = dt_string, shuffling = False)
    
    #Creating new result row
    results = [seed, best_auroc_epoch, best_loss_epoch, acc_score, auroc_score]
    #adding created result row to the report
    report_df_moddim_data.loc[len(report_df_moddim_data)] = results
    #top variables correlated with latent features are calculated
    latent_f_top_vars_list = latent_corr_calc(X_test, column_names, eval_z)
    #adding the top variables calculated with the specified random seed
    seeds_corr_list.append(latent_f_top_vars_list)
    
 
    
current_dir = os.getcwd()
report_folder = current_dir + "\hyperparameter_reports\\"
if not os.path.exists(report_folder):
    os.mkdir(report_folder)
    print("hyperparameter_reports directory is created under " + current_dir)
    
report_df_moddim_data.to_csv(os.path.join(report_folder,"moddim_data_valtest_report_"+ dt_string +  ".txt"), sep = "\t", index = False)


seed is 1
./net_weights/SVAE_models_15-05-2023_10-05-40/ directory is created
15-05-2023_10-05-40_loss_logs directory is created under H:\Projects\My Thesis\loss_values\
Accuracy Score on the test data is: 0.6210526315789474
ROC-AUC Score on the test data is: 0.5946135516353435
seed is 11
Accuracy Score on the test data is: 0.6210526315789474
ROC-AUC Score on the test data is: 0.6044669280141436
seed is 21
Accuracy Score on the test data is: 0.6210526315789474
ROC-AUC Score on the test data is: 0.6162900188323917
seed is 31
Accuracy Score on the test data is: 0.6210526315789474
ROC-AUC Score on the test data is: 0.6050626465275377
seed is 41
Accuracy Score on the test data is: 0.6210526315789474
ROC-AUC Score on the test data is: 0.6043612360198317
seed is 51
Accuracy Score on the test data is: 0.6210526315789474
ROC-AUC Score on the test data is: 0.6154156577885391
seed is 61
Accuracy Score on the test data is: 0.6195488721804512
ROC-AUC Score on the test data is: 0.6093575848418463
s

In [25]:
report_df_moddim_data["auroc"].mean()

0.6063466120911642

In [26]:
report_df_moddim_data["auroc"].std()

0.012664601485949263

In [27]:
print(" The AUROC Performance of Best SVAE model on moderate-dimensional dataset over 50 seeds is: ")

print(str(round(report_df_moddim_data["auroc"].mean(),4)) + " +- " + str(round(report_df_moddim_data["auroc"].std(), 4)))

 The AUROC Performance of Best SVAE model on moderate-dimensional dataset over 50 seeds is: 
0.6063 +- 0.0127


### Correlation calculation between input variables and latent dimensions

In [35]:
current_dir = os.getcwd()
latent_folder = current_dir + "\latent_variable_correlations\\"
if not os.path.exists(latent_folder):
    os.mkdir(latent_folder)
    print("hyperparameter_reports directory is created under " + current_dir)
    
    
top_amount = 10
seed_lenght = len(seeds_list)    

latent_total_corr_list = []
#iterating over each latent variable
for f in range(eval_z.shape[1]):
    f_dict = {}
    #iterating over correlation list of each random seed
    for s, seed_corr in enumerate(seeds_corr_list):
        if s == 0:
            f_corr_seed1 = seed_corr[f].droplevel(level = 0).dropna()
            f_corr_seed1 = pd.DataFrame({'Variable':f_corr_seed1.index, 'AVG_abs_corr':f_corr_seed1.values})
            print("Top "+ str(top_amount) + " variables correlated with latent feature " + str(f+1) + " for random seed " + str(seeds_list[s]))
            f_corr_seed1_sorted = f_corr_seed1.sort_values(by = "AVG_abs_corr", ascending = False)[:top_amount]
            print(f_corr_seed1_sorted)
            f_corr_seed1_sorted.to_csv(os.path.join(latent_folder,"moddim_top_" + str(top_amount) + "_correlations_with_latent_variable_" + str(f+1) + "_"  + dt_string +  "_no_shuffling_random_seed" + str(seeds_list[s]) + ".txt"), sep = "\t", index = True)
            
        f_corr = seed_corr[f].droplevel(level = 0).dropna()
        
        for var_num in range(f_corr.size):
            if f_corr.index[var_num] not in f_dict:
                f_dict[f_corr.index[var_num]] = f_corr[var_num]
            else:
                f_dict[f_corr.index[var_num]] += f_corr[var_num]
        
    f_total_corr_df = pd.DataFrame.from_dict(f_dict, orient = "index").rename(columns = {0:"AVG_abs_corr"})
    latent_total_corr_list.append(f_total_corr_df)
    
        
        

for i in range(len(latent_total_corr_list)):
    latent_avg_corr = latent_total_corr_list[i] / seed_lenght 
    print("Top "+ str(top_amount) + " variables correlated with latent feature " + str(i+1) + " averaged in " + str(seed_lenght) + " random seeds.")
    latent_avg_corr_sorted = latent_avg_corr.sort_values(by = "AVG_abs_corr",ascending = False)[:top_amount]
    print(latent_avg_corr_sorted)
    
    latent_avg_corr_sorted.to_csv(os.path.join(latent_folder,"moddim_top_" + str(top_amount) + "_correlations_with_latent_variable_" + str(i+1) + "_"  + dt_string +  "_no_shuffling_" + str(seed_lenght) + " seeds" + ".txt"), sep = "\t", index = True)


Top 10 variables correlated with latent feature 1 for random seed 1
          Variable  AVG_abs_corr
0        ICD10_S61      0.124191
1        ICD10_A09      0.117072
2        ICD10_N87      0.106052
3        ICD10_R50      0.104959
4      ATC10_A04AA      0.104132
5        ICD01_Z09      0.102427
6      ATC10_G04BD      0.101400
7      ATC10_G03AA      0.100779
8  prednisolone_m0      0.099080
9      ATC10_J01EA      0.096286
Top 10 variables correlated with latent feature 2 for random seed 1
      Variable  AVG_abs_corr
0  hosptime_10      0.254344
1  ATC10_D02AE      0.220154
2       haq_m0      0.183528
3  ATC10_N05CF      0.179775
4    ICD10_K57      0.177741
5  hosptime_01      0.176591
6    ICD10_I10      0.176181
7  ATC10_B03BA      0.165237
8  ATC01_A06AC      0.163508
9    ICD10_Z71      0.162257
Top 10 variables correlated with latent feature 1 averaged in 50 random seeds.
             AVG_abs_corr
hosptime_10      0.120195
ATC10_C03CA      0.089073
ATC10_B03BA      0.080915