# Preprocessing mixed data with categorical and integer values for VAE

Autoencoder cannot ingest character valued categorical data. Therefore we have to preprocess data first. Example dataset is prostateSurvival from r package asaur. It contains categorical variables grade, stage, Agegroup and status and integer valued survTime variable.

In [17]:
import pandas as pd
import os

#Change working directory
os.chdir('..')

#Load data and print the first rows
X = pd.read_csv('Data/prostateSurvival.csv')
X.head()

Unnamed: 0,grade,stage,ageGroup,survTime,status
0,mode,T1c,80+,18,0
1,mode,T1ab,75-79,23,0
2,poor,T1c,75-79,37,0
3,mode,T2,70-74,27,0
4,mode,T1c,70-74,42,0


data_processing folder includes functions for preprocessing. First we create a dictionary of sklearn OneHotEncoders for each categorical variable in the dataset

In [18]:
from data_processing import *
data_dict = data_dictionary(X)
data_dict

{'grade': OneHotEncoder(handle_unknown='ignore'),
 'stage': OneHotEncoder(handle_unknown='ignore'),
 'ageGroup': OneHotEncoder(handle_unknown='ignore'),
 'survTime': OneHotEncoder(handle_unknown='ignore'),
 'status': OneHotEncoder(handle_unknown='ignore')}

Final preprocessing step is to create variables X_input, X_dict and realisation_counts. X_input concatenates all variables together into one matrix. It is fed to the VAE-encoder neural network. X_dict is a dictionary where the key is the variable name and the value is the preprocessed matrix. This is important for the decoder neural network of VAE. Finally realisation_counts includes information about the realisation counts of variables. For non-categorical values it is always one. This is for the splitting of the decoder output layer, where each split is for one variable only.

In [20]:
variable_types = ["cat", "cat", "cat", "int_negBin", "cat"]
X_input, X_dict, realisation_counts, = get_inputs_outputs(X, data_dict, variable_types)
print(X_input)
print(X_dict)
print(realisation_counts)

[[1. 0. 0. ... 1. 0. 0.]
 [1. 0. 1. ... 1. 0. 0.]
 [0. 1. 0. ... 1. 0. 0.]
 ...
 [1. 0. 0. ... 1. 0. 0.]
 [1. 0. 0. ... 1. 0. 0.]
 [1. 0. 0. ... 1. 0. 0.]]
{'grade': array([[1., 0.],
       [1., 0.],
       [0., 1.],
       ...,
       [1., 0.],
       [1., 0.],
       [1., 0.]]), 'stage': array([[0., 1., 0.],
       [1., 0., 0.],
       [0., 1., 0.],
       ...,
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.]]), 'ageGroup': array([[0., 0., 0., 1.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       ...,
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.]]), 'survTime': array([[18],
       [23],
       [37],
       ...,
       [ 8],
       [ 6],
       [86]], dtype=int64), 'status': array([[1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       ...,
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.]])}
[2, 3, 4, 1, 3]


Decoder consists of decoder for categorical values and other variables separately. Since decoder parameterizes probability distributions we define and these distributions can have varying number of parameters, we have to calculate the right outputs size for INT-decoder.

In [21]:
decoder_int_output_size = decoder_int_output_layer_size(variable_types)
decoder_int_output_size

2

## Summary
VAE requires 
* X : Original data
* X_input : Onehot encoded data
* X_dict : Data dictionary of preprocessed data
* variable_types : Probability distributions for each variable
* realisation_counts : Realisation counts for categorical variables 
* decoder_int_output_size : INT-Decoder size