#GINN
This notebook is an exemple of how to deploy GINN in the wild.
GINN is a missing data imputation algorithm, you can find more info [HERE](https://arxiv.org/pdf/1905.01907.pdf).

To install this framework use pip.

In [0]:
!pip install git+https://github.com/spindro/GINN.git

In this example we use the heart dataset and remove features completely at random. This dataset contains both categorical and numerical features and will show the ability of handling them at the same time. You just need to specify wich columns contains categorical variables and viceversa.

In [0]:
import csv
import numpy as np
from sklearn import model_selection, preprocessing

from ginn import GINN
from ginn.utils import degrade_dataset, data2onehot

In [0]:
datafile_w = 'heart.csv'

X = np.zeros((303,13),dtype='float')
y = np.zeros((303,1),dtype='int')
with open(datafile_w,'r') as f:
    reader=csv.reader(f)
    for i, row in enumerate(reader):
        data=[float(datum) for datum in row[:-1]]
        X[i]=data
        y[i]=row[-1]
        
cat_cols = [1,2,5,6,8,10,11,12,13]
num_cols = [0,3,4,7,9]
y = np.reshape(y,-1)
num_classes = len(np.unique(y))

We divide the dataset in train and test set  to show what our framework can do when new data arrives. We induce missingness with a completely at random mechanism and remove 20% of elements from the data matrix of both sets. We store also the matrices indicating wether an element is missing or not.

In [0]:
missingness= 0.2
seed = 42

x_train, x_test, y_train, y_test = model_selection.train_test_split(
    X, y, test_size=0.3, stratify=y
)
cx_train, cx_train_mask = degrade_dataset(x_train, missingness,seed, np.nan)
cx_test,  cx_test_mask  = degrade_dataset(x_test, missingness,seed, np.nan)

cx_tr = np.c_[cx_train, y_train]
cx_te = np.c_[cx_test, y_test]

mask_tr = np.c_[cx_train_mask, np.ones(y_train.shape)]
mask_te = np.c_[cx_test_mask,  np.ones(y_test.shape)]

Here we proprecess the data applying a one-hot encoding for the categorical variables.
We get the encoded dataset three different masks that indicates the missing features and if these features are categorical or numerical, plus the new columns for the categorical variables with their one-hot range.

In [0]:
[oh_x, oh_mask, oh_num_mask, oh_cat_mask, oh_cat_cols] = data2onehot(
        np.r_[cx_tr,cx_te], np.r_[mask_tr,mask_te], num_cols, cat_cols
)

We scale the features with a min max scaler that will preserve the one-hot encoding

In [0]:
oh_x_tr = oh_x[:x_train.shape[0],:]
oh_x_te = oh_x[x_train.shape[0]:,:]

oh_mask_tr = oh_mask[:x_train.shape[0],:]
oh_num_mask_tr = oh_mask[:x_train.shape[0],:]
oh_cat_mask_tr = oh_mask[:x_train.shape[0],:]

oh_mask_te = oh_mask[x_train.shape[0]:,:]
oh_num_mask_te = oh_mask[x_train.shape[0]:,:]
oh_cat_mask_te = oh_mask[x_train.shape[0]:,:]

scaler_tr = preprocessing.MinMaxScaler()
oh_x_tr = scaler_tr.fit_transform(oh_x_tr)

scaler_te = preprocessing.MinMaxScaler()
oh_x_te = scaler_te.fit_transform(oh_x_te)

Now we are ready to impute the missing values on the training set!

In [0]:
imputer = GINN(oh_x_tr,
               oh_mask_tr,
               oh_num_mask_tr,
               oh_cat_mask_tr,
               oh_cat_cols,
               num_cols,
               cat_cols
              )

imputer.fit()
imputed_tr = scaler_tr.inverse_transform(imputer.transform())
### OR ###
# imputed_ginn = scaler_tr.inverse_transform(imputer.fit_transorm())
# for the one-liners

In case arrives new data, you can just reuse the model...
*   Add the new data
*   Impute!



In [0]:
imputer.add_data(oh_x_te,oh_mask_te,oh_num_mask_te,oh_cat_mask_te)

imputed_te = imputer.transform()
imputed_te = scaler_te.inverse_transform(imputed_te[x_train.shape[0]:])
### OR ###
# imputed_te = scaler_te.inverse_transform(imputer.fit_transorm()[x_train.shape[0]:])
# for the one-liners

... or fine tune the model on this new evidence.



In [0]:
imputer.fit(fine_tune=True)

imputed_te_ft = imputer.transform()
imputed_te_ft = scaler_te.inverse_transform(imputed_te_ft[x_train.shape[0]:])
### OR ###
# imputed_te_ft = scaler_te.inverse_transform(imputer.fit_transorm()[x_train.shape[0]:])
# for the one-liners

Now use your imputed dataset as you wish!