# Latent Space Stability
---
This notebook explores the stability of the Latent Space in the HighDimSynthesizer for various datasets.

Below we use the **synthesized.insight** module to work with a dataset's latent space. There are three utility functions used for fetching information about the latent space of a dataset. Each one listed below returns a more compact summary that the one above.
- `get_latent_space()`: returns the entire dataset encoded into the latent_space. 
    - The columns in the returned dataframe are l_0, ..., l_N-1, m_0, ..., m_N-1, s_0, ..., s_N-1. 
    - m_i is the encoded mean value for dimension i and each row, 
    - s_i is the encoded stddev for each dimension i and each row, 
    - and l_i is a sample from the encoded distribution in each dimension i and each row.
    - **returns a (num_rows, 3xnum_latent_dimensions) array as a DataFrame.**
- `latent_dimennsion_usage()`: returns the 'usage' of each dimension for the dataset (typically usage should be between 0 and 1). 
    - Note there are two ways to calculate this (see further down for details).
    - **returns a (num_latent_dimensions, 1) array as a DataFrame.**
- `total_latent_space_usage()`
    - **returns a scalar value reflecting the total latent space usage.**


In [75]:
import os
import warnings
from typing import List
import logging

warnings.filterwarnings(action='ignore', module='numpy')
warnings.filterwarnings(action='ignore', module='pandas')
warnings.filterwarnings(action='ignore', module='sklearn')
warnings.filterwarnings(action='ignore', module='tensorflow')

import numpy as np
import pandas as pd
import scipy as sp
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

from synthesized import HighDimSynthesizer
from synthesized.common import ValueFactory
from synthesized.insight.latent import get_latent_space, latent_dimension_usage, total_latent_space_usage
from synthesized.insight.dataset import describe_dataset_values, describe_dataset, classification_score


if not 'workbookDir' in globals():
    workbookDir = os.getcwd()
os.chdir(os.path.split(os.path.split(workbookDir)[0])[0])

pd.options.display.max_rows = 10
pd.options.display.max_columns = 50

%load_ext autoreload
%autoreload 2
%matplotlib inline

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Datasets

In [43]:
atlas = pd.read_csv('data/templates/atlas_higgs_detection.csv')
credit = pd.read_csv('data/templates/credit.csv')
insure = pd.read_csv('data/templates/claim_prediction.csv')
telecom = pd.read_csv('data/templates/telecom-churn.csv')

DATASETS = {'atlas': atlas, 'credit': credit, 'insure': insure, 'telecom': telecom}

In [44]:
pd.concat([describe_dataset(ds).set_index('property').T.rename(index=lambda x: name) for name, ds in DATASETS.items()], axis='index', sort=True).rename_axis('dataset', axis='index')

Unnamed: 0_level_0,num_CategoricalValue,num_ContinuousValue,num_NanValue,num_SamplingValue,total_columns,total_rows
dataset,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
atlas,2,31,,,33,100000
credit,6,5,1.0,,11,150000
insure,5,3,,,8,1338
telecom,17,3,1.0,1.0,21,7043


## Stability Experiments
---
Considering the stability in the latent space:
- Variables:
    - dataset
    - number of training iterations
    - number of rows (same sample)
    - number of rows (different samples)
    - latent space usage type
        - "stddev": we can measure latent space usage in each dimension by considering the average 'stddev' for the dataset, $\sigma\text{-LSU}_n$.
        - "mean": or we can measure the usage by considering the standard deviation of the 'mean' for the dataset, $\mu\text{-LSU}_n$.
        
$\sigma\text{-LSU}_n = 1-\text{AVE}(\sigma_{n,i})$ 

$\mu\text{-LSU}_n = \sqrt{\text{VAR}(\mu_{n,i})}$

Where $n$ is the dimension and i corresponds to the row-$i$. The average and variance functions are performed over the rows, $i$. 

0 = unused, 1 = used.


In [117]:
NUM_ITERATIONS = [5000]
NUM_ROWS = [128, 256, 512, 1024]
REPEATS = 3
LSU_TYPES = ['stddev', 'mean']

#### Experiment 1:  Same Subsample Stability

In [None]:
random_seed = 1618033
experiment_1_trials = []

for name, ds in DATASETS.items():
    for num_iter in NUM_ITERATIONS:
        for num_rows in NUM_ROWS:
            for trial in range(REPEATS):
                
                latent_space = get_latent_space(df=ds.sample(num_rows, random_state=random_seed), num_iterations=num_iter)
                
                for usage_type in LSU_TYPES:
                    lsu = latent_dimension_usage(df_latent=latent_space, usage_type=usage_type)
                    lsu = lsu.drop('dimension', axis='columns').T.rename(columns=lambda x: f'z{x}').reset_index(drop=True)
                    
                    df_params = pd.DataFrame.from_records([{
                        'lsu_type': usage_type, 'dataset': name, 'num_iterations': num_iter, 
                        'num_rows': num_rows, 'trial': trial
                    }])
                    
                    df_trial = pd.concat((df_params, lsu), axis='columns')
                    experiment_1_trials.append(df_trial)

experiment_1_trials = pd.concat(experiment_1_trials, axis='index', ignore_index=True)
experiment_1_data = experiment_1_trials.melt(id_vars=['lsu_type', 'dataset', 'num_iterations', 'num_rows', 'trial'], var_name='latent_dim', value_name='usage')


In [None]:
sns.catplot(
    data=experiment_1_data, x='num_rows', y='usage', hue='latent_dim', row='dataset', col='lsu_type',
    kind='bar', aspect=2.2, legend=None, palette=sns.light_palette((230, 90, 60), input="husl", n_colors=32, reverse=True)
)

In [None]:
sns.relplot(
    data=experiment_1_data, x='num_rows', y='usage', hue='latent_dim', row='dataset', col='lsu_type',
    kind='line', aspect=2.2, legend=None, palette=sns.light_palette("navy", reverse=True, n_colors=32)
)

#### Experiment 2: Random Subsample Stability

In [None]:
experiment_2_trials = []

for name, ds in DATASETS.items():
    for num_iter in NUM_ITERATIONS:
        for num_rows in NUM_ROWS:
            for trial in range(REPEATS):
                
                latent_space = get_latent_space(df=ds.sample(num_rows, random_state=trial), num_iterations=num_iter)
                
                for usage_type in LSU_TYPES:
                    lsu = latent_dimension_usage(df_latent=latent_space, usage_type=usage_type)
                    lsu = lsu.drop('dimension', axis='columns').T.reset_index(drop=True)
                    
                    df_params = pd.DataFrame.from_records([{
                        'lsu_type': usage_type, 'dataset': name, 'num_iterations': num_iter, 
                        'num_rows': num_rows, 'trial': trial
                    }])
                    
                    df_trial = pd.concat((df_params, lsu), axis='columns')
                    experiment_2_trials.append(df_trial)

experiment_2_trials = pd.concat(experiment_2_trials, axis='index', ignore_index=True)
experiment_2_data = experiment_2_trials.melt(id_vars=['lsu_type', 'dataset', 'num_iterations', 'num_rows', 'trial'], var_name='latent_dim', value_name='usage')