# Gumbel-Softmax - New feature
This notebook showcases a new feature introduced in version 0.6, Gumbel-Softmax activations!

**Structure of the notebook:**

1. A quick recap on categorical feature synthesis
2. Softmax and the Gumbel-Softmax activation
3. Synthesized categorical features comparison
    * Previous version
    * New version

## A quick recap on categorical feature synthesis
Before synthesizing we typically preprocess our features. In the case of categorical features, one-hot encodings are frequently used in order to transform discrete features into sparse blocks of 1's and 0's. Converting symbolic inputs like categorical features to sparse arrays allows neural network (NN) models to handle the data similarly to very different feature formats like numerical continuous features.

An example:
* Before one-hot encoding:

<style>
th {
  padding-top: 5px;
  padding-right: 10px;
  padding-bottom: 5px;
  padding-left: 10px;
}
</style>

| ID | Gender | AgeRange |
| :------------: | :-------:  | :-------:  |
| 1 | Male | 20-29 |
| 2 | Female | 10-19 |

* After one-hot encoding:

| ID | Gender_Male | Gender_Female | AgeRange_10-19 | AgeRange_20-29 |
| :------------: | :-------:  | :-------:  | :-------:  | :-------:  |
| 1 | 1 | 0 | 0 | 1 |
| 2 | 0 | 1 | 1 | 0 |

GANs attempt to synthesize these sparse distributions as they appear on real data. However, despite the input categorical features having a sparse format, NN classifiers learn __[logits](https://en.wikipedia.org/wiki/Logit)__, non-normalized probability distributions, for each class represented in the one-hot encoded input. Without activation layers that can handle this output, you might get synthetic records looking something like this:

| ID | Gender_Male | Gender_Female | AgeRange_10-19 | AgeRange_20-29 |
| :------------: | :-------:  | :-------:  | :-------:  | :-------:  |
| 1 | 0.867 | 0.622 | -0.155 | 0.855 |
| 2 | 0.032 | 1.045 | 0.901 | -0.122 |

This looks messy; leaves you with the job of inferring a sensible output (p.e. use the class with highest activation) and also is a potential flag for a GAN discriminator to identify fake samples.

Let's see what Gumbel-Softmax is and what it can do about to fix the issue!

## Softmax and the Gumbel-Softmax activation
Softmax is a differentiable family of functions that map an array of logits to probabilities, i.e. values are bounded in the range $[0, 1]$ and sum to 1.
These are often used for turning logits into probability distributions from which we can sample. However these samples can't help us in gradient descent model learning because they are obtained from a random process (no relation with the model's parameters).

The Gumbel-Softmax (GS) is a special kind of Softmax function that got introduced in 2016 (fun fact: coincidentally it was proposed in the same time by two independent teams) __[\[1](https://arxiv.org/abs/1611.00712)__, __[2\]](https://arxiv.org/abs/1611.00712)__. It works like a continuous approximation of Softmax. Instead of using logits directly __[Gumbel distribution](https://en.wikipedia.org/wiki/Gumbel_distribution)__ noise is added before the softmax operation so that our model is outputting a combination from a deterministic component, parameterized by the mean and the variance of the categorical distribution, and a stochastic component, the Gumbel noise, which is just helping us sample without adding bias to the process.

A temperature parameter, usually called tau or lambda and defined in $]0, inf[$ is used to tune this distribution between the true categorical distribution and a uniform distribution respectively. This parameter is usually kept close to 0.

## Synthesized categorical features comparison
To showcase the new feature we will focus on the visual inspection of the categorical outputs, similar to the examples above.
For this comparison we will leverage the DRAGAN implementation of the library on the adult dataset. The available snippets should reproduce the results but the takeaways are fully delivered on the cached results of this notebook.
Since the new feature is already implemented in our DRAGAN implementation, we will inherit and make a very simple override so that we use a generator without the GS activation.

In [1]:
from pmlb import fetch_data

from ydata_synthetic.synthesizers.regular import DRAGAN
from ydata_synthetic.synthesizers import ModelParameters, TrainParameters

data = fetch_data('adult')
num_cols = ['age', 'fnlwgt', 'capital-gain', 'capital-loss', 'hours-per-week']
cat_cols = ['workclass','education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex',
            'native-country', 'target']


# DRAGAN training
#Defining the training parameters of DRAGAN

noise_dim = 128
dim = 128
batch_size = 500

log_step = 100
epochs = 5  # For the purpose of this demo the number of epochs does not really matter, we are just comparing output formats
learning_rate = 1e-5
beta_1 = 0.5
beta_2 = 0.9

gan_args = ModelParameters(batch_size=batch_size,
                           lr=learning_rate,
                           betas=(beta_1, beta_2),
                           noise_dim=noise_dim,
                           layers_dim=dim)

train_args = TrainParameters(epochs=epochs,
                             sample_interval=log_step)

n_discriminator = 3
sample_size = 100

In [2]:
# Mimicking the previous DRAGAN implementation
class OldDRAGAN(DRAGAN):
    """The simple override of the define_gan below blocks the generator from plugging in the GS activation layer.
    This makes it equivalent to the previous implementation.
    The source code will help you understanding how it works"""
    def define_gan(self, col_transform_info = None):
        super().define_gan(col_transform_info=None)

In [3]:
from tensorflow.random import uniform
from tensorflow.dtypes import float32

# Random noise for sampling both generators
noise = uniform([sample_size, noise_dim], dtype=float32)

print('Previous DRAGAN version synthesis')
old_dragan = OldDRAGAN(gan_args, n_discriminator)
old_dragan.train(data, train_args, num_cols, cat_cols)

old_samples = old_dragan.generator(noise, training=False).numpy()

print('New DRAGAN version synthesis')
new_dragan = DRAGAN(gan_args, n_discriminator)
new_dragan.train(data, train_args, num_cols, cat_cols)

new_samples = new_dragan.sample(sample_size)


Previous DRAGAN version synthesis


 20%|██        | 1/5 [00:10<00:42, 10.64s/it]

Epoch: 0 | disc_loss: -0.36303991079330444 | gen_loss: -0.02544642798602581


 40%|████      | 2/5 [00:20<00:31, 10.42s/it]

Epoch: 1 | disc_loss: -0.5309355854988098 | gen_loss: -0.020024392753839493


 60%|██████    | 3/5 [00:31<00:20, 10.34s/it]

Epoch: 2 | disc_loss: -0.4962470829486847 | gen_loss: -0.04659884423017502


 80%|████████  | 4/5 [00:41<00:10, 10.26s/it]

Epoch: 3 | disc_loss: -0.34895461797714233 | gen_loss: -0.09296654164791107


100%|██████████| 5/5 [00:52<00:00, 10.44s/it]

Epoch: 4 | disc_loss: -0.0004760622978210449 | gen_loss: -0.2121104896068573
New DRAGAN version synthesis



 20%|██        | 1/5 [00:14<00:59, 14.90s/it]

Epoch: 0 | disc_loss: -0.2968001067638397 | gen_loss: -0.16787190735340118


 40%|████      | 2/5 [00:30<00:45, 15.09s/it]

Epoch: 1 | disc_loss: -0.4475342333316803 | gen_loss: -0.08200405538082123


 60%|██████    | 3/5 [00:44<00:29, 14.75s/it]

Epoch: 2 | disc_loss: -0.5292890667915344 | gen_loss: -0.06154463812708855


 80%|████████  | 4/5 [00:58<00:14, 14.64s/it]

Epoch: 3 | disc_loss: -0.6085172891616821 | gen_loss: -0.05449139326810837


100%|██████████| 5/5 [01:13<00:00, 14.64s/it]


Epoch: 4 | disc_loss: -0.632192850112915 | gen_loss: -0.03990979120135307


Synthetic data generation: 100%|██████████| 1/1 [00:00<00:00, 95.68it/s]


In [4]:
# Sample both generators
old_samples = old_dragan.generator(noise, training=False).numpy()
new_samples = new_dragan.generator(noise, training=False).numpy()

In [5]:
from pandas import DataFrame

# Get the input/output data preprocessor map to help us isolate the categorical feats output
preprocessor_map = new_dragan.processor.col_transform_info

# Isolate the categorical features and get the feature names
n_num_feats = len(preprocessor_map.numerical.feat_names_in)
cat_out_names = preprocessor_map.categorical.feat_names_out

# Place the categorical parts of the samples in Pandas DataFrames
old_cat_samples = DataFrame(old_samples[:,n_num_feats:], columns=cat_out_names)
new_cat_samples = DataFrame(new_samples[:,n_num_feats:], columns=cat_out_names)

In [6]:
# Inspect the old categorical outputs of the generator
old_cat_samples.head()

Unnamed: 0,workclass_0,workclass_1,workclass_2,workclass_3,workclass_4,workclass_5,workclass_6,workclass_7,workclass_8,education_0,...,native-country_34,native-country_35,native-country_36,native-country_37,native-country_38,native-country_39,native-country_40,native-country_41,target_0,target_1
0,0.401984,0.006191,0.624994,0.195168,1.291041,1.11658,-0.820349,1.03181,-0.309409,0.91527,...,0.324346,-0.971263,-0.089323,0.457682,-0.646701,1.144929,-1.546669,-0.374628,1.099963,1.221397
1,-0.048472,0.215874,0.414385,0.116093,1.136396,0.920146,-0.329587,1.086331,0.065789,1.351759,...,-0.24941,-0.22508,-0.025395,0.351308,-0.826858,1.02949,-0.388164,-0.265167,0.755968,1.115297
2,0.119381,0.338525,0.518873,0.783457,1.573044,0.659619,-0.732085,1.006257,0.063389,1.217871,...,-0.082185,-0.571648,-0.377966,0.536919,-0.731658,0.955397,-0.601435,0.502624,0.996215,1.403981
3,0.594597,0.159835,0.392797,0.160737,1.216151,0.614027,-0.298966,0.824552,-0.886809,1.000056,...,0.00713,-0.033384,-0.287456,0.433346,-0.68139,1.312952,-1.37649,0.169668,0.686038,1.332423
4,0.244797,0.546484,0.676641,0.050418,1.352613,0.527226,-0.750618,0.742486,0.062396,1.068858,...,-0.248066,-0.679454,-0.210461,0.603596,-0.29603,0.873734,-0.138126,0.413228,0.388745,1.675902


In [7]:
# Inspect the new categorical outputs of the generator
new_cat_samples.head()

Unnamed: 0,workclass_0,workclass_1,workclass_2,workclass_3,workclass_4,workclass_5,workclass_6,workclass_7,workclass_8,education_0,...,native-country_34,native-country_35,native-country_36,native-country_37,native-country_38,native-country_39,native-country_40,native-country_41,target_0,target_1
0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
4,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


Did you notice the difference of the Gumbel-Softmax in the output of the generators?
By default this feature is implemented in all the regular generators.

Enjoy the improved categorical generation!