# Gumbel-Softmax - New feature
This notebook showcases a new feature introduced in version 0.6, Gumbel-Softmax activations!

**Structure of the notebook:**

1. A quick recap on categorical feature synthesis
2. Softmax and the Gumbel-Softmax activation
3. Synthesized categorical features (before)
4. Synthesized categorical features (now)

## A quick recap on categorical feature synthesis
Before synthesizing we typically preprocess our features. In the case of categorical features, one-hot encodings are frequently used in order to transform discrete features into sparse blocks of 1's and 0's. Converting symbolic inputs like categorical features to sparse arrays allows neural network (NN) models to handle the data similarly to very different feature formats like numerical continuous features.

An example:
* Before one-hot encoding:
<style>
th {
  padding-top: 5px;
  padding-right: 10px;
  padding-bottom: 5px;
  padding-left: 10px;
}
</style>

| ID | Gender | AgeRange |
| :------------: | :-------:  | :-------:  |
| 1 | Male | 20-29 |
| 2 | Female | 10-19 |

* After one-hot encoding:
<style>
th {
  padding-top: 5px;
  padding-right: 10px;
  padding-bottom: 5px;
  padding-left: 10px;
}
</style>
| ID | Gender_Male | Gender_Female | AgeRange_10-19 | AgeRange_20-29 |
| :------------: | :-------:  | :-------:  | :-------:  | :-------:  |
| 1 | 1 | 0 | 0 | 1 |
| 2 | 0 | 1 | 1 | 0 |

GANs attempt to synthesize these sparse distributions as they appear on real data. However, despite the input categorical features having a sparse format, NN classifiers learn __[logits](https://en.wikipedia.org/wiki/Logit)__, non-normalized probability distributions, for each class represented in the one-hot encoded input. Without a final layer (for convention lets call it an activation layer) to handle this output, you might get a float output looking something like this:
<style>
th {
  padding-top: 5px;
  padding-right: 10px;
  padding-bottom: 5px;
  padding-left: 10px;
}
</style>
| ID | Gender_Male | Gender_Female | AgeRange_10-19 | AgeRange_20-29 |
| :------------: | :-------:  | :-------:  | :-------:  | :-------:  |
| 1 | 0.867 | 0.622 | -0.155 | 0.855 |
| 2 | 0.032 | 1.045 | 0.901 | -0.122 |

This looks messy; leaves you with the job of inferring a sensible output (p.e. use the class with highest activation) and also is a potential flag for a GAN discriminator to identify fake samples.

Let's see what Gumbel-Softmax can do about it!

## Softmax and the Gumbel-Softmax activation
Softmax is a differentiable family of functions that map an array of logits to probabilities, i.e. values are bounded in the range $[0, 1]$ and sum to 1.
These are often used for turning logits into probability distributions from which we can sample. However these samples can't help us in gradient descent model learning because they are obtained from a random process (no relation with the model's parameters).

The Gumbel-Softmax is a special kind of Softmax function. It works like a continuous approximation of Softmax. Instead of using logits directly __[Gumbel distribution](https://en.wikipedia.org/wiki/Gumbel_distribution)__ noise is added before the softmax operation so that our model is outputting a combination from a deterministic component, parameterized by the mean and the variance of the categorical distribution, and a stochastic component, the Gumbel noise, which is just helping us sample without adding bias to the process.

A temperature parameter, usually called tau or lambda and defined in $]0, inf[$ is used to tune this distribution between the true categorical distribution and a uniform distribution respectively. This parameter is usually kept close to 0.