In [None]:
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
from matplotlib import cm
from matplotlib.ticker import LinearLocator, FormatStrFormatter
import numpy as np

%matplotlib inline

# Introduction to VAEs

- What are they?
- Motivation
    - discriminative models vs generative models, what problems do generative models solve?
- Pros vs other methods
- Cons vs other methods
- Show result, images, videos

In a high-dimensional input space, the area a model allocates to a class may be much larger than the area occupied by training examples for the class

## Our journey for VAEs starts with

# Entropy...?

[MENTION]

What comes into your mind when you hear the word entropy?
*You will expect an answer like, disorder, thermodinamics, statistical mechanics, etc.*

the fact that the word entropy is used in information theory is not because it came over from the already known Boltzman entropy (1870s) or the Gibbs entropy (1878) (which is a generalization of the first).

In the context of information theory it means uncertainty or average information content. In opinion it shouldn't be called entropy as it not only is confusing but hides the real meaning and real-world interpretation of the definition.

Entropy in information theory was derived separately from any previous notions of the concept. I was derived as a measure of uncertainty or average information contained in a series of probable events.

## Once upon a time...

## Claude Shannon

![](images/claude_shannon.jpg)

- Father of Information Theory
- math, electrical engineer and crypto
- 1948 Landmark paper: A Mathematical Theory of Communication (how to best encode information) and the start of Information Theory.
- Develop the concept of information entropy. A measure of the uncertainty in a message.
- Nice fact: Prove that by threading the whitespace as part of the alphabet the undertainty in written language dereases. Which showed this cultural practice to have a quantifiable relation.

## How can I quantify the uncertainty in a system which consists of a series of events?

or in Shannon's own words

![uncertainty question](images/uncertainty_question.png)

[WRITE]

\begin{equation*}
U(X) = -\sum_{i=1}^N P(x_i)\cdot log_n[P(x_i)]
\end{equation*}

where $X$ is a random variable with possible values $x_i$ and probability mass function $P(X)$.

[MENTION/WRITE]

In the case of a continous random variable the sum turns into an integral and the probability becomes a probability density function.

\begin{equation*}
U(X) =-\int p(x)\cdot log_n[p(x)] dx
\end{equation*}


[MENTION]

We are not calling it entropy...yet. Actually Shannon did not call it entropy (from two sources he called it *missing information*$^1$ or *uncertainty*$^2$). It wasn't until during a talk with the great von Neumann, Neumann suggested that he should call it entropy. Neumann arguments where that firstly, a very similar mathematical construct was already being used in Boltzmann’s statistical mechanics and secondly and more important nobody knows what entropy really is nor understand it very well, so in a debate you will always have the advantage. So he introduced this concept as entropy (Shannon entropy).
 





[ASK QUESTION]

Why is it this form?

[1] Avery, J. (2003) Information theory and evolution, World Scientific.

[2] Tribus, M. and McIrvine, E. C. (1971) Energy and information. Scientific American 225 179–188.
Van Campenhout, J. M. and Cover, T. M. (1981) Maximum entropy and conditional entropy. IEEE
Transactions on Information Theory 27 483–489.

## Why did Shannon wrote this way?
Remember: it did not come from the already stablished entropy concept, but after the proposal of the equation was then the relation stablished.

## Specially why the fancy good looking logarithm? 


Well basically beacause of three main requirements that Shannon impose on this concept in order for his information theory to be valid.

Let $p_i$ the set of kown probabilities of possible events $x_i$, and $U$ the overall/average uncertainty of the outcome of an event.

1. $U$ should be a continuous function in $p_i$
2. If all events have the same probability, then $U$ should be monotonically increasing with the number of events.
2. If an event is broken down into two successive events, the original $U$ should be the weighted sum of the individual values of $U$ ($U$ for each successive event).

Let's see if the definition  $U(X) = -\sum_{i=1}^N P(x_i)\cdot log_n[P(x_i)]$ stisfies all three requirements.

Go trough each point providing the derivation in the link below

[MENTION]

1. continuous = differentiable on every point $p_i$
2. Makes sense the more options the more uncetain you are of the outcome.
3. Thirds is very difficult to understand without an example, you can use the example of the coin and dice provided. Drawing the separation diagram of probabilities is also helpful to understand this.

[WRITE]

[uncertainty definition notes](notes/1_uncertainty_definition.pdf)

What Shannon proved is (screenshot taken from Shannon's paper)$^a$

![2_uncertainty_theorem.png](images/uncertainty_theorem.png)

[a] [Shannon 1948, A Mathematical Theory of Communication](sources/shannon_1948.pdf)

[MENTION]

The equation for uncertainty was NOT derived from the concepts of entropy, it was independently derived and I think that we just called it entropy we loose all the richness of understanding what it really means..."uncertainty".

K is just a constant which has no direct meaning or use.

Now let's look at some particularities of this uncertainty formula

[MENTION]

It is always positive (easy to show)

It is dependent on the base of the logarithm ($n$). $n$ defines the units of measurement of our uncertainty

Common cases:

- $n=2$ -> *bits*
- $n=e$ -> *nats*

$n=2$ obviously the adopted unit. Gives for very useful interpretations:

- $U(X)$ is the minimum average length of an optimized compressed encoding to communicate a random outcome $x_i$ coming from X.
- $U(X)$ corresponds to the average minimum number of binary questions needed to figure out a random outcome $x_i$ from X.

[WRITE]

[properties of uncertainty when using bit units](notes/2_uncertainty_bit.pdf)

Binary event

In [None]:
p = np.linspace(0.001, 0.999, 50)
H = - p * np.log2(p) - (1 - p) * np.log2(1 - p)
plt.plot(p, H)
plt.xlabel('p')
plt.ylabel('Uncertainty (bits)')
plt.title('Uncertainty of a binary event');

[MENTION]

In the case of a binary event the uncertainty is zero if the probability is of the event to happen is zero or 1 -> We are certain of the outcome there is no uncertainty, and a maximum uncertainty of the outcome happens when both have maximum probability. This last statement is true for any number of events possible, for a continuos probability distribution this means that the uniform distribution has the biggest uncertainty.

[proof of last statement, do not derive just mention](notes/3_proof_max_uncertainty.pdf)

"Constant K merely amounts to a choice of a unit of measure" but has no useful purpose.

In [None]:
# Make data.
p1 = np.linspace(0.2, 0.46, 50)
p2 = np.linspace(0.2, 0.46, 50)
p1, p2 = np.meshgrid(p1, p2)

In [None]:
# maximum uncertainty on equal probabilities: for 3 on 0.(3)
fig = plt.figure(figsize=(16,7))
ax = fig.gca(projection='3d')

H = - p1 * np.log2(p1) - p2 * np.log2(p2) - (1 - p1 - p2) * np.log2(1 - p1 - p2)

ax.view_init(azim=0, elev=90)
ax.zaxis.set_major_formatter(FormatStrFormatter(''))
surf = ax.plot_surface(p1, p2, H, cmap=cm.coolwarm, linewidth=0, antialiased=False)
plt.xlabel('p1'); plt.ylabel('p2'); plt.title('Uncerrtainty of three events');
clb = plt.colorbar(surf, shrink=0.5, aspect=5); clb.ax.set_title('Uncertainty (bits)');

From now on I will use the term *entropy* to refer to the uncertainty equation shown before. The letter $H$ will be used to denote it.

\begin{equation*}
H(X) = -\sum_{i=1}^N P(x_i)\cdot log{ P(x_i)}
\end{equation*}
\begin{equation*}
H(X) =-\int p(x)\cdot log_n[p(x)] dx
\end{equation*}

[MENTION]

Let's calculate an example of the continuous version of $H(X)$

[entropy exponential distribution](notes/4_entropy_exponential.pdf)

good, moving on... hopefully you have got the real grasp of the meaning of entropy in this context.

Forget for a moment that this is the equation for entropy, what does the equation look like?

\begin{equation*}
-\sum_{i=1}^N P(x_i)\cdot log\big( P(x_i)\big)
\end{equation*}

[MENTION]

Let's take a step back... How do we calculate the average of 3 numbers, $A_1$, $A_2$ and $A_3$, $\frac{A_1 + A_2 + A_3}{3}$, and for $N$ numbers $\frac{1}{N}\sum_{i=1}^NA_i$.

Now what if the we had an associated weight $w_i$ to each number, weighted average $\frac{1}{\sum_{i=1}^Nw_i}\sum_{i=1}^Nw_iA_i$ and what is $\frac{w_i}{\sum_{i=1}^Nw_i}$, we can call it the probability associated to drawing the quantity $A_i$, so then if in the entropy equation we call $A_i=-log(p_i)$ then we come to the conclusion that the entropy is in reality the weighted average of the quantity $-log(p_i)$

And what is this quantity $-log(p_i)$?... well as Shannon mentioned, entropy is the average amount of information contained in all events, hence this quantity measures the amount of information gained by knowing that the outcome from a draw of all possible events is $x_i$.

Most explanations start from this concept and then expand to the entropy concept, I hink is erroneous to doing this way, because if that was the case why use a logarithm to denote information in the first place? The concept of uncertainty needs to be understood first and giving hat the only possible form is with a logarithm it follows to define information as $-log(p_i)$

A weighted average over the quantity $-log(p_i)$.

To which Shannon defines as **information**.

$$I_i=-log(p_i)$$

How does information behave

In [None]:
fig = plt.figure(figsize=(16,7))

In [None]:
p = np.linspace(0.01, 0.999, 50)
I = - np.log2(p)
plt.plot(p, I)
plt.xlabel('probability')
plt.ylabel('Information (bits)')
plt.title('Information');

[MENTION]

The more an outcome is expected the lower the information gained by knowing this is the outcome.

This of course implies that the probability distribution of the events are known prior to the outcome. Example: Suppose I know there is 80% probability of rain and 20% of not raining (NL percentages :)), by which outcome would I have gained more information than what I already had? $I(.8)=0.22$ and $I(.2)=1.6$.

Information is zero if I am totally sure of an outcome, I know 100% that gravity will bring me back if I jump, thank I jump and indeed I come back to the floor, did I gained information something?

If probability of an event is $1/2$ then the information contained is equal to $1 bit$ for information. Given we use base $2$ of course. How much information I gain knowing the outcome of flipping a fair coin?

Perhaps more extreme examples.

## Probability $\uparrow$ then Information $\downarrow$

## KL DIVERGENCE