Information theory quantifies the amount of information present. In information theory, the amount of information is characterized as:

- Predictability:
  - Guaranteed events have zero information.
  - Likely events have little information. (Biased dice have little information.)
  - Random events process more information. (Random dice have more information.)
- Independent events add information. Rolling a dice twice with heads have twice the information of rolling the dice once with a head.

In information theory, chaos processes more information.

Information of an event is defined as:

$$I(x) = - \log(P(x))$$




### Entropy 

In DL, we often use the cross entropy or the KL-Divergence as our cost function. Those terms occurs frequently in research papers. In this section, we will understand what are they? 

Entropy measures the amount of information. In data compression, it represents the minimum number of bits in representing data. By definition, entropy is defined as:
$$
H(y) = \sum_{i} y_i \log \frac{1}{y_i} = -\sum_{i } y_i \log y_{i}
$$
Suppose we have strings only composed of "a", "b" or "c" with the chance of occurrence be 25%, 25% and 
50% respectively. The entropy is:
$$
\begin{align}
H & = 0.25 \log(\frac{1}{0.25}) + 0.25 \log(\frac{1}{0.25})  + 0.5 \log(\frac{1}{0.5}) \\
H  &= 0.25 \cdot 2 + 0.25 \cdot 2  +  0.5 \cdot 1 \\
& = 1.5 \\
\end{align}
$$
We will use bit 0 to represent 'c' and bits 10 for 'a' and bits 11 for 'b'. In average, we need 1.5 bits per character to represent a string.

#### Cross entropy

Cross entropy is defined as:
$$
H(y, \hat{y}) = \sum_i y_i \log \frac{1}{\hat{y}_i} = -\sum_i y_i \log \hat{y}_i
$$
If entropy measures the minimum of bits to encode information using the most optimized scheme. Cross entropy measures the minimum of bits to encode $$y$$ using the wrong optimized scheme from $$\hat{y}$$. The cross entropy is always higher than entropy unless both distributions are the same: you need more bits to encode the information if you use a less optimized scheme. 

In our previous example, we classify a picture as either a bus, a truck or an airplane. The output probability has the format: (bus, truck, airplane). The true label probability distribution for a bus is (1, 0, 0) and our model prediction can be (0.88, 0.08, 0.04).

The cross entropy of this example is:
$$
\begin{align}
H(y, \hat{y}) &= -\sum_i y_i \log \hat{y}_i \\
&= - 1 \log{0.88} - 0 \cdot \log{0.08} - 0 \cdot \log{0.04} =   - \log{0.88} \\
\end{align}
$$

#### KL Divergence

KL Divergence is defined as:
$$
D_{KL}(y~||~\hat{y}) = \sum_i y_i \log \frac{y_i}{\hat{y}_i}
$$
In machine learning, KL Divergence measures the difference between 2 probability distributions. 

<div class="imgcap">
<img src="images/kl.png" style="border:none;width:80%">
</div>

(Source Wikipedia)

It becomes a very good cost function to penalize the difference between the true labels and the predictions made by the model. It is even better for stochastic processes when the true label $$y_i$$ is stochastic rather than deterministic (with probability either 0 or 1).

#### Solving cross entropy = solving KL-divergence

KL divergence is simply cross entropy $$H(y, \hat{y})$$ minus entropy $$H(y) $$ (the extra bits needed to encode the data):
$$
\begin{align}
D_{KL}(y~||~\hat{y}) &= \sum_i y_i \log \frac{y_i}{\hat{y}_i} \\
&= \sum_i y_i \log \frac{1}{\hat{y}_i} - \sum_i y_i \log \frac{1}{y_i}  \\
&= H(y, \hat{y}) - H(y) 
\end{align}
$$
The entropy of the true label is un-related with how we model it, i.e. $$\frac{\partial{H(y)}}{\partial{w}} = 0 $$. Therefore, the optimal solution for the KL-divergence is the same as that of the cross entropy.
$$
\begin{align}
D_{KL}(y~||~\hat{y}) &= H(y, \hat{y}) - H(y) \\
\\
\frac{\partial{D_{KL}(y~||~\hat{y})}} {\partial{w}} &= \frac{\partial{H(y, \hat{y})}}{\partial{w}} - \frac{\partial{H(y)}}{\partial{w}} \\ 
&= \frac{\partial{H(y, \hat{y})}}{\partial{w}} \\
\end{align}
$$
> Optimizing KL-divergence is the same as optimizing cross entropy.

KL-Divergence is more intuitive in the cost function discussion. In some research, we add constraints to KL-divergence to optimize our model. Nevertheless, cross entropy requires less computation than KL-divergence and used frequently in deep learning.

For a deterministic process, $$y_i$$ is always equal to 1 for the true label and 0 otherwise. Hence, the cross entropy can be further simplified as:
$$
\begin{align}
H(y, \hat{y}) &= -\sum_i y_i \log \hat{y}_i \\
&= -\sum_i \log \hat{y}_i \\
\end{align}
$$
### Choice of cost function

**Deep learning is about knowing your costs. Good cost function builds good models.** San Francisco is about 400 miles from Los Angeles. It costs about $80 for the gas. When you order food from a restaurant, they do not deliver to homes more than a few miles away. From their perspective, the cost grows exponentially with distance. So in reality, we can modify our cost function to address special objectives. For example, some cost functions ignore outliers better than others.

### Maximum Likelihood Estimation

We want to build a model with $$\hat\theta$$ that maximizes the probability of the observed data (a model that fits the data the best: **Maximum Likelihood Estimation MLE**):

$$
\begin{split}
\hat\theta & = \arg\max_{\theta} \prod^N_{i=1} p(x_i \vert \theta ) \\
\end{split}
$$

However, multiplications overflow or underflow easily. Since $$\log(x)$$ is monotonic, optimize $$log(f(x))$$ is the same as optimize $$f(x)$$. We add the negative sign because the log of a probability invert the direction of $$p(x)$$. So instead of the MLE, we take the $$\log$$ and minimize the **negative log likelihood (NLL)**. 

$$
\begin{split}
\hat\theta & = \arg\min_{\theta} - \sum^N_{i=1} \log p(x_i \vert \theta ) \\
\end{split}
$$

**NLL and minimizing cross entropy is equivalent**:

$$
\begin{split}
\hat\theta & = \arg\min_{\theta} - \sum^N_{i=1} \log q(x_i \vert \theta ) \\
&  = \arg\min_{\theta} - \sum_{x \in X} p(x) \log q(x \vert \theta ) \\
& = \arg\min_{\theta} H(p, q) \\ 
\end{split}
$$

#### Putting it together

We want to build a model that fits our data the best. We start with the maximum likelihood estimation (MLE) which later change to negative log likelihood to avoid overflow or underflow. Mathematically, the negative log likelihood and the cross entropy have the same equation. KL divergence provides another perspective in optimizing a model. However, even they uses different formula, they both end up with the same solution.

> Cross entropy is one common objective function in deep learning.

### Mean square error (MSE)

In a regression problem, $$y = f(x; w)$$. In real life, we are dealing with un-certainty and in-complete information. So we may model the problem as:

$$
\hat{y} = f(x; θ) \\
y \sim \mathcal{N}(y;μ=\hat{y}, σ^2) \\
p(y | x; θ) = \frac{1}{\sigma\sqrt{2\pi}} \exp({\frac{-(y - \hat{y})^{2} } {2\sigma^{2}}}) \\
$$

with $$\sigma$$ pre-defined by users:

The log likelihood becomes optimizing the mean square error:

$$
\begin{split}
J &= \sum^m_{i=1} \log p(y | x; θ) \\
& = \sum^m_{i=1}  \log \frac{1}{\sigma\sqrt{2\pi}}  \exp({\frac{-(y^{(i)} - \hat{y^{(i)}})^{2} } {2\sigma^{2}}}) \\
& = \sum^m_{i=1} - \log(\sigma\sqrt{2\pi}) - \log \exp({\frac{(y^{(i)} - \hat{y^{(i)}})^{2} } {2\sigma^{2}}}) \\
& = \sum^m_{i=1} - \log(\sigma) - \frac{1}{2} \log( 2\pi) - {\frac{(y^{(i)} - \hat{y^{(i)}})^{2} } {2\sigma^{2}}} \\
& =  - m\log(\sigma) - \frac{m}{2} \log( 2\pi) - \sum^m_{i=1} {\frac{(y^{(i)} - \hat{y^{(i)}})^{2} } {2\sigma^{2}}} \\
& =  - m\log(\sigma) - \frac{m}{2} \log( 2\pi) - \sum^m_{i=1} {\frac{ \| y^{(i)} - \hat{y^{(i)}} \|^{2} } {2\sigma^{2}}} \\
\nabla_θ J & = - \nabla_θ \sum^m_{i=1} {\frac{ \| y^{(i)} - \hat{y^{(i)}} \|^{2} } {2\sigma^{2}}} \\ 
\end{split} 
$$

> Many cost functions used in deep learning, including the MSE, can be derived from the MLE.

### Maximum A Posteriori (MAP)

MLE maximizes $$ p(y \vert x; θ) $$. 

$$
\begin{split}
θ^* = \arg\max_θ (P(y \vert x; θ)) & = \arg\max_w \prod^n_{i=1} P( y \vert x; θ)\\
\end{split} 
$$

Alternative, we can find the most likely $$θ$$ given $$y$$:

$$
θ^{*}_{MAP} = \arg \max_θ p(θ \vert y) = \arg \max_θ \log p(y \vert θ) + \log p(θ) \\
$$

Apply Bayes' theorem:

$$
\begin{split}
θ_{MAP} & = \arg \max_θ p(θ \vert y) \\
& = \arg \max_θ \log p(y \vert θ) + \log p(θ) - \log p(y)\\
& = \arg \max_θ \log p(y \vert θ) + \log p(θ) \\
\end{split}
$$

To demonstrate the idea, we use a Gaussian distribution of $$ \mu=0, \sigma^2 = \frac{1}{\lambda}$$ as the our prior:

$$
\begin{split}
p(θ) & =  \frac{1}{\sqrt{2 \pi \frac{1}{\lambda}}} e^{-\frac{(θ - 0)^{2}}{2\frac{1}{\lambda}} } \\
\log p(θ) & = - \log {\sqrt{2 \pi \frac{1}{\lambda}}} + \log e^{- \frac{\lambda}{2}θ^2}  \\
& = C^{'} - \frac{\lambda}{2}θ^2 \\
- \sum^N_{j=1} \log p(θ) &= C + \frac{\lambda}{2} \| θ \|^2 \quad \text{ L-2 regularization}
\end{split}
$$

Assume the likelihood is also gaussian distributed:

$$
\begin{split}
p(y^{(i)} \vert θ) & \propto e^{ - \frac{(\hat{y}^{(i)} - y^{(i)})^2}{2 \sigma^2} } \\
- \sum^N_{i=1} \log p(y^{(i)} \vert θ) & \propto \frac{1}{2 \sigma^2} \| \hat{y}^{(i)} - y^{(i)} \|^2 \\
\end{split}
$$

So for a Gaussian distribution prior and likelihood, the cost function is

$$
\begin{split}
J(θ) & = \sum^N_{i=1} - \log p(y^{(i)} \vert θ) - \log p(θ) \\
& = - \sum^N_{i=1} \log p(y^{(i)} | θ) - \sum^d_{j=1} \log p(θ) \\
&=  \frac{1}{2 \sigma^2} \| \hat{y}^{(i)} - y^{(i)} \|^2 + \frac{\lambda}{2} \| θ \|^2 + constant
\end{split}
$$

which is the same as the MSE with L2-regularization.

If the likeliness is computed from a logistic function, the corresponding cost function is:

$$
\begin{split}
p(y_i \vert x_i, w) & = \frac{1}{ 1 + e^{- y_i w^T x_i} } \\
J(w) & = - \sum^N_{i=1} \log p(y_i \vert x_i, w) - \sum^d_{j=1} \log p(w_j) - C \\
&= \sum^N_{i=1} \log(1 + e^{- y_i w^T x_i})  + \frac{\lambda}{2} \| w \|^2 + constant
\end{split}
$$

Like, MLE, MAP provides a mechanism to derive the cost function. However, MAP can also model the uncertainty into the cost function which turns out to be the regularization factor used in deep learning.

### Nash Equilibrium

In the game theory, the Nash Equilibrium is reached when no player will change its strategy after considering all possible strategy of opponents. i.e. in the Nash equilibrium, no one will change its decision even after we will all the player strategy to everyone. A game can have 0, 1 or multiple Nash Equilibria. 

#### The Prisoner's Dilemma

In the prisoner's dilemma problem, police arrests 2 suspects but only have evidence to charge them for a lesser crime with 1 month jail time. But if one of them confess, the other party will receive a 12 months jail time and the one confess will be released. Yet, if both confess, both will receive a jail time of 6 months. The first value in each cell is what Mary will get in jail time for each decision combinations while the second value is what Peter will get.

<div class="imgcap">
<img src="images/nash.png" style="border:none;width:80%">
</div>

For Mary, if she thinks Peter will keep quiet, her best strategy will be confess to receive no jail time instead of 1 month.

<div class="imgcap">
<img src="images/nash2.png" style="border:none;width:80%">
</div>

On the other hand, if she thinks Peter will confess, her best strategy will be confess also to get 6 months jail time.
<div class="imgcap">
<img src="images/nash3.png" style="border:none;width:80%">
</div>

After knowing all possible actions, in either cases, Mary's best action is to confess. Similarly, Peter should confess also. Therefore (-6, -6) is the Nash Equilibrium even (-1, -1) is the least jail time combined. Why (-1, -1) is not a Nash Equilibrium? Because if Mary knows Peter will keep quiet, she can switch to confess and get a lesser sentence which Peter will response by confessing the crime also. (Providing that Peter and Mary cannot co-ordinate their strategy.)

### Jensen-Shannon Divergence

It measures how distinguishable two or more distributions are from each other.

$$
JSD(X || Y) = H(\frac{X + Y}{2}) - \frac{H(X) + H(Y)}{2}
$$
