In [1]:
%%javascript
MathJax.Hub.Config({
    TeX: { equationNumbers: { autoNumber: "AMS" } }
});

<IPython.core.display.Javascript object>

## Exercise 2.15 MLE minimizes KL divergence to the empirical distribution
Let $p_{emp}(x)$ be the empirical distribution, and let $q(x|\theta)$ be some model. Show that $\mathrm{argmin}_q\mathbb{KL}(p_{emp}||q)$
is obtained by $q(x) = q(x; \theta)$, where $\theta$ is the MLE. Hint: use non-negativity of the KL divergence.

### Notes
We will prove that the MLE minimises the KL divergence between the empirical distribution and some model. What does this mean? Suppose we get access to some data $\mathcal{D}$ generated by some unknown distribution and we are asked to estimate this distribution based on the data. 

In this situation, there are two approaches we could take: the **nonparametric** and the **parametric**. On the former approach, we could make an estimation based purely on the data we have at hand. This estimation could be the empirical distribution $p_{emp}(x)$, defined as 

$$
p_{emp}(x) = \sum_{i=1}^N w_i\delta_{xi}(x)
$$

where we require $0\le w_i < 1$ and $\sum_{i=1}^N w_i = 1$.

On the latter approach, we could make some reasonable assumptions about the data pattern and create a parametric model $q(x|\theta)$ for it. The model is based on the prior we have about the process behind the data and it dictates the general shape of the distribution (e.g. Gaussian). One of the most popular ways of estimating the parameters is the Maximum Likelihood Estimator (MLE). The MLE is the parameter value $\theta^*$ which solves the following optimization problem:

\begin{equation}
\theta^* = \mathrm{argmax}_\theta\prod_{x\in\mathcal{D}}p(x|\theta)
\end{equation}

The RHS of (1) is called the likelihood of the dataset and it is a function of $\theta$. Note that most of the times it's most convenient to optimize the log-likelihood, which gives the same result as optimizing the likelihood.

$$
\theta^* = \mathrm{argmax}_{\theta}\prod_{x\in\mathcal{D}}p(x|\theta) = \mathrm{argmax}_\theta\sum_{x\in\mathcal{D}}\log(p(x|\theta))
$$


Given two such different scenarios, this exercise proposes we find a bridge between them. We will find the parametric model $q(x|\theta)$ that is the most similar to the empirical distribution $p_{emp}(x)$. As you can imagine, one great way to do this is to express the similarity by the KL-divergence and minimize this value.

### Solution

\begin{equation}
\mathbb{KL}(p_{emp}||q) = \sum p_{emp}\log(\frac{p_{emp}}{q}) = \sum p_{emp}\log(p_{emp}) - \sum p_{emp}\log q
\end{equation}

Since $p_{emp}$ is fixed given the dataset $\mathcal{D}$, the first term of (1) is constant and we only need to focus on the second term.

\begin{aligned}
\mathrm{argmin}_\theta\mathbb{KL}(p_{emp}||q(\theta)) & = -\mathrm{argmin}_\theta\sum p_{emp}\log q \\
& = \mathrm{argmax}_\theta\sum \log(q^{p_{emp}}) \\
& = \mathrm{argmax}_\theta\log\left(\prod_{x\in\mathcal{D}} q^{p_{emp}}\right)
\end{aligned}

For the final step, note that the empirical distribution is defined as $p_{emp}(x_i) = N_i/N$, where $N_i$ is the number of occurrences of $x_i$ and $N$ is the size of our dataset.

Thus:

\begin{aligned}
\mathrm{argmin}_\theta\mathbb{KL}(p_{emp}||q(\theta)) & = \mathrm{argmax}_\theta \log(\prod_{x\in\mathcal{D}}q^{p_{emp}}) \\
& = \mathrm{argmax}_\theta\frac{1}{N}\log\left(\prod_{x\in\mathcal{D}}q(x_i|\theta)^{N_i}\right)
\end{aligned}

This is the same expression as the log-likelihood, with the exception of the $1/N$ term. Therefore the solution for this problem is the MLE parameter $\theta^*$.

### Conclusion
In this exercise, we saw that the MLE of a parametric model is the most similar distribution to the empirical distribution, amongst all the models considered $(\theta\in\Theta)$. This makes a lot of sense - as the name tells, the maximum likelihood estimator is the parameter that maximizes the likelihood that our given dataset could occur. So its optimization is strongly based on the data (not taking into account a probability prior, like Bayesian models. The only prior is the shape of the distribution, like a Gaussian, which is given by the model). Therefore, as both the MLE and $p_{emp}$ have such a strong relationship with the empirical data, it makes sense that they have a strong relationship between themselves.