In [3]:
# scientific
import numpy as np;

# plotting
%matplotlib inline
import matplotlib as mpl;
from matplotlib import pyplot as plt;
from mpl_toolkits.mplot3d import Axes3D

# Font Size
mpl.rcParams.update({"font.size" : 14});

# python
import random;

$$ \LaTeX \text{ macros here }
\newcommand{\X}{\mathcal{X}}
\newcommand{\D}{\mathcal{D}}
\newcommand{\Z}{\mathcal{Z}}
\newcommand{\L}{\mathcal{L}}
$$

# Introduction

In machine learning, **Expectation-Maximization (EM)** is an iterative method for computing maximum likelihood estimates of the parameters in a statistical model, for data in which some variables are unobserved.

This is my attempt to write *the tutorial I wish I had read* when I first started learning about Expectation-Maximization.  The algorithm is essential to all areas of machine learning, and it is crucial not only to have an intuitive picture of what the algorithm does, but also to master the mathematical details in order to derive procedures for new models.

Below, I have tried to derive Expectation-Maximization as clearly and cleanly as possible, neglecting neither intuition nor rigor along the way.

# Background

The derivation of Expectation-Maximization rests on the notions of *convexity* and *information*.  Let us briefly review these topics.

## Convexity

## Information Theory

# Problem Setting

Suppose we observe data $\X$ generated from a model $p$ with parameters $\theta$ in the presence of hidden variables $Z$.  As usual, we wish to compute the maximum likelihood estimate

$$
\hat\theta_{ML}
= \arg\max_\theta \ell(\theta|\X)
= \arg\max_\theta \ln p(\X|\theta)
$$
    
of the parameters given our observed data.  In some cases, we may also seek to *infer* the values $\Z$ of the hidden variables $Z$.

# Evidence Lower Bound

The data log-likelihood $\ell(\theta|\X) = \ln p(\X|\theta)$ of the parameters given the observed data is useful for both inference and parameter estimation.  Working directly with this quantity is often difficult in latent variable models, and so we must resort to other methods.

Our general approach will be to reason about the hidden variables through a proxy distribution $q(z)$, which we use to compute a lower-bound on the log-likelihood.  This section is devoted to deriving one such bound, called the **Evidence Lower Bound (ELBO)**.

## Deriving the ELBO

We can expand the data log-likelihood by marginalizing over the hidden variables:

$$
\ell(\theta|\X)
= \ln p(\X|\theta)
= \ln \sum_z p(\X,z|\theta)
$$

Through Jensen's inequality, we obtain the following bound, for any distribution $q(z)$ over the hidden variables:

$$\begin{align*}
\ell(\theta | \X)
&=   \ln \sum_z p(\X,z | \theta) \\
&=   \ln \sum_z q(z) \frac{p(\X, z | \theta)}{q(z)} \\
&\geq \sum_z q(z) \ln \frac{p(\X,z | \theta)}{q(z)}
\equiv \L(q,\theta)
\end{align*}$$

This lower bound $\L(q, \theta)$ is called the **Evidence Lower Bound (ELBO)** and can be rewritten as follows:

$$
\begin{align*}
\ell(\theta | \X)
\geq \L(q, \theta)
&= \sum_z q(z) \ln \frac{p(\X,z | \theta)}{q(z)} \\
&= \sum_z q(z) \ln p(\X, z | \theta) - \sum_z q(z) \ln q(z) \\
&= E_q[ \ln p(\X, Z | \theta) ] - E_q[ \ln q(z) ] \\
&= E_q[ \ln p(\X, Z | \theta) ] + H(q)
\end{align*}
$$

## Relationship to Relative Entropy

The first term in the last line above closely resembles the cross entropy between $q(Z)$ and the joint distribution $p(X,Z|\theta)$ of the observed and hidden variables.  However, the variables $X$ are fixed to our observations $X = \X$ and so $p(\X, Z| \theta)$ is an *unnormalized* ((In this case, $\int p(\X, z)\, dz \neq 1$.)) distribution over $Z$.  It is easy to see that this does not set us back too far; in fact, the lower bound $\mathcal{L}(q,\theta)$ differs from a Kullback-Liebler divergence only by a constant with respect to $Z$:

$$\begin{align*}
D_{KL}(q || p(Z|\X,\theta)
&= H(q, p(Z|\X,\theta)) - H(q) \\
&= E_q[ -\ln p(Z|\X,\theta) ] - H(q) \\
&= E_q[ -\ln p(Z,\X,\theta) ] - E_q[ -\ln p(\X|\theta) ] - H(q) \\
&= E_q[ -\ln p(Z,\X,\theta) ] + \ln p(\X|\theta) - H(q) \\
&= -\mathcal{L}(q,\theta) + \mathrm{const.}
\end{align*}$$

This yields a second proof of the evidence lower bound, following from the nonnegativity of relative entropy.  In fact, this is the proof given in **[tzikas2008:variational]** and **[murphy:mlapp]**.

$$\begin{equation*}
\ln(\X | \theta)
= D_{KL}(q || p(Z,\X,\theta)) + \mathcal{L}(q,\theta) 
\geq \mathcal{L}(q,\theta)
\end{equation*}$$

# Example:  Coin Flips

# References

- **[Do & Batzglou 2008]** Do, Chuong B. and Serafim Batzglou. _What is the Expectation Maximization Algorithm?_ Nature Biotechnology 26.8, August 2008.
- **[Ruzzo]** Ruzzo, Larry.  _The Expectation Maximization Algorithm_. Lecture slides.