# <font color=darkcyan>  Expectation Maximization for latent data models </font>

In the case where we are interested in estimating unknown parameters $\theta\in\mathbb{R}^m$ characterizing a model with missing data, the Expectation Maximization (EM) algorithm (Dempster et al. 1977) can be used when the joint distribution of the missing data $Z$ and the observed data $X$ is explicit. For all $\theta\in\mathbb{R}^m$, let $p_{\theta}$ be the probability density function of $(Z,X)$ when the model is parameterized by $\theta$ with respect to a given reference measure $\mu$. The EM algorithm aims at computing iteratively an approximation of the maximum likelihood estimator which maximizes the observed data loglikelihood:

$$
\ell(\theta;X) = \log p_{\theta}(X) =\log \int p_{\theta}(z,X)\mu(\mathrm{d}z)\,.
$$

As this quantity cannot be computed explicitly in general cases, the EM algorithm finds the maximum likelihood estimator by iteratively maximizing the expected complete data loglikelihood.
 
Start with an inital value $\theta^{(0)}$ and let $\theta^{(t)}$ be the estimate at the $t$-th iteration for $t\geqslant 0$, then the next iteration of EM is decomposed into two steps.

1. E-step. Compute the expectation of the complete data loglikelihood, with respect to the conditional distribution of the missing data given the observed data parameterized by $\theta^{(t)}$:

$$
Q(\theta,\theta^{(t)}) =\mathbb{E}_{\theta^{(t)}}\left[\log p_{\theta}(Z,X)|X \right]\,.
$$

2. M step. Determine $\theta^{(t+1)}$ by maximizing the function Q:

$$
\theta^{(t+1)}\in \mbox{argmax}_\theta Q(\theta,\theta^{(t)})\,.
$$


#### Question 1 
Prove the following crucial property motivates the EM algorithm.  
For all $\theta,\theta^{(t)}$,
    
$$
\ell(X;\theta) - \ell(X;\theta^{(t)}) \geqslant Q(\theta,\theta^{(t)})-Q(\theta^{(t)},\theta^{(t)})\,.
$$

In the following, $Z = (Z_1,\ldots,Z_n)$ and $X = (X_1,\ldots,X_n)$ where $\{(Z_i,X_i)\}_{1\leqslant i\leqslant n}$  are i.i.d. in $\{-1,1\} \times \mathbb{R}^d$. For $k\in\{-1,1\}$, write $\pi_k = \mathbb{P}_\theta(Z_1 = k)$. Assume that, conditionally on the event $\{Z_1 = k\}$, $X_1$ has a Gaussian distribution with mean $\mu_k \in\mathbb{R}^d$ and covariance matrix $\Sigma\in \mathbb{R}^{d\times d}$. In this case, the parameter $\theta=(\pi_1, \mu_1,\mu_{-1}, \Sigma)$ belongs to the set $\Theta= [0,1] \times \mathbb{R}^d \times \mathbb{R}^d \times \mathbb{R}^{d \times d}$. For all $t\geq 0$, the parameter $\theta^{(t)}$ is written
$$
\theta^{(t)} = \left\{\pi^{(t)}_1, \mu^{(t)}_1,\mu^{(t)}_{-1}, \Sigma^{(t)}\right\}\,.
$$

#### Question 2
- Write the complete data loglikelihood
- Let $\theta^{(t)}$ be the current parameter estimate. Compute $\theta\mapsto Q(\theta,\theta^{(t)})$.
- Compute $\theta^{(t+1)} = \mathrm{argmax}_\theta Q(\theta,\theta^{(t)})$. 

In [None]:
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import multivariate_normal

#### <font color=darkorange>  Simulated data</font>

#### Question 3
- Sample data from a mixture of two Gaussian distributions and display the associated histogram

In [None]:
# number of observations
n_samples = 100

# means and variance to be estimated
mu1, sigma1 = -2, 1.5
mu2, sigma2 = 3, 1

# prior probability of the first cluster/goup 
pi1 = 0.3

# A compléter

In [None]:
def Gaussian_pdf(x, mean, variance):
  z = np.exp(-(x - mean)**2/(2*variance))/np.sqrt(2*np.pi*variance)
  return z

In [None]:
# visualize the training data

# A compléter

#### <font color=darkorange>  EM algorithm</font>

In [None]:
# plot estimated density - plot the density of a mixture of two Gaussian distributions

# A compléter

#### Question 5
 - Compute $\omega_t^i = \mathbb{P}_{\theta^{(t)}}(Z_i=1|X_i)$.
 - Write the loop of the EM algorithm.
 - Run the algorithm and display the loglikelihood and the estimates along iterations.


In [None]:
n_clust = 2
weights = np.ones((n_clust)) / n_clust
means = [0, 1]
variances = [0.5, 2]

In [None]:
# number of iterations of the EM algorithm
n_it = 100

In [None]:
pi1_est = np.zeros(n_it+1)
mu1_est = np.zeros(n_it+1)
mu2_est = np.zeros(n_it+1)
sigma1_est = np.zeros(n_it+1)
sigma2_est = np.zeros(n_it+1)

pi1_est[0] = weights[0]
mu1_est[0] = means[0]
mu2_est[0] = means[1]
sigma1_est[0] = variances[0]
sigma2_est[0] = variances[1]

In [None]:
# A compléter

In [None]:
# plot loglikelihood along iterations

In [None]:
# plot parameters along iterations

#### Question 6

- Explore the sensitivity of the algorithm with respect to i) the initial estimates, ii) the number of iterations, iii) the number of observations and iv) the number of clusters. 

In [1]:
# A compléter