# Model-Based Statistical Learning

***class 5***

<hr>

## 1 - Gaussian Mixture Model (GMM)

### Review

$$P(x,\theta) = \sum_{k=1}^K\pi_k.\phi(x; \mu_k, \Sigma_k)$$

The GMM is probably the most popular mixture model for two main reasons:

1. **wrong reason**: it is easy to implement and its computation is the simplest
2. **right reason**: even though it is a simple model, it is flexible enough to fit/approximate a large number of cases.

The GMM may fit data that does not appear *linear* (e.g. a cloud of point in the shape of a circle)

The issue is finding the right $K$:

> When varying $K$, we can fit to any data distribution. The limit will be to choose $K$ appropriately. 

### Model inference

The model inference in the GMM case is not easy due to the specific form of the log-likelihood:

\begin{align}
log\mathcal{L}(x;\theta) &= log(\prod_{i=1}^N\sum_{k=1}^K\pi_k.\phi(x_i; \mu_k, \Sigma_k))\\
&= \sum_{i=1}^Nlog(\sum_{k=1}^K\pi_k.\phi(x_i; \mu_k, \Sigma_k))\\
\end{align}

This is not easily solved because of the sum of log of sum operation.

### The Expectation-Maximization (EM) algorithm

The EM algorithm is still the most efficient optimizer than any other as at today. It is the **classical solution**.

<u>First driving idea:</u>

We first revisit the model by introducing a **latent variable** $z \in [0, 1]^K$ to encode the class memberships:

$$z_{ik}=1 \text{ if } x_i \text{ belongs to the cluster } k \text{, $0$ otherwise}$$

\begin{align}
z|\pi &\sim Multinomial(1;\pi) \text{, i.e $p(z=k) = \pi_k$}\\
x|z=k &\sim \mathcal{N}(x;\mu_k, \Sigma_k)
\end{align}

If we integrate over $z$, we obtain the mixture of gaussian $P(x,\theta) = \sum_{k=1}^K\pi_k.\phi(x; \mu_k, \Sigma_k)$. This allows us to write the likelihood of the couple $(x, z)$ (called the **complete likelihood**) as:

\begin{align}
log\mathcal{L}(x, z; \theta) &= \sum^n_{i=1}\big[ log(p(z_i|x_i;\theta)) + log(p(x_i;\theta))\big]\\
&= log\mathcal{L}(x, \theta) + \sum^n_{i=1}log(p(z_i|x_i;\theta))\\
log\mathcal{L}(x;\theta) &= log\mathcal{L}(x; z; \theta) - \sum^n_{i=1}log(p(z_i|x_i;\theta))\\
\end{align}

**Note:** $log\mathcal{L}(x; z; \theta)$ is a **lower-bound** of $log\mathcal{L}(x;\theta)$. However we need to know $z$, which we are looking for. But, if we know $z$, we can maximize the lower bound instead of the log-likelihood.

**Note:** The EM Algorithm works for any mixture model, and any model with a latent variable.

<u>Second driving idea:</u>

The spirit of the EM algorithm is to alternate between two steps:

1. **<u>Expectation</u> step:** Knowing a certain value of $\theta$ called $\theta^*$, we compute the expectation of the lower bound $log\mathcal{L}(x; z; \theta)$: $$\mathbb{E}[\mathcal{l}(x; z| \theta)|\theta^*]=Q(\theta|\theta^*)$$

Note: $Q$ is a function of $\theta$ that depends on a previous value $\theta^*$.

2. **<u>Maximization</u> step:** $Q(\theta|\theta^*)$ is optimized over $\theta$ to obtain a new value/estimate of $\theta^*$

<hr>

> <u>Theorem (Dempster, Laird, Rubin (theorem proposition), 1979; Wu (correct proof), 1981):</u> 
>
> **The series of parameters $(\theta^*)_q$ generated by the EM algorithm converges towards a <u>local</u> maximum of the log-likelihood $log\mathcal{L}(x;\theta)$.**

<hr>

<u>Graphical representation:</u>

There is a **dependence to the initialization**. As such, a number of random $\theta^0$ initializations is used as starting points. The best solution is kept as it leads to the maximum likelihood (local). In practice, we also stop the algorithm when a plateau of the likelihood is detected:

![EMconvergence](images/EMcoverge.png)

The central quantity to compute in the E-step is:

\begin{align}
Q(\theta|\theta^*) &= \mathbb{E}[l(x, z|\theta^*)]\\
l(x, z|theta) &= \sum^n_{i=1} \sum^K_{k=1} z_{ik} log(\pi_k\phi(x; \mu_k \Sigma_k))\\
\end{align}
and therefore $Q(\theta|\theta^*) = \sum_i \sum_k E(z_ik|\theta^*) log(\pi_k\phi(x_i, \mu_k,\Sigma_k))$
So the E step for the GMM reduces to the computation of :

$$\gamma_{ik} = \mathbb{E}[z_ik|\theta^*] \overset{Bayes}{\propto} P(z_{ik} =1|\theta^*)p(x_i|z_{ik}=1,\theta^*)\pi_k^*.\phi(x_i;\mu_k^*,\Sigma_k^*)$$

- **E-step**: 

> $\gamma_{ik}\propto \pi^*_k.\phi(x_i;\mu_k^*;\Sigma_k^*) \forall i, \forall k$

- **M-step**: 

> Maximize over $\pi_k$, $\mu_k$, $\Sigma_k$, the function $Q(\theta|\theta^*) = \sum_i\sum_k\gamma_{ik}log(\pi_k.\phi(x_i;\mu_k,\Sigma_k))$ where $\phi(x_i;\mu_k,\Sigma_k) = \frac{1}{|\Sigma_k|^{1/2}2\pi^d}exp(-\frac{1}{2}(x_i-\mu_k)^T\Sigma^{-1}_k(x_i-\mu_k))$
where $d$ is the dimensionality of $x_i\in \mathbb{R}^d$.

The update equations for $\pi_k$, $\mu_k$, and $\Sigma_k$ canbe attained by simply taking the partial derivatives of $Q(\theta|\theta^*$) regarding $\pi_k$, $\mu_k$, and $\Sigma_k$ respectivelly and equalling to 0.

$$\frac{\delta}{\delta\mu_k}Q(\theta|\theta^*)= 0 \Leftrightarrow \mu_k^* = \frac{1}{n_k}\sum^n\gamma_{ik}x_i \text{  where  } n_k = \sum^n\gamma_{ik}$$

$$\frac{\delta}{\delta\Sigma_k}Q(\theta|\theta^*)= 0 \Leftrightarrow \Sigma_k^* = \frac{1}{n_k}\sum^n\gamma_{ik}(x_i-\mu_k^*)^T(x_i-\mu_k^*)$$

$$\frac{\delta}{\delta\pi_k}Q(\theta|\theta^*) \overset{\text{under the constraint $\Sigma_k\pi_k=1$}}{=} 0 \Leftrightarrow \pi_k = \frac{n_k}{n}$$ 

<hr>

<u>Computation of the derivative $\frac{\delta}{\delta\mu_k}Q(\theta|\theta^*)$:</u>

\begin{align}
\frac{\delta}{\delta\mu_k}Q(\theta|\theta^*) &= \frac{\delta}{\delta\mu_k}\mathbb{E}[l(x, z|\theta^*)]\\
 &= \frac{\delta}{\delta\mu_k}\sum_i \sum_k E(z_ik|\theta^*) log(\Pi_k\phi(x_i, \mu_k,\Sigma_k))\\
 &= \frac{\delta}{\delta\mu_k}\sum_i\sum_k\gamma_{ik}log(\Pi_k.\phi(x_i;\mu_k,\Sigma_k))\\
 &= \frac{\delta}{\delta\mu_k}\sum_i\sum_k\gamma_{ik}log\big(\frac{\Pi_k}{|\Sigma_k|^{1/2}(2\pi)^{d/2}}\exp(-\frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k))\big) \\
 &= \frac{\delta}{\delta\mu_k}\sum_i\sum_k\gamma_{ik}\big(log\big(\frac{\Pi_k}{|\Sigma_k|^{1/2}(2\pi)^{d/2}}\big)-\frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k)\big)\\
 &= \frac{\delta}{\delta\mu_k}\sum_i\sum_k\gamma_{ik}\big(-\frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k)\big)\quad\text{(The first element does not depend on $\mu_k$)}\\
 &= \sum_i\gamma_{ik}\frac{\delta}{\delta\mu_k}\big(-\frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k)\big)\quad\text{(All $k$ different from $k^*$ are considered constants)}\\
\end{align}

We set $X=(x-\mu_k)$ and $a = \Sigma_k$. As such, we know:

\begin{align}
\frac{\delta}{\delta\mu_k}\big(X^T(aX)\big) &= \frac{\delta}{\delta\mu_k}\big((aX)^TX\big)\\
 &= \frac{\delta}{\delta\mu_k}\big(a^TX^TX\big)\\
 &= a^T\frac{\delta}{\delta\mu_k}\big(X^TX\big)\\
 &= a^T2X\\
 &= 2(\Sigma_k^{-1})^T(x-\mu_k)\\
 &= 2(\Sigma_k^T)^{-1}(x-\mu_k)\\
 &= 2\Sigma_k^{-1}(x-\mu_k)\quad\text{Given $\Sigma_k$ is a square matrix}\\
\end{align}

Given this results, we retrieve

\begin{align}
\frac{\delta}{\delta\mu_k}Q(\theta|\theta^*) &= \sum_i\gamma_{ik}\big(-\frac{2}{2}\Sigma_k^{-1}(x-\mu_k)\big)\\
\end{align}

We are looking for:

\begin{align}
\frac{\delta}{\delta\mu_k}Q(\theta|\theta^*) &= 0 \\
\sum_i\gamma_{ik}\big(-\Sigma_k^{-1}(x-\mu_k)\big) &= 0\\
\sum_i\gamma_{ik}(x-\mu_k) &= 0\\
\Leftrightarrow \mu_k^* &= \frac{\sum_i\gamma_{ik}x_i}{\sum_i\gamma_{ik}} \\
\end{align}

As such we find:

\begin{align}
\mu_k^* &= \frac{\sum_i\gamma_{ik}x_i}{n_k} \\
\end{align}

<hr>

<u>Computation of the derivative $\frac{\delta}{\delta\Sigma_k}Q(\theta|\theta^*)$:</u>

\begin{align}
\frac{\delta}{\delta\Sigma_k}Q(\theta|\theta^*) &= \frac{\delta}{\delta\Sigma_k}\mathbb{E}[l(x, z|\theta^*)]\\
 &= \frac{\delta}{\delta\Sigma_k}\sum_i \sum_k E(z_ik|\theta^*) log(\Pi_k\phi(x_i, \mu_k,\Sigma_k))\\
 &= \frac{\delta}{\delta\Sigma_k}\sum_i\sum_k\gamma_{ik}log(\Pi_k.\phi(x_i;\mu_k,\Sigma_k))\\
 &= \frac{\delta}{\delta\Sigma_k}\sum_i\sum_k\gamma_{ik}log\big(\frac{\Pi_k}{|\Sigma_k|^{1/2}(2\pi)^{d/2}}\exp(-\frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k))\big) \\
 &= \frac{\delta}{\delta\Sigma_k}\sum_i\sum_k\gamma_{ik}\big(log\big(\frac{\Pi_k}{|\Sigma_k|^{1/2}(2\pi)^{d/2}}\big)-\frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k)\big)\\
 &= \sum_i\gamma_{ik}\frac{\delta}{\delta\Sigma_k}\big(log\big(\frac{\Pi_k}{|\Sigma_k|^{1/2}(2\pi)^{d/2}}\big)-\frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k)\big)\quad\text{(All $k$ different from $k^*$ are considered constants)}\\
 &= \sum_i\gamma_{ik}\frac{\delta}{\delta\Sigma_k}\big(log(\Pi_k)-log(|\Sigma_k|^{1/2})-log((2\pi)^{d/2})\big) - \sum_i\gamma_{ik}\frac{\delta}{\delta\Sigma_k}\big(\frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k)\big)\\
 &= -\frac{1}{2}\big(\sum_i\gamma_{ik}\frac{\delta}{\delta\Sigma_k}log(|\Sigma_k|) + \sum_i\gamma_{ik}\frac{\delta}{\delta\Sigma_k}\big((x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k)\big)\big)\quad\text{(We remove the elements that don't depend on $\Sigma_k$)}\\
\end{align}

We know that:

\begin{align}
\frac{\delta}{\delta\Sigma_k}log(|\Sigma_k|) &= ((\Sigma_k)^{-1})^T\\
\frac{\delta}{\delta X}a^TX^{-1}a &= -(X^{-1})^Taa^T(X^{-1})^T
\end{align}

As such:

\begin{align}
\frac{\delta}{\delta\Sigma_k}Q(\theta|\theta^*) &= -\frac{1}{2}\big(\sum_i\gamma_{ik} ((\Sigma_k)^{-1})^T - \sum_i\gamma_{ik}((\Sigma_k)^{-1})^T(x-\mu_k)(x-\mu_k)^T((\Sigma_k)^{-1})^T\big) \\
 &= -\frac{1}{2}((\Sigma_k)^{-1})^T\big(\sum_i\gamma_{ik} - \sum_i\gamma_{ik}(x-\mu_k)(x-\mu_k)^T((\Sigma_k)^{-1})^T\big) \\
\end{align}

We are looking for:

\begin{align}
 \frac{\delta}{\delta\Sigma_k}Q(\theta|\theta^*) &= 0 \\
 -\frac{1}{2}((\Sigma_k)^{-1})^T\big(\sum_i\gamma_{ik} - \sum_i\gamma_{ik}(x-\mu_k)(x-\mu_k)^T((\Sigma_k)^{-1})^T\big) &= 0 \\
 \sum_i\gamma_{ik} &= \sum_i\gamma_{ik}(x-\mu_k)(x-\mu_k)^T((\Sigma_k)^{-1})^T\\
 1 &= \frac{\sum_i\gamma_{ik}(x-\mu_k)^T(x-\mu_k)\Sigma_k^{-1}}{\sum_i\gamma_{ik}}\\
 \Sigma_k &= \frac{\sum_i\gamma_{ik}(x-\mu_k)^T(x-\mu_k)}{\sum_i\gamma_{ik}}\quad\text{(As $\Sigma_k^T=\Sigma_k$)}\\
\end{align}


As such, we find:

\begin{align}
 \Sigma_k &= \frac{\sum_i\gamma_{ik}(x-\mu_k)^T(x-\mu_k)}{n_k} \\
\end{align}
<hr>

## 2 - Model Selection, how to choose $K$?

We need a quantity that mesure the adequacy of the model to the data. This quantity is the **likelihood**. 

### Model selection theory

There is a model selection criteria that penalizes the likelihood with a quantity that favors models with a low number of groups.

$$MSCriteria = log\mathcal{L}(x;\hat{\theta}) - pen(K)$$

The work of model selection is to find the right $pen(K)$.

### Popular criteria

- **AIC**: $log(\mathcal{L}(x;\hat{\theta})) - \eta(M)$

- **BIC**: $log(\mathcal{L}(x;\hat{\theta})) - \frac{1}{2}\eta(M)log(n)$

where $\eta(M)$ is the number of free scalar parameters in the model $M$.

In practice, $\mu(GMM)$ is easy to compute.  $$\mu(GMM\text{ with $K$ groups})=\text{ nb of }\pi_k + \text{ nb of }\mu_k + \text{ nb of }\Sigma_k$$. Knowing the $\pi_k$ must sum to 1, it implies that there are: $$(K-1) + kd + K\frac{d(d+1)}{2}$$ free parameters.

As such:

\begin{align}
BIC(GMM)&=log(\mathbb{L}(x;\hat{\theta}) - \frac{(K-1)+K*d + K*\frac{d*(d+1)}{2}}{2} log(n)
\end{align}