In [1]:
import numpy as np
import pandas as pd
import math

## Part I:  Gaussian Mixtures

### Objective
Implement the EM algorithm **from scratch** for a $p$-dimensional Gaussian mixture model 
with $G$ components: 
$$
\sum_{k=1}^G p_k \cdot \textsf{N}(x; \mu_k, \Sigma).
$$
### Requirements

Your implementation should consists of **four** functions. 

- **`Estep`** function: This function should return an $n$-by-$G$ matrix, where the $(i,j)$th entry represents the conditional probability $P(Z_i = k \mid x_i)$. Here $i$ ranges from 1 to $n$ and $k$ ranges from $1$ to $G$.

- **`Mstep`** function: This function should return the updated parameters for the Gaussian mixture model.

- **`loglik`** function: This function computes the log-likelihood of the data given the parameters.

- **`myEM`** function (main function): Inside this function, you can call the `Estep`, `Mstep`, and `loglik` functions. The function should take the following inputs and return the estimated parameters and log-likelihood:     

  - **Input**: 
    - data: The dataset.
    - $G$: The number of components.
    - Initial parameters.
    - `itmax`: The number of iterations.
  - **Output**: 
    - `prob`: A $G$-dimensional probability vector $(p_1, \dots, p_G)$
    - `mean`: A $p$-by-$G$ matrix with the $k$-th column being $\mu_k$, the $p$-dimensional mean for the $k$-th Gaussian component. 
    - `Sigma`: A $p$-by-$p$ covariance matrix $\Sigma$ shared by all $G$ components; 
    - `loglik`: A number equal to $\sum_{i=1}^n \log \Big [ \sum_{k=1}^G p_k \cdot \textsf{N}(x; \mu_k, \Sigma) \Big ].$

**Implementation Guidelines:**

  - Avoid explicit loops over the sample size $n$.
  - You are allowed to use loops over the number of components $G$, although you can avoid all loops. 
  - You are not allowed to use pre-existing functions or packages for evaluating normal densities.

In [2]:
def Estep(data, G, prob, mu, Sigma):
    g = np.zeros((len(data), G))
    U, D, _ = np.linalg.svd(Sigma)
    D_tilda = np.diag(1 / np.sqrt(D))
    x_tilda = D_tilda.dot(U.T.dot(data.T)).T # n by dim
    mu_tilda = D_tilda.dot(U.T.dot(mu)).T # G by dim
    for i in range(G):
        g[:, i] = np.sum(((x_tilda - mu_tilda[i])*(x_tilda - mu_tilda[i])), axis=1)
        g[:, i] = np.exp(-0.5 * g[:, i]) * prob[i]
    g /= np.sqrt((2*math.pi) ** ndim * np.linalg.det(Sigma))
    g /= np.sum(g, axis=1, keepdims=True)
    return g

In [3]:
def Mstep(data, g):
    prob = np.mean(g, axis=0)
    mu = np.zeros((ndim, G))
    Sigma = np.zeros((ndim, ndim))
    for i in range(G):
        mu[:, i] = np.sum(np.multiply(g[:, i:i+1], data), axis=0) / np.sum(g[:, i])
        temp_data = data - mu[:, i]
        temp_data1 = np.multiply(g[:, i:i+1], temp_data)
        Sigma += temp_data.T.dot(temp_data1)
    Sigma /= n
    return prob, mu, Sigma

In [4]:
def loglik(data, prob, mu, Sigma):
    g = np.zeros((len(data), G))
    U, D, _ = np.linalg.svd(Sigma)
    D_tilda = np.diag(1 / np.sqrt(D))
    x_tilda = D_tilda.dot(U.T.dot(data.T)).T # n by dim
    mu_tilda = D_tilda.dot(U.T.dot(mu)).T
    for i in range(G):
        g[:, i] = np.sum(((x_tilda - mu_tilda[i])*(x_tilda - mu_tilda[i])), axis=1)
        g[:, i] = np.exp(-0.5 * g[:, i]) * prob[i]
    g /= np.sqrt((2*math.pi) ** ndim * np.linalg.det(Sigma))
    llh = np.sum(np.log(np.sum(g, axis=1)), axis=0)
    return llh

In [5]:
def myEM(data, G, prob, mu, Sigma, itmax=20):
    for i in range(itmax):
        g = Estep(data, G, prob, mu, Sigma)
        prob, mu, Sigma = Mstep(data, g)
    return prob, mu, Sigma, loglik(data, prob, mu, Sigma)

### Testing

Test your code with the provided dataset,  [[faithful.dat](https://liangfgithub.github.io/Data/faithful.dat)], with both $G=2$ and $G=3$. 


In [6]:
data = pd.read_table("faithful.dat", sep="\s+", index_col=0)

**For the case when $G=2$**, set your initial values as follows:

- $p_1 = 10/n$, $p_2 = 1 - p_1$.
- $\mu_1$ =  the mean of the first 10 samples; $\mu_2$ = the mean of the remaining samples.
- Calculate $\Sigma$ as  
$$
\frac{1}{n} \Big [ \sum_{i=1}^{10} (x_i- \mu_1)(x_i- \mu_1)^t + \sum_{i=11}^n (x_i- \mu_2)(x_i- \mu_2)^t \Big].
$$
Here $x_i - \mu_i$ is a 2-by-1 vector, so the resulting $\Sigma$ matrix is a 2-by-2 matrix. 

Run your EM implementation with **20** iterations. 

In [7]:
n = len(data)
G = 2
ndim = 2
prob = np.array([10/n, 1-10/n])
mu = np.array([np.mean(data[:10], axis=0), np.mean(data[10:], axis=0)])
mu = mu.T
Sigma = np.zeros((2,2))
temp_data1 = np.array(data[:10]-mu[:, 0])
temp_data2 = np.array(data[10:]-mu[:, 1])
Sigma[0, 0] = np.sum(np.multiply(temp_data1[:,0], temp_data1[:,0])) + np.sum(np.multiply(temp_data2[:,0], temp_data2[:,0]))
Sigma[0, 1] = Sigma[1, 0] = np.sum(np.multiply(temp_data1[:,0],temp_data1[:,1]))+np.sum(np.multiply(temp_data2[:,0],temp_data2[:,1]))
Sigma[1, 1] = np.sum(np.multiply(temp_data1[:,1], temp_data1[:,1])) + np.sum(np.multiply(temp_data2[:,1], temp_data2[:,1]))
Sigma /= n

In [8]:
new_prob, new_mu, new_Sigma, llh = myEM(data, G, prob, mu, Sigma, itmax=20)

In [9]:
print("prob")
print(new_prob)
print("mean")
print(new_mu)
print("Sigma")
print(new_Sigma)
print("loglik")
print(llh)

prob
[0.04297883 0.95702117]
mean
[[ 3.49564188  3.48743016]
 [76.79789154 70.63205853]]
Sigma
           eruptions     waiting
eruptions   1.297936   13.924336
waiting    13.924336  182.580092
loglik
-1289.5693549424107


**For the case when $G=3$**, set your initial values as follows:


- $p_1 = 10/n$, $p_2 = 20/n$, $p_3= 1 - p_1 - p_2$
- $\mu_1 = \frac{1}{10} \sum_{i=1}^{10} x_i$,  the mean of the first 10 samples; $\mu_2 = \frac{1}{20} \sum_{i=11}^{30} x_i$, the mean of next 20 samples; and $\mu_3$ = the mean of the remaining samples. 
- Calculate $\Sigma$ as 
$$
\frac{1}{n} \Big [ \sum_{i=1}^{10} (x_i- \mu_1)(x_i- \mu_1)^t + \sum_{i=11}^{30} (x_i- \mu_2)(x_i- \mu_2)^t + \sum_{i=31}^n (x_i- \mu_3)(x_i- \mu_3)^t \Big].$$


Run your EM implementation with **20** iterations. 

In [10]:
n = len(data)
G = 3
ndim = 2
prob = np.array([10/n, 20/n, 1-10/n-20/n])
mu = np.array([np.mean(data[:10], axis=0), np.mean(data[10:30], axis=0),
np.mean(data[30:], axis=0)])
mu = mu.T
Sigma = np.zeros((ndim,ndim))
temp_data1 = np.array(data[:10]-mu[:, 0])
temp_data2 = np.array(data[10:30]-mu[:, 1])
temp_data3 = np.array(data[30:]-mu[:, 2])
Sigma[0, 0] = np.sum(np.multiply(temp_data1[:,0], temp_data1[:,0])) + np.sum(np.multiply(temp_data2[:,0], temp_data2[:,0])) + np.sum(np.multiply(temp_data3[:,0], temp_data3[:,0]))
Sigma[0, 1] = Sigma[1, 0] = np.sum(np.multiply(temp_data1[:,0],temp_data1[:,1]))+np.sum(np.multiply(temp_data2[:,0],temp_data2[:,1]))+np.sum(np.multiply(temp_data3[:,0],temp_data3[:,1]))
Sigma[1, 1] = np.sum(np.multiply(temp_data1[:,1], temp_data1[:,1])) + np.sum(np.multiply(temp_data2[:,1], temp_data2[:,1])) + np.sum(np.multiply(temp_data3[:,1], temp_data3[:,1]))
Sigma /= n

In [11]:
new_prob, new_mu, new_Sigma, llh = myEM(data, G, prob, mu, Sigma, itmax=20)

In [12]:
print("prob")
print(new_prob)
print("mean")
print(new_mu)
print("Sigma")
print(new_Sigma)
print("loglik")
print(llh)

prob
[0.04363422 0.07718656 0.87917922]
mean
[[ 3.51006918  2.81616674  3.54564083]
 [77.10563811 63.35752634 71.25084801]]
Sigma
           eruptions     waiting
eruptions   1.260158   13.511538
waiting    13.511538  177.964191
loglik
-1289.350958862738
