# Mixture model experiment: MCMC and EM
In this notebook we perform the experiment to verify that we see an exponential amount of gradient queries for stochastic gradient descent with an increasing parameter $d$, whereas we observe a linear relation in the case of the Metropolis-adjusted Langevin algorithm.

In [None]:
%load_ext autoreload
%autoreload 2
%matplotlib notebook
import math

import numpy as np
import pandas as pd
from tqdm import tqdm

import seaborn as sns
import matplotlib.pyplot as plt

from src.plot import plot_points_in_ball, plot_gmm_initializations_2d
from src.mixture import GaussianMixture
from src.optimizer import em, ula

## Data generation and parameter initialization

In this section we visualize how data points are generated in the experiment and how the mean parameters are initialized. We will do this for the 2-dimensional setting.

We follow the settings explained on page 39, appendix E, in the paper *Sampling can be faster than optimization*. 

For $d=2$ we have $n=2^d=4$ data points, and a single Gaussian $M=\lfloor \log_2(d) \rfloor = 1$, and thus a single mean parameter. The region that contains the data is the ball (circle) centered at the origin having radius $R=2M=2$.

The data points are generated such that the following properties are achieved:

1. The clusters need to be adequately separated.
2. The number of clusters need to be few ($M=\lfloor \log_2 d \rfloor$ in paper).

Observe that the data points have either the x or y coordinate set to 0 since the number of non-zero coefficients for each sampled point is $M=1$.

In the following figure we visualize the generated points along with the two different parameter initialization techniques for $d=2$. Initializing the mean of the Gaussian ...

1. from the data
2. uniformly at random from a ball of radius $R$

In [None]:
plot_gmm_initializations_2d()

For verification purposes, the following figures show a sample of $1000$ points from the ball of radius $R=2$ in 2- and 3-dimensional space.

In [None]:
plot_points_in_ball(radius=2, dim=2)

In [None]:
plot_points_in_ball(radius=2, dim=3)

## Experiment

Now we begin the experiment. The setup is quite simplistic:
1. Iterate over parameter $d$.
2. Gather our distribution datasets for that parameter $d$.
3. Estimate the parameters of the distribution using expectation-maximization.
4. Estimate the parameters of the distribution using MALA.
5. Save the amount of gradient queries required for both approaches.
6. Create line plot of required gradient queries for convergence.

In [None]:
MAX_ITERATIONS = 5000
MAX_ESTIMATION_ITERATIONS = 1000000
EM_ESTIMATE_THRESHOLD = 1e-8
EM_ERROR_THRESHOLD = 1e-6
ULA_PARAM_ESTIMATE_THRESHOLD = 1e-5
ULA_OBJECTIVE_ESTIMATE_THRESHOLD = 1e-8
FROM_DATA = True
NB_TRIALS = 10
NB_TRIALS_ESTIMATE = 20
NB_EXPS_ULA = 1
MAX_D = 8
ERROR_GRADIENT_ESTIMATE = 1e-4
GAMMA = None

dims = range(2, MAX_D+1)

models = {
    m.__name__: {
        'constructor': m,
        'df': None
    }
    for m in [
        GaussianMixture
    ]
}

for name, model_dict in models.items():
    # Record results
    model_results = []

    
    # Default iterations to start with
    iterations_em = 2
    iterations_ula = 2
    
    # Run for rest of dimensions
    for d in tqdm(dims):
        # Create finite mixture model problem
        model = model_dict['constructor'](d)
        
        # Run EM
        if iterations_em < MAX_ITERATIONS:
            
            # Estimate underlying set of parameters
            em_true_params = None
            em_true_params_found = False
            em_true_params_iterations = iterations_em*1000
            while not em_true_params_found:
                
                # Assume we have enough iterations until proven otherwise
                em_true_params_found = True
                
                # Run through trials
                em_estimate_params = np.zeros((NB_TRIALS_ESTIMATE, em_true_params_iterations, 
                                               model.params.shape[0], model.params.shape[1]))
                em_estimate_objective = np.zeros((NB_TRIALS_ESTIMATE, em_true_params_iterations))
                for i in range(NB_TRIALS_ESTIMATE):
                    
                    # Exit if we failed for given amount of iterations already
                    if not em_true_params_found:
                        break
                    
                    # Reset parameters
                    model.reset()
                    
                    # Run EM
                    em_param_iterates, em_objective_iterates = em(model, em_true_params_iterations)
                    em_estimate_params[i] = em_param_iterates
                    em_estimate_objective[i] = em_objective_iterates
                    
                    # Make sure we don't differ too much from previous estimates
                    for j in range(i+1):
                        
                        # If differ by more than threshold, we reset and increase amount of iterations
                        delta = np.linalg.norm(em_estimate_params[j][-1] - em_estimate_params[i][-1], 1)
                        if delta > EM_ESTIMATE_THRESHOLD:
                            em_true_params_iterations *= 10
                            em_true_params_found = False
                            print(f'Delta: {delta}')
                            print(f'Iterations now: {em_true_params_iterations}')
                            break
                        
                # Just choose any of the 20 iterations
                em_true_params = em_estimate_params[-1]
                em_true_params_objective = em_estimate_objective[-1]
                
            # Repeat for the specified amount of trials
            for _ in range(NB_TRIALS):
            
                # Record how many iterations were required
                model.reset()
                iterations_em = len(em(model, MAX_ITERATIONS, em_true_params_objective[-1], EM_ERROR_THRESHOLD)[0])
                model_results.append({'Dimensions': d, 'Algorithm': 'EM', 'Gradient queries': iterations_em})
                
        # Run ULA (disabled as we couldn't get it to converge in the end)
        if False and iterations_ula < MAX_ITERATIONS:
            
            # Estimate underlying set of parameters
            ula_expected_params = None
            ula_expected_objective = None
            ula_expected_params_found = False
            ula_expected_params_iterations = iterations_ula*1000
            if ula_expected_params_iterations > MAX_ESTIMATION_ITERATIONS:
                raise TimeoutError
            while not ula_expected_params_found:
                
                # Assume we have enough iterations until proven otherwise
                ula_expected_params_found = True
                
                # Run through trials
                ula_estimate_params = np.zeros((NB_TRIALS_ESTIMATE,
                                               model.params.shape[0], model.params.shape[1]))
                ula_estimate_objective = np.zeros(NB_TRIALS_ESTIMATE)
                for i in range(NB_TRIALS_ESTIMATE):
                    
                    # Reset parameters
                    model.reset(FROM_DATA)
                    
                    # Run ULA
                    ula_param_iterates, ula_objective_iterates = ula(model, 
                                                                     ula_expected_params_iterations,
                                                                     NB_EXPS_ULA,
                                                                     ERROR_GRADIENT_ESTIMATE,
                                                                     gamma=GAMMA)
                    
                    # Compute expectation on parameters
                    ula_estimate_params[i] = np.mean(ula_param_iterates)
                    ula_estimate_objective[i] = np.mean(ula_objective_iterates)
                    
                    # Make sure we don't differ too much from previous estimates
                    for j in range(i+1):
                        
                        # If differ by more than threshold, we reset and increase amount of iterations
                        delta_params = np.linalg.norm(ula_estimate_params[j] - ula_estimate_params[i], 1)
                        delta_objective = abs(ula_estimate_objective[j] - ula_estimate_objective[i])
                        if delta_params > ULA_PARAM_ESTIMATE_THRESHOLD or delta_objective > ULA_OBJECTIVE_ESTIMATE_THRESHOLD:
                            ula_expected_params_iterations *= 10
                            print(f'Delta params: {delta_params}')
                            print(f'Delta objective: {delta_objective}')
                            print(f'Iterations now: {ula_expected_params_iterations}')
                            ula_expected_params_found = False
                            break
                        
                # Just choose any of the 20 iterations
                ula_expected_params = ula_estimate_params[-1]
                ula_expected_objective = ula_estimate_objective[-1]
                
            # Repeat for the specified amount of trials
            for _ in range(NB_TRIALS):
            
                # Record how many iterations were required
                model.reset(FROM_DATA)
                iterations_ula = len(ula(model, MAX_ITERATIONS, NB_EXPS_ULA, ERROR_GRADIENT_ESTIMATE, 
                                        exp_mu=ula_expected_params, exp_U=ula_expected_objective, 
                                         timeout=MAX_ITERATIONS, gamma=GAMMA)[0])
                model_results.append({'Dimensions': d, 'Algorithm': 'ULA', 'Gradient queries': iterations_ula})
                
                
    # Save results
    model_dict['df'] = pd.DataFrame(model_results)

In [None]:
model_dict['df']

## Analysis
Now that we have all the necessary data we can examine our results to see if they make sense in the context of the paper we are referencing.

We begin by generating figures for the individual mixture problems:

In [None]:
# Gaussian
plt.figure()
sns.lineplot(data=models['GaussianMixture']['df'], x='Dimensions', y='Gradient queries', hue='Algorithm').set_title('Gaussian Mixture Model')
plt.show()