# Mixture model experiment
In this notebook we perform the experiment to verify that we indeed see an exponential amount of gradient queries for stochastic gradient descent with an increasing parameter $d$, whereas we observe a linear relation in the case of the Metropolis-adjusted Langevin algorithm.

In [None]:
import math
import seaborn as sns

## Generating the data
We begin by generating the datasets used in our evaluation. Since we are trying various distributions, we have to make sure that they are all parameterized properly so as to achieve the following properties:
1. The clusters need to be adequately separated.
2. The amount of clusters in our dataset needs to be few ($M=\log_2 d$ in paper).

Something we could consider trying out would be to violate these properties to see what the failure mode is.

In our experiment we use the following distributions (we always let $N=2^d$):
1. Gaussian $\sigma = 1 / \sqrt{d}, M=\log_2 d$
2. Dirichlet
3. Exponential
4. Student's T (included since not log-concave for all parameters)

For our experiment to work, we need to generate new such problems for an increasing parameter $d$. Therefore we will need in total $4d$ datasets to work with.

## Experiment
Now we begin the experiment. The setup is quite simplistic:
1. Iterate over parameter $d$.
2. Gather our distribution datasets for that parameter $d$.
3. Estimate the parameters of the distribution using expectation-maximization.
4. Estimate the parameters of the distribution using MALA.
5. Save the amount of gradient queries required for both approaches.
6. Create line plot of required gradient queries for convergence.

In [None]:
ds = [i for i in range(1, 10)]
iterations = len(ds)

gauss_gq_EM = np.zeros(len(ds))
gauss_gq_MALA = np.zeros(len(ds))
dirichlet_gq_EM = np.zeros(len(ds))
dirichlet_gq_MALA = np.zeros(len(ds))
exponential_gq_EM = np.zeros(len(ds))
exponential_gq_MALA = np.zeros(len(ds))
students_gq_EM = np.zeros(len(ds))
students_gq_MALA = np.zeros(len(ds))

# Iterate and gather datasets
for i, d in tqdm(enumerate(ds)):
    
    # Compute necessary parameters
    components = int(math.log(d, 2))
    
    # Load correct dataset
    g = gaussian[d]
    d = dirichlet[d]
    e = exponential[d]
    s = students[d]
    
    # Run EM
    gauss_gq_EM[i] = EM_until_convergence(g, k=components)
    dirichlet_gq_EM[i] = EM_until_convergence(d, k=components)
    exponential_gq_EM[i] = EM_until_convergence(e, k=components)
    students_gq_EM[i] = EM_until_convergence(s, k=components)

    # Run MALA
    gauss_gq_MALA[i] = MALA_until_convergence(g, k=components)
    dirichlet_gq_MALA[i] = MALA_until_convergence(d, k=components)
    exponential_gq_MALA[i] = MALA_until_convergence(e, k=components)
    students_gq_MALA[i] = MALA_until_convergence(s, k=components)

## Analysis
Now that we have all the necessary data we can examine our results to see if they make sense in the context of the paper we are referencing.

We begin by generating figures for the individual mixture problems:

In [None]:
# Gaussian
gdf = pd.DataFrame({'Dimensions': ds + ds, 
                    'Gradient queries': np.concatenate((gauss_gq_EM, gauss_gq_MALA)), 
                    'Algorithm': ['EM']*iterations + ['MALA']*iterations})
sns.lineplot(data=gdf, x='Dimensions', y='Gradient queries', hue='Algorithm').set_title('Gaussian Mixture Model')
plt.show()

In [None]:
# Dirichlet
ddf = pd.DataFrame({'Dimensions': ds + ds, 
                    'Gradient queries': np.concatenate((dirichlet_gq_EM, dirichlet_gq_MALA)), 
                    'Algorithm': ['EM']*iterations + ['MALA']*iterations})
sns.lineplot(data=ddf, x='Dimensions', y='Gradient queries', hue='Algorithm').set_title('Dirichlet Mixture Model')
plt.show()

In [None]:
# Exponential
edf = pd.DataFrame({'Dimensions': ds + ds, 
                    'Gradient queries': np.concatenate((exponential_gq_EM, exponential_gq_MALA)), 
                    'Algorithm': ['EM']*iterations + ['MALA']*iterations})
sns.lineplot(data=edf, x='Dimensions', y='Gradient queries', hue='Algorithm').set_title('Exponential Mixture Model')
plt.show()

In [None]:
# Student's T
tdf = pd.DataFrame({'Dimensions': ds + ds, 
                    'Gradient queries': np.concatenate((students_gq_EM, students_gq_MALA)), 
                    'Algorithm': ['EM']*iterations + ['MALA']*iterations})
sns.lineplot(data=tdf, x='Dimensions', y='Gradient queries', hue='Algorithm').set_title('Student\'s T Mixture Model')
plt.show()

BLABLABLA analysis BLABLABLA

Let us now also generate a large figure containing the data for all datasets:

In [None]:
adf = pd.DataFrame({'Dimensions': ds*8, 
                    'Gradient queries': np.concatenate((gaussian_gq_EM, gaussian_gq_MALA,
                                                        dirichlet_gq_EM, dirichlet_gq_MALA,
                                                        exponential_gq_EM, exponential_gq_MALA, 
                                                        students_gq_EM, students_gq_MALA)), 
                    'Combination': ['Gaussian (EM)']*iterations + ['Gaussian (MALA)']*iterations +
                                   ['Dirichlet (EM)']*iterations + ['Dirichlet (MALA)']*iterations +
                                   ['Exponential (EM)']*iterations + ['Exponential (MALA)']*iterations +
                                   ['Student\'s T (EM)']*iterations + ['Student\'s T (MALA)']*iterations +})
sns.lineplot(data=adf, x='Dimensions', y='Gradient queries', hue='Combination').set_title('All Mixture Models')
plt.show()