# Mixture model experiment
In this notebook we perform the experiment to verify that we indeed see an exponential amount of gradient queries for stochastic gradient descent with an increasing parameter $d$, whereas we observe a linear relation in the case of the Metropolis-adjusted Langevin algorithm.

In [None]:
import math
import numpy as np
import pandas as pd
import seaborn as sns
from tqdm import tqdm
from matplotlib.patches import Circle
import matplotlib.pyplot as plt

## Generating the data
We begin by generating the datasets used in our evaluation. Since we are trying various distributions, we have to make sure that they are all parameterized properly so as to achieve the following properties:
1. The clusters need to be adequately separated.
2. The amount of clusters in our dataset needs to be few ($M=\log_2 d$ in paper).

Something we could consider trying out would be to violate these properties to see what the failure mode is.

In our experiment we use the following distributions (we always let $N=2^d$):
1. Gaussian $\sigma = 1 / \sqrt{d}, M=\log_2 d$
2. Dirichlet
3. Exponential
4. Student's T (included since not log-concave for all parameters)

For our experiment to work, we need to generate new such problems for an increasing parameter $d$. Therefore we will need in total $4d$ datasets to work with.

## Experiment
Now we begin the experiment. The setup is quite simplistic:
1. Iterate over parameter $d$.
2. Gather our distribution datasets for that parameter $d$.
3. Estimate the parameters of the distribution using expectation-maximization.
4. Estimate the parameters of the distribution using MALA.
5. Save the amount of gradient queries required for both approaches.
6. Create line plot of required gradient queries for convergence.

In [None]:
# https://stackoverflow.com/a/54544972/8238129
# Generate "num_points" random points in "dimension" that have uniform
# probability over the unit ball scaled by "radius" (length of points
# are in range [0, "radius"]).
def random_ball(num_points: int, dimension: int, radius: int = 1):
    # First generate random directions by normalizing the length of a
    # vector of random-normal values (these distribute evenly on ball).
    random_directions = np.random.normal(size=(dimension, num_points))
    random_directions /= np.linalg.norm(random_directions, axis=0)
    # Second generate a random radius with probability proportional to
    # the surface area of a ball with a given radius.
    random_radii = np.random.random(num_points) ** (1 / dimension)
    # Return the list of random (direction & length) points.
    return radius * (random_directions * random_radii).T

In [None]:
def plot_points_in_ball(num_points: int = 1000, radius: int = 1, dim: int = 2):
    if dim < 2 or dim > 3:
        raise ValueError('Invalid dimension')

    points = random_ball(num_points, dim, radius=radius)

    subplot_kw = {}

    if dim == 3:
        subplot_kw=dict(projection='3d')
        sns.set(style = "darkgrid")

    fig, ax = plt.subplots(subplot_kw=subplot_kw)

    if dim == 2:
        ax.set_aspect('equal')
        patch = Circle((0, 0), radius, fill=False, ls='-', lw=0.25)
        ax.add_patch(patch)

    ax.scatter(*np.split(points, dim, axis=1), marker='.')
    plt.show()

In [None]:
%matplotlib notebook
plot_points_in_ball(dim=2)

In [None]:
class Mixture():
    def __init__(self, d: int):
        # dimension
        self.d = d
        # number of mixtures
        self.M = int(math.log(d, 2))
        
    def pdf(self, x: np.ndarray):
        raise NotImplementedError

        
class GaussianMixture(Mixture):
    """Class for the Gaussian Mixture Model experiment"""
    def __init__(self, d: int, init_from_data: bool = True):
        super().__init__(d)
        # radius containing the data
        self.R = 2 * self.M
    
        # variance
        self.var = 1 / d
        
        # covariance matrix (isotropic and uniform)
        # self.cov = np.diag(np.repeat(self.var, d))
        
        # normalization constant for each Gaussian is the same
        # self.Z = math.sqrt(((2 * math.pi) ** d) * np.linalg.det(self.cov))
        self.Z = math.sqrt(((2 * math.pi * self.var) ** d))
        # lambda_i's
        self.lambda_ = self.Z * self.var / 1000

        self.points = self.sample()
        self.mu = self.init_params(init_from_data)


    def sample(self) -> np.ndarray:
        """Create synthetic dataset with sparse entries for GMM experiment"""
        rng = np.random.default_rng()
        d = self.d
        # number of data points
        N = 2 ** d

        # number of nonzero entries of each point
        num_nonzero = self.M

        # create ndarray of permuted indices of each data point
        idx = np.array([
            rng.permutation(i) for i in np.tile(np.arange(d), (N, 1))
        ])

        # M nonzero entries, selected uniformly at random
        idx_nonzero = idx[:, :num_nonzero]

        # initialize points array with zeros
        points = np.zeros(idx.shape)

        # all nonzero entries follow a uniform distribution on [-1, 1]
        for i, indices in enumerate(idx_nonzero):
            points[i, indices] = rng.uniform(low=-1, high=1, size=num_nonzero)
        return points


    def init_params(self, from_data: bool) -> np.ndarray:
        if from_data:
            # initialize cluster centers from data
            rng = np.random.default_rng()
            return rng.choice(self.points, size=self.M)
        
        # initialize in ball of radius R
        return random_ball(num_points=self.M, radius=self.R, dim=self.d)

        
    def pdf_constant_mixture(self):
        # Computes the constant mixture
        # Assume data are distributed in a bounded region
        # and take this constant mixture to describe that observation
        # Z_0 == Z ?
        return (np.linalg.norm(self.points, axis=1) <= self.R).astype(int) / self.Z
        
    def pdf(self) -> float:
        norm_diff = np.linalg.norm(self.points[:, None, :] - self.mu[None, :, :], axis=2)
        exponent = -0.5 * norm_diff ** 2 / self.var
        const_mix = (1 - self.M * self.lambda_) * self.pdf_constant_mixture()
        return (self.var / 1000) * np.exp(exponent).sum() + const_mix
        
    def prior(self):
        # Computes the prior term        
        sqrt_M_times_R = math.sqrt(M) * self.R
        # Frobenius norm of means matrix
        fro = np.linalg.norm(self.mu)
        return np.exp(-M * (fro - sqrt_M_times_R) ** 2 * int(fro >= sqrt_M_times_R))
         
        
    def objective(self):
        return -np.log(self.prior()) - np.sum(np.log(self.pdf()))

        

In [None]:
m = GaussianMixture(4)

In [None]:
m.objective()

In [None]:
def em(model):
    """Expectation Maximization Algorithm"""
    raise NotImplementedError

def mala():
    """Metropolis-adjusted Langevin algorithm"""
    raise NotImplementedError

In [None]:
dims = range(2, 10)

models = [
    GaussianMixture
]

results_df = pd.DataFrame(
    np.nan,
    index=dims,
    columns=pd.MultiIndex.from_product((models, ['em', 'mala']))
)

In [None]:
# Iterate and gather datasets
for d in tqdm(dims):
    # components = int(math.log(d, 2))
    
    for constructor in models:
        model = constructor(d)
        
        # Run EM
        results_df.loc[d, (m, 'em')] = em(data, k=components)
        
        # Run MALA
        results_df.loc[d, (m, 'mala')] = mala(data, k=components)

## Analysis
Now that we have all the necessary data we can examine our results to see if they make sense in the context of the paper we are referencing.

We begin by generating figures for the individual mixture problems:

In [None]:
for m in models:
    results_model = results_df[m]
    
    print(results_df[m])

In [None]:
# Gaussian
gdf = pd.DataFrame({'Dimensions': ds + ds, 
                    'Gradient queries': np.concatenate((gauss_gq_EM, gauss_gq_MALA)), 
                    'Algorithm': ['EM']*iterations + ['MALA']*iterations})
sns.lineplot(data=gdf, x='Dimensions', y='Gradient queries', hue='Algorithm').set_title('Gaussian Mixture Model')
plt.show()

In [None]:
# Dirichlet
ddf = pd.DataFrame({'Dimensions': ds + ds, 
                    'Gradient queries': np.concatenate((dirichlet_gq_EM, dirichlet_gq_MALA)), 
                    'Algorithm': ['EM']*iterations + ['MALA']*iterations})
sns.lineplot(data=ddf, x='Dimensions', y='Gradient queries', hue='Algorithm').set_title('Dirichlet Mixture Model')
plt.show()

In [None]:
# Exponential
edf = pd.DataFrame({'Dimensions': ds + ds, 
                    'Gradient queries': np.concatenate((exponential_gq_EM, exponential_gq_MALA)), 
                    'Algorithm': ['EM']*iterations + ['MALA']*iterations})
sns.lineplot(data=edf, x='Dimensions', y='Gradient queries', hue='Algorithm').set_title('Exponential Mixture Model')
plt.show()

In [None]:
# Student's T
tdf = pd.DataFrame({'Dimensions': ds + ds, 
                    'Gradient queries': np.concatenate((students_gq_EM, students_gq_MALA)), 
                    'Algorithm': ['EM']*iterations + ['MALA']*iterations})
sns.lineplot(data=tdf, x='Dimensions', y='Gradient queries', hue='Algorithm').set_title('Student\'s T Mixture Model')
plt.show()

BLABLABLA analysis BLABLABLA

Let us now also generate a large figure containing the data for all datasets:

In [None]:
adf = pd.DataFrame({'Dimensions': ds*8, 
                    'Gradient queries': np.concatenate((gaussian_gq_EM, gaussian_gq_MALA,
                                                        dirichlet_gq_EM, dirichlet_gq_MALA,
                                                        exponential_gq_EM, exponential_gq_MALA, 
                                                        students_gq_EM, students_gq_MALA)), 
                    'Combination': ['Gaussian (EM)']*iterations + ['Gaussian (MALA)']*iterations +
                                   ['Dirichlet (EM)']*iterations + ['Dirichlet (MALA)']*iterations +
                                   ['Exponential (EM)']*iterations + ['Exponential (MALA)']*iterations +
                                   ['Student\'s T (EM)']*iterations + ['Student\'s T (MALA)']*iterations +})
sns.lineplot(data=adf, x='Dimensions', y='Gradient queries', hue='Combination').set_title('All Mixture Models')
plt.show()