***

### K-Means Properties

### Victor Agaba, Cheryl Chen, Garrett Lee, Evan Li

***

In [174]:
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture

In [175]:
# generate data diven mu, sigma2 and n
def generate_data(mu, sigma2=1, n=3000, seed=308):
    np.random.seed(seed)
    data1 = np.column_stack((np.random.normal(mu, sigma2, n),
                             np.ones(n)))
    data2 = np.column_stack((np.random.normal(-mu, sigma2, n),
                             np.zeros(n)))
    data = np.row_stack((data1, data2))
    np.random.shuffle(data)
    return data

In [178]:
def compare_models(mus):
    
    table = np.zeros((1, 8))
    
    for mu in mus:
        data = generate_data(mu)
        
        # fit kmeans
        kmeans = KMeans(n_clusters=2, random_state=308, n_init=30)
        fit_k = kmeans.fit(data[:, 0].reshape(-1, 1))
        means_k = fit_k.cluster_centers_.squeeze()
        vars_k = np.array([np.var(data[fit_k.labels_ == i, 0])
                            for i in range(2)])
        if means_k[0] > means_k[1]:
            means_k = means_k[::-1]
            vars_k = vars_k[::-1]

        # fit a gmm
        gmm = GaussianMixture(n_components=2, random_state=308, n_init=30)
        fit_g = gmm.fit(data[:, 0].reshape(-1, 1))
        means_g = fit_g.means_.squeeze()
        covs_g = fit_g.covariances_.squeeze()
        if means_g[0] > means_g[1]:
            means_g = means_g[::-1]
            covs_g = covs_g[::-1]
        
        # add results to table
        row = np.concatenate((means_k, means_g, vars_k, covs_g))
        table = np.row_stack((table, row))
    

    # compile results into a dataframe
    table = table[1:,:]
    table = np.column_stack((np.array(mus), table))
    table = np.round(table, 3)     
    df = pd.DataFrame(table, columns=['mu', 'kmeans_mu1', 'kmeans_mu2', 'gmm_mu1', 'gmm_mu2',
                                      'kmeans_var1', 'kmeans_var2', 'gmm_var1', 'gmm_var2'])
    
    return df
    

In [179]:
mus = [0.5, 1, 2]
df = compare_models(mus)
df

Unnamed: 0,mu,kmeans_mu1,kmeans_mu2,gmm_mu1,gmm_mu2,kmeans_var1,kmeans_var2,gmm_var1,gmm_var2
0,0.5,-0.936,0.886,-0.734,0.682,0.463,0.46,0.8,0.78
1,1.0,-1.189,1.17,-1.073,1.014,0.663,0.658,0.954,0.972
2,2.0,-2.013,2.05,-1.987,2.042,0.969,0.923,1.052,0.976


### Discussion:

We see from the estimated means and variances that:
1. When the original distribution means are close together, kmeans has
   a bias that extends their estimates further away because it assigns observations to the
   closest mean at the expense of accounting for overlapping class-conditonal
   distributions. The resulting assignments have skewed distributions because the
   overlapping tails are truncated during assignments per iteration.

2. The varianves are underestimated by kmeans when the original distribution means are close
   together because the of the same truncation that causes kmeans to maximise the between-
   cluster variance and minimize within-cluster variance. The resulting clusters are
   consequently small and biased towards being disjoint.

3. Gaussian mixture models have more accurate results both in means and variance
   because it maximizes the joint likelihood of observing the sampled distribution,
   even if it implies overlapping gaussians. This allows estimated means to be closer
   together and estimated variances to be larger.

4. When the original means are further apart, kmeans performs about as well as the GMM because
   the underlying distributions have very little overlap, so truncating the tails
   does not have as significant of a biasing effect when the kmeans is used.