# Using Gaussian Mixture Models to Explore Differences in Clustering

### I decided to approach this problem from a more unsupervised learning method.
When considering K-means clustering often one of the pitfalls can be the shape of the clusters. When considering the number of dimensions that the data has it seemed intuitive that spherical clusters would be the least likely.
These are the following steps in my approach
* Researched Guassian Mixture models
* Researched comparable metrics
* Combined several approaches to the data cleaning, modeling and metric scoring
It must be noted that only the combination of these analytical methods are my own.
I must give credit to:
   * Kajot for the data and first part of the script for the data processing and cleaning
   * Kam Sen and Prabhath Nanisetty from their Q&A on stats.stackexchange.com http://stats.stackexchange.com/questions/90769/using-bic-to-estimate-the-number-of-k-in-kmeans
   * Sklearns gaussian mixture model example
  

## Use Bayesian Information Criterion to Compare K-means and Gaussian Mixture Model

Guassian Mixture Models and K-means use different metrics for comparing the best clusters. I used BIC for both forms of clustering so that I could compare the approaches

## Results

* The optimal number of clusters using BIC score and GMM is 5 and this has a "full" geometry parameter

* The optimal number of clusters using BIC score and K-means was 4.

* The GMM provided a lower BIC score than K-mean.

## Next Steps

Calculate silouhette score and inertia for GMM and compare to K-means
Play around with different imputation methods to see if that makes a difference

## Thoughts

I can't access the paper behind the paywall so I cannot use their methods or compare their methods to mine on how to find the optimal number of clusters. My results seem to point to the possibility of a larger number of clusters.

In [None]:
import pandas as pd
from sklearn import metrics
from sklearn import mixture
import re
from sklearn.preprocessing import Imputer
from numpy import random
import seaborn as sb
import numpy as np
import itertools
from scipy import linalg
import matplotlib.pyplot as plt
import matplotlib as mpl
from sklearn import cluster
from scipy.spatial import distance
from sklearn.preprocessing import StandardScaler

%matplotlib inline
 
### Set path to the data set
dataset_path = "77_cancer_proteomes_CPTAC_itraq.csv"
clinical_info = "clinical_data_breast_cancer.csv"
pam50_proteins = "PAM50_proteins.csv"
 
## Load data
data = pd.read_csv(dataset_path,header=0,index_col=0)
clinical = pd.read_csv(clinical_info,header=0,index_col=0)## holds clinical information about each patient/sample
pam50 = pd.read_csv(pam50_proteins,header=0)
 
## Drop unused information columns
data.drop(['gene_symbol','gene_name'],axis=1,inplace=True)
 
 
## Change the protein data sample names to a format matching the clinical data set
data.rename(columns=lambda x: "TCGA-%s" % (re.split('[_|-|.]',x)[0]) if bool(re.search("TCGA",x)) is True else x,inplace=True)
 
## Transpose data for the clustering algorithm since we want to divide patient samples, not proteins
data = data.transpose()
data3 = data.copy()
 
## Drop clinical entries for samples not in our protein data set
clinical = clinical.loc[[x for x in clinical.index.tolist() if x in data.index],:]
 
## Add clinical meta data to our protein data set, note: all numerical features for analysis start with NP_ or XP_
merged = data.merge(clinical,left_index=True,right_index=True)
 
## Change name to make it look nicer in the code!
processed = merged
 
## Numerical data for the algorithm, NP_xx/XP_xx are protein identifiers from RefSeq database
processed_numerical = processed.loc[:,[x for x in processed.columns if bool(re.search("NP_|XP_",x)) == True]]
 
## Select only the PAM50 proteins - known panel of genes used for breast cancer subtype prediction
processed_numerical_p50 = processed_numerical.ix[:,processed_numerical.columns.isin(pam50['RefSeqProteinID'])]
 
## Impute missing values (maybe another method would work better?)
## Impute missing values (maybe another method would work better?)
imputer = Imputer(missing_values='NaN', strategy='median', axis=1)
imputer = imputer.fit(processed_numerical_p50)
processed_numerical_p50 = imputer.transform(processed_numerical_p50)

In [None]:
X = processed_numerical_p50

In [None]:
lowest_bic = np.infty
bic = []
n_components_range = range(1, 7)
cv_types = ['spherical', 'tied', 'diag', 'full']
for cv_type in cv_types:
    for n_components in n_components_range:
        # Fit a Gaussian mixture with EM
        gmm = mixture.GaussianMixture(n_components=n_components,
                                      covariance_type=cv_type)
        gmm.fit(X)
        bic.append(gmm.bic(X))
        if bic[-1] < lowest_bic:
            lowest_bic = bic[-1]
            best_gmm = gmm

bic = np.array(bic)
color_iter = itertools.cycle(['navy', 'turquoise', 'cornflowerblue',
                              'darkorange'])
clf = best_gmm
bars = []

# Plot the BIC scores
spl = plt.subplot(2, 1, 1)
for i, (cv_type, color) in enumerate(zip(cv_types, color_iter)):
    xpos = np.array(n_components_range) + .2 * (i - 2)
    bars.append(plt.bar(xpos, bic[i * len(n_components_range):
                                  (i + 1) * len(n_components_range)],
                        width=.2, color=color))
plt.xticks(n_components_range)
plt.ylim([bic.min() * 1.01 - .01 * bic.max(), bic.max()])
plt.title('BIC score per model')
xpos = np.mod(bic.argmin(), len(n_components_range)) + .65 +\
    .2 * np.floor(bic.argmin() / len(n_components_range))
plt.text(xpos, bic.min() * 0.97 + .03 * bic.max(), '*', fontsize=14)
spl.set_xlabel('Number of components')
spl.legend([b[0] for b in bars], cv_types)

In [None]:


def compute_bic(kmeans,X):
    """
    Computes the BIC metric for a given clusters

    Parameters:
    -----------------------------------------
    kmeans:  List of clustering object from scikit learn

    X     :  multidimension np array of data points

    Returns:
    -----------------------------------------
    BIC value
    """
    # assign centers and labels
    centers = [kmeans.cluster_centers_]
    labels  = kmeans.labels_
    #number of clusters
    m = kmeans.n_clusters
    # size of the clusters
    n = np.bincount(labels)
    #size of data set
    N, d = X.shape

    i = 0
    X[np.where(labels == i)]
    
    #compute variance for all clusters beforehand
    cl_var = (1.0 / (N - m) / d) * sum([sum(distance.cdist(X[np.where(labels == i)], [centers[0][i]], 'euclidean')**2) for i in range(m)])
    
    const_term = 0.5 * m * np.log(N) * (d+1)

    BIC = np.sum([n[i] * np.log(n[i]) -
               n[i] * np.log(N) -
             ((n[i] * d) / 2) * np.log(2*np.pi*cl_var) -
             ((n[i] - 1) * d/ 2) for i in range(m)]) - const_term

    return(BIC)


ks = range(1,10)

# run 9 times kmeans and save each result in the KMeans object
KMeans = [cluster.KMeans(n_clusters = i, init="k-means++").fit(X) for i in ks]
# now run for each cluster the BIC computation
BIC = [compute_bic(kmeansi,X) for kmeansi in KMeans]
# additional list to match score with number of clusters
BIC2 = [(compute_bic(kmeansi,X), kmeansi.n_clusters) for kmeansi in KMeans]

In [None]:
bic2 = [i * -1 for i in BIC]
n_components_range = range(1,10)
BIC 
spl = plt.subplot(2, 1, 1)
plt.xticks(n_components_range)
plt.ylim([min(bic2) * 1.01 - .01 * max(bic2), max(bic2)])
plt.title('BIC score per model K-Means')
plt.bar(n_components_range,bic2, width=.4, color = 'darkorange')
xpos = bic2.index(min(bic2)) + 1
plt.text(xpos, min(bic2) * 0.97 + .03 * max(bic2), '*', fontsize=14)
plt.xlabel('Number of components')