# 26-6-Gaussian Mixture Models (GMM) approach to clustering

So far in the module, we reviewed the algorithms that assign observations to only a single cluster. This type of clustering algorithms are called hard clustering. There exists another type of clustering algorithms such that each observation is assigned to several clusters with associated probabilities. This strand of clustering algorithms is called soft clustering. In this checkpoint, we present a soft clustering algorithm called Gaussian Mixture Models (in short GMM) which belongs to a general class of probabilistic clustering algorithms.

The main advantages of GMM are as follows:

It's a soft clustering algorithm. So, we can assess the confidence of the cluster assignments by investigating the probabilities.
It doesn't assume anything about the geometry of the clusters unlike k-means. So, it can also tackle with the non-linear geometries.

The assumption of our data being generated by a mix of normal distributions may sound too strong. But, if you recall the Central Limit Theorem, it states that if we have enough samples from a population, the means of the samples converge to a normal distribution no matter the original distribution of the population. Counting on this theorem, GMM searches for the means and the standard deviations of the Gaussian (normal) distributions.

#### Assumptions of GMM

There are two important assumptions that GMM makes:
The first one is that there are k distributions that generate the data. In effect, this is equivalent to say that there are exactly k clusters in the data.
The other assumption is that all of these k distributions are Gaussians. However, GMM doesn't put constraints on the parameters of these Gaussians but estimates them such that the likelihood of the data being generated by these k Gaussians is maximized.

GMM might become an expensive algorithm in terms of computational time. Hence, applying it to very high-dimensional datasets may take too long to converge. When we have very high-dimensional datasets, we may consider applying a dimensionality reduction technique first to reduce the dimension of the data before applying GMM.

## Assignment

* Apply GMM to the heart disease data by setting n_components=2. 
* Get ARI and silhoutte scores for your solution and compare it with those of the k-means and hierarchical clustering solutions that you implemented in the assignments of the previous checkpoints. 
* Which algorithm does perform better?

In [6]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.preprocessing import StandardScaler
from sklearn.mixture import GaussianMixture
from sklearn import metrics
from sqlalchemy import create_engine
import warnings
warnings.filterwarnings("ignore")

from sklearn.metrics import silhouette_score
import warnings
warnings.filterwarnings("ignore")

In [2]:
postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'heartdisease'

In [3]:
engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))
heartdisease_df = pd.read_sql_query('select * from heartdisease',con=engine)

# no need for an open connection, as we're only doing a single query
engine.dispose()

In [4]:
# Define the features and the outcome
X = heartdisease_df.iloc[:, :13]
y = heartdisease_df.iloc[:, 13]

# Replace missing values (marked by ?) with a 0
X = X.replace(to_replace='?', value=0)

# Binarize y so that 1 means heart disease diagnosis and 0 means no diagnosis.
y = np.where(y > 0, 0, 1)

# Standardize the data.
scaler = StandardScaler()
X_std = scaler.fit_transform(X)

In [7]:
# Defining the agglomerative clustering
gmm_cluster = GaussianMixture(n_components=2, random_state=123)

# Fit model
clusters = gmm_cluster.fit_predict(X_std)

print("ARI score: {}".format(
    metrics.adjusted_rand_score(y, clusters)))

print("Silhouette score: {}".format(
    metrics.silhouette_score(X_std, clusters, metric='euclidean')))

ARI score: 0.18389186035089963
Silhouette score: 0.13628813153331445


### GMM scores lower than both k-means and hierarchical clustering in terms of ARI and silhouette scores.

### 2. GMM implementation of scikit-learn has a parameter called covariance_type. This parameter determines the type of covariance parameters to use. Specifically, there are four types you can specify:
- full: This is the default. Each component has its own general covariance matrix.
- tied: All components share the same general covariance matrix.
- diag: Each component has its own diagonal covariance matrix.
- spherical: Each component has its own single variance.

Try all of these. Which one does perform better in terms of ARI and silhouette scores?

In [8]:
# Defining the agglomerative clustering
gmm_cluster = GaussianMixture(n_components=2, random_state=123, covariance_type="full")

# Fit model
clusters = gmm_cluster.fit_predict(X_std)

print("ARI score with covariance_type=full: {}".format(
    metrics.adjusted_rand_score(y, clusters)))

print("Silhouette score with covariance_type=full: {}".format(
    metrics.silhouette_score(X_std, clusters, metric='euclidean')))
print("------------------------------------------------------")

# Defining the agglomerative clustering
gmm_cluster = GaussianMixture(n_components=2, random_state=123, covariance_type="tied")

# Fit model
clusters = gmm_cluster.fit_predict(X_std)

print("ARI score with covariance_type=tied: {}".format(
    metrics.adjusted_rand_score(y, clusters)))

print("Silhouette score with covariance_type=tied: {}".format(
    metrics.silhouette_score(X_std, clusters, metric='euclidean')))
print("------------------------------------------------------")

# Defining the agglomerative clustering
gmm_cluster = GaussianMixture(n_components=2, random_state=123, covariance_type="diag")

# Fit model
clusters = gmm_cluster.fit_predict(X_std)

print("ARI score with covariance_type=diag: {}".format(
    metrics.adjusted_rand_score(y, clusters)))

print("Silhouette score with covariance_type=diag: {}".format(
    metrics.silhouette_score(X_std, clusters, metric='euclidean')))
print("------------------------------------------------------")


# Defining the agglomerative clustering
gmm_cluster = GaussianMixture(n_components=2, random_state=123, covariance_type="spherical")

# Fit model
clusters = gmm_cluster.fit_predict(X_std)

print("ARI score with covariance_type=spherical: {}".format(
    metrics.adjusted_rand_score(y, clusters)))

print("Silhouette score with covariance_type=spherical: {}".format(
    metrics.silhouette_score(X_std, clusters, metric='euclidean')))
print("------------------------------------------------------")

ARI score with covariance_type=full: 0.18389186035089963
Silhouette score with covariance_type=full: 0.13628813153331445
------------------------------------------------------
ARI score with covariance_type=tied: 0.18389186035089963
Silhouette score with covariance_type=tied: 0.13628813153331445
------------------------------------------------------
ARI score with covariance_type=diag: 0.18389186035089963
Silhouette score with covariance_type=diag: 0.13628813153331445
------------------------------------------------------
ARI score with covariance_type=spherical: 0.20765243525722465
Silhouette score with covariance_type=spherical: 0.12468753110276873
------------------------------------------------------


### ARI score of covariance type spherical is higher than the others and its silhouette score is lower than the others. The scores of the other covariance types are the same.