1. Apply GMM to the heart disease data by setting n_components=2. Get ARI and silhoutte scores for your solution and compare it with those of the k-means and hierarchical clustering solutions that you implemented in the assignments of the previous checkpoints. Which algorithm does perform better?

2. GMM implementation of scikit-learn has a parameter called covariance_type. This parameter determines the type of covariance parameters to use. Specifically, there are four types you can specify:

 - full: This is the default. Each component has its own general covariance matrix.
 - tied: All components share the same general covariance matrix.
 - diag: Each component has its own diagonal covariance matrix.
 - spherical: Each component has its own single variance.

Try all of these. Which one does perform better in terms of ARI and silhouette scores?

In [9]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.patches import Ellipse
from sklearn.preprocessing import StandardScaler
from sklearn.mixture import GaussianMixture
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn import datasets, metrics

In [10]:
from sqlalchemy import create_engine
postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'heartdisease'

engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))
heartdisease_df = pd.read_sql_query('select * from heartdisease',con=engine)

# no need for an open connection, as we're only doing a single query
engine.dispose()

In [11]:
# Define the features and the outcome
X = heartdisease_df.iloc[:, :13]
y = heartdisease_df.iloc[:, 13]

# Replace missing values (marked by ?) with a 0
X = X.replace(to_replace='?', value=0)

# Binarize y so that 1 means heart disease diagnosis and 0 means no diagnosis
y = np.where(y > 0, 0, 1)

# Standarizing the features
scaler = StandardScaler()
X_std = scaler.fit_transform(X)

1. Apply GMM to the heart disease data by setting n_components=2. Get ARI and silhoutte scores for your solution and compare it with those of the k-means and hierarchical clustering solutions that you implemented in the assignments of the previous checkpoints. Which algorithm does perform better?

K-means clearly performs better with the largest ARI and silhouette scores.

In [12]:
# Defining the agglomerative clustering
gmm_cluster = GaussianMixture(n_components=2, random_state=123)
agg_cluster = AgglomerativeClustering(linkage='average', 
                                      affinity='cosine',
                                      n_clusters=2)
k_cluster = KMeans(n_clusters=2, random_state=123)

# Fit model
gmm_clusters = gmm_cluster.fit_predict(X_std)
agg_clusters = agg_cluster.fit_predict(X_std)
k_clusters = k_cluster.fit_predict(X_std)

In [13]:
print('ARI score for GMM algorithm: ', metrics.adjusted_rand_score(y, gmm_clusters))
print('Silhouette score for GMM algorithm: ', metrics.silhouette_score(X_std, gmm_clusters, metric='euclidean'))
print('\n')
print('ARI score for k-means algorithm: ', metrics.adjusted_rand_score(y, k_clusters))
print('Silhouette score for k-means algorithm: ', metrics.silhouette_score(X_std, k_clusters, metric='euclidean'))
print('\n')
print('ARI score for hierarchical algorithm: ', metrics.adjusted_rand_score(y, agg_clusters))
print('Silhouette score for hierarchical algorithm: ', metrics.silhouette_score(X_std, agg_clusters, metric='euclidean'))
print('\n')

ARI score for GMM algorithm:  0.18389186035089963
Silhouette score for GMM algorithm:  0.13628813153331445


ARI score for k-means algorithm:  0.4380857727169879
Silhouette score for k-means algorithm:  0.17530682286260937


ARI score for hierarchical algorithm:  0.2940490133353465
Silhouette score for hierarchical algorithm:  0.14837359969689895




2. GMM implementation of scikit-learn has a parameter called covariance_type. This parameter determines the type of covariance parameters to use. Specifically, there are four types you can specify:

 - full: This is the default. Each component has its own general covariance matrix.
 - tied: All components share the same general covariance matrix.
 - diag: Each component has its own diagonal covariance matrix.
 - spherical: Each component has its own single variance.

Try all of these. Which one does perform better in terms of ARI and silhouette scores?

The full, tied, and diag models are all the same. The model with spherical covariance has a higher ARI score and lower silhouette score than the others.

In [15]:
# Defining the agglomerative clustering
full_cluster = GaussianMixture(n_components=2, random_state=123, covariance_type='full')
tied_cluster = GaussianMixture(n_components=2, random_state=123, covariance_type='tied')
diag_cluster = GaussianMixture(n_components=2, random_state=123, covariance_type='diag')
spherical_cluster = GaussianMixture(n_components=2, random_state=123, covariance_type='spherical')

# Fit model
full_clusters = full_cluster.fit_predict(X_std)
tied_clusters = tied_cluster.fit_predict(X_std)
diag_clusters = diag_cluster.fit_predict(X_std)
spherical_clusters = spherical_cluster.fit_predict(X_std)

In [16]:
print('ARI score for full covariance: ', metrics.adjusted_rand_score(y, full_clusters))
print('Silhouette score for full covariance: ', metrics.silhouette_score(X_std, full_clusters, metric='euclidean'))
print('\n')
print('ARI score for tied covariance: ', metrics.adjusted_rand_score(y, tied_clusters))
print('Silhouette score for tied covariance: ', metrics.silhouette_score(X_std, tied_clusters, metric='euclidean'))
print('\n')
print('ARI score for diag covariance: ', metrics.adjusted_rand_score(y, diag_clusters))
print('Silhouette score for diag covariance: ', metrics.silhouette_score(X_std, diag_clusters, metric='euclidean'))
print('\n')
print('ARI score for spherical covariance: ', metrics.adjusted_rand_score(y, spherical_clusters))
print('Silhouette score for spherical covariance: ', metrics.silhouette_score(X_std, spherical_clusters, metric='euclidean'))


ARI score for full covariance:  0.18389186035089963
Silhouette score for full covariance:  0.13628813153331445


ARI score for tied covariance:  0.18389186035089963
Silhouette score for tied covariance:  0.13628813153331445


ARI score for diag covariance:  0.18389186035089963
Silhouette score for diag covariance:  0.13628813153331445


ARI score for spherical covariance:  0.20765243525722465
Silhouette score for spherical covariance:  0.12468753110276873
