In this assignment, you'll continue working with the [heart disease dataset](http://archive.ics.uci.edu/ml/datasets/Heart+Disease) from the UC Irvine Machine Learning Repository.

Load the dataset from Thinkful's database. To connect to the database, use these credentials:
```
postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'heartdisease'
```
The dataset needs some preprocessing. So, before working with the dataset, apply the following code:
```
# Define the features and the outcome
X = heartdisease_df.iloc[:, :13]
y = heartdisease_df.iloc[:, 13]

# Replace missing values (marked by `?`) with a `0`
X = X.replace(to_replace='?', value=0)

# Binarize y so that `1` means heart disease diagnosis and `0` means no diagnosis
y = np.where(y > 0,1, 0)
```
Here, `X` will represent your features and `y` will hold the labels. If `y` is equal to `1`, that indicates that the corresponding patient has heart disease. And if `y` is equal to `0`, then the patient doesn't have heart disease.

To complete this assignment, submit a link to a Jupyter Notebook containing your solutions to the following tasks below. You can also take a look at these [example solutions](https://github.com/Thinkful-Ed/data-201-resources/blob/master/clustering_module_solutions/6.solution_gmm.ipynb).

1. Apply GMM to the heart disease dataset by setting `n_components=2`. Get ARI and silhouette scores for your solution and compare it with those of the k-means and hierarchical clustering solutions that you implemented in the previous checkpoint assignments. Which algorithm performs best?
1. GMM implementation of scikit-learn has a parameter called `covariance_type`. This parameter determines the type of covariance parameters to use. There are four types that you can specify:
 1. `full`: This is the default. Each component has its own general covariance matrix.
 1. `tied`: All components share the same general covariance matrix.
 1. `diag`: Each component has its own diagonal covariance matrix.
 1. `spherical`: Each component has its own single variance.

 Try all of these. Which one performs best in terms of ARI and silhouette scores?

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.patches import Ellipse
from sklearn.preprocessing import StandardScaler
from sklearn.mixture import GaussianMixture
from sklearn.decomposition import PCA
from sklearn import datasets, metrics
from sqlalchemy import create_engine


postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'heartdisease'

engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))
heartdisease_df = pd.read_sql_query('select * from heartdisease', con = engine)
engine.dispose()

# Define the features and the outcome
X = heartdisease_df.iloc[:, :13]
y = heartdisease_df.iloc[:, 13]

# Replace missing values (marked by `?`) with a `0`
X = X.replace(to_replace='?', value=0)

# Binarize y so that `1` means heart disease diagnosis and `0` means no diagnosis
y = np.where(y > 0,1, 0)

# Apply GMM to the heart disease dataset by setting `n_components=2`. Get ARI and silhouette scores for your solution and compare it with those of the k-means and hierarchical clustering solutions that you implemented in the previous checkpoint assignments. Which algorithm performs best?

In [2]:
X_std = StandardScaler().fit_transform(X)

gmm_cluster = GaussianMixture(n_components = 2)
clusters = gmm_cluster.fit_predict(X_std)

print("Adjusted Rand Index of the GMM solution: {}"
     .format(metrics.adjusted_rand_score(y, clusters)))
print("The silhouette score of the GMM solution: {}"
     .format(metrics.silhouette_score(X_std, clusters, metric = 'euclidean')))

Adjusted Rand Index of the GMM solution: 0.18389186035089963
The silhouette score of the GMM solution: 0.13628813153331445


The two-cluster k-means ARI was 0.43 and the silhouette score was 0.17. For the hierarchical cluster, the ARI was 0.14 and the silhouette score was 0.14. Based on both socres, the K-means solution was the best solution for this dataset. 

# 1. GMM implementation of scikit-learn has a parameter called `covariance_type`. This parameter determines the type of covariance parameters to use. There are four types that you can specify: full, tied, diag, spherical. Try all of these. Which one performs best in terms of ARI and silhouette scores?

In [4]:
gmm_full = GaussianMixture(n_components = 2, covariance_type = 'full')
full_cluster = gmm_full.fit_predict(X_std)
gmm_tied = GaussianMixture(n_components = 2, covariance_type = 'tied')
tied_cluster = gmm_tied.fit_predict(X_std)
gmm_diag = GaussianMixture(n_components = 2, covariance_type = 'diag')
diag_cluster = gmm_diag.fit_predict(X_std)
gmm_sphe = GaussianMixture(n_components = 2, covariance_type = 'spherical')
sphe_cluster = gmm_sphe.fit_predict(X_std)

print("Adjusted Rand Index of the GMM solution with covariance_type = full: {}"
     .format(metrics.adjusted_rand_score(y, full_cluster)))
print("The silhouette score of the GMM solution with covariance_type = full: {}"
     .format(metrics.silhouette_score(X_std, full_cluster, metric = 'euclidean')))
print("\nAdjusted Rand Index of the GMM solution with covariance_type = tied: {}"
     .format(metrics.adjusted_rand_score(y, tied_cluster)))
print("The silhouette score of the GMM solution with covariance_type = tied: {}"
     .format(metrics.silhouette_score(X_std, tied_cluster, metric = 'euclidean')))
print("\nAdjusted Rand Index of the GMM solution with covariance_type = diag: {}"
     .format(metrics.adjusted_rand_score(y, diag_cluster)))
print("The silhouette score of the GMM solution with covariance_type = diag: {}"
     .format(metrics.silhouette_score(X_std, diag_cluster, metric = 'euclidean')))
print("\nAdjusted Rand Index of the GMM solution with covariance_type = spherical: {}"
     .format(metrics.adjusted_rand_score(y, sphe_cluster)))
print("The silhouette score of the GMM solution with covariance_type = spherical: {}"
     .format(metrics.silhouette_score(X_std, sphe_cluster, metric = 'euclidean')))

Adjusted Rand Index of the GMM solution with covariance_type = full: 0.4207322145049338
The silhouette score of the GMM solution with covariance_type = full: 0.16118591340148433

Adjusted Rand Index of the GMM solution with covariance_type = tied: 0.18389186035089963
The silhouette score of the GMM solution with covariance_type = tied: 0.13628813153331445

Adjusted Rand Index of the GMM solution with covariance_type = diag: 0.18389186035089963
The silhouette score of the GMM solution with covariance_type = diag: 0.13628813153331445

Adjusted Rand Index of the GMM solution with covariance_type = spherical: 0.20765243525722468
The silhouette score of the GMM solution with covariance_type = spherical: 0.12468753110276873


Based on the covariance type, the full covariance type performs the best.