This notebook, by [felipe.alonso@urjc.es](mailto:felipe.alonso@urjc.es)

In this notebook we will analyze clustering methods over the Pima Indiand Diabetes dataset.

# Table of Contents

0. [Preliminaries](#preliminaries)
1. [K-means](#k_means) 
2. [Hierchical clustering](#hierarchical)
3. [Project Ideas](#ideas)

---
<a id='preliminaries'></a>
# 0 . Preliminaries

## Import libraries

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# your code here
# ... add as many libraries as you want

from src.utils import plot_scatter, plot_silhouette

## Load dataset

In this lab exercise you are using the Pima Indian Diabetes data. Your hypothesis is that **there might be groups of patients with similar behavior** and you want to get some insights about them.

In [None]:
from src.ddbb import load_pima_indian

X, y = load_pima_indian('./data/pima_indian_diabetes.csv')
feat_names = X.columns

---
<a id='k_means'></a>
# 1. K-means

In [None]:
from sklearn.preprocessing import StandardScaler

X1 = X[['bmi','glucose']].values
X1 = StandardScaler().fit_transform(X1)

In [None]:
X1.shape

In [None]:
from sklearn.cluster import KMeans

# build the clustering model
k = 2
kmeans = KMeans(n_clusters = k).fit(X1)

# Centroids 
centroids = kmeans.cluster_centers_

# Labels
cluster_labels = kmeans.labels_

# do the plotting
plot_scatter(X1,'k = ' + str(k), cluster_labels, centroids)
plt.show()

What if we use the target variable `y`?

In [None]:
# do the plotting
plot_scatter(X1,'k = ' + str(k), y, centroids)
plt.show()

Careful here, the purpose is to group our observations not classify them (so there might be subgroups within our observations having the same or differente outcome)

### How many cluster are there?

In [None]:
K = range(1,15)

inertia = []
for k in K:
    kmeans = KMeans(n_clusters=k).fit(X1)
    inertia.append(kmeans.inertia_)
    
plt.plot(K,inertia,'.-')
plt.xlabel('# of clusters')
plt.ylabel('Inertia')
plt.show()

#### Use the silhouette analysis

In [None]:
kmeans = KMeans(n_clusters=7).fit(X1)
plot_silhouette(X1,k,kmeans.labels_,kmeans.cluster_centers_)

### Let's analyze our observations depending on the cluster label

In [None]:
df = X.copy()
df['cluster_label'] = cluster_labels
df.head()

In [None]:
df[df.cluster_label==3]


In [None]:
df[df.cluster_label==0]


<div class = "alert alert-info">
<b>Note:</b> You can use either <b>cluster_labels</b> or <b>outcome</b> in the above representation
</div>

### PCA & K-means

Two options here:
    
1. K-means + PCA representation
2. PCA dimensionality reduction + K-means

In [None]:
from sklearn.decomposition import PCA

# scaling
X_scaled = StandardScaler().fit_transform(X)

# Number of components
pca = PCA().fit(X_scaled)
X_pca = PCA(n_components=2).fit_transform(X_scaled)

# Data visualization (just 2 components)
plt.figure(figsize=(15,10))
plt.subplot(2,2,1)
plt.plot(np.cumsum(pca.explained_variance_ratio_),'.-')
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance');

plt.subplot(2,2,2)
plt.bar(range(pca.n_components_), pca.explained_variance_ratio_, color='black')
plt.xlabel('PCA features')
plt.ylabel('variance %')
plt.xticks(range(pca.n_components_))

plt.subplot(2,2,3)
plt.scatter(X_pca[:,0],X_pca[:,1], c=y)
plt.xlabel('$x_1$ (PCA)',fontsize=16)
plt.ylabel('$x_2$ (PCA)',fontsize=16)

plt.subplot(2,2,4)
plt.scatter(X_pca[:,0],X_pca[:,1], c=cluster_labels)
plt.xlabel('$x_1$ (PCA)',fontsize=16)
plt.ylabel('$x_2$ (PCA)',fontsize=16)

plt.show()

#### Option 2:

In [None]:
X_pca = PCA(n_components=2).fit_transform(X_scaled)
kmeans = KMeans(n_clusters = k).fit(X_pca)

# Centroids 
centroids = kmeans.cluster_centers_

# Labels
cluster_labels = kmeans.labels_

# do the plotting
plot_scatter(X_pca,'k = ' + str(k), cluster_labels, centroids)
plt.show()

---
<a id='hierarchical'></a>
# 2. Hierarchical clustering

In [None]:
from scipy.cluster.hierarchy import dendrogram, linkage

Z = linkage(X1, 'ward')
dendrogram(Z)
plt.show()

In [None]:
from sklearn.cluster import AgglomerativeClustering

agg = AgglomerativeClustering(n_clusters=4).fit(X1)
plot_scatter(X1,'Hierarchical clustering', agg.labels_) 

---
<a id='ideas'></a>
# Project Ideas


Here there are some ideas that you might want to consider for your project:

- Apply the k-means algorithm to your dataset, was it helpful? Did you get any insight? Comment on the number of cluster you used.

- What if you used Hierarchical clustering? Any differences? 


In all above, justify your decisions.