## Clustering

In [90]:
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN
from sklearn.metrics import silhouette_score

&nbsp;

In [91]:
df = pd.read_csv('clinical_records_dataset.csv')

X = df.drop(['DEATH_EVENT', 'time'], axis=1)
y_true = df['DEATH_EVENT']

&nbsp;

### Function for MinMax feature normalization
The input `x` is the raw data in a 2-D array of the shape `(number of data points, number of features`.

The output `x_norm` is the normalized data of the input `x` with the same shape as the input.

This function will be used for normalizing data before using DBSCAN for clustering.


In [92]:
def feature_norm(x):
    # x is a 2-D array of the shape (number of data points, number of features
    eps = np.finfo(float).eps
    x_norm = x - np.expand_dims(x.min(0), axis=0)
    x_norm = x_norm / (np.expand_dims((x.max(0) - x.min(0)), axis=0) + eps)
    
    return x_norm

&nbsp;

# Task 1: Function for computing purity
This is your function of purity.

The indices of the clusters in `y_true` and `y_pred` start from 0 in `compute_purity`, e.g., [1, 1, 0, 0, 2, 2, 2].

`y_true` is the array of true class indices of all data points, `len(y_true)=number of data points`.

`y_pred` is the array of cluster indices of all data points, `len(y_pred)=number of data points`.

In [None]:
def compute_purity(y_true, y_pred):
    # This is your function of purity
    # y_true is the array of true class indices of all data points, len(y_true)=number of data points
    # y_pred is the array of cluster indices of all data points, len(y_pred)=number of data points

    y_true = np.array(y_true)
    y_pred = np.array(y_pred)

    majorities = []

    uniqueClusters = pd.unique(y_pred)

    for cluster in uniqueClusters:
        # get y_pred indexes whose value = cluster
        cluster_idx = np.where(y_pred == cluster)[0]

        counts = {}
        for i in cluster_idx:
            counts[y_true[i]] = counts.setdefault(y_true[i], 0) + 1

        majorities.append(max(counts.values()))
    
    purity = sum(majorities)/len(y_true)
    return purity

&nbsp;

# Task 2

In [94]:
kmeans = KMeans(n_clusters=2, random_state=42).fit(X)
y_pred = kmeans.fit_predict(X)

uniqueClusters = np.unique(y_pred)

for cluster in uniqueClusters:
    dataCount = len(np.where(y_pred == cluster)[0])
    print(f"Cluster {cluster}:")
    print(f"- count: {dataCount}")
    print(f"- % of data points: {dataCount/len(y_pred) : .2f}")
    print("\r")

purity = compute_purity(y_true, y_pred)
print(f"Clustering Result Purity: {purity: .4f}")

for cluster in uniqueClusters:
    cluster_idx = np.where(y_pred == cluster)[0]
    true = y_true[cluster_idx]
    pred = y_pred[cluster_idx]

    purity_of_cluster = compute_purity(true, pred)
    print(f"Purity of Cluster {cluster}: {purity_of_cluster: .2f}")



Cluster 0:
- count: 234
- % of data points:  0.78

Cluster 1:
- count: 65
- % of data points:  0.22

Clustering Result Purity:  0.6789
Purity of Cluster 0:  0.69
Purity of Cluster 1:  0.63


78% of the data points were assigned to Cluster 0 and 22% of the data points were assigned to Cluster 1. 

The purity of the clustering result is 68%. 

The purity of Cluster 0 is 69% and the purity of Cluster 1 is 63% with Cluster 0 having the highest purity.

&nbsp;

# Task 3

In [149]:
k_arr = [2, 10, 30, 50, 100]
results = []

for k in k_arr:
    arr = []
    purity_arr = []
    silhouette_score_arr = []
    for i in range(10):
        kmeans = KMeans(n_clusters=k)
        y_pred = kmeans.fit_predict(X)

        purity_k = compute_purity(y_true, y_pred)
        purity_arr.append(purity_k)

        score = silhouette_score(X, y_pred, metric='euclidean')
        silhouette_score_arr.append(score)

    results.append([k, np.average(purity_arr), np.average(silhouette_score_arr)])

pd.DataFrame(results, columns=["k", "Purity", "Silhouette Coefficient"])

Unnamed: 0,k,Purity,Silhouette Coefficient
0,2,0.67893,0.575425
1,10,0.684281,0.574478
2,30,0.700669,0.552328
3,50,0.723746,0.561044
4,100,0.765886,0.509231


The best purity result was obtained when k = 100 and the best silhouette coefficient was obtained when k = 2 (or k = 10). 

For K-means, purity measures the agreement between the assigned clusters and the ground truth labels of the data points (or the percent of the total number of data points that were classified correctly). Purity ranges between 0 and 1, with higher purity values indicating higher agreement, and 1 meaning that all data points within a cluster share the same ground truth label and high cluster quality. 

Typically, it is easier to achieve a higher purity when k is large. 

When there are less clusters/lower value of k, there are more data points within each cluster that can belong to different classes (more possible classes). This would decrease the purity since there is a higher possibility of the data points being possibly misclassified/predicted incorrectly. 

While with higher numbers of clusters, there are less data points in each cluster that could possibly belong to different classees. In other words there is a higher chance of each cluster containing a single class, thus increasing the purity value. Smaller clusters are also able to capture patterns more as compared to larger clusters. For example, if each data point were assigned to its own cluster, then the purity score would be 1 as it indicates higher agreement.

&nbsp;

# Task 4

In [96]:
X_norm = feature_norm(X)

eps_arr = [0.3, 0.5, 0.7]

dbs_results = []

for eps in eps_arr:    
    clustering = DBSCAN(eps=eps, min_samples=5, metric='euclidean')
    y_pred = clustering.fit_predict(X_norm)

    # number of clusters and noisy points
    n_clusters = len(set(y_pred)) - (1 if -1 in y_pred else 0)
    n_noise = list(y_pred).count(-1)
    
    purity_dbs = compute_purity(y_true, y_pred)

    dbs_results.append([eps, n_clusters, n_noise, purity_dbs])


pd.DataFrame(dbs_results, columns=["eps", "Number of Clusters", "Number of Anomalies", "Purity"])

Unnamed: 0,eps,Number of Clusters,Number of Anomalies,Purity
0,0.3,18,146,0.688963
1,0.5,22,21,0.688963
2,0.7,22,13,0.695652


The best purity was obtained when eps = 0.7.

In [97]:
X_norm = feature_norm(X)

eps_arr = [0.3, 0.5, 0.7]

dbs_results = []

for eps in eps_arr:    
    clustering = DBSCAN(eps=eps, min_samples=5, metric='euclidean')
    y_pred = clustering.fit_predict(X_norm)

    # number of clusters and noisy points
    n_clusters = len(set(y_pred)) - (1 if -1 in y_pred else 0)
    n_noise = list(y_pred).count(-1)
    
    true = y_true

    # remove noise/anomalies
    range_max = len(X_norm)
    true = np.array([true[i] for i in range(0, range_max) if y_pred[i] != -1])
    y_pred = np.array([y_pred[i] for i in range(0, range_max) if y_pred[i] != -1])

    purity_dbs = compute_purity(true, y_pred)

    dbs_results.append([eps, n_clusters, n_noise, purity_dbs])


pd.DataFrame(dbs_results, columns=["eps", "Number of Clusters", "Number of Anomalies", "Purity"])

Unnamed: 0,eps,Number of Clusters,Number of Anomalies,Purity
0,0.3,18,146,0.777778
1,0.5,22,21,0.701439
2,0.7,22,13,0.702797


(If anomalies were removed, then the best purity is obtained when eps = 0.3)