# Unsupervised Learning

- Using the UCI Mushroom dataset, use k-means and a suitable cluster evaluation metric to determine the optimal number of clusters in the dataset. Note that this may not necessarily be two (edible versus not-edible).
- Plot this metric while increasing the number of clusters, e.g.,  𝑘=2..30  (see [here](http://scikit-learn.org/stable/auto_examples/cluster/plot_adjusted_for_chance_measures.html#sphx-glr-auto-examples-cluster-plot-adjusted-for-chance-measures-py) for an example).
- Visualise the data using the number of clusters and a suitable projection or low-dimensional embedding.

Lets first import everything we need

In [39]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.metrics import pairwise_distances
from sklearn.cluster import KMeans
from sklearn import decomposition
from sklearn import metrics

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import math

df = pd.read_csv("../Data/mushroom.data")

Theh lets split the data into X and y

In [40]:
X, y = df.drop('edibility', axis='columns'), df['edibility']
X, y = pd.get_dummies(X), pd.get_dummies(y)

Using PCA

In [41]:
pca = decomposition.PCA(n_components=3)
pca.fit(X)
Xpca = pca.transform(X)

### Using Silhouette score:

Simple measure for a hard clustering like k-means. A higher SC means better clusters.

Composed of two scores:
* a - Mean distance between a sample and all other points in the same class
* b - Mean distance between a sample and all other points in the *next nearest* cluster

$sc = \frac{b - a}{max(a, b)}$

In [None]:
min_k, max_k = 2, 33
row, col = 8, 4

fig, axs = plt.subplots(row, col,figsize=(16, 16))
sc = []

for n in range((1+max_k-min_k)):
    k = min_k + n
 
    kmeans = KMeans(n_clusters=k)
    kmeans.fit(Xpca)
    y_kmeans = kmeans.predict(Xpca)
    labels = kmeans.labels_
    centers = kmeans.cluster_centers_
    sc.append(metrics.silhouette_score(Xpca, labels, metric='euclidean'))
    
    # Adding subplots to figure
    subfigure = axs[math.floor(n/col), n%col]
    subfigure.scatter(Xpca[:, 0], Xpca[:, 1], c=y.index, s=15, cmap='plasma')
    subfigure.scatter(centers[:, 0], centers[:, 1], c='black', s=70, alpha=0.6);
    subfigure.set_title("k=%d, sc=%f"%(k,sc[-1]))
    subfigure.axis('off')

Finally lets compare the Silhouette score based on k value, and then find the value of k that gives the best value

In [None]:
fig=plt.figure(figsize=(18,6))
ax = plt.axes()
ax.set(ylim=(0, 1))
plt.plot(range(min_k, max_k+1), sc)
plt.xticks(range(0, max_k+1))
ax.set_xlabel("k")
ax.set_ylabel("Silhouette score")
plt.show()
print("The optimal value fo k is: k=%d, where sc=%f"%(sc.index(max(sc))+2, max(sc)))