# Unsupervised Learning
- K-Means clustering

This technique is one of the fundamentals in unsupervised learning. It is used when we want to figure out if there is any hidden pattern inside our unlabelled data. Remember in unsupervised learning, we only have the features input X without the target value y to do prediction. Therefore, k-Means algorithm can be used to cluster our data points into certain groups if these data points share similarity in any form. 
As you can tell, we can also use this technique to identify anomaly in our dataset which are specific data points that do not resemble other data points. The object of this algorithm is to find certain number of `centroids` so that it achieves the smallest distance between all the data points and its corresponding `centroids`. However, the number of centroids is something we need to manually inspect and give to the algorithm. Once given the number of `centroids`, the algorithm can return the most efficient output to assign the data points to appropriate `centroids`. 

<center><img src='./assets/knn.png' width="800"></center>

Two applications we are going to explore in this lesson are clustering iris dataset and image segmentation. 

In [None]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# Though the following import is not directly being used, it is required
# for 3D projection to work
from mpl_toolkits.mplot3d import Axes3D

from utils import *
from sklearn.cluster import KMeans
from sklearn import datasets

In [None]:
from sklearn.datasets import load_iris

data = load_iris()
X_iris = data.data
y_iris = data.target
data.target_names

An example of clustering the flowers based on the petal length and width

In [None]:
plt.figure(figsize=(9, 3.5))

plt.subplot(121)
plt.title("Classification")
plt.plot(X_iris[y_iris==0, 2], X_iris[y_iris==0, 3], "yo", label="Iris setosa")
plt.plot(X_iris[y_iris==1, 2], X_iris[y_iris==1, 3], "bs", label="Iris versicolor")
plt.plot(X_iris[y_iris==2, 2], X_iris[y_iris==2, 3], "g^", label="Iris virginica")
plt.xlabel("Petal length", fontsize=14)
plt.ylabel("Petal width", fontsize=14)
plt.legend(fontsize=12)


plt.subplot(122)
plt.title("Clustering")
plt.scatter(X_iris[:, 2], X_iris[:, 3], c="k", marker=".")
plt.xlabel("Petal length", fontsize=14)
plt.tick_params(labelleft=False)


plt.show()

In [None]:
print(X_iris.shape, y_iris.shape)

Let's create a kMeans to cluster iris dataset
https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

In [None]:
#Applying kmeans to the dataset / Creating the kmeans classifier
# TODO:
# Your code goes here!



The number of clusters is dependent on the `n_clusters` value we put in when creating the KMeans object.

In [None]:
# What is the meaning of y_kmeans?
# TODO:
# Your code goes here!


In [None]:
# Find the cluster centers
# TODO:
# Your code goes here!


In [None]:
# How good is this clustering? (Performance metrics)
# TODO:
# Your code goes here!



In [None]:
#Visualising the clusters
plt.scatter(X_iris[y_kmeans == 0, 0], X_iris[y_kmeans == 0, 1], s = 100, c = 'red', label = 'Iris-setosa')
plt.scatter(X_iris[y_kmeans == 1, 0], X_iris[y_kmeans == 1, 1], s = 100, c = 'blue', label = 'Iris-versicolour')
plt.scatter(X_iris[y_kmeans == 2, 0], X_iris[y_kmeans == 2, 1], s = 100, c = 'green', label = 'Iris-virginica')

#Plotting the centroids of the clusters
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:,1], s = 100, c = 'yellow', label = 'Centroids')

plt.legend()

Any observation on the above figure?

Let's visualise the data points and clusters in 3D. That is the best thing we can do eventhough our iris dataset has 4 features.

In [None]:
plot_k_clusters(X_iris, kmeans, 3)

In [None]:
# Plot the ground truth
fig = plt.figure(figsize=(12, 8))
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)

for name, label in [('Setosa', 0),
                    ('Versicolour', 1),
                    ('Virginica', 2)]:
    ax.text3D(X_iris[y_iris == label, 3].mean(),
              X_iris[y_iris == label, 0].mean(),
              X_iris[y_iris == label, 2].mean() + 2, name,
              horizontalalignment='center',
              bbox=dict(alpha=.2, edgecolor='w', facecolor='w'))
# Reorder the labels to have colors matching the cluster results
y_iris = np.choose(y_iris.astype(np.int), [1, 2, 0]).astype(np.float)
ax.scatter(X_iris[:, 3], X_iris[:, 0], X_iris[:, 2], c=y_iris, edgecolor='k', s= 70)

ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])
ax.set_xlabel('Petal width')
ax.set_ylabel('Sepal length')
ax.set_zlabel('Petal length')
ax.set_title('Ground Truth')
ax.dist = 12

plt.show()

#### Image Segmentation with K-Means:
- One application of clustering is used in image segmentation. This technique is used to cluster similar pixels in an image to have the same color. It is commonly used for example in segmenting a camera image of a self-driving car to detect pedestrians, street signs or other objects. More advanced image segmentation can be done using Convolutional Network which is popular in Deep Learning. 

In [None]:
import os
from matplotlib.image import imread 

# Here is to load a prepared image, feel free to upload your own image and put it inside the images folder in the same directory
image = imread(os.path.join("images","awww.png"))
image.shape

For this particular image we have 4 channels RGBA with the 4th channel being alpha or transparency value. For other images you may have 1 channel just for gray color or 3 channel with RGB values.

In [None]:
plt.imshow(image)

In [None]:
X = image.reshape(-1, 4)
kmeans = KMeans(n_clusters=2).fit(X)

In [None]:
segmented_image = kmeans.cluster_centers_[kmeans.labels_]
segmented_image = segmented_image.reshape(image.shape)

In [None]:
plt.imshow(segmented_image)