<a href="https://colab.research.google.com/github/wwillis125/Image-Classification/blob/master/OHAB_IntrotoClustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---
<center>
<img src="https://drive.google.com/uc?export=view&id=1i1_Glu8nhZPHh9S5PSq16JIBmligAA_x"  alt="drawing" height="100"/>
<br><br>
</center>

# ONE HOUR AT BOOTCAMP: Intro to Clustering
### Instructor: [Roberto Reif](https://www.linkedin.com/in/robertoreif/)
---



# K-Means Clustering Algorithm

__Purpose:__
The purpose of this lecture is to learn an unsupervised learning clustering algorithm, known as **K-means**. We will learn how to run this algorithm using sklearn, and apply it to cluster colors in an image.   

__At the end of this lecture you will be able to:__
> 1. Understan what **K-means** is and how it works
> 2. Run a K-means algorithm in Python

### K-Means Overview

**K-means** is one of the most basic clustering algorithms.  It relies on finding cluster centers to group data points based on minimizing the sum of squared errors between each datapoint and its cluster center.  

In [0]:
#######################
#       imports       #
#######################
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from PIL import Image

import urllib.request
import io

In [0]:
plt.rcParams['figure.figsize'] = [6,6]
plt.rcParams.update({'font.size': 22})

Below we define a helper function that displays data in 2-dimensions

In [0]:
# helper function that allows us to display data in 2 dimensions an highlights the clusters
def display_cluster(X,km=[],num_clusters=0):
    color = 'brgcmyk'
    alpha = 0.5
    s = 20
    if num_clusters == 0:
        plt.scatter(X[:,0],X[:,1],c = color[0],alpha = alpha,s = s)
    else:
        for i in range(num_clusters):
            plt.scatter(X[km.labels_==i,0],X[km.labels_==i,1],c = color[i],alpha = alpha,s=s)
            plt.scatter(km.cluster_centers_[i][0],km.cluster_centers_[i][1],c = color[i], marker = 'x', s = 100)

Let's briefly explore the `KMeans` documentation, we will use two arguments `n_clusters` and `random_state`.  The other parameters are beyond the scope of this lecture, and you are encouraged to read through them on your own.


In [0]:
KMeans?

### Cluster starting points
Let's start by creating a simple dataset.

In [0]:
angle = np.linspace(0,2*np.pi,20, endpoint = False)
X = np.append([np.cos(angle)],[np.sin(angle)],0).transpose()
display_cluster(X)

Let's now group this data into three clusters. We will use two different random states to initialize the algorithm.

Clustering with a random state of 42:

In [0]:
k = 3
km = KMeans(n_clusters=k,random_state=42) 
km.fit(X)
display_cluster(X,km,k)

We clustered with a random state of 42! Why 42? We use 42 because it is [The Answer to the Ultimate Question of Life, the Universe, and Everything](https://en.wikipedia.org/wiki/Phrases_from_The_Hitchhiker%27s_Guide_to_the_Galaxy#Answer_to_the_Ultimate_Question_of_Life,_the_Universe,_and_Everything_(42))

Clustering with a random state of 41:

In [0]:
km = KMeans(n_clusters=k,random_state=41) 
km.fit(X)
display_cluster(X,km,k)

## Question:

Why are the clusters different when we run  the K-means with two different random states?



*It's because the starting points of the cluster centers have an impact on where the final clusters lie. The starting point of the clusters is controlled by the random state.*

### Determining the optimum number of clusters

Let's create a fake dataset with 2 features that visually contains a few clusters and we will try to group them.

In [0]:
n_samples = 1000
n_bins = 4  
centers = [(-3, -3), (0, 0), (3, 3), (6, 6)]
X, y = make_blobs(n_samples=n_samples, n_features=2, cluster_std=1.0,
                  centers=centers, shuffle=False, random_state=42)
display_cluster(X)
pd.DataFrame(X[:5,:],columns = ['x','y'])

How many clusters do you observe?

Let's run K-means with seven clusters.

In [0]:
k = 7
km = KMeans(n_clusters=k,random_state=42)
km.fit(X)
display_cluster(X,km,k)

K-means clustering is one of the most simple clustering algorithms.  One of the limitations is that it depends on the starting point of the clusters, and the number of clusters need to be defined beforehand.

Re-run the code above with a different random state!

Now let's re-run the algorithm with four clusters.

In [0]:
k = 4
km = KMeans(n_clusters=k,random_state=42)
km.fit(X)
display_cluster(X,km,k)

Should we use four or seven clusters?  

- In this case it may be visually obvious that four clusters is better than seven
- This is because we can easily view the data in two dimensional space
- However, real world data usually has more than two dimensions
- A dataset with a higher dimensional space is hard to visualize
- A way of solving this is to plot the **inertia** 

**inertia**: sum of squared distances of samples to their closest cluster center

We can extract the inertia of the k-means algorithm using the *inertia_* attribute

In [0]:
km.inertia_

### Problem 1:

Write code that calculates the inertia for 1 to 10 clusters, and plot the inertia as a function of the number of clusters.

In [0]:
### Write code here




Where does the elbow of the curve occur?

What do you think the inertia would be if you have the same number of clusters and data points?

### Clustering Colors from an Image

Let's start by loading an image.

In [0]:
# load the image
URL = 'https://bit.ly/2W2EMEF'
with urllib.request.urlopen(URL) as url:
    f = io.BytesIO(url.read())
img = np.array(Image.open(f))

img = img[150:600,100:750,:] # crops the image

plt.imshow(img)
plt.axis('off');
print(img.shape)

The image above has 450 pixels in height and 650 pixels in width.  Each pixel has 3 values that represent how much red, green and blue it has. Below you can play with different combinations of RGB to create different colors. In total, you can create $256^3 = 16,777,216$ unique colors.

In [0]:
# assign values for the RGB.  Each value should be between 0 and 255
R = 35
G = 95
B = 131

# diplays the color
plt.imshow([[np.array([R,G,B]).astype('uint8')]])
plt.axis('off');

We can observe the amound of Red, Green and Blue in the image.

In [0]:
title = ['Red','Green','Blue']
plt.figure(figsize=[20,10])

for i in range(3):
  channel = img.copy()*0
  channel[:,:,i] = img[:,:,i].copy()
  plt.subplot(1,3,i+1)
  plt.imshow(channel,vmin=0,vmax=255)
  plt.title(title[i])
  plt.axis('off');

First we will reshape the image into a table that has a pixel per row and each column represents the red, green and blue channel.

In [0]:
img_flat = img.reshape(img.shape[0]*img.shape[1],3)
df = pd.DataFrame(img_flat[:,:],columns=['Red','Green','Blue'])
df.head()

Since there are 450x650 pixels we get 292,500 rows! 

In [0]:
img_flat.shape

Let's run K-means with 4 clusters.

In [0]:
k = 4
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(img_flat);

Now let's replace each row with its closest cluster center.

In [0]:
img_flat2 = img_flat.copy() # makes a shallow copy of the array

# loops for each cluster center
for i in np.unique(kmeans.labels_):
    img_flat2[kmeans.labels_==i,:] = kmeans.cluster_centers_[i]

We now need to reshape the the data from 292,500 x 3 to 450 x 650 x 3

In [0]:
img2 = img_flat2.reshape(img.shape)

plt.figure(figsize=[10,10])
plt.subplot(1,2,1)
plt.imshow(img)
plt.axis('off')
plt.title('Original');

plt.subplot(1,2,2)
plt.imshow(img2)
plt.axis('off')
plt.title(f'k = {k}');

### Problem 2:
Write a function that receives the image and number of clusters (k), and returns (1) the image quantized into k colors, and (2) the inertia.

In [0]:
def image_cluster(img,k):
    ### Write code here
    
    
    return img2, kmeans.inertia_

### Problem 3:

Call the function for k being 1 through 11, and draw an inertia curve.

What is the optimum number of clusters?

In [0]:
### Write code here




Often times, the elbow method does not work as expected.  There are alternatives such as the [silhouette coefficient](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html).

**NOTE:** normalizing the features can also affect the way the clusters are created.

### Problem 4:
Plot in a grid all the images for the different k values.

In [0]:
### Write code here




## Solutions to the problems

### Problem 1:

In [0]:
inertia = []
list_num_clusters = list(range(1,11))
for k in list_num_clusters:
    km = KMeans(n_clusters=k)
    km.fit(X)
    inertia.append(km.inertia_)
    
plt.plot(list_num_clusters,inertia,'b')
plt.scatter(list_num_clusters,inertia,color ='b')
plt.xlabel('Number of Clusters')
plt.ylabel('Inertia');

### Problem 2:

In [0]:
def image_cluster(img,k):
    img_flat = img.reshape(img.shape[0]*img.shape[1],3)
    kmeans = KMeans(n_clusters=k, random_state=0).fit(img_flat)
    img_flat2 = img_flat.copy()

    # loops for each cluster center
    for i in np.unique(kmeans.labels_):
        img_flat2[kmeans.labels_==i,:] = kmeans.cluster_centers_[i]
        
    img2 = img_flat2.reshape(img.shape)

    return img2, kmeans.inertia_

### Problem 3:

In [0]:
k_vals = list(range(1,12,1))
img_list = []
inertia = []
for k in k_vals:
    print(k)
    img2, ine = image_cluster(img,k)
    img_list.append(img2)
    inertia.append(ine)  

In [0]:
plt.plot(k_vals,inertia, 'b')
plt.scatter(k_vals,inertia, color = 'b')
plt.xlabel('k')
plt.ylabel('Inertia');

### Problem 4

In [0]:
plt.figure(figsize=[20,10])
plt.subplot(3,4,1)
plt.imshow(img)
plt.title('Original')
plt.axis('off');
for i in range(len(k_vals)):
    plt.subplot(3,4,i+2)
    plt.imshow(img_list[i])
    plt.title('k = '+ str(k_vals[i]))
    plt.axis('off');