# K-Means clustering

K-Means clustering is a method that aims to partition $n$ observations into $k$ clusters in which each observation belongs to the cluster with the nearest cluster centroid. 

Algorithm: 
- Input number of clusters, randomly initialize centers
- Assign all points to the closest cluster center
- Change cluster centers to be in the middle of its points
- Repeat until convergence

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs, make_circles

In [None]:
# create some synthetic data in clusters
n_samples = 800  
n_clusters = 5
cluster_std = 0.5

centers = [
    [1, 0.7],
    [1.5, 2.5],
    [0, -2],
    [-1.2, -0.3],
    [-0.5,2]
]

X, y = make_blobs(n_samples=n_samples, centers=centers, cluster_std=cluster_std, random_state=42)
plt.scatter(X[:, 0], X[:, 1], c=y, s=20);

In [None]:
# but for unsupervized learning we don't start with any labels (we don't have y)
# so we don't know which cluster a point came from, or even how many clusters there should be
plt.scatter(X[:, 0], X[:, 1], s=20);

Let's use k-means to analyze this data!

`Scikit-learn` has built-in methods for K-Menas clustering.

In [None]:
from sklearn.cluster import KMeans

In [None]:
kmeans = KMeans(n_clusters=7,random_state=0)
kmeans.fit(X)

In [None]:
kmeans

In [None]:
kmeans.cluster_centers_

In [None]:
plt.scatter(X[:, 0], X[:, 1], s=20);
plt.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:,1],  c='red', marker='x', s=100)

In [None]:
# use the cluster centers to predict the cluster for each point
y_predicted = kmeans.predict(X)
plt.scatter(X[:, 0], X[:, 1], c=y_predicted, s=20);

In [None]:
# actual labels
plt.scatter(X[:, 0], X[:, 1], c=y, s=20);

## k-Means for Color Compression

One interesting application of clustering is in color compression within images. For example, imagine you have an RGB image with potentially 256\*256\*256 (>16 million) colors. In most images, a large number of the colors will be unused, and many of the pixels in the image will have
similar or even identical colors. We can use the k-Means clustering for color compression.

In [None]:
from sklearn.datasets import load_sample_image
china = load_sample_image("china.jpg")
ax = plt.axes(xticks=[], yticks=[])
ax.imshow(china);

The image itself is stored in a three-dimensional array of size (height, width, RGB), containing red/blue/green contributions as integers from 0 to 255. 

In [None]:
china.shape

One way we can view this set of pixels is as a cloud of points in a three-dimensional color space, with the pixels being the rows and the columns being Red, Green, and Blue. We will reshape the data to `[n_samples, n_features]` and rescale the colors so that they lie between 0 and 1. 

In [None]:
data = china / 255.0 # use 0...1 scale
data = data.reshape(-1, 3)
data.shape

In [None]:
# it is difficult to decide any clusters just from a 3d scatter plot
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(projection='3d')
ax.scatter(data[:, 0], data[:, 1], data[:, 2],s=1, alpha=0.005);

Now let’s reduce these potentially 16 million colors to just 16 colors, using a k-means clustering across the pixel space. Because we are dealing with a very large dataset, we will use the mini-batch k-means, which operates on subsets of the data to compute the result much more quickly than the standard k-means algorithm:

In [None]:
n_colors = 16

from time import time
print("Fitting model on the full image (k-means)")
t0 = time()
kmeans = KMeans(n_clusters=n_colors)
kmeans.fit(data)
print(f"done in {time() - t0:0.3f}s.")

In [None]:
# view the 16 RGB colors
(kmeans.cluster_centers_*256).round().astype(int)

In [None]:
# Get labels for all points
print("Predicting color indices on the full image (k-means)")
t0 = time()
labels = kmeans.predict(data)
print(f"done in {time() - t0:0.3f}s.")


In [None]:
new_colors = kmeans.cluster_centers_[labels]
china_recolored = new_colors.reshape(china.shape)
fig, ax = plt.subplots(1, 2, figsize=(16, 6))
ax[0].imshow(china)
ax[0].set_title('Original Image', size=16)
ax[1].imshow(china_recolored)
ax[1].set_title('k-means Image', size=16);

Some detail is certainly lost in the rightmost panel, but the overall image is still easily recognizable.

In [None]:
china2 = china/128.0
china2=china2.round()
china2=(china2*128.0).astype(int)
fig, ax = plt.subplots(1, 2, figsize=(16, 6))
ax[0].imshow(china)
ax[0].set_title('Original Image', size=16)
ax[1].imshow(china2)
ax[1].set_title('Quantized Image', size=16)