< [Distance and Similarity](../ica06/Distance_and_Similarity.ipynb) | Contents (TODO) |  [Neural Networks](../ica08/Neural_Networks.ipynb) >

<a href="https://colab.research.google.com/github/stephenbaek/bigdata/blob/master/in-class-assignments/ica07/Cluster_Analysis.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>

# k-means Clustering and the Lloyd's Algorithm

Clustering algorithms are a category of unsupervised learning algorithms which seek to learn an optimal grouping of data points. One of the most widely used clustering algorithms is known as *k-means clustering*. k-means clustering algorithm is a typical example of NP-hard problem, whose solution is unknown. Luckily, a method called the *Lloyd's algorithm* is known to converge to a local minimum of the solution (not the global minimum though) and can be quite useful in many cases.

To begin with, we first generate some simulated data samples using `make_blobs` function available in Scikit-Learn.

In [0]:
from sklearn.datasets.samples_generator import make_blobs
import matplotlib.pyplot as plt

N = 1000

x, y = make_blobs(n_samples=N, centers=3, cluster_std=0.5, random_state=0)
plt.scatter(x[:, 0], x[:, 1]);

As the name indicates, k-means clustering algorithm seeks to find '*means*' or '*centroids*' for k of each cluster. Here, we define three centroids that are randomly initialized.

In [0]:
import numpy as np

K = 3    # user-defined parameter k

centroids = np.random.uniform(-3, 3, size=(K, 2))
plt.scatter(x[:, 0], x[:, 1]);
plt.scatter(centroids[:, 0], centroids[:, 1], c='black', s=300, alpha=0.5);

With the centroids initialized as above, we can now evaluate the cluster membership of each data point based on their distances to the centroids.

In [0]:
y_pred = -np.ones(N)
for i in range(N):
  d = np.zeros(K)
  for j in range(K):
    d[j] = np.sqrt(np.sum((x[i] - centroids[j])**2))
  y_pred[i] = np.argmin(d)

  
plt.scatter(x[:, 0], x[:, 1], c = y_pred);
plt.scatter(centroids[:, 0], centroids[:, 1], c='black', s=300, alpha=0.5);

Now, based on the cluster membership, we will update the positions of the centroids: to the real center of the cluster, not the randomly initialized positions. The code below does the update:

In [0]:
for i in range(K):
  centroids[i] = [0, 0]
  
for i in range(N):
  centroids[ int(y_pred[i]) ] += x[i]

for i in range(K):
  centroids[i] /= np.sum(y_pred == i)

plt.scatter(x[:, 0], x[:, 1], c = y_pred);
plt.scatter(centroids[:, 0], centroids[:, 1], c='black', s=300, alpha=0.5);

Now, the displacement of the centroids causes the change of the group membership. So we copy and paste the group membership code a few cells above and reuse it below. Notice the updated membership.

In [0]:
y_pred = -np.ones(N)
for i in range(N):
  d = np.zeros(K)
  for j in range(K):
    d[j] = np.sqrt(np.sum((x[i] - centroids[j])**2))
  y_pred[i] = np.argmin(d)

  
plt.scatter(x[:, 0], x[:, 1], c = y_pred);
plt.scatter(centroids[:, 0], centroids[:, 1], c='black', s=300, alpha=0.5);

Again, the change of group membership requires update of centroid locations. Similar to the above, we will copy and paste exactly the same code we used earlier.

In [0]:
for i in range(K):
  centroids[i] = [0, 0]
  
for i in range(N):
  centroids[ int(y_pred[i]) ] += x[i]

for i in range(K):
  centroids[i] /= np.sum(y_pred == i)

plt.scatter(x[:, 0], x[:, 1], c = y_pred);
plt.scatter(centroids[:, 0], centroids[:, 1], c='black', s=300, alpha=0.5);

You may now realize that the clusters are being updated and the algorithm begins to group the data correctly. As such, k-means clustering algorithm (Lloyd's algorithm) is simply a repetition of the membership update and centroid update back and forth. Therefore, we may benefit from modularizing the above code cells into functions:

In [0]:
# Assignment: Implement functions to modularize the above steps of the Lloyd's algorithm.
def update_membership(points, centers):
  # YOUR CODE HERE
  return clusters

def update_centroids(points, clusters):
  # YOUR CODE HERE
  return centers

def plot_clusters(points, clusters, centers):
  plt.scatter(points[:, 0], points[:, 1], c = clusters);
  plt.scatter(centers[:, 0], centers[:, 1], c='black', s=300, alpha=0.5);


Now, with the functions defined above, we can run the cell below multiple times (just hit the play button one after another) to complete the Lloyd's algorithm. Each time you run it, see how the cluster updates.

In [0]:
y_pred = update_membership(x, centroids)
centroids = update_centroids(x, y_pred)
plot_clusters(x, y_pred, centroids)

Finally, we just need one more component: some criteria to check when to terminate the iteration.

**Assignment** Search online for the convergence criteria of the Lloyd's algorithm. Implement a function named `kmeans(points, k)` that internally calls `update_membership` and `update_centroids` functions above, repeatedly. In the implementation, let the function determine when the convergence is achieved and terminate.


< [Distance and Similarity](../ica06/Distance_and_Similarity.ipynb) | Contents (TODO) |  [Neural Networks](../ica08/Neural_Networks.ipynb) >

<a href="https://colab.research.google.com/github/stephenbaek/bigdata/blob/master/in-class-assignments/ica07/Cluster_Analysis.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>