# Section 09: K-Means Clustering

## 1

Load the `wine` dateset from `sklearn`. Scale with `StandardScalar` and store the scaled data. Make a scatterplot of `color_intensity` against `alcohol` with points colored by class.

## 2

Run K-Means Clustering using different numbers of clusters (1—20) and make an elbow plot to determine the number of clusters to use. Does the number one would select from the elbow plot match the number of classes in the original dataset?

## 3

Use the number of clusters you determined from previous question to run `slkearn`'s K-Means Clustering with `random_state=2`. Then do the following:

- Print the SSE for the K-Means Model.
- Plot the predictions in a scatterplot. 
- Plot the center of each K-means cluster (read the documentation to understand what attribute to use). 
- Overlay discrepant points between predicted and actual cluster using a different marker shape. 

## 4

For this question, we are going to use the toy dataset with "datasets.make_blobs" attribution. The goal is to manually implement a K-Means Algorithm and compare with the result obtained in using sklearn.clustering.

Fill in the K-Means function below, then run the following cell to see how your function compares to `sklearn`'s. Write your function as follows:

1. Initialize an empty dictionary with k keys to store cluster data.
2. For each datapoint, calculate its Euclidean distance to each centroid. Assign each point to the cluster with centroid having the smallest distance. 
3. When all objects have been assigned, recalculate the positions of the k centroids by taking the means of the components of the assigned datapoints. For example, if centroid C1 has points $((1,2),(2,3),(1,3))$ assigned to it, the updated position of C1 would be $\big(\frac{1+2+1}{3}$,$\frac{2+3+3}{3}\big) = (1.33, 2.67)$
4. Repeat until a stopping criterion is met.

In [1]:
# X: Data.
# k: Number of means to fit.
# init_centroids: Initial cluster centers.
# tol: Tolerance for difference between old and new cluster-center estimates as measured by Frobenius norm.
# max_iter: Maximum number of iterations.
def K_Means_Algorithm(X, k, init_centroids, tol, max_iter):
    
    ### YOUR CODE HERE.

    return clusters, centroids

In [None]:
from sklearn import datasets
from matplotlib.colors import ListedColormap

# Initialize parameters, load data.
k = 4
init_centroids  = np.array([[-1.5, -1.0], [0.5, -1.0], [0.0, 1.0], [1.0, 1.0]])
n_samples = 100
X, y = datasets.make_blobs(n_samples=n_samples, centers=4, cluster_std=1.5, random_state = 99)
scaler = StandardScaler()
X = scaler.fit_transform(X)
colors = ListedColormap(['red', 'purple', 'green', 'blue'])

# Apply your K-Mean Clustering
clusters, centroids = K_Means_Algorithm(X, k, init_centroids, tol=1e-4, max_iter=300)
y_pred = np.zeros(n_samples)
for i in range(n_samples):
    for j in range(1,k):
        if X[i,:].tolist() in clusters[j]:
            y_pred[i] = j

# Apply sklearn's K-Means Clustering
kmeans = KMeans(verbose=1, n_clusters=k, init=init_centroids, tol=1e-4, max_iter=300, algorithm='full').fit(X)

# Visualize 
plt.figure(figsize = (16,4))
ax1  = plt.subplot(1,3,1)
ax1.scatter(X[:,0],X[:,1], c = y, cmap=ListedColormap(['black', 'gray', 'pink', 'magenta']))
ax1.set_title('Actual Data')

ax2  = plt.subplot(1,3,2)
ax2.scatter(X[:,0],X[:,1], c=y_pred, cmap=colors)
ax2.scatter(centroids[:,0],centroids[:,1],c='k',marker='o',s=300,alpha=0.5,edgecolor='none')
ax2.set_title('Manual K-Means')

ax3  = plt.subplot(1,3,3)
ax3.scatter(X[:,0],X[:,1], c = kmeans.labels_, cmap=colors)
C = kmeans.cluster_centers_
ax3.scatter(C[:,0],C[:,1],c='k',marker='o',s=300,alpha=0.5,edgecolor='none')
ax3.set_title('Sklearn K-Means')

plt.show()

print('My centroids:\n', centroids)
print('sklearn\'s centroids:\n', C)