## PCA and K-means clustering with MNIST data

For the In-class exercise today we'll see how to project MNIST data onto a few principal components, and see if K-means clustering can correctly identify the data of the same digit. Let's load the MNIST data as before in the next cell.

### Add MNIST dataset to the noteook
* First of all, click on "+ Add data" in the upper right corner, search for 'mnist npz', then click on the first one.
* Run the first cell, which gives you the path of the MNIST dataset

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

Tip: If you want to look up what you've done in previous exercise/lab, you can expand the menu on the left hand side and click on 'Your Work'.

In [None]:
# Load the mnist dataset
mnist = np.load('/kaggle/input/mnist-numpy/mnist.npz')
#print what's inside the npz file
print(mnist.files)

For the In-class exercise we'll focus on training data, and we'll use only 0's and 1's as before. Please pick 0's and 1's from the MNIST data as before.

In [None]:
# Choose 0's and 1's for x_train and y_train
# Remember to normalize x_train by 255 (make the range from 0 to 1 for each pixel)
x_train = mnist['x_train'][mnist['y_train']<=1]/255.
y_train = mnist['y_train'][mnist['y_train']<=1]
print('x_train shape:', x_train.shape)

Now, instead of trying to construct features (like the average pixel density) from the input, we simply treat the numbers associated with each pixel as a feature, i.e. each image has 28x28 = 784 features. It's a lot, no? So we will use only the first 3000 samples as the input for the rest of the exercise.

In the next cell, take the first 3000 samples from x_train and y_train, and use **reshape** to transform the shape into (3000, 784) for x_train.

[Reference for np.reshpae](https://numpy.org/doc/stable/reference/generated/numpy.reshape.html)

In [None]:
# Take the first 3000 samples
# For each sample, reshape the 28*28 image into a 1D array of 28*28=784 entries
x_train = np.reshape(x_train[:3000],(3000,784))
y_train = y_train[:3000]
print('x_train shape after reshape:', x_train.shape)

Now, we use the method of PCA to reducd the dimension of the feature space from 784 to 2, and see how good we can separate 0's and 1's.

In [None]:
from sklearn.decomposition import PCA

# Try PCA with 2-component (reduction of dimension)
pca = PCA(n_components=2)
PCA_components = pca.fit_transform(x_train)

# Plot the output from the 2-component PCA
fig = plt.figure(figsize=(10,6), dpi=80)  
plt.scatter(PCA_components[:,0][y_train==0], PCA_components[:,1][y_train==0], alpha=.5, color='y', label='True 0')
plt.scatter(PCA_components[:,0][y_train==1], PCA_components[:,1][y_train==1], alpha=.5, color='g', label='True 1')
plt.xlabel('PCA 1')
plt.ylabel('PCA 2')
plt.legend()
plt.show()

Looks like PCA has done a good job seperating 0's and 1's!

In the following, we pretend that we do not know the data come from 0's and 1's (imagine you are given the plot above with the same color for all points). We'll apply K-means on the distributions from PCA, and see if K-means can identify the correct two clusters.

The output of K-means is a 1D array of called **labels**, one number for each sample. Sample with the same label are identified as belonging to the same cluster.

In [None]:
from sklearn.cluster import KMeans

# Try K-means with 2 clusters
kmeans = KMeans(n_clusters=2, random_state=0).fit(x_train)
kmean_output = kmeans.labels_
print('KMeans labels shape:', kmean_output.shape)
# Take a look at the first five entries of labels
print('KMeans labels look like:', kmean_output[0:6])

In the next cell, please plot the output from K-means on the place of PCA components, separating samples with different labels (use two different colors for the two labels). What do you find?

In [None]:
# Plot the output clusters by KMeans
fig = plt.figure(figsize=(10,6), dpi=80)  
plt.scatter(PCA_components[:,0][kmean_output==0], PCA_components[:,1][kmean_output==0], alpha=.5, color='y', label='Cluster 0')
plt.scatter(PCA_components[:,0][kmean_output==1], PCA_components[:,1][kmean_output==1], alpha=.5, color='g', label='Cluster 1')
plt.xlabel('PCA 1')
plt.ylabel('PCA 2')
plt.legend()
plt.show()

Looks like we *almost* get the same clusters as before! Well, almost.

Please make a plot highlighting which points are grouped to the 'wrong' cluster. Does K-means make such mistake 'as expected'?

In [None]:
# Make a plot highlighting points that are grouped to the 'wrong' cluster by K-means using any method you like
fig = plt.figure(figsize=(10,6), dpi=80) 

plt.scatter(PCA_components[:,0][kmean_output!=y_train], PCA_components[:,1][kmean_output!=y_train], alpha=.1, color='black', label='right')
plt.scatter(PCA_components[:,0][kmean_output==y_train], PCA_components[:,1][kmean_output==y_train], alpha=.5, color='r', label='wrong')
plt.xlabel('PCA 1')
plt.ylabel('PCA 2')
plt.legend()
plt.show()

Finally, we'll show how to make an 'elbow plot' to find the optimal k (number of clusters). We'd expect to see the turning point at k=2. Will that be the case?

The metric we'll use is the attribute **inertia** from Scikit-learn K-means, which is the sum of squared distances of samples to their closest cluster center. 

In [None]:
# Elbow plot
ks = range(1, 6)
inertias = []
for k in ks:
    # Create a KMeans instance with k clusters: model
    model = KMeans(n_clusters=k)
    # Fit model to samples
    model.fit(x_train)
    # Append the inertia to the list of inertias
    inertias.append(model.inertia_)

    
fig = plt.figure(figsize=(6,6), dpi=80)
plt.plot(ks, inertias, '-o', color='black')
plt.xlabel('number of clusters, k')
plt.ylabel('inertia')
plt.xticks(ks)
plt.show()
