In [None]:
import numpy as np
from PIL import Image
from sklearn.decomposition import PCA, KernelPCA
import sklearn
import matplotlib.pyplot as plt

# Introduction

In this notebook we are going to learn about denoisifying images using PCA. We begin with a simple example of noisy linear observations. We then introduce the fashion MNIST data-set, where we will take some images and artificially add noise. You must then submit a set of exercises related to denoisifying the images using PCA and Kernel PCA.

To try and understand how PCA can help us denoisify, consider the following plot:

In [None]:
# original data
x = np.linspace(0, 1, 101)
y = 2 * x

# add noise to data
y_noisy = y + np.random.normal(0, 0.2, size = x.shape)

fig, ax = plt.subplots(nrows = 1, ncols = 1)
fig.set_figheight(5)
fig.set_figwidth(10)

ax.scatter(x, y, color = 'r', label = 'original data')
ax.scatter(x, y_noisy, label = 'noisy data')
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.legend()

The plot above shows simple linear data, corrupted by a small amount of Gaussian noise. Can you see why a 1-dimensional representation of the data can help in getting rid of the noise? If we project every noisy data point to the line $y = 2x$, we recover out original data exactly! This is the idea behind PCA denoisifying. This is a simple example, so we are going to explore a more interesting one:

# Denoisify Fashion-MNIST

We will now try to denoisify data from the fashion MNIST data-set. This is an analogous data set to the MNIST hand written digits set, however it is made up of pictures of different types of clothing. It consists of 10 different labels: t-shirt (0), trouser (1), pullover (2), dress (3), coat (4), sandal (5), shirt (6), sneaker (7), bag (8), ankle boot (9). The code below downloads 5022 images from the data-set and splits into a training and testing data-set. We will then visualize the test set. 

In [None]:
X = np.genfromtxt('fashion-mnist_train.csv', delimiter=',')

Y = X[1:, 0]
X_train = X[21:, 1:]
X_test = X[1:21, 1:]

In [None]:
def plot_images(X, title):
    # we use this function to plot the training images
    fig, axs = plt.subplots(nrows=5, ncols=4, figsize=(8, 8))
    for img, ax in zip(X, axs.ravel()):
        ax.imshow(img.reshape((28, 28)), cmap="Greys")
        ax.axis("off")
    fig.suptitle(title, fontsize=12)

plot_images(X_test, 'Fashion-MNIST')

We will artificially add noise to the images and display them:

In [None]:
X_noisy = X_test + np.random.normal(loc = 0, scale = 40, size = X_test.shape)

plot_images(X_noisy, 'Noisy Images')

To denoisify, we will use the training set to *learn* a low-dimensional space that represents our data, which can be used to remove noise from *similar* images. We can see an example here: https://scikit-learn.org/stable/auto_examples/applications/plot_digits_denoising.html where we use a data-set of uncorrupted digits to remove the noise from corrupted digits.

Note that we treat each image as a flat vector with 784 features. Ideally we want to take bigger advantages of known structure in the images, for example, when flattening the image, we lose a lot of spatial structure (a pixel is closely related to all those around it)!. As an example, the following paper: https://www.researchgate.net/publication/267228169_PCA_based_image_denoising adds filters to the processes to achieve much better results. 

Other examples of exploiting image structure will be introduced in Module 21 (Convolutional Neural Networks).

# Exercises

The following code will train a PCA model on the training set. It then uses the learned principal components to project the noisy data into a lower dimensional representation, and reconstructs them. We hope this will denoisify the image. However, as you see, it will no be very good. The noise will be gone but the images will not resemble the original. Try increasing the number of principal components to fix this! Play around with it until you get a good denoisification. Answer the following questions:

1. What is the behavior of the reconstructed images as you include more components?

2. What is the best number of components for denoisifying this particular test set?

In [None]:
# fit a PCA model
number_of_principal_components = 10
pca = PCA(n_components = number_of_principal_components)
pca.fit(X_train)

# transform the noisy data into a lower dimensional representation and then back to a high dimensional one
X_reconstructed_pca = pca.inverse_transform(pca.transform(X_noisy))

# plot the denoisified images
plot_images(X_reconstructed_pca, 'Reconstructed using PCA')

3. Repeat the exercise above but using Kernel PCA instead. Are you able to achieve better results?

In [None]:
# code here

4. Finally, we note that the shoes are not always reconstructed well. Assuming you had labelled images, can you think of a way of improving the denoisification process?