# Image Compression with PCA

Here we use an exploratory data science technique called PCA, Principle Components Analysis, which tries to find the "components" that explain the most variance in the data (i.e. if you were to plot k dimensional data into a k dimensional space, what k vectors explain the most difference in data points?).

For a version of this exercise in R, see the following: https://www.r-bloggers.com/image-compression-with-principal-component-analysis/

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import pairwise_distances_argmin
from sklearn.datasets import load_sample_image
from sklearn.utils import shuffle

Let's load up an image

In [None]:
original = load_sample_image("china.jpg")
plt.imshow(original)
plt.show()
img[0:2]

The data comes in a three dimensional array, the first representing the width, the second representing the height, and finally the third representing the color in RGB.  We preprocess the data to make it amenable to feed into a training model.

In [None]:
# Normalize the data to be in [0,1] - also cast to float.
img = np.array(original, dtype=np.float64) / 255

# split into three channels
img_r = img[:,:,0]
img_g = img[:,:,1]
img_b = img[:,:,2]
img[:2], img_r[:2]

Let's run PCA to find the principle components.  These are the vectors that define a space for which we can project each row of pixels into.  By using fewer components, we can save less information about the picture, but we also lose some of the quality of the image.

In [None]:
# declare the model and fit it
comps = 30
pca_r = PCA(n_components=comps)
pca_g = PCA(n_components=comps)
pca_b = PCA(n_components=comps)
pca_r.fit(img_r)
pca_g.fit(img_g)
pca_b.fit(img_b)
pca_r.components_

In [None]:
# project all the channels into the reduced space
img_comp_r = pca_r.inverse_transform(pca_r.transform(img_r))
img_comp_g = pca_g.inverse_transform(pca_g.transform(img_g))
img_comp_b = pca_b.inverse_transform(pca_b.transform(img_b))

img_comp = np.dstack((img_comp_r, img_comp_g, img_comp_b))
img_comp[img_comp < 0] = 0
img_comp[img_comp > 1] = 1
img_comp[:2]

In [None]:
plt.imshow(img_comp)
plt.show()

In [None]:
from scipy import misc
misc.imsave('original.jpg', original)
misc.imsave('compressed.jpg', img_comp)