# CSD 3: PCA

0. For this Case Study assignment you should have in your current folder the ebay_boys_girls_shirts folder, holding the four CSV files describing the train and test shirts images, and the boys and girls images folders. This is what we did in CSD 1, **if you already have the data in your current folder you don't need to run this again!**:

In [None]:
import requests
import tarfile

url = "http://www.tau.ac.il/~saharon/DScourse/ebay_boys_girls_shirts.tar.gz"
r = requests.get(url)

with open("ebay_boys_girls_shirts.tar", "wb") as file:
    file.write(r.content)

with tarfile.open("ebay_boys_girls_shirts.tar") as tar:
    tar.extractall('.')

1. We would like to perform PCA on our girls and boys shirts images dataset. An image is a classic candidate for dimensionality reduction methods since even a colored 100x100 image has 30,000 features! (why?)

In CSD2 we learned how to read a random sample of images into a 4D numpy array. We defined the following functions:

In [None]:
import sys
import warnings
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from skimage import transform, color, img_as_ubyte

%matplotlib inline

def get_file_list(df, folder, n_sample = None, seed = None):
    if n_sample is None:
        file_ids_list = df.file_id.values
    else:
        file_ids_list = df.sample(n = n_sample, random_state = seed).file_id.values
    files_list = [folder + '/' + str(file_id) + '.jpg' for file_id in file_ids_list]
    return files_list

def read_image_and_resize(f, w = 100, h = 100):
    img = plt.imread(f)
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        img = transform.resize(img, (w, h), mode='constant')
        img = img_as_ubyte(img)
    img = color.gray2rgb(img)
    img = img[np.newaxis, :, :, :3]
    if img.shape != (1, 100, 100, 3):
        raise ValueError(f + str(img.shape))
    return img

def read_images_4d_array(files_list):
    images_list = [read_image_and_resize(file) for file in files_list]
    images_array = np.concatenate(images_list)
    return images_array

Complete the `get_images_matrix` function **using the above functions** to unite what we did into a single function receiving `csv_file` path to the metadata CSV file, `folder` the name of the images folder and `n` sample size. See below for how the usage of this function to get `x_boys_train` and `x_girls_train`.

In [None]:
def get_images_matrix(csv_file, folder, n = None, seed = 1976):
    df = ### YOUR CODE HERE ###
    files_list = ### YOUR CODE HERE ###
    images = ### YOUR CODE HERE ###
    return images, files_list

In [None]:
folder = 'ebay_boys_girls_shirts/'
x_boys_train, boys_files_list = get_images_matrix(folder + 'boys_train.csv', folder + 'boys', 2000)
x_girls_train, girls_files_list = get_images_matrix(folder + 'girls_train.csv', folder + 'girls', 2000)

print(x_boys_train.shape)
print(x_girls_train.shape)

2. Can you calculate the size of each of our 4D numpy arrays? Verify with this helper function:

In [None]:
def numpy_array_size_in_bytes(a):
    print(a.size * a.itemsize)

numpy_array_size_in_bytes(x_boys_train)

3. But we can't use our 4D arrays with PCA just yet. We need to:

(a) Reshape them as 2D arrays having N rows (images) X P columns (pixels)

In [None]:
def get_all_pixels(x):
    return x.reshape(-1, np.prod(x.shape[1:]))

x_boys_train_all = get_all_pixels(x_boys_train)
x_girls_train_all = get_all_pixels(x_girls_train)

print(x_boys_train_all.shape)
print(x_girls_train_all.shape)

(b) Stack them one on top of the other, to have a giant `x_train` 2D numpy array, of size [4000, 30000].

Do that. Remember the docs, SO and good old Google (though once you get experience you're expected to know this yourself!).

In [None]:
x_train = ### YOUR CODE HERE ###

print(x_train.shape)

4. As in class, we first center the `x_train` matrix:

In [None]:
x_train_centered = x_train - x_train.mean(axis = 0)

5. Can you calculate the `x_train_centered` matrix size in bytes? Use the `numpy_array_size_in_bytes` function to help you.

In [None]:
numpy_array_size_in_bytes(x_train_centered)

Does this result surprise you? Make sure you get this calculation.

6. How would you show that `x_train_centered` is indeed centered?

In [None]:
### YOUR CODE HERE ###

7. As in class, we import `PCA` from [sklearn](https://scikit-learn.org/stable/) and fit it to data, asking for say first 10 PCs:

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components = 10)

pca.fit(x_train_centered)

8. How would you get the `W` matrix of weights? (see class or the [docs](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html))

In [None]:
W = ### YOUR CODE HERE ###
print(W.shape)

9. We multiply `x_train_centered` by `W` to get the reduced `x_train_reduced`:

In [None]:
x_train_reduced = np.matmul(x_train_centered, W.T)

print(x_train_reduced.shape)

10. What's the "sklearn" way of performing the last two stages?

In [None]:
x_train_reduced = ### YOUR CODE HERE ###

print(x_train_reduced.shape)

11. Let's compare boys and girls shirts images distribution of score on the first PC:

In [None]:
plt.hist(x_train_reduced[:2000, 0], alpha=0.5, label='boys', color = 'blue')
plt.hist(x_train_reduced[2000:, 0], alpha=0.5, label='girls', color = 'pink')
plt.legend(loc='upper right')
plt.show()

There doesn't seem to be a dramatic effect as we might have wanted, for better classification later on. Try this with other PCs, in our experience the 2nd PC captures "boys-like" vs. "girls-like" differences better.

12. Plot the images as points in a scatterplot (`plt.scatter`) of their score in the first PC vs. their score in the second PC. Use different colors for boys shirts images and for girls shirts images. You should get something like the below: 

In [None]:
### YOUR CODE HERE ###

13. Since we're dealing with images the best way to get what a given PC's "subject", what it is "talking about" is to simply to view those images which have a high or low score for this PC.

Which 16 images have the highest score for PC1?

In [None]:
highest_score_ids = np.argpartition(x_train_reduced[:, 0], -16)[-16:]
print(highest_score_ids)

Which 16 images have the lowest score for PC1?

In [None]:
lowest_score_ids = np.argpartition(x_train_reduced[:, 0], 16)[:16]
print(lowest_score_ids)

We got the indices of the highest and lowest shirts images for PC1. How do we connect them back to the actual files so we can show them? If you recall when we sampled the images we also got the `boys_files_list` and `girls_files_list`.

In [None]:
all_files_list = np.array(boys_files_list + girls_files_list)

So the highest and lowest score files are:

In [None]:
highest_score_files = all_files_list[highest_score_ids]
lowest_score_files = all_files_list[lowest_score_ids]

highest_score_files

And now we can use our `read_images_4d_array` and `merge_images` functions from CSD2 to read these images and present them on a grid:

In [None]:
def merge_images(image_batch, size = [20, 20]):
    h,w = image_batch.shape[1], image_batch.shape[2]
    c = image_batch.shape[3]
    img = np.zeros((int(h*size[0]), w*size[1], c))
    for idx, im in enumerate(image_batch):
        i = idx % size[1]
        j = idx // size[1]
        img[j*h:j*h+h, i*w:i*w+w,:] = im/255
    return img

highest_images = read_images_4d_array(highest_score_files)
lowest_images = read_images_4d_array(lowest_score_files)

highest_images_merged = merge_images(highest_images, size = [4, 4])
lowest_images_merged = merge_images(lowest_images, size = [4, 4])

In [None]:
plt.figure(figsize=(10,5))
plt.subplot(1, 2, 1)
plt.title('Highest PC1 Score Shirts')
plt.axis('off')
plt.imshow(highest_images_merged)

plt.subplot(1, 2, 2)
plt.title('Lowest PC1 Score Shirts')
plt.axis('off')
plt.imshow(lowest_images_merged)

Well now it is quite clear what PC1 is all about...

14. Combine all of the above to a function called `plot_highest_lowest_on_PC` which would accept a PC number (0 to 9) and plot a 4x4 grid of the 16 shirts images with highest and lowest scores on this PC, side by side.

In [None]:
def plot_highest_lowest_on_PC(pc):
    ### YOUR CODE HERE ###

In [None]:
plot_highest_lowest_on_PC(2)

# submit results

In [None]:
ans = {}
ans['HW'] = 'CSD3'
ans['id_number'] = #### your id here ####

#### please answer the questions below with one word, as a lower case string, please make sure you do not have typos or misspelling in your answer.
Q1) what is the dominant color for shirts with high PC1 score?<br>
Q2) what is the dominant color for shirts with low PC1 score?<br>
Q3) what is the dominant color for shirts with high PC3 score?<br>
Q4) what is the dominant color for shirts with low PC3 score?

In [None]:
ans['Q1'] = #### your answer here ####
ans['Q2'] = #### your answer here ####
ans['Q3'] = #### your answer here ####
ans['Q4'] = #### your answer here ####

# finish!

to submit your HW please run this last code block and follow the instructions. <BR>
this code will create a CSV file in the current directory on the azure notebooks project <br>
please download it and submit it through moodle

In [None]:
import pandas as pd
df_ans = pd.DataFrame.from_dict(ans, orient='index')
if df_ans.shape[0] == 6:
    df_ans.to_csv('{}_{}.csv'.format(ans['HW'],str(ans['id_number'])))
    print("OK!")
else:
    print("seems like you missed a question, make sure you have run all the code blocks")