## Task 1: Users to Movies


The following example is from from http://web.stanford.edu/class/cs246/slides/06-dim_red.pdf

Consider the following data matrix, $X$, 

<img src="img.png" height="300" width="300" align='center'>

- Here, each row corresponds to the ratings submitted by a single user on a scale of $1$ to $5$ for each of the movies. If a user hasn't submitted a movie rating, then rating is then marked by a zero. 
- By a visual inspection, we see that the movies are either **sci-fi** or **romance**
- The individual movies that we start with can be considered 5 different dimensions, whereas when we group them into two genres (sci-fi or romance), these may be seen as a compressed representation of our data.
- So the natural question is, can we it possible to gain compressed representation of our data matrix to highlight this distinction in our data?


In [2]:
# Import necessary libs:

# 1. import plt from matplotlib
# 2. import Axes3D from mpl_toolkits.mplot3d
# 3. import proj3d from mpl_toolkits.mplot3d
# 4. import FancyArrowPatch from matplotlib.patches

In [2]:
# Create the dataset using the np.array() with the data provided above
X = np.array([[1, 1, 1, 0, 0],
              [3, 3, 3, 0, 0],
              [4, 4, 4, 0, 0],
              [5, 5, 5, 0, 0],
              [0, 2, 0, 4, 4],
              [0, 0, 0, 5, 5],
              [0, 1, 0, 2, 2]])
# Store the number of users for further usages.
# Store the number of movies for further usages.


In [3]:
# Plot the data set:

# 1. Create three arrays: users, movie, and reviews. to represent the data matrix
#     that is users[0], movie[0] and reviews[0] represent the review of the first user on the first movie.
# tips: use np.array() and flatten() function.

# 2. Set the figure size to (13,13) by using the function plt.figure().

# 3. Add the subplot that point the 1*1 grid by using the function add_subplot() on the figure object.
#     set the first positional arguments to 111 and projection to 3d.

# 4. Set the font size of the legend to be 10 by using plt.rcParams with 'legend.fontsize' as the key.

# 5. Plot the dataset using plot() for the Sci-fi movie and set x to be the user list, y to be the movie list and z to be the reviews
#     moreover, set resonalbe color and label legend.

# 6. Plot the dataset using plot() for the Romance follow the pervious instruction.

# 7. Set the legend to a proper position using ax.legend(loc=?)

# 8. Set label for the x and y axis with proper front size using plt.xlabel(...)

# 9. Set the title of this fig using plt.title()

# 10. Set the ticks for x axis and y aixs by using plt.xticks()/yticks()

# 11. plot and present the fig using plt.show()


In [4]:
# Data Preprocessing:

# 1. Calculate the mean of the data set
# 2. Subtract the mean from the data set
# 3. Store the new centered data set

## Solution1: Implementing PCA using Singular Value Decomposition (SVD)

We start with the simplest and most straightforward strategy first - **Singular Value Decomposition**. <br>

From our Matrix theory, we know that ever matrix out there can we decomposed into a multiplication of 3 matrices (image is from Tim Roughgarden):

$$ X = U S  V^T$$

In class we proved that we can use the SVD to factorize $X^TX=(USV^T)^T(USV^T)=VS^2V^T$.

The principal components of the matrix $A=X^TX$, lie in the rows of matrix $V^T$. Therefore, by selecting the first $k$ columns of $V$, we end up selecting $v_1, v_2, ..., v_k$ vectors.


In [5]:
# Calculate the U, S, V^T:
# 1. Use the singular value decomposition from numpy.
# 2. np.linalg.svd()
# 3. Store the u,s,v^T values

In [6]:
print("U.shape, S.shape, V.T.shape ->", u.shape, s.shape, vT.shape, end="\n\n")

print("U =",np.around(, decimals=3), sep="\n", end="\n\n")

print("S =",np.around(, decimals=3), sep="\n", end="\n\n")

print("V.T =",np.around(, decimals=3), sep="\n", end="\n\n")

In [7]:
# plot the singlar values for the  D  matrix.
# 1. Calculate the D matrix using s: D is s*s
# 2. Set the fig size to (15,5)
# 3. Add the line chart using plt.plot( ?? ,'bo-')
# 3. Add proper tital, ticks, axis labels


In [8]:
# Obtaining our compressed data representation:
# 1. Determine at least k singular values are needed to represent the data set from the fig above
# 2. Obtain the first k of v^T and store it
# 3. Calculate the compressed data using np.matmul(), X and stored first k of v^T
# 4. Print the compressed value of X

Let's visualize what just happened.

In [9]:
# Visualize what just happened:
# 1. Set the fig size to (15,5)
# 2. Create propor title, axis and legend
# 3. Plot the data


## Solution2: Directly computing  V and D 

Now we compute $V$ (aka as the eigenvectors), and the diagonal elements of $D$ (aka eigenvalues) from $A=X^TX=V D V^T$

The covariance matrix data matrix, $X$, can be computed as  $\frac{1}{N}X^TX$. <br>
If $X$ is our data matrix comprising of $d$ features. Then $X^TX$ is a $(d \times d)$ symmetrix matrix wherein each entry at location **ij** corresponds to the scalar projection of **feature i** with **feature j**.

In [10]:
# Alternative implementation：
# Directly computing V and D from X and X^T
# 1. Comput XTX using np.matmul() and store it.
# 2. Apply np.linalg.eig() to clculate the eigen vectors and values

In [13]:
print("V (Eigen-vectors) = ")
print(np.around(, decimals=3))
print()
print("diagonal elements of D (Eigen-values) = ")
print(np.around(, decimals=3)) 
print()
print("sqrt(Eigen-values) = ")
print(np.around(np.sqrt(np.abs()), decimals=3))

Notice the following:
1. That the **square-root of the eigen-values** of the covariance matrix $X^TX$ correspond exactly the the **singular values** of the data matrix $X$.
2. The **eigen-vectors** of $X^TX$ are exactly the same as the column vectors in the matrix $V$ when we performed SVD on $X$.

Therefore, the same Princpal components of our data matrix $X$, may be extracted via SVD or from $X$'s convariance matrix.


## Task 2: Human Faces 

Each image is a 62x47 pixel array. The images are read into a matrix called fea. Each row of the matrix fea represents one image (example). The features (columns) are the pixel values. Each example is represented by a vector of real numbers of length 2914, listing the pixels from left to right, row by row, from top to bottom.

In [None]:
# Import libs:
# 1. numpy
# 2. matplotlib and plt
# 3. pandas
# 4. fetch_lfw_people from sklearn.datasets

In [None]:
# Data set:
# 1. Load the dataset using fetch_lfw_people() with min_faces_per_person setted to be 70
#     detail of min_faces_per_person please refer to https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_lfw_people.html
# 2. Store the number of images and its hight, width using lfw_people.images.shape
# 3. Calculate number of pixels
# 4. Store the pixel values using lfw_people.data


In [None]:
def plt_face(x):
    global h,w
    plt.imshow(x.reshape((h, w)), cmap=plt.cm.gray)
    plt.xticks([])

In [14]:

# Use the function we provided above, plot some faces:
# 1. Define the fig size to (10,20)
# 2. Use plt_face()
# 3. plt.show()

In [16]:
# Find the Mean picture:
# 1. Calculate the mean of the image data
# 2. Remove the mean from all the image
# 3. plot the face use plt_face()

In [17]:
# Find eig vec and eig value:
# 1. Calculate the covariance metric of the zero_mean data
# 2. Use the np.linalg.eig() to compute eig value and eig vectors
# 3. Find the top5 features
# 4. Calculate the new value based on the top5 feature.
# 5. Store the new value.

In [18]:
print("Top 5 Vector:")
print()
print(eigvec.real.tolist())
print()
print("Top 5 EigVal:")
print()
print(eigval.real)
print()
print("Associateed 5 attributes in fourth image")
print("Indexing by",top5)
print()
print(fea[3][top5])

In [19]:
print("Top 5 EigVal:")
print()
print(eigval.real)

In [None]:
# prjection of fourth face to first 5 principle components

In [20]:
print("The proejction of fourth image")
print(??)

In [21]:
# project back to the image space where d=5
# X’= X_pca * VT  + X_mean 


In [22]:
# project back to images where d=50
# 1. Find top 50 eig vec and eig val
# 2. Store the top 50 eig vec
# 3. Store the top 50 eig val
# 4. compute the newfeature using top50 eig vec and eig val.
# 5. plot the feace