### Singular Value Decomposition

So far in this lesson, you have gained some exposure to Singular Value Decomposition.  In this notebook, you will get some hands on practice with this technique.

Let's get started by reading in our libraries and setting up the data we will be using throughout this notebook

`1.` Run the cell below to create the **user_movie_subset** dataframe.  This will be the dataframe you will be using for the first part of this notebook.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import svd_tests as t
%matplotlib inline

# Read in the datasets
movies = pd.read_csv('../data/movies_clean.csv')
reviews = pd.read_csv('../data/reviews_clean.csv')

del movies['Unnamed: 0']
del reviews['Unnamed: 0']

# Create user-by-item matrix
user_items = reviews[['user_id', 'movie_id', 'rating']]
user_by_movie = user_items.groupby(['user_id', 'movie_id'])['rating'].max().unstack()

user_movie_subset = user_by_movie[[75314,  68646, 99685]].dropna(axis=0)
print(user_movie_subset)

`2.` Now that you have the **user_movie_subset** matrix, use this matrix to correctly match each key to the correct value in the dictionary below.  Use the cells below the dictionary as necessary.

In [None]:
# match each letter to the best statement in the dictionary below - each will be used at most once
a = 6
b = 68646
c = 'The Godfather'
d = 'Goodfellas'
e = 3298
f = 30685
g = 3

sol_1_dict = {
    'the number of users in the user_movie_subset': #enter a letter,
    'the number of movies in the user_movie_subset': #enter a letter,
    'the user_id with the highest average ratings given': #enter a letter,
    'the movie_id with the highest average ratings received': #enter a letter,
    'the name of the movie that received the highest average rating': #enter a letter,
}


#test dictionary here
t.test1(sol_1_dict)

In [None]:
# Cell for work


# user with the highest average rating


# movie with highest average rating


# list of movie names

    
# users by movies


Now that you have a little more context about the matrix we will be performing Singular Value Decomposition on, we're going to do just that.  To get started, let's remind ourselves about the dimensions of each of the matrices we are going to get back.   Essentially, we are going to split the **user_movie_subset** matrix into three matrices:

$$ U \Sigma V^T $$


`3.` Given what you learned about in the previous parts of this lesson, provide the dimensions for each of the matrices specified above using the dictionary below.

In [None]:
# match each letter in the dictionary below - a letter may appear more than once.
a = 'a number that you can choose as the number of latent features to keep'
b = 'the number of users'
c = 'the number of movies'
d = 'the sum of the number of users and movies'
e = 'the product of the number of users and movies'

sol_2_dict = {
    'the number of rows in the U matrix': #enter a letter, 
    'the number of columns in the U matrix': #enter a letter, 
    'the number of rows in the V transpose matrix': #enter a letter, 
    'the number of columns in the V transpose matrix': #enter a letter
}

#test dictionary here
t.test2(sol_2_dict)

Now let's verify the above dimensions by performing SVD on our user-movie matrix.

`4.` Below you can find the code used to perform SVD in numpy.  You can see more about this functionality in the [documentation here](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.linalg.svd.html).  What do you notice about the shapes of your matrices?  If you try to take the dot product of the three objects you get back, can you directly do this to get back the user-movie matrix?

In [None]:
u, s, vt = np.linalg.svd(user_movie_subset)
s.shape, u.shape, vt.shape

In [None]:
# Run this cell for our thoughts on the questions posted above
t.question4thoughts()

`5.` Use the thoughts from the above question to create **u**, **s**, and **vt** with three (the max number for this matrix) latent features.  When you have all three matrices created correctly, run the test below to show that the dot product of the three matrices creates the original user-movie matrix.  The matrices should have the following dimensions:

$$ U_{n x k} $$

$$\Sigma_{k x k} $$

$$V^T_{k x m} $$

where:

1. n is the number of users
2. k is the number of latent features to keep
3. m is the number of movies


In [None]:
# Change the dimensions of u, s, and vt as necessary to use three latent features
# update the shape of u and store in u_new
u_new = #implement your code here

# update the shape of s and store in s_new
s_new = #implement your code here

# Because we are using 3 latent features and there are only 3 movies, 
# vt and vt_new are the same
vt_new = #implement your code here

In [None]:
# Check your matrices against the solution
assert u_new.shape == (6, 3), "Oops!  The shape of the u matrix doesn't look right. It should be 6 by 3."
assert s_new.shape == (3, 3), "Oops!  The shape of the sigma matrix doesn't look right.  It should be 3 x 3."
assert vt_new.shape == (3, 3), "Oops! The shape of the v transpose matrix doesn't look right.  It should be 3 x 3."
assert np.allclose(np.dot(np.dot(u_new, s_new), vt_new), user_movie_subset), "Oops!  Something went wrong with the dot product.  Your result didn't reproduce the original movie_user matrix."
print("That's right! The dimensions of u should be 6 x 3, and both v transpose and sigma should be 3 x 3.  The dot product of the three matrices how equals the original user-movie matrix!")

`6.` Scikit-learn also has an easy way to implement SVD.  The documentation for this implementation can be found [in the scikit-learn documentation here](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html).  Below, we have imported this library, as well as created the 1-0 user-movie matrix used earlier in this lesson. Use `fit_transform` on the `user_by_movie` matrix to obtain

$$ U \Sigma V^T $$

with 200 components and 5 iterations.

In [None]:
from sklearn.decomposition import TruncatedSVD

user_by_movie = user_by_movie.applymap(lambda val: 1 if val > 0 else 0)

svd = TruncatedSVD(n_components=200, n_iter=5, random_state=42)

u = # code for u matrix
vt = # code for vt
s = # code for sigma
print('u', u.shape)
print('s', s.shape)
print('vt', vt.shape)

`7.` How much variability can be explained by each of the 200 components?  How much of the variability can be explained in total by the 200 components?

`8.` Create your prediction matrix and verify it is the shape you would expect.