### 奇异值分解

到目前为止，你已经学习了一些奇异值分解知识。在此 notebook 中，你将练习这方面的技巧。

首先读取库和设置将在这个 notebook 中一直使用的数据。

`1.` 请运行以下单元格并创建 **user_movie_subset** DataFrame。你将在此 notebook 的第一部分使用该 DataFrame。

**注意：分解该矩阵需要大约 10 分钟的时间。**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import svd_tests as t
%matplotlib inline

# Read in the datasets
movies = pd.read_csv('data/movies_clean.csv')
reviews = pd.read_csv('data/reviews_clean.csv')

del movies['Unnamed: 0']
del reviews['Unnamed: 0']

# Create user-by-item matrix
user_items = reviews[['user_id', 'movie_id', 'rating']]
user_by_movie = user_items.groupby(['user_id', 'movie_id'])['rating'].max().unstack()

user_movie_subset = user_by_movie[[73486, 75314,  68646, 99685]].dropna(axis=0)
print(user_movie_subset)

`2.` 获得 **user_movie_subset** 矩阵后，请根据该矩阵将每个键与以下字典中的正确值相匹配。请使用字典下方的单元格。

In [None]:
# match each letter to the best statement in the dictionary below - each will be used at most once
a = 20
b = 68646
c = 'The Godfather'
d = 'Goodfellas'
e = 265
f = 30685
g = 4

sol_1_dict = {
    'the number of users in the user_movie_subset': # letter here,
    'the number of movies in the user_movie_subset': # letter here,
    'the user_id with the highest average ratings given': # letter here,
    'the movie_id with the highest average ratings received': # letter here,
    'the name of the movie that received the highest average rating': # letter here
}


#test dictionary here
t.test1(sol_1_dict)

In [None]:
# Cell for work
# user with the highest average rating

# movie with highest average rating

# list of movie names
   
# users by movies


现在你已经大致了解了我们将对其执行奇异值分解的矩阵，下面我们将进行分解。首先，回忆下要获取的每个矩阵的维度。   我们要将 **user_movie_subset** 矩阵拆分为三个矩阵：

$$ U \Sigma V^T $$


`3.` 请根据在这节课之前的部分学到的知识，在以下字典中指出上述每个矩阵的维度。

In [None]:
# match each letter in the dictionary below - a letter may appear more than once.
a = 'a number that you can choose as the number of latent features to keep'
b = 'the number of users'
c = 'the number of movies'
d = 'the sum of the number of users and movies'
e = 'the product of the number of users and movies'

sol_2_dict = {
    'the number of rows in the U matrix': # letter here, 
    'the number of columns in the U matrix': # letter here, 
    'the number of rows in the V transpose matrix': # letter here, 
    'the number of columns in the V transpose matrix': # letter here
}

#test dictionary here
t.test2(sol_2_dict)

现在对用户-电影矩阵执行 SVD，并验证上述维度。

`4.` 以下是在 NumPy 中执行 SVD 的代码。要详细了解此功能，请参阅[此文档](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.linalg.svd.html)。你发现这些矩阵的形状有什么规律？如果尝试对获得的三个对象执行点积运算，能够直接获得用户-电影矩阵吗？

In [None]:
u, s, vt = np.linalg.svd(user_movie_subset) # perform svd here on user_movie_subset
s.shape, u.shape, vt.shape

In [None]:
# Run this cell for our thoughts on the questions posted above
t.question4thoughts()

`5.` 请利用上个问题的思路创建有 4 个潜在特征的 U、S 和 V 转置矩阵。正确创建所有三个矩阵后，请运行以下测试，表示三个矩阵的点积能够创建原始用户-电影矩阵。这些矩阵的维度应该如下所示：

$$ U_{n x k} $$

$$\Sigma_{k x k} $$

$$V^T_{k x m} $$

其中：

1. n 表示用户数
2. k 表示潜在特征的数量（在此例中是 4 个）
3. m 表示电影数量

In [None]:
# Change the dimensions of u, s, and vt as necessary to use four latent features
# update the shape of u and store in u_new
u_new = # change the shape of u here

# update the shape of s and store in s_new
s_new = #change the shape of s as necessary 

# Because we are using 4 latent features and there are only 4 movies, 
# vt and vt_new are the same
vt_new = # change the shape of vt as necessary

In [None]:
# Check your matrices against the solution
assert u_new.shape == (20, 4), "Oops!  The shape of the u matrix doesn't look right. It should be 20 by 4."
assert s_new.shape == (4, 4), "Oops!  The shape of the sigma matrix doesn't look right.  It should be 4 x 4."
assert vt_new.shape == (4, 4), "Oops! The shape of the v transpose matrix doesn't look right.  It should be 4 x 4."
assert np.allclose(np.dot(np.dot(u_new, s_new), vt_new), user_movie_subset), "Oops!  Something went wrong with the dot product.  Your result didn't reproduce the original movie_user matrix."
print("That's right! The dimensions of u should be 20 x 4, and both v transpose and sigma should be 4 x 4.  The dot product of the three matrices how equals the original user-movie matrix!")

∑ 矩阵能够告诉我们每个潜在特征从用户-电影矩阵中捕获的原始变化性程度如何。要解释的总变化量等于对角元素的平方和。第一个分量解释的变化量等于对角线中第一个值的平方。第二个分量解释的变化量等于对角线中第二个值的平方。   

`6.` 利用以上信息，你能判断仅使用前两个分量的话，能够解释原始用户-电影矩阵中的多少变化量吗？请在以下单元格中尝试一下，然后对照下个单元格中的解答检验你的答案。

In [None]:
total_var = # Total variability here
var_exp_comp1_and_comp2 = # Variability Explained by the first two components here
perc_exp = # Percent of variability explained by the first two components here

# Run the below to print your results
print("The total variance in the original matrix is {}.".format(total_var))
print("Ther percentage of variability captured by the first two components is {}%.".format(perc_exp))

In [None]:
assert np.round(perc_exp, 2) == 98.55, "Oops!  That doesn't look quite right.  You should have total variability as the sum of all the squared elements in the sigma matrix.  Then just the sum of the squared first two elements is the amount explained by the first two latent features.  Try again."
print("Yup!  That all looks good!")

`7.` 与上个问题类似，但是更改下 U、∑ 和 V 转置矩阵的形状。这次仅使用前两个分量重现用户-电影矩阵，而不使用四个分量。设置好矩阵后，请运行测试并对照解答页面检查你的矩阵。这些矩阵的维度应该如下所示：

$$ U_{n x k} $$

$$\Sigma_{k x k} $$

$$V^T_{k x m} $$

其中：

1. n 表示用户数
2. k 表示潜在特征的数量（在此例中是 2 个）
3. m 表示电影数量

In [None]:
# Change the dimensions of u, s, and vt as necessary to use four latent features
# update the shape of u and store in u_new
u_2 = # change the shape of u here

# update the shape of s and store in s_new
s_2 = #change the shape of s as necessary 

# Because we are using 4 latent features and there are only 4 movies, 
# vt and vt_new are the same
vt_2 = # change the shape of vt as necessary

In [None]:
# Check that your matrices are the correct shapes
assert u_2.shape == (20, 2), "Oops!  The shape of the u matrix doesn't look right. It should be 20 by 2."
assert s_2.shape == (2, 2), "Oops!  The shape of the sigma matrix doesn't look right.  It should be 2 x 2."
assert vt_2.shape == (2, 4), "Oops! The shape of the v transpose matrix doesn't look right.  It should be 2 x 4."
print("That's right! The dimensions of u should be 20 x 2, sigma should be 2 x 2, and v transpose should be 2 x 4. \n\n The question is now that we don't have all of the latent features, how well can we really re-create the original user-movie matrix?")

`8.` 使用所有 4 个潜在特征时，我们能够完全重现用户-电影矩阵。现在只有 2 个潜在特征，我们可以衡量下重现原始矩阵的效果，方法是查看点击运算生成的评分与实际评分之间的平方误差之和。仅使用两个潜在特征计算平方误差之和，并在以下单元格中对照解答页面测试你的答案。

In [None]:
# Compute the dot product
pred_ratings = # store the result of the dot product here

# Compute the squared error for each predicted vs. actual rating
sum_square_errs = # compute the sum of squared differences from each prediction to each actual value here

In [None]:
# Check against the solution
assert np.round(sum_square_errs, 2) == 85.34, "Oops!  That doesn't look quite right.  You should return a single number for the whole matrix."
print("That looks right!  Nice job!")

这时候你可能会想.. 为何要选择一个不能返回包含所有原始评分的完整用户-电影矩阵的 k。问得好。一个是计算原因，我们肯定想要降低数据的维度，但这不是我们希望 k 小于电影数量和用户数量的主要原因。

我们暂时先思维往后退一步。在我们刚刚查看的这个示例中，矩阵很整洁。每个用户-电影组合都有一个评分。**没有丢失任何值。**但是从上节课我们知道，用户-电影矩阵有很多值丢失了。  

下面是与我们刚刚对其执行 SVD 的矩阵相似的矩阵：

<img src="imgs/nice_ex.png" width="400" height="400">

现实中：

<img src="imgs/real_ex.png" width="400" height="400">


所以，如果保留所有 k 个潜在特征，∑ 矩阵中值更小的潜在特征很有可能解释的变化性是由噪点导致的，而不是信号导致的。此外，如果我们在重构原始用户-电影矩阵时使用“有噪点的”潜在特征，可能会导致评分比仅使用与信号相关的潜在特征得出的评分更糟糕。   

`9.` 我们对缺少值的矩阵执行 SVD，使数据更加真实。下面我向矩阵中添加了一个新的用户，该用户尚未对所有四部电影都评分。请尝试对新的矩阵执行 SVD。会发生什么？

In [None]:
# This line adds one nan value as the very first entry in our matrix
user_movie_subset.iloc[0, 0] = np.nan # no changes to this line

# Try svd with this new matrix
u, s, vt = # Compute SVD on the new matrix with the single nan value

**请在这里填写答案。** 


```python

```