# <img align="left" src="./images/film_strip_vertical.png"     style=" width:40px;  " > 实践实验室：基于内容的过滤深度学习

在本练习中，您将使用神经网络实现基于内容的过滤来构建电影推荐系统。 

# 大纲 <img align="left" src="./images/film_reel.png"     style=" width:40px;  " >
- [ 1 - 包](#1)
- [ 2 - 电影评分数据集](#2)
  - [ 2.1 使用神经网络的基于内容的过滤](#2.1)
  - [ 2.2 准备训练数据](#2.2)
- [ 3 - 用于基于内容的过滤的神经网络](#3)
  - [ 3.1 预测](#3.1)
    - [ 练习 1](#ex01)
- [ 4 - 恭喜！](#4)


<a name="1"></a>
## 1 - 包 <img align="left" src="./images/movie_camera.png"     style=" width:40px;  ">
我们将使用熟悉的包：NumPy、TensorFlow 和来自 [scikit-learn](https://scikit-learn.org/stable/) 的有用例程。我们还将使用 [tabulate](https://pypi.org/project/tabulate/) 来整齐地打印表格，使用 [Pandas](https://pandas.pydata.org/) 来组织表格数据。

In [1]:
import numpy as np
import numpy.ma as ma
from numpy import genfromtxt
from collections import defaultdict
import pandas as pd
import tensorflow as tf
from tensorflow import keras
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split
import tabulate
from recsysNN_utils import *
pd.set_option("display.precision", 1)

<a name="2"></a>
## 2 - 电影评分数据集 <img align="left" src="./images/film_rating.png" style=" width:40px;" >
数据集来自 [MovieLens ml-latest-small](https://grouplens.org/datasets/movielens/latest/) 数据集。 

[F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4: 19:1–19:19. <https://doi.org/10.1145/2827872>]

原始数据集有 600 个用户评分的 9000 部电影，评分范围为 0.5 到 5，步长为 0.5。数据集已缩小规模，专注于 2000 年以来的电影和流行类型。缩小后的数据集有 $n_u = 395$ 个用户和 $n_m= 694$ 部电影。对于每部电影，数据集提供电影标题、发布日期和一个或多个类型。例如"玩具总动员 3"于 2010 年发布，有几种类型："Adventure|Animation|Children|Comedy|Fantasy|IMAX"。该数据集除了用户的评分外，几乎没有关于用户的信息。该数据集用于为下面描述的神经网络创建训练向量。 

<a name="2.1"></a>
### 2.1 使用神经网络的基于内容的过滤

在协同过滤实验中，您生成了两个向量：一个用户向量和一个项目/电影向量，它们的点积将预测评分。这些向量仅从评分中得出。   

基于内容的过滤也生成用户和电影特征向量，但认识到可能有关于用户和/或电影的其他可用信息可以改善预测。附加信息被提供给神经网络，然后神经网络生成用户和电影向量，如下所示。
<figure>
    <center> <img src="./images/RecSysNN.png"   style="width:500px;height:280px;" ></center>
</figure>
提供给网络的电影内容是原始数据和一些"工程特征"的组合。回想一下课程 1 第 2 周实验 4 中的特征工程讨论和实验。原始特征是电影发布的年份和作为独热向量呈现的电影类型。有 14 种类型。工程特征是从用户评分得出的平均评分。具有多种类型的电影每种类型都有一个训练向量。 

用户内容仅由工程特征组成。为每个用户计算每种类型的平均评分。此外，用户 ID、评分数量和平均评分可用，但不包含在训练或预测内容中。它们对解释数据很有用。

训练集由数据集中用户所做的所有评分组成。用户和电影/项目向量作为训练集一起呈现给上述网络。用户向量对于用户评分的所有电影都是相同的。 

下面，让我们加载并显示一些数据。

In [2]:
# Load Data, set configuration variables
item_train, user_train, y_train, item_features, user_features, item_vecs, movie_dict, user_to_genre = load_data()

num_user_features = user_train.shape[1] - 3  # remove userid, rating count and ave rating during training
num_item_features = item_train.shape[1] - 1  # remove movie id at train time
uvs = 3  # user genre vector start
ivs = 3  # item genre vector start
u_s = 3  # start of columns to use in training, user
i_s = 1  # start of columns to use in training, items
scaledata = True  # applies the standard scalar to data if true
print(f"Number of training vectors: {len(item_train)}")

Number of training vectors: 58187


某些用户和项目/电影特征不用于训练。下面，括号"[]"中的特征（如"用户 ID"、"评分数量"和"平均评分"）在训练和使用模型时不包含在内。请注意，用户向量对于所有评分的电影都是相同的。

In [3]:
pprint_train(user_train, user_features, uvs,  u_s, maxcount=5)

[user id],[rating count],[rating ave],Act ion,Adve nture,Anim ation,Chil dren,Com edy,Crime,Docum entary,Drama,Fan tasy,Hor ror,Mys tery,Rom ance,Sci -Fi,Thri ller
2,16,4.1,3.9,5.0,0.0,0.0,4.0,4.2,4.0,4.0,0.0,3.0,4.0,0.0,4.2,3.9
2,16,4.1,3.9,5.0,0.0,0.0,4.0,4.2,4.0,4.0,0.0,3.0,4.0,0.0,4.2,3.9
2,16,4.1,3.9,5.0,0.0,0.0,4.0,4.2,4.0,4.0,0.0,3.0,4.0,0.0,4.2,3.9
2,16,4.1,3.9,5.0,0.0,0.0,4.0,4.2,4.0,4.0,0.0,3.0,4.0,0.0,4.2,3.9
2,16,4.1,3.9,5.0,0.0,0.0,4.0,4.2,4.0,4.0,0.0,3.0,4.0,0.0,4.2,3.9


In [4]:
pprint_train(item_train, item_features, ivs, i_s, maxcount=5, user=False)

[movie id],year,ave rating,Act ion,Adve nture,Anim ation,Chil dren,Com edy,Crime,Docum entary,Drama,Fan tasy,Hor ror,Mys tery,Rom ance,Sci -Fi,Thri ller
6874,2003,4.0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
6874,2003,4.0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
6874,2003,4.0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
8798,2004,3.8,1,0,0,0,0,0,0,0,0,0,0,0,0,0
8798,2004,3.8,0,0,0,0,0,1,0,0,0,0,0,0,0,0


In [5]:
print(f"y_train[:5]: {y_train[:5]}")

y_train[:5]: [4.  4.  4.  3.5 3.5]


上面，我们可以看到电影 6874 是一部 2003 年发布的动作片。用户 2 对动作片的平均评分为 3.9。此外，电影 6874 还被列在犯罪和惊悚类型中。MovieLens 用户给这部电影的平均评分为 4。一个训练示例由两个表中的一行和来自 y_train 的评分组成。

<a name="2.2"></a>
### 2.2 准备训练数据
回想一下课程 1 第 2 周，您探索了特征缩放作为改善收敛的一种方法。我们将使用 [scikit learn StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) 来缩放输入特征。这在课程 1 第 2 周实验 5 中使用过。下面，还显示了 inverse_transform 以产生原始输入。

In [6]:
# scale training data
if scaledata:
    item_train_save = item_train
    user_train_save = user_train

    scalerItem = StandardScaler()
    scalerItem.fit(item_train)
    item_train = scalerItem.transform(item_train)

    scalerUser = StandardScaler()
    scalerUser.fit(user_train)
    user_train = scalerUser.transform(user_train)

    print(np.allclose(item_train_save, scalerItem.inverse_transform(item_train)))
    print(np.allclose(user_train_save, scalerUser.inverse_transform(user_train)))

True
True


为了让我们能够评估结果，我们将把数据分成训练集和测试集，如课程 2 第 3 周所讨论的。这里我们将使用 [sklearn train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) 来分割和打乱数据。请注意，将初始随机状态设置为相同的值可确保 item、user 和 y 以相同的方式打乱。

In [7]:
item_train, item_test = train_test_split(item_train, train_size=0.80, shuffle=True, random_state=1)
user_train, user_test = train_test_split(user_train, train_size=0.80, shuffle=True, random_state=1)
y_train, y_test       = train_test_split(y_train,    train_size=0.80, shuffle=True, random_state=1)
print(f"movie/item training data shape: {item_train.shape}")
print(f"movie/item test  data shape: {item_test.shape}")

movie/item training data shape: (46549, 17)
movie/item test  data shape: (11638, 17)


The scaled, shuffled data now has a mean of zero.

In [8]:
pprint_train(user_train, user_features, uvs, u_s, maxcount=5)

[user id],[rating count],[rating ave],Act ion,Adve nture,Anim ation,Chil dren,Com edy,Crime,Docum entary,Drama,Fan tasy,Hor ror,Mys tery,Rom ance,Sci -Fi,Thri ller
1,0,0.6,0.7,0.6,0.6,0.7,0.7,0.5,0.7,0.2,0.3,0.3,0.5,0.5,0.8,0.5
0,0,1.6,1.5,1.7,0.9,1.0,1.4,0.8,-1.2,1.2,1.2,1.6,0.9,1.4,1.2,1.0
0,0,0.8,0.6,0.7,0.5,0.6,0.6,0.3,-1.2,0.7,0.8,0.9,0.6,0.2,0.6,0.6
1,0,-0.1,0.2,-0.1,0.3,0.7,0.3,0.2,1.0,-0.5,-0.7,-2.1,0.5,0.7,0.3,0.0
-1,0,-1.3,-0.8,-0.8,0.1,-0.1,-1.1,-0.9,-1.2,-1.5,-0.6,-0.5,-0.6,-0.9,-0.4,-0.9


Scale the target ratings using a Min Max Scaler to scale the target to be between -1 and 1. We use scikit-learn because it has an inverse_transform. [scikit learn MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html)

In [9]:
scaler = MinMaxScaler((-1, 1))
scaler.fit(y_train.reshape(-1, 1))
ynorm_train = scaler.transform(y_train.reshape(-1, 1))
ynorm_test = scaler.transform(y_test.reshape(-1, 1))
print(ynorm_train.shape, ynorm_test.shape)

(46549, 1) (11638, 1)


<a name="3"></a>
## 3 - 用于基于内容的过滤的神经网络
现在，让我们按照上图描述构建一个神经网络。它将有两个通过点积组合的网络。您将构建这两个网络。在这个例子中，它们将是相同的。请注意，这些网络不需要相同。如果用户内容明显大于电影内容，您可以选择相对于电影网络增加用户网络的复杂性。在这种情况下，内容是相似的，所以网络是相同的。

- 使用 Keras 顺序模型
    - 第一层是具有 256 个单元和 relu 激活的密集层。
    - 第二层是具有 128 个单元和 relu 激活的密集层。
    - 第三层是具有 `num_outputs` 个单元和线性或无激活的密集层。   
    
网络的其余部分将提供。提供的代码不使用 Keras 顺序模型，而是使用 Keras [函数式 API](https://keras.io/guides/functional_api/)。这种格式允许组件互连方式的更大灵活性。


In [10]:
# GRADED_CELL
# UNQ_C1

num_outputs = 32
tf.random.set_seed(1)
user_NN = tf.keras.models.Sequential([
    ### START CODE HERE ###   
    tf.keras.layers.Dense(256, activation='relu'),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(num_outputs, activation='linear'),
    ### END CODE HERE ###  
])

item_NN = tf.keras.models.Sequential([
    ### START CODE HERE ###     
    tf.keras.layers.Dense(256, activation='relu'),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(num_outputs, activation='linear'),
    ### END CODE HERE ###  
])

# create the user input and point to the base network
input_user = tf.keras.layers.Input(shape=(num_user_features))
vu = user_NN(input_user)
vu = tf.linalg.l2_normalize(vu, axis=1)

# create the item input and point to the base network
input_item = tf.keras.layers.Input(shape=(num_item_features))
vm = item_NN(input_item)
vm = tf.linalg.l2_normalize(vm, axis=1)

# compute the dot product of the two vectors vu and vm
output = tf.keras.layers.Dot(axes=1)([vu, vm])

# specify the inputs and output of the model
model = Model([input_user, input_item], output)

model.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            [(None, 14)]         0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            [(None, 16)]         0                                            
__________________________________________________________________________________________________
sequential (Sequential)         (None, 32)           40864       input_1[0][0]                    
__________________________________________________________________________________________________
sequential_1 (Sequential)       (None, 32)           41376       input_2[0][0]                    
______________________________________________________________________________________________

In [11]:
# Public tests
from public_tests import *
test_tower(user_NN)
test_tower(item_NN)

[92mAll tests passed!
[92mAll tests passed!


<details>
  <summary><font size="3" color="darkgreen"><b>Click for hints</b></font></summary>
    
  You can create a dense layer with a relu activation as shown.
    
```python     
user_NN = tf.keras.models.Sequential([
    ### START CODE HERE ###     
  tf.keras.layers.Dense(256, activation='relu'),

    
    ### END CODE HERE ###  
])

item_NN = tf.keras.models.Sequential([
    ### START CODE HERE ###     
  tf.keras.layers.Dense(256, activation='relu'),

    
    ### END CODE HERE ###  
])
```    
<details>
    <summary><font size="2" color="darkblue"><b> Click for solution</b></font></summary>
    
```python 
user_NN = tf.keras.models.Sequential([
    ### START CODE HERE ###     
  tf.keras.layers.Dense(256, activation='relu'),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dense(num_outputs),
    ### END CODE HERE ###  
])

item_NN = tf.keras.models.Sequential([
    ### START CODE HERE ###     
  tf.keras.layers.Dense(256, activation='relu'),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dense(num_outputs),
    ### END CODE HERE ###  
])
```
</details>
</details>

    


我们将使用均方误差损失和 Adam 优化器。

In [12]:
tf.random.set_seed(1)
cost_fn = tf.keras.losses.MeanSquaredError()
opt = keras.optimizers.Adam(learning_rate=0.01)
model.compile(optimizer=opt,
              loss=cost_fn)

In [13]:
tf.random.set_seed(1)
model.fit([user_train[:, u_s:], item_train[:, i_s:]], ynorm_train, epochs=30)

Train on 46549 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<tensorflow.python.keras.callbacks.History at 0x7fba9d2b5110>

评估模型以确定测试数据上的损失。它与训练损失相当，表明模型没有对训练数据产生严重的过拟合。

In [14]:
model.evaluate([user_test[:, u_s:], item_test[:, i_s:]], ynorm_test)



0.10449595100221243

<a name="3.1"></a>
### 3.1 预测
下面，您将在多种情况下使用您的模型进行预测。 
#### 为新用户预测
首先，我们将创建一个新用户，并让模型为该用户推荐电影。在您尝试了示例用户内容的示例后，可以自由更改用户内容以匹配您自己的偏好，并查看模型建议的内容。请注意，评分在 0.5 到 5.0 之间（含），步长为 0.5。

In [15]:
new_user_id = 5000
new_rating_ave = 1.0
new_action = 1.0
new_adventure = 1
new_animation = 1
new_childrens = 1
new_comedy = 5
new_crime = 1
new_documentary = 1
new_drama = 1
new_fantasy = 1
new_horror = 1
new_mystery = 1
new_romance = 5
new_scifi = 5
new_thriller = 1
new_rating_count = 3

user_vec = np.array([[new_user_id, new_rating_count, new_rating_ave,
                      new_action, new_adventure, new_animation, new_childrens,
                      new_comedy, new_crime, new_documentary,
                      new_drama, new_fantasy, new_horror, new_mystery,
                      new_romance, new_scifi, new_thriller]])


让我们看看新用户评分最高的电影。回想一下，用户向量具有偏爱喜剧和浪漫类型的类型。
下面，我们将使用一组电影/项目向量 `item_vecs`，它们为训练/测试集中的每部电影都有一个向量。这与上面的用户向量匹配，并使用缩放的向量来预测我们上面新用户对所有电影的评分。

In [16]:
# generate and replicate the user vector to match the number movies in the data set.
user_vecs = gen_user_vecs(user_vec,len(item_vecs))

# scale the vectors and make predictions for all movies. Return results sorted by rating.
sorted_index, sorted_ypu, sorted_items, sorted_user = predict_uservec(user_vecs,  item_vecs, model, u_s, i_s, 
                                                                       scaler, scalerUser, scalerItem, scaledata=scaledata)

print_pred_movies(sorted_ypu, sorted_user, sorted_items, movie_dict, maxcount = 10)

y_p,movie id,rating ave,title,genres
4.86762,64969,3.61765,Yes Man (2008),Comedy
4.86692,69122,3.63158,"Hangover, The (2009)",Comedy|Crime
4.86477,63131,3.625,Role Models (2008),Comedy
4.85853,60756,3.55357,Step Brothers (2008),Comedy
4.85785,68135,3.55,17 Again (2009),Comedy|Drama
4.85178,78209,3.55,Get Him to the Greek (2010),Comedy
4.85138,8622,3.48649,Fahrenheit 9/11 (2004),Documentary
4.8505,67087,3.52941,"I Love You, Man (2009)",Comedy
4.85043,69784,3.65,Brüno (Bruno) (2009),Comedy
4.84934,89864,3.63158,50/50 (2011),Comedy|Drama


If you do create a user above, it is worth noting that the network was trained to predict a user rating given a user vector that includes a **set** of user genre ratings.  Simply providing a maximum rating for a single genre and minimum ratings for the rest may not be meaningful to the network if there were no users with similar sets of ratings.

#### 为现有用户预测。
让我们看看数据集中的用户之一"用户 36"的预测。我们可以将预测的评分与模型的评分进行比较。请注意，具有多种类型的电影在训练数据中会出现多次。例如，"时间机器"有三种类型：冒险、动作、科幻

In [17]:
uid =  36 
# form a set of user vectors. This is the same vector, transformed and repeated.
user_vecs, y_vecs = get_user_vecs(uid, scalerUser.inverse_transform(user_train), item_vecs, user_to_genre)

# scale the vectors and make predictions for all movies. Return results sorted by rating.
sorted_index, sorted_ypu, sorted_items, sorted_user = predict_uservec(user_vecs, item_vecs, model, u_s, i_s, scaler, 
                                                                      scalerUser, scalerItem, scaledata=scaledata)
sorted_y = y_vecs[sorted_index]

#print sorted predictions
print_existing_user(sorted_ypu, sorted_y.reshape(-1,1), sorted_user, sorted_items, item_features, ivs, uvs, movie_dict, maxcount = 10)

y_p,y,user,user genre ave,movie rating ave,title,genres
3.1,3.0,36,3.0,2.86,"Time Machine, The (2002)",Adventure
3.0,3.0,36,3.0,2.86,"Time Machine, The (2002)",Action
2.8,3.0,36,3.0,2.86,"Time Machine, The (2002)",Sci-Fi
2.3,1.0,36,1.0,4.0,"Beautiful Mind, A (2001)",Romance
2.2,1.0,36,1.5,4.0,"Beautiful Mind, A (2001)",Drama
1.6,1.5,36,1.75,3.52,Road to Perdition (2002),Crime
1.6,2.0,36,1.75,3.52,Gangs of New York (2002),Crime
1.5,1.5,36,1.5,3.52,Road to Perdition (2002),Drama
1.5,2.0,36,1.5,3.52,Gangs of New York (2002),Drama


#### 查找相似项目
上面的神经网络产生两个特征向量：一个用户特征向量 $v_u$ 和一个电影特征向量 $v_m$。这些是 32 个条目的向量，其值难以解释。然而，相似的项目将具有相似的向量。此信息可用于进行推荐。例如，如果用户对"玩具总动员 3"评分很高，可以通过选择具有相似电影特征向量的电影来推荐类似的电影。

相似性度量是两个向量 $\mathbf{v_m^{(k)}}$ 和 $\mathbf{v_m^{(i)}}$ 之间的平方距离：
$$\left\Vert \mathbf{v_m^{(k)}} - \mathbf{v_m^{(i)}}  \right\Vert^2 = \sum_{l=1}^{n}(v_{m_l}^{(k)} - v_{m_l}^{(i)})^2\tag{1}$$

<a name="ex01"></a>
### 练习 1

编写一个函数来计算平方距离。

In [20]:
# GRADED_FUNCTION: sq_dist
# UNQ_C2
def sq_dist(a,b):
    """
    Returns the squared distance between two vectors
    Args:
      a (ndarray (n,)): vector with n features
      b (ndarray (n,)): vector with n features
    Returns:
      d (float) : distance
    """
    ### START CODE HERE ###     
    d = sum(np.square(a-b))
    ### END CODE HERE ###     
    return (d)

In [21]:
# Public tests
test_sq_dist(sq_dist)

[92mAll tests passed!


In [22]:
a1 = np.array([1.0, 2.0, 3.0]); b1 = np.array([1.0, 2.0, 3.0])
a2 = np.array([1.1, 2.1, 3.1]); b2 = np.array([1.0, 2.0, 3.0])
a3 = np.array([0, 1, 0]);       b3 = np.array([1, 0, 0])
print(f"squared distance between a1 and b1: {sq_dist(a1, b1)}")
print(f"squared distance between a2 and b2: {sq_dist(a2, b2)}")
print(f"squared distance between a3 and b3: {sq_dist(a3, b3)}")

squared distance between a1 and b1: 0.0
squared distance between a2 and b2: 0.030000000000000054
squared distance between a3 and b3: 2


<details>
  <summary><font size="3" color="darkgreen"><b>Click for hints</b></font></summary>
    
  While a summation is often an indication a for loop should be used, here the subtraction can be element-wise in one statement. Further, you can utilized np.square to square, element-wise, the result of the subtraction. np.sum can be used to sum the squared elements.
    
</details>

    


电影之间的距离矩阵可以在模型训练时计算一次，然后在不重新训练的情况下重新用于新推荐。一旦模型训练完成，第一步是为每部电影获取电影特征向量 $v_m$。为此，我们将使用训练好的 `item_NN` 并构建一个小模型，允许我们通过它运行电影向量以生成 $v_m$。

In [23]:
input_item_m = tf.keras.layers.Input(shape=(num_item_features))    # input layer
vm_m = item_NN(input_item_m)                                       # use the trained item_NN
vm_m = tf.linalg.l2_normalize(vm_m, axis=1)                        # incorporate normalization as was done in the original model
model_m = Model(input_item_m, vm_m)                                
model_m.summary()

Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_3 (InputLayer)            [(None, 16)]         0                                            
__________________________________________________________________________________________________
sequential_1 (Sequential)       (None, 32)           41376       input_3[0][0]                    
__________________________________________________________________________________________________
tf_op_layer_l2_normalize_2/Squa [(None, 32)]         0           sequential_1[1][0]               
__________________________________________________________________________________________________
tf_op_layer_l2_normalize_2/Sum  [(None, 1)]          0           tf_op_layer_l2_normalize_2/Square
____________________________________________________________________________________________

Once you have a movie model, you can create a set of movie feature vectors by using the model to predict using a set of item/movie vectors as input. `item_vecs` is a set of all of the movie vectors. Recall that the same movie will appear as a separate vector for each of its genres. It must be scaled to use with the trained model. The result of the prediction is a 32 entry feature vector for each movie.

In [24]:
scaled_item_vecs = scalerItem.transform(item_vecs)
vms = model_m.predict(scaled_item_vecs[:,i_s:])
print(f"size of all predicted movie feature vectors: {vms.shape}")

size of all predicted movie feature vectors: (1883, 32)


现在让我们计算每个电影特征向量与所有其他电影特征向量之间的平方距离矩阵：
<figure>
    <left> <img src="./images/distmatrix.PNG"   style="width:400px;height:225px;" ></center>
</figure>

然后，我们可以通过查找每行的最小值来找到最接近的电影。我们将使用 [numpy 掩码数组](https://numpy.org/doc/1.21/user/tutorial-ma.html) 来避免选择同一部电影。对角线上的掩码值不会包含在计算中。

In [25]:
count = 50
dim = len(vms)
dist = np.zeros((dim,dim))

for i in range(dim):
    for j in range(dim):
        dist[i,j] = sq_dist(vms[i, :], vms[j, :])
        
m_dist = ma.masked_array(dist, mask=np.identity(dist.shape[0]))  # mask the diagonal

disp = [["movie1", "genres", "movie2", "genres"]]
for i in range(count):
    min_idx = np.argmin(m_dist[i])
    movie1_id = int(item_vecs[i,0])
    movie2_id = int(item_vecs[min_idx,0])
    genre1,_  = get_item_genre(item_vecs[i,:], ivs, item_features)
    genre2,_  = get_item_genre(item_vecs[min_idx,:], ivs, item_features)

    disp.append( [movie_dict[movie1_id]['title'], genre1,
                  movie_dict[movie2_id]['title'], genre2]
               )
table = tabulate.tabulate(disp, tablefmt='html', headers="firstrow", floatfmt=[".1f", ".1f", ".0f", ".2f", ".2f"])
table

movie1,genres,movie2,genres.1
Save the Last Dance (2001),Drama,John Q (2002),Drama
Save the Last Dance (2001),Romance,Saving Silverman (Evil Woman) (2001),Romance
"Wedding Planner, The (2001)",Comedy,National Lampoon's Van Wilder (2002),Comedy
"Wedding Planner, The (2001)",Romance,Mr. Deeds (2002),Romance
Hannibal (2001),Horror,Final Destination 2 (2003),Horror
Hannibal (2001),Thriller,"Sum of All Fears, The (2002)",Thriller
Saving Silverman (Evil Woman) (2001),Comedy,Cats & Dogs (2001),Comedy
Saving Silverman (Evil Woman) (2001),Romance,Save the Last Dance (2001),Romance
Down to Earth (2001),Comedy,Joe Dirt (2001),Comedy
Down to Earth (2001),Fantasy,"Haunted Mansion, The (2003)",Fantasy


结果表明，模型将建议来自同一类型的电影。

<a name="4"></a>
## 4 - 恭喜！ <img align="left" src="./images/film_award.png" style=" width:40px;">
您已经完成了一个基于内容的推荐系统。    

这种结构是许多商业推荐系统的基础。如果有可用信息，用户内容可以大大扩展以包含更多关于用户的信息。项目不仅限于电影。这可用于推荐任何项目、书籍、汽车或与您"购物车"中的项目相似的项目。