# 如何使用推荐工具以及推荐算法进行推荐

**常用的推荐算法包括基于SVD家族的协同过滤算法，基于FM，DeepFM等的二阶交互模型**

这个notebook尽可能的实现这些基本方法并利用RMSE对算法进行评估。

此notebook是依据如下的notebooks进行了修改:
**[morrisb](https://www.kaggle.com/morrisb/how-to-recommend-anything-deep-recommender)**,**[siavrez](https://www.kaggle.com/siavrez/deepfm-model)**
***
+ [1. 加载库文件](#1)<br>
+ [2. 加载Item文件](#2)<br>
+ [3. 加载User文件](#3)<br>
+ [4. 过滤稀疏的User和Item](#4)<br>
+ [5. 创建训练和测试集](#5)<br>
+ [6. 转换User-Ratings到User-Item-Rating-Matrix](#6)<br>
+ [7. 推荐算法引擎](#7)<br>
 + [7.1. Mean Rating](#7.1)<br>
 + [7.2. Weighted Mean Rating](#7.2)<br>
 + [7.3. Cosine User-User Similarity](#7.3)<br>
 + [7.4. Matrix Factorisation With Keras And Gradient Descent](#7.5)<br>
 + [7.5. Deep Learning With Keras](#7.6)<br>
+ [8. Exploring Python Libraries](#8)<br>
 + [8.1. Surprise Library](#8.1)<br>
 + [8.2. Lightfm Library](#8.2)<br>
 + [8.3. Deepctr Library](#8.3)<br>
+ [9. Conclusion](#9)<br>
***
## <a id=1>1. 加载库文件</a>

In [1]:
!pip install deepctr

Collecting deepctr
[?25l  Downloading https://files.pythonhosted.org/packages/d9/8e/03d45ded03d594212003801e2b4af0927b66575741fd6df72a07fb6affd3/deepctr-0.7.2-py3-none-any.whl (79kB)
[K    100% |████████████████████████████████| 81kB 3.1MB/s 
[31mmxnet 1.3.0.post0 has requirement numpy<1.15.0,>=1.8.2, but you'll have numpy 1.15.2 which is incompatible.[0m
[31mkmeans-smote 0.1.0 has requirement imbalanced-learn<0.4,>=0.3.1, but you'll have imbalanced-learn 0.5.0.dev0 which is incompatible.[0m
[31mkmeans-smote 0.1.0 has requirement numpy<1.15,>=1.13, but you'll have numpy 1.15.2 which is incompatible.[0m
[31mfastai 0.7.0 has requirement torch<0.4, but you'll have torch 0.4.1 which is incompatible.[0m
[31manaconda-client 1.7.2 has requirement python-dateutil>=2.6.1, but you'll have python-dateutil 2.6.0 which is incompatible.[0m
[31mimbalanced-learn 0.5.0.dev0 has requirement scikit-learn>=0.20, but you'll have scikit-learn 0.19.1 which is incompatible.[0m
Installin

In [2]:
# To store the data
import pandas as pd

# To do linear algebra
import numpy as np

# To create plots
import matplotlib.pyplot as plt

# # To create interactive plots
# from plotly.offline import init_notebook_mode, plot, iplot, download_plotlyjs
# import plotly as py
# import plotly.graph_objs as go
# # init_notebook_mode(connected=True)
# To operator files
import os
# To shift lists
from collections import deque

# To compute similarities between vectors
from sklearn.metrics import mean_squared_error
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# To use recommender systems
import surprise as sp
from surprise.model_selection import cross_validate

# To create deep learning models
from keras.layers import Input, Embedding, Reshape, Dot, Concatenate, Dense, Dropout
from keras.models import Model

# To create sparse matrices
from scipy.sparse import coo_matrix

# To light fm
from lightfm import LightFM
from lightfm.evaluation import precision_at_k

# To deepctr
from deepctr.inputs import SparseFeat, DenseFeat, get_feature_names
from deepctr.models import DeepFM, xDeepFM, DCN, DIN, DSIN, DIEN

# To stack sparse matrices
from scipy.sparse import vstack

Using TensorFlow backend.


## <a id=2>2. 加载Item文件</a>

In [3]:
# 加载netflix-prize-data数据集
os.listdir('../input/netflix-prize-data/')
# qualifying.txt:要提交的预测文件
# MovieID1:
# CustomerID11,Date11
# CustomerID12,Date12
# -> 
# MovieID1:
# Rating11
# Rating12

# probe.txt: 和qualifying.txt文件类似，与之不同的是没有Date列

# movie_titles.txt : 电影信息，数据格式为MovieId, YearOfRelease, Title
# combined_data_1/2/3/4.txt ： 训练集， 数据格式为CustomerID(user), Rating, Date

['qualifying.txt',
 'movie_titles.csv',
 'combined_data_4.txt',
 'combined_data_2.txt',
 'README',
 'combined_data_1.txt',
 'combined_data_3.txt',
 'probe.txt']

In [4]:
movie_netflix = pd.read_csv('../input/netflix-prize-data/movie_titles.csv', 
                           encoding = 'ISO-8859-1', 
                           header = None, 
                           names = ['Id', 'Year', 'Name']).set_index('Id')

print('Shape Movie-Titles:\t{} \n Contains {} items'.format(movie_netflix.shape, movie_netflix.shape[0]))
movie_netflix.sample(5)

Shape Movie-Titles:	(17770, 2) 
 Contains 17770 items


Unnamed: 0_level_0,Year,Name
Id,Unnamed: 1_level_1,Unnamed: 2_level_1
16623,1976.0,Bugsy Malone
5950,2004.0,Evel Knievel
12808,1997.0,Velocity Trap
1719,2004.0,The Life Aquatic with Steve Zissou
379,1996.0,Crash Dive


In [5]:
# 加载the-movies-dataset数据集
# os.listdir('../input/the-movies-dataset')
# movies_metadata.csv: 电影元文件，每个电影共计24个特征
# keywords.csv: id-keyword，每个电影对应一个关键词
# credits.csv: id-cast-crew，每个电影对应摄制组和演员信息
# links.csv: id-imdbid-tmdbid，不同电影平台对同一部电影的不用标识
# ratings_small.csv : 评分数据，userId-movieId-rating-timestamp

In [6]:
# low_memory=False关键词
# low_memory=False 参数设置后，pandas会一次性读取csv中的所有数据，然后对字段的数据类型进行唯一的一次猜测。这样就不会导致同一字段的Mixed types问题了。
# 但是这种方式真的非常不好，一旦csv文件过大，就会内存溢出；
# movie_metadata = pd.read_csv('../input/the-movies-dataset/movies_metadata.csv', low_memory=False)[['original_title', 'id', 'release_date', 'vote_count']].set_index('id')
# # 移除投票次数小于10的样本
# movie_metadata = movie_metadata[movie_metadata['vote_count']>10].drop('vote_count', axis=1)

# print('Shape Movie-Metadata:\t{}\n Contains {} items'.format(movie_metadata.shape, movie_metadata.shape[0]))
# movie_metadata.sample(5)

In [7]:
# 加载movielens20m数据集
# os.listdir('../input/movielens-20m-dataset/')
# tag.csv: userId-movieId-tag-timestamp
# rating.csv: userId-movieId-rating-timestamp
# movie.csv: movieId-title-genres
# link.csv: moiveId-imdbId-tmbdId
# genome_scores.csv: movieId-tagId-relevance
# genome_tags.csv: tagId-tag

In [8]:
# movie_movielens = pd.read_csv('../input/movielens-20m-dataset/movie.csv').set_index('movieId')
# print('Shape MovieLens-movice:\t{}\n Contains {} items'.format(movie_movielens.shape, movie_movielens.shape[0]))
# movie_movielens.head(5)

## <a id=3>3. 加载User文件</a>
其中每条user样本，都类似关联算法中的transaction。统一User-Item-Rating的columns为userId-itemId-rating。

In [9]:
# Load single data-file 
# combined_data_1 = pd.read_csv('../input/netflix-prize-data/combined_data_1.txt', header=None, names=['User', 'Rating', 'Date'], usecols=[0, 1, 2])
# combined_data_2 = pd.read_csv('../input/netflix-prize-data/combined_data_2.txt', header=None, names=['User', 'Rating', 'Date'], usecols=[0, 1, 2])
# combined_data_3 = pd.read_csv('../input/netflix-prize-data/combined_data_3.txt', header=None, names=['User', 'Rating', 'Date'], usecols=[0, 1, 2])
# combined_data_4 = pd.read_csv('../input/netflix-prize-data/combined_data_4.txt', header=None, names=['User', 'Rating', 'Date'], usecols=[0, 1, 2])
# df_raw = pd.cocat([combined_data_1, combined_data_2, combined_data_3, combined_data_4], axis=0).reset_index()
# 鉴于netflix-prize-data中存在
df_raw = pd.read_csv('../input/netflix-prize-data/combined_data_1.txt', header=None, names=['userId', 'rating', 'Date'], usecols=[0, 1, 2])
print('Shape Raw Data:\t{}'.format(df_raw.shape))

# Find empty rows to slice dataframe for each movie
# 编码思路是先找出缺失值的索引，然后遍历过滤掉索引值
tmp_movies = df_raw[df_raw['rating'].isna()]['userId'].reset_index()
movie_indices = [[index, int(movie[:-1])] for index, movie in tmp_movies.values] # drop ':'

# Shift the movie_indices by one to get start and endpoints of all movies
shifted_movie_indices = deque(movie_indices)
shifted_movie_indices.rotate(-1)  # the first element turn to the last element.


# Gather all dataframes
user_data = []

# Iterate over all movies
for [df_id_1, movie_id], [df_id_2, next_movie_id] in zip(movie_indices, shifted_movie_indices):
    
    # Check if it is the last movie in the file
    if df_id_1<df_id_2:
        tmp_df = df_raw.loc[df_id_1+1:df_id_2-1].copy()
    else:
        tmp_df = df_raw.loc[df_id_1+1:].copy()
        
    # Create movie_id column
    tmp_df['movieId'] = movie_id
    
    # Append dataframe to list
    user_data.append(tmp_df)

# Combine all dataframes
netflix_prize_User = pd.concat(user_data)
del user_data, df_raw, tmp_movies, tmp_df, shifted_movie_indices, movie_indices, df_id_1, movie_id, df_id_2, next_movie_id
print('Shape User-Ratings:\t{}'.format(netflix_prize_User.shape))
netflix_prize_User.sample(5)

Shape Raw Data:	(24058263, 3)
Shape User-Ratings:	(24053764, 4)


Unnamed: 0,userId,rating,Date,movieId
9926947,211138,3.0,2005-07-26,1925
14404112,377942,4.0,2005-07-07,2782
14557200,765929,4.0,2003-12-26,2800
21580856,2460896,4.0,2005-04-06,4056
6409512,1996259,4.0,2003-10-02,1255


In [10]:
# movie_dataset_User = pd.read_csv('../input/the-movies-dataset/ratings.csv', low_memory=False)
# print('Shape User-Ratings:\t{}'.format(movie_dataset_User.shape))
# movie_dataset_User.head(5)

In [11]:
# movielens_movie_User = pd.read_csv('../input/movielens-20m-dataset/rating.csv')
# print('Shape MovieLens-movice:\t{}'.format(movielens_movie_User.shape))
# movielens_movie_User.head(5)

## <a id=4>4. 过滤稀疏的User和Item</a>
对于user，过滤其与评分系统交互较少的user，即评分的item数量较少；对于item，其被user评过分的次数较少。（其主要目的是为了方便测试，在实验生产环境中，应该对稀疏的user和item做特殊处理，如使用LR模型，深度模型等）

In [12]:
def filter_user_item(user_item_rating, min_nb_item_ratings=300, min_nb_user_ratings=200):
    filter_items = (user_item_rating['movieId'].value_counts() > min_nb_item_ratings)
    filter_items = filter_items[filter_items].index.tolist()
    
    filter_users = (user_item_rating['userId'].value_counts() > min_nb_user_ratings)
    filter_users = filter_users[filter_users].index.tolist()
    filter_ret = user_item_rating[(user_item_rating['movieId'].isin(filter_items)) & (user_item_rating['userId'].isin(filter_users))]
    print('Shape User-Ratings unfiltered:\t{}'.format(user_item_rating.shape))
    print('Shape User-Ratings filtered:\t{}'.format(filter_ret.shape))
    return filter_ret

In [13]:
# netflix_prize_User
filtered_netflix_prize_User = filter_user_item(netflix_prize_User)
# filtered_movie_dataset_User = filter_user_item(movie_dataset_User)
# filtered_movielens_movie_User = filter_user_item(movielens_movie_User)

Shape User-Ratings unfiltered:	(24053764, 4)
Shape User-Ratings filtered:	(6168476, 4)


In [14]:
del netflix_prize_User#, movie_dataset_User, movielens_movie_User

## <a id=5>5. 创建训练和测试集</a>
创建训练集和测试集的目的在于使用推荐系统测评指标进行验证模型的性能，鉴于rating是一个连续值，可以采用RMSE度量方式，即
$$RMSE(root\ square\ error)=\sqrt{\frac{\sum (y_i-z_i)^2}{N}}$$
其中$y_i$表示真实值，$z_i$表示验证值。

In [15]:
def get_train_test(filtered_user_item, test_size=0.5):
    X_train, X_test, _, _ = train_test_split(filtered_user_item.reset_index(), filtered_user_item['movieId'].values, test_size=test_size, random_state=2020, stratify=filtered_user_item['movieId'].values)
    return X_train, X_test

In [16]:
# train_data1, test_data1 = get_train_test(filtered_movie_dataset_User)
# movieId1 = train_data1.movieId
# userId1 = train_data1.userId
# train_data2, test_data2 = get_train_test(filtered_movielens_movie_User)
# movieId2 = train_data2.movieId
# userId2 = train_data2.userId
train_data3, test_data3 = get_train_test(filtered_netflix_prize_User)
movieId3 = train_data3.movieId
userId3 = train_data3.userId
# del filtered_movie_dataset_User, filtered_movielens_movie_User, filtered_netflix_prize_User
# del filtered_netflix_prize_User

## <a id=6>6. 转换User-Ratings到User-Item-Rating-Matrix</a>
转换矩阵使得DataFrame是以userId为index，itemId为columns，其中矩阵中每个值对应rating（即评分）。

In [17]:
def get_user_item_rating_mat(data):
    return data.pivot_table(index='userId', columns='movieId', values='rating')

In [18]:
# train_data1 = get_user_item_rating_mat(train_data1)
# train_data2 = get_user_item_rating_mat(train_data2)
matrix_train_data3 = get_user_item_rating_mat(train_data3)
# train_data1.sample(4), train_data2.sample(4), train_data3.sample(4)
matrix_train_data3.head(5)

movieId,1,3,5,6,8,12,16,17,18,19,23,24,25,26,28,29,30,32,33,35,36,38,39,40,44,45,46,47,48,50,52,55,56,57,58,68,70,71,73,74,...,4440,4441,4442,4444,4447,4448,4449,4450,4451,4452,4454,4456,4459,4460,4461,4463,4465,4467,4468,4470,4472,4473,4474,4476,4478,4479,4482,4483,4484,4485,4488,4489,4490,4491,4492,4493,4495,4496,4497,4499
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
1000079,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,,,,,,,,,
1000192,,,,,,,,,,,,,,,2.0,,,,,,,,,,,,,,,,,,,,3.0,,,,,,...,,,,,,,,,,,,,,,,,5.0,,,,,,,,,5.0,,,,,,,,,,,,,,
1000301,,,,,,,,,4.0,,,,,,3.0,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,4.0,,,,,,,,,,4.0,,,,,,,,,
1000387,,,,,,,,,,,,,,,4.0,,,,,,,,,,,,,,,,,,,4.0,2.0,5.0,,,,,...,,,,,,,,,,,,,,,,,,,,,2.0,,,,,,,,,1.0,,,,,,,,,,
1000410,,,,,,,,,4.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,3.0,,,,,,,3.0,,


In [19]:
# train_data1.to_csv('train_data1.csv', index=False, header=None)
# train_data2.to_csv('train_data2.csv', index=False, header=None)
# train_data3.to_csv('train_data3.csv', index=False, header=None)

In [20]:
# del train_data1, train_data2, train_data3
# del train_data3

由上可知，其中user-item-rating-matrix中有大量的NaN值，对于PureSVD的输入是不合法的，因此若使用PureSVD算法的话，需要对矩阵中的NaN值进行填充。

## <a id=7>7. 推荐引擎</a>
### <a id=7.1>7.1. Mean Rating</a>
使用Mean Rating作为最终的预测结果，这样的结果会导致rating具有偏向性，收视率较高的（即每列中NaN的值较少）会受到影响，使得其rating偏低，进一步让rating结果偏向于收视率较低的rating。

In [21]:
def mean_rating(train, test):
    # 0：表示沿着每一列或行标签/索引值向下执行方法
    # 1：表示沿着每一行或列标签/索引值向右执行方法
    ratings_mean = train.mean(axis=0).rename('rating_mean')
    df_pred = test.set_index('movieId').join(ratings_mean)[['rating', 'rating_mean']]
#     df_pred.fillna(df_pred.mean(), inplace=True)
    rmse = np.sqrt(mean_squared_error(y_true=df_pred['rating'], y_pred=df_pred['rating_mean']))
    print("mean rating's rmse is {}".format(rmse))

In [22]:
# train_data3 = pd.read_csv('./train_data3.csv',header=None)
# train_data3.head(5)

In [23]:
# train_data3.index = userId3
# train_data3.columns = movieId3 

In [24]:
# train_data3 = pd.read_csv('./train_data3.csv', header=None, index_col=userId3.values, names=movieId3.values)
# mean_rating_data1 = mean_rating(train_data3, test_data3)
# del train_data3

In [25]:
# train_data1 = pd.read_csv('./train_data1.csv')
# mean_rating_data1 = mean_rating(train_data1, test_data1)
# del train_data1
# train_data2 = pd.read_csv('./train_data2.csv')
# mean_rating_data2 = mean_rating(train_data2, test_data2)
# del train_data2
# train_data3 = pd.read_csv('./train_data3.csv')
# mean_rating_data3 = mean_rating(train_data3, test_data3)
# del train_data3
mean_rating_data3 = mean_rating(matrix_train_data3, test_data3)

mean rating's rmse is 1.0125986596206626


### <a id=7.2>7.2. [Weighted Mean Rating](https://www.quora.com/How-does-IMDbs-rating-system-work)</a>
借助贝叶斯估计（the Bayesian estimate），权重评分公式如下：
$$(WR) = \frac{v}{v+m} \times R + \frac{m}{v+m} \times C$$
其中，$R$为电影的平均值，$v$为电影的投票数量，$m$为Top250的最低票数（当前值为25000），$C$为整个数据集的平均票数（当前为7.0）。

In [26]:
def weighted_mean_rating(train, test, m=1000):
    C = train.stack().mean()  # 一个浮点数
    """
    数据格式如下：
    userId1:
    movieId11, rating
    movieId12, rating
    userId2:
    movieId21, rating
    movieId22, rating
    """
    R = train.mean(axis=0).values # movie个数的一个array，每个值为rating的平均值
    v = train.count().values # movie个数的一个array，每个值为user的个数
    weighted_score = (v/ (v+m) *R) + (m/ (v+m) *C)
    df_prediction = test.set_index('movieId').join(pd.DataFrame(weighted_score, index=train.columns, columns=['prediction']))[['rating', 'prediction']]
    y_true = df_prediction['rating']
    y_pred = df_prediction['prediction']
    rmse = np.sqrt(mean_squared_error(y_true=y_true, y_pred=y_pred))
    print('weighted mean rating"s rmse is {}'.format(rmse))
    return rmse

In [27]:
weighted_mean_rating_data3 = weighted_mean_rating(matrix_train_data3, test_data3, 50)

weighted mean rating"s rmse is 1.0133315324948378


其中$m$是一个超参数，通过调节$m$来改变整体全部评分和每个电影评分的比重。

### <a id=7.3>7.3. Cosine User-User Similarity</a>
利用余弦相似度计算用户向量之间的相似度，然后利用这个相似度作为一个电影评分权重和当前电影的评分做加权相乘。
$$score=\frac{\sum cosine_{ij} rating_{ij}}{\sum cosine_{ij}}$$
需要注意的1）缩放因子；2）和之前两种算法相比更加细化，细化至userId；3）超参数相似度排名top-n。

In [28]:
def cosine_u2u_similarity(train, test, n_recommendation=100):
    train_imputed = train.T.fillna(train.mean(axis=1)).T  # 利用均值进行填充NaN
    similarity = cosine_similarity(train_imputed.values)  # 计算用户之间的余弦相似度
    similarity -= np.eye(similarity.shape[0]) # 减去自身相似度
    
    prediction = []
    userId_idx_mapping = {userId:idx for idx, userId in enumerate(train_imputed.index)}
    for userId in test.userId.unique():
        similarity_user_index = np.argsort(similarity[userId_idx_mapping[userId]])[::-1]
        similarity_user_score = np.sort(similarity[userId_idx_mapping[userId]])[::-1]
        for movieId in test[test.userId == userId].movieId.values:
            
            score = (train_imputed.iloc[similarity_user_index[:n_recommendation]][movieId] * similarity_user_score[:n_recommendation]).values.sum() / similarity_user_score[:n_recommendation].sum()
            prediction.append([userId, movieId, score])
    
    # Create prediction DataFrame
    df_pred = pd.DataFrame(prediction, columns=['userId', 'movieId', 'prediction']).set_index(['userId', 'movieId'])
    df_pred = test.set_index(['userId', 'movieId']).join(df_pred)


    # Get labels and predictions
    y_true = df_pred['rating'].values
    y_pred = df_pred['prediction'].values

    # Compute RMSE
    rmse = np.sqrt(mean_squared_error(y_true=y_true, y_pred=y_pred))
    print("consine_u2u_similarity's rmse is {}".format(rmse))
    return rmse

In [29]:
cosine_u2u_similarity_data3 = cosine_u2u_similarity(matrix_train_data3, test_data3, n_recommendation=100)

consine_u2u_similarity's rmse is 1.4789047365862287


### <a id=7.4>7.4. Matrix Factorization With Keras And Gradient Descent</a>
鉴于user-item-rating是高维且稀疏的矩阵，因此可以用embedding形式表示movieId和userId，然后使用Dot操作去拟合这个user-item-rating矩阵。

In [30]:
def matrix_factorization_dot(train, test, embedding_size=50):
    userId_idx_mapping = {userId:idx for idx, userId in enumerate(train.userId.unique())}
    movieId_idx_mapping = {movieId:idx for idx, movieId in enumerate(train.movieId.unique())}
    # 和reset_index函数一样，为了方便NN模型的输入（主要体现在Batch的获取上）
    train_user_data = train.userId.map(userId_idx_mapping)
    train_movie_data = train.movieId.map(movieId_idx_mapping)
    
    test_user_data = test.userId.map(userId_idx_mapping)
    test_movie_data = test.movieId.map(movieId_idx_mapping)
    
    nb_users = len(userId_idx_mapping)
    nb_movies = len(movieId_idx_mapping)
    
    
    # 创建模型
    # 定义输入，维度
    userId_input = Input(shape=[1], name='user')
    movieId_input = Input(shape=[1], name='movie')
    # 创建embedding层
    user_embedding = Embedding(
        output_dim=embedding_size,
        input_dim=nb_users,
        input_length=1,
        name='user_embedding'
    )(userId_input)
    
    movie_embedding = Embedding(
        output_dim=embedding_size,
        input_dim=nb_movies,
        input_length=1,
        name='movie_embedding'
    )(movieId_input)
    # Reshape the embedding layers
    user_vector = Reshape([embedding_size])(user_embedding)
    movie_vector = Reshape([embedding_size])(movie_embedding)

    # Compute dot-product of reshaped embedding layers as prediction
    y = Dot(1, normalize=False)([user_vector, movie_vector])

    # Setup model
    model = Model(inputs=[userId_input, movieId_input], outputs=y)
    model.compile(loss='mse', optimizer='adam')


    # Fit model
    model.fit([train_user_data, train_movie_data],
              train.rating,
              batch_size=256, 
              epochs=10,
              validation_split=0.4,
              shuffle=True)

    # Test model
    y_pred = model.predict([test_user_data, test_movie_data])
    y_true = test.rating.values

    #  Compute RMSE
    rmse = np.sqrt(mean_squared_error(y_pred=y_pred, y_true=y_true))
    print('\n\nTesting Result With Keras Matrix-Factorization: {:.4f} RMSE'.format(rmse))
    return rmse

In [31]:
matrix_factorization_dot_train3 = matrix_factorization_dot(train_data3, test_data3)

Train on 1850542 samples, validate on 1233696 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10

### <a id=7.6>7.6. Deep Learning With Keras</a>
添加模型深度，使用DNN拟合user-item-rating矩阵值，这里仅添加了一个全连接层（dense），使用矩阵拼接作为model的输入。

In [32]:
def matrix_factorization_dnn(train, test, nb_user_embedding=20, nb_movie_embedding=40):
    userId_idx_mapping = {userId:idx for idx, userId in enumerate(train.userId.unique())}
    movieId_idx_mapping = {movieId:idx for idx, movieId in enumerate(train.movieId.unique())}
    
    # Create correctly mapped train- & testset
    train_user_data = train.userId.map(userId_idx_mapping)
    train_movie_data = train.movieId.map(movieId_idx_mapping)

    test_user_data = test.userId.map(userId_idx_mapping)
    test_movie_data = test.movieId.map(movieId_idx_mapping)
    
    nb_users = len(userId_idx_mapping)
    nb_movies = len(movieId_idx_mapping)
    ##### Create model
    # Set input layers
    userId_input = Input(shape=[1], name='user')
    movieId_input = Input(shape=[1], name='movie')

  
    
    # Create embedding layers for users and movies
    user_embedding = Embedding(output_dim=nb_user_embedding, 
                               input_dim=nb_users,
                               input_length=1, 
                               name='user_embedding')(userId_input)
    movie_embedding = Embedding(output_dim=nb_movie_embedding, 
                                input_dim=nb_movies,
                                input_length=1, 
                                name='item_embedding')(movieId_input)

    # Reshape the embedding layers
    user_vector = Reshape([nb_user_embedding])(user_embedding)
    movie_vector = Reshape([nb_movie_embedding])(movie_embedding)

    # Concatenate the reshaped embedding layers
    concat = Concatenate()([user_vector, movie_vector])

    # Combine with dense layers
    dense = Dense(256)(concat)
    y = Dense(1)(dense)

    # Setup model
    model = Model(inputs=[userId_input, movieId_input], outputs=y)
    model.compile(loss='mse', optimizer='adam')


    # Fit model
    model.fit([train_user_data, train_movie_data],
              train.rating,
              batch_size=256, 
              epochs=5,
              validation_split=0.5,
              shuffle=True)

    # Test model
    y_pred = model.predict([test_user_data, test_movie_data])
    y_true = test.rating.values

    #  Compute RMSE
    rmse = np.sqrt(mean_squared_error(y_pred=y_pred, y_true=y_true))
    print('\n\nTesting Result With Keras Deep Learning: {:.4f} RMSE'.format(rmse))
    return rmse

In [33]:
matrix_factorization_dnn_train3 = matrix_factorization_dnn(train_data3, test_data3)

Train on 1542119 samples, validate on 1542119 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


Testing Result With Keras Deep Learning: 0.9134 RMSE


## <a id=8>8. Exploring Python Libraries</a>
### <a id=8.1>8.1. Surprise Library</a>
[surprise library](http://surpriselib.com/) 是为推荐系统而构建的一个库，有很多内置算法。

[SVD](https://surprise.readthedocs.io/en/stable/matrix_factorization.html#surprise.prediction_algorithms.matrix_factorization.SVD)，其预测rating为$\bar{r}_{ui}=\mu+b_u+b_i+q_i^Tp_u$，若$u$是未知的，则偏置$b_u$和因子$p_u$假设为0，同理对于item $i$的$b_i$和$q_i$。此时的评估函数如下:
$$\sum_{r_{ui}\in R_{train}}(r_{ui}-\bar{ui})^2 + \lambda(b_i^2 + b_u^2 + ||q_i||^2 + ||p_u||^2)$$
使用随机梯度进行参数学习，
$$b_u \gets b_u + \alpha(e_{ui} - \lambda b_u)$$
$$b_i \gets b_i + \alpha(e_{ui} - \lambda b_i)$$
$$p_u \gets p_u + \alpha(e_{ui} \cdot q_i - \lambda p_u)$$
$$q_i \gets q_i + \lambda (e_{ui} \cdot p_u - \lambda q_i)$$
其中$e_{ui}=r_{ui}-\bar{r}_{ui}$。

[SVD++](https://surprise.readthedocs.io/en/stable/matrix_factorization.html#surprise.prediction_algorithms.matrix_factorization.SVDpp)其中预测rating为$\bar{r}_{ui}=\mu + b_u + b_i + q_i^T(p_u + |I_u|^{-\frac{1}{2}}\sum_{j \in I_u}y_j)$，其中$y_j$是一组隐式因子，主要描述了user $u$对item $j$的评价的事实，和评价的rating无关。

[Slope One](https://surprise.readthedocs.io/en/stable/slope_one.html)其中预测的rating为$\bar{r_{ui}}=\mu_u + \frac{1}{|R_i(u)|} \sum_{j \in R_i(u)} dev(i,j)$，其中$R_i(u)$是item的集合，它是按照user $u$的，并且这个集合user $j$同样评价过，$dev(i,j)$被定义为$dev(i,j) = \frac{1}{U_{ij}} \sum_{u \in U_{ij}} r_{ui}-r_{uj}$。

[NMF](https://surprise.readthedocs.io/en/stable/matrix_factorization.html#surprise.prediction_algorithms.matrix_factorization.NMF)其中预测rating为$\bar{r}_{ui}=q_i^Tp_u$，同样使用随机梯度下降算法，其中item和user的隐因子$f$更新如下：
$$p_{uf} \gets p_{uf} \cdot \frac{\sum_{i \in I_u}q_{if} \cdot r_{ui}}{\sum_{i \in I_u} q_{if} \cdot \bar{r}_{ui} + \lambda_u |I_u| p_{uf}}$$
$$q_{if} \gets q_{if} \cdot \frac{\sum_{u \in U_i}p_{uf} \cdot r_{ui}}{\sum_{u \in U_i}p_{uf} \cdot \bar{r}_{ui} + \lambda_i |U_i| q_{if}}$$
其中$\lambda_u$和$\lambda_i$是超参数，且此算法高度依赖初始化值。

[NormalPredictor](https://surprise.readthedocs.io/en/stable/basic_algorithms.html#surprise.prediction_algorithms.random_pred.NormalPredictor)其中预测rating基于正则化的假设上即
$$\begin{split}\hat{\mu} &= \frac{1}{|R_{train}|} \sum_{r_{ui} \in R_{train}}
r_{ui}\\\\        \hat{\sigma} &= \sqrt{\sum_{r_{ui} \in R_{train}}
\frac{(r_{ui} - \hat{\mu})^2}{|R_{train}|}}\end{split}$$
 
[KNNBasic](https://surprise.readthedocs.io/en/stable/knn_inspired.html#surprise.prediction_algorithms.knns.KNNBasic) 其中rating的预测基于KNN的相似性。
$$\hat{r}_{ui} = \frac{
\sum\limits_{v \in N^k_i(u)} \text{sim}(u, v) \cdot r_{vi}}
{\sum\limits_{v \in N^k_i(u)} \text{sim}(u, v)}$$

[KNNWithMeans](https://surprise.readthedocs.io/en/stable/knn_inspired.html#surprise.prediction_algorithms.knns.KNNWithMeans)在KNNBasic的基础上添加了均值。
$$\hat{r}_{ui} = \mu_u + \frac{ \sum\limits_{v \in N^k_i(u)}
\text{sim}(u, v) \cdot (r_{vi} - \mu_v)} {\sum\limits_{v \in
N^k_i(u)} \text{sim}(u, v)}$$

[KNNWithZScore](https://surprise.readthedocs.io/en/stable/knn_inspired.html#surprise.prediction_algorithms.knns.KNNWithZScore)和KNNWithMeans相比，对每个user的rating做了$z$-score处理。
$$
\hat{r}_{ui} = \mu_u + \sigma_u \frac{ \sum\limits_{v \in N^k_i(u)}
\text{sim}(u, v) \cdot (r_{vi} - \mu_v) / \sigma_v} {\sum\limits_{v
\in N^k_i(u)} \text{sim}(u, v)}
$$

[KNNBaseline](https://surprise.readthedocs.io/en/stable/knn_inspired.html#surprise.prediction_algorithms.knns.KNNBaseline)
$$
\hat{r}_{ui} = b_{ui} + \frac{ \sum\limits_{v \in N^k_i(u)}
\text{sim}(u, v) \cdot (r_{vi} - b_{vi})} {\sum\limits_{v \in
N^k_i(u)} \text{sim}(u, v)}
$$

[BaselineOnly](https://surprise.readthedocs.io/en/stable/basic_algorithms.html)评估rating为$\bar{r}_{ui}=b_{ui} =\mu + b_u + b_i$，当$u$未知的时候，$b_u$假设为0。

[CoClustering](https://surprise.readthedocs.io/en/stable/co_clustering.html#surprise.prediction_algorithms.co_clustering.CoClustering)
$$
\hat{r}_{ui} = \overline{C_{ui}} + (\mu_u - \overline{C_u}) + (\mu_i - \overline{C_i})
$$

In [34]:
def surprise_library(data):
    # Load dataset into surprise specific data-structure
    sampled_data = sp.Dataset.load_from_df(data[['userId', 'movieId', 'rating']].sample(20000), sp.Reader())

    benchmark = []
    # Iterate over all algorithms
    for algorithm in [sp.SVD(), sp.SVDpp(), sp.SlopeOne(), sp.NMF(), sp.NormalPredictor(), sp.KNNBaseline(), sp.KNNBasic(), sp.KNNWithMeans(), sp.KNNWithZScore(), sp.BaselineOnly(), sp.CoClustering()]:
        # Perform cross validation
        results = cross_validate(algorithm, sampled_data, measures=['RMSE'], cv=3, verbose=False)

        # Get results & append algorithm name
        tmp = pd.DataFrame.from_dict(results).mean(axis=0)
        tmp = tmp.append(pd.Series([str(algorithm).split(' ')[0].split('.')[-1]], index=['Algorithm']))

        # Store data
        benchmark.append(tmp)
    return benchmark

In [35]:
surprise_train3 = surprise_library(filtered_netflix_prize_User)

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...


In [36]:
surprise_train3

[test_rmse      1.03665
 fit_time       1.64775
 test_time    0.0843676
 Algorithm          SVD
 dtype: object, test_rmse    1.03153
 fit_time      3.0279
 test_time    0.13135
 Algorithm      SVDpp
 dtype: object, test_rmse       1.1954
 fit_time      0.294815
 test_time    0.0919317
 Algorithm     SlopeOne
 dtype: object, test_rmse      1.20121
 fit_time       3.48961
 test_time    0.0812257
 Algorithm          NMF
 dtype: object, test_rmse            1.47137
 fit_time           0.0353295
 test_time          0.0874324
 Algorithm    NormalPredictor
 dtype: object, test_rmse        1.04083
 fit_time         4.82785
 test_time       0.197138
 Algorithm    KNNBaseline
 dtype: object, test_rmse     1.08784
 fit_time      4.69928
 test_time     0.19174
 Algorithm    KNNBasic
 dtype: object, test_rmse         1.19282
 fit_time          4.77909
 test_time         0.20083
 Algorithm    KNNWithMeans
 dtype: object, test_rmse           1.1967
 fit_time           5.20028
 test_time         0.194

### <a id=8.2>8.2. Lightfm Library</a>
[lightfm librariy](https://github.com/lyst/lightfm)重点关注具有显式和隐式的矩阵分解，此外可以利用item等元信息来达到基于内容推荐和协同推荐共同作用的混合模型，从而在一定程度上减少了冷启动的问题。

In [37]:
def lightfm_library(train, test):
    # Create user- & movie-id mapping
    user_id_mapping = {id:i for i, id in enumerate(train['userId'].unique())}
    movie_id_mapping = {id:i for i, id in enumerate(train['movieId'].unique())}
    
    # Create correctly mapped train- & testset
    train_user_data = train['userId'].map(user_id_mapping)
    train_movie_data = train['movieId'].map(movie_id_mapping)

    test_user_data = test['userId'].map(user_id_mapping)
    test_movie_data = test['movieId'].map(movie_id_mapping)


    # Create sparse matrix from ratings
    shape = (len(user_id_mapping), len(movie_id_mapping))
    train_matrix = coo_matrix((train['rating'].values, (train_user_data.astype(int), train_movie_data.astype(int))), shape=shape)
    test_matrix = coo_matrix((test['rating'].values, (test_user_data.astype(int), test_movie_data.astype(int))), shape=shape)


    # Instantiate and train the model
    model = LightFM(loss='warp', no_components=20)
    model.fit(train_matrix, epochs=10, num_threads=2)


    # Evaluate the trained model
    k = 20
    precision_score = precision_at_k(model, test_matrix, k=k).mean()
#     print('Train precision at k={}:\t{:.4f}'.format(k, precision_at_k(model, train_matrix, k=k).mean()))
    print('Test precision at k={}:\t\t{:.4f}'.format(k, precision_score))
    return precision_score

In [38]:
lightfm_train3 = lightfm_library(train_data3, test_data3)

Test precision at k=20:		0.4259


### <a id=8.3>8.3. DeepCTR</a>
[DeepCTR](https://github.com/shenweichen/DeepCTR)是一个基于深度的CTR预测库。

In [39]:
## DeepFM
def deepfm_algo(data):

    sparse_features = ["movieId", "userId"]
    target = ['rating']
    for feat in sparse_features:
            lbe = LabelEncoder()
            data[feat] = lbe.fit_transform(data[feat])
    
    fixlen_feature_columns = [SparseFeat(feat, data[feat].nunique(), embedding_dim=4)
                              for feat in sparse_features]
    
    linear_feature_columns = fixlen_feature_columns
    dnn_feature_columns = fixlen_feature_columns
    feature_names = get_feature_names(linear_feature_columns + dnn_feature_columns)
    
    train, test = train_test_split(data, test_size=0.5)
    train_model_input = {name:train[name].values for name in feature_names}
    test_model_input = {name:test[name].values for name in feature_names}

    # 4.Define Model,train,predict and evaluate
    model = DeepFM(linear_feature_columns, dnn_feature_columns, task='regression')
    model.compile("adam", "mse", metrics=['mse'], )
    
    history = model.fit(train_model_input, train[target].values,
                        batch_size=256, epochs=5, verbose=2, validation_split=0.5, )
    pred_ans = model.predict(test_model_input, batch_size=256)
    rmse = np.sqrt(mean_squared_error(test[target].values, pred_ans))
    print("test RMSE", rmse)
    return rmse

In [40]:
deepfm_algor_train3 = deepfm_algo(filtered_netflix_prize_User)

Train on 1542119 samples, validate on 1542119 samples
Epoch 1/5
 - 39s - loss: 0.9381 - mean_squared_error: 0.9357 - val_loss: 0.8314 - val_mean_squared_error: 0.8267
Epoch 2/5
 - 37s - loss: 0.8004 - mean_squared_error: 0.7932 - val_loss: 0.7888 - val_mean_squared_error: 0.7794
Epoch 3/5
 - 37s - loss: 0.7684 - mean_squared_error: 0.7570 - val_loss: 0.7824 - val_mean_squared_error: 0.7692
Epoch 4/5
 - 40s - loss: 0.7572 - mean_squared_error: 0.7423 - val_loss: 0.7819 - val_mean_squared_error: 0.7656
Epoch 5/5
 - 42s - loss: 0.7503 - mean_squared_error: 0.7326 - val_loss: 0.7826 - val_mean_squared_error: 0.7638
test RMSE 0.8748718957924188


In [41]:
ret_rmse = [mean_rating_data3, weighted_mean_rating_data3, cosine_u2u_similarity_data3, matrix_factorization_dot_train3, matrix_factorization_dnn_train3, lightfm_train3, deepfm_algor_train3] + surprise_train3['RMSE'].tolist() 
ret_rmse_name = ['mean_rating', 'weighted', 'cosine_u2u_similarity', 'mf_dot', 'mf_dnn', 'lightfm', 'deepfm'] + surprise_train3['RMSE'].tolist()
figure, ax = plt.subplots(figsize=(16,4))
print(ret_rmse)
plt.bar(range(len(ret_rmse)), ret_rmse, tick_label=ret_rmse_name)
for tick in ax.get_xticklabels():
    tick.set_rotation(90)
plt.title('Different RMSE in Dataset by RS algorithm')
plt.show()

TypeError: list indices must be integers or slices, not str

## <a id=9>9. 总结</a>
Other **python recommender libraries** are:
+ [implicit](https://github.com/benfred/implicit)
+ [spotlight](https://github.com/maciejkula/spotlight)
+ [turicreate](https://github.com/apple/turicreate/blob/master/README.md)
+ [mrec](https://github.com/Mendeley/mrec)
+ [recsys](https://github.com/ocelma/python-recsys)
+ [crab](http://muricoca.github.io/crab/)