## 基本介绍

[Surprise(Simple Python RecommendatIon System Engine)](http://surpriselib.com/)是 scikit 系列中的一个，简单易用，同时支持多种推荐算法：
* [基础算法/baseline algorithms](http://surprise.readthedocs.io/en/stable/basic_algorithms.html)
* [基于近邻方法(协同过滤)/neighborhood methods](http://surprise.readthedocs.io/en/stable/knn_inspired.html)
* [矩阵分解方法/matrix factorization-based (SVD, PMF, SVD++, NMF)](http://surprise.readthedocs.io/en/stable/matrix_factorization.html#surprise.prediction_algorithms.matrix_factorization.SVD)

| 算法类名        | 说明  |
| ------------- |:-----|
|[random_pred.NormalPredictor](http://surprise.readthedocs.io/en/stable/basic_algorithms.html#surprise.prediction_algorithms.random_pred.NormalPredictor)|Algorithm predicting a random rating based on the distribution of the training set, which is assumed to be normal.|
|[baseline_only.BaselineOnly](http://surprise.readthedocs.io/en/stable/basic_algorithms.html#surprise.prediction_algorithms.baseline_only.BaselineOnly)|Algorithm predicting the baseline estimate for given user and item.|
|[knns.KNNBasic](http://surprise.readthedocs.io/en/stable/knn_inspired.html#surprise.prediction_algorithms.knns.KNNBasic)|A basic collaborative filtering algorithm.|
|[knns.KNNWithMeans](http://surprise.readthedocs.io/en/stable/knn_inspired.html#surprise.prediction_algorithms.knns.KNNWithMeans)|A basic collaborative filtering algorithm, taking into account the mean ratings of each user.|
|[knns.KNNBaseline](http://surprise.readthedocs.io/en/stable/knn_inspired.html#surprise.prediction_algorithms.knns.KNNBaseline)|A basic collaborative filtering algorithm taking into account a baseline rating.|	
|[matrix_factorization.SVD](http://surprise.readthedocs.io/en/stable/matrix_factorization.html#surprise.prediction_algorithms.matrix_factorization.SVD)|The famous SVD algorithm, as popularized by Simon Funk during the Netflix Prize.|
|[matrix_factorization.SVDpp](http://surprise.readthedocs.io/en/stable/matrix_factorization.html#surprise.prediction_algorithms.matrix_factorization.SVDpp)|The SVD++ algorithm, an extension of SVD taking into account implicit ratings.|
|[matrix_factorization.NMF](http://surprise.readthedocs.io/en/stable/matrix_factorization.html#surprise.prediction_algorithms.matrix_factorization.NMF)|A collaborative filtering algorithm based on Non-negative Matrix Factorization.|
|[slope_one.SlopeOne](http://surprise.readthedocs.io/en/stable/slope_one.html#surprise.prediction_algorithms.slope_one.SlopeOne)|A simple yet accurate collaborative filtering algorithm.|
|[co_clustering.CoClustering](http://surprise.readthedocs.io/en/stable/co_clustering.html#surprise.prediction_algorithms.co_clustering.CoClustering)|A collaborative filtering algorithm based on co-clustering.|


其中基于近邻的方法(协同过滤)可以设定不同的度量准则:

| 相似度度量标准 | 度量标准说明  |
| ------------- |:-----|
|[cosine](http://surprise.readthedocs.io/en/stable/similarities.html#surprise.similarities.cosine)|Compute the cosine similarity between all pairs of users (or items).|
|[msd](http://surprise.readthedocs.io/en/stable/similarities.html#surprise.similarities.msd)|Compute the Mean Squared Difference similarity between all pairs of users (or items).|
|[pearson](http://surprise.readthedocs.io/en/stable/similarities.html#surprise.similarities.pearson)|Compute the Pearson correlation coefficient between all pairs of users (or items).|
|[pearson_baseline](http://surprise.readthedocs.io/en/stable/similarities.html#surprise.similarities.pearson_baseline)|Compute the (shrunk) Pearson correlation coefficient between all pairs of users (or items) using baselines for centering instead of means.|


在统计学中，皮尔逊相关系数（Pearson correlation coefficient），通常用 R 或ρ 表示，是用来度量两个变量 X 和 Y 之间的相互关系（线性相关）的，取值范围在 [-1,+1] 之间，数学公式定义如下：

![](http://static.zybuluo.com/zhuanxu/cvpzb7hwie2enybr9c4vs6c1/image_1c1cj7mreidu9lq151lver7n09.png)


支持不同的评估准则

| 评估准则 | 准则说明  |
| ------------- |:-----|
|[rmse](http://surprise.readthedocs.io/en/stable/accuracy.html#surprise.accuracy.rmse)|Compute RMSE (Root Mean Squared Error).|
|[msd](http://surprise.readthedocs.io/en/stable/similarities.html#surprise.similarities.msd)|Compute MAE (Mean Absolute Error).|
|[fcp](http://surprise.readthedocs.io/en/stable/accuracy.html#surprise.accuracy.fcp)|Compute FCP (Fraction of Concordant Pairs).|

下面是一些使用实例。
## 使用实例

1.数据集说明

我们此处使用movielens数据集，其格式是：`user item rating timestamp`，我们先载如数据集

In [2]:
from surprise import Dataset
data = Dataset.load_builtin('ml-100k')

Dataset ml-100k could not be found. Do you want to download it? [Y/n] Y
Trying to download dataset from http://files.grouplens.org/datasets/movielens/ml-100k.zip...
Done! Dataset ml-100k has been saved to /root/.surprise_data/ml-100k


In [17]:
data.raw_ratings[:5]

[('742', '294', 3.0, '881005590'),
 ('36', '883', 5.0, '882157581'),
 ('264', '203', 2.0, '886122508'),
 ('862', '568', 3.0, '879304799'),
 ('458', '969', 4.0, '886395899')]

2.矩阵分解方法

In [6]:
from surprise import SVD
from surprise import evaluate, print_perf

In [7]:
# k折交叉验证(k=3)
data.split(n_folds=3)
# 试一把SVD矩阵分解
algo = SVD()
# 在数据集上测试一下效果
perf = evaluate(algo, data, measures=['RMSE', 'MAE'])
#输出结果
print_perf(perf)

Evaluating RMSE, MAE of algorithm SVD.

------------
Fold 1
RMSE: 0.9466
MAE:  0.7475
------------
Fold 2
RMSE: 0.9433
MAE:  0.7450
------------
Fold 3
RMSE: 0.9470
MAE:  0.7472
------------
------------
Mean RMSE: 0.9456
Mean MAE : 0.7466
------------
------------
        Fold 1  Fold 2  Fold 3  Mean    
RMSE    0.9466  0.9433  0.9470  0.9456  
MAE     0.7475  0.7450  0.7472  0.7466  


下面我们定义如何载入自己的数据集

In [12]:
import os
from surprise import Reader

# 指定文件所在路径
file_path = os.path.expanduser('~/.surprise_data/ml-100k/ml-100k/u.data')
# 告诉文本阅读器，文本的格式是怎么样的
reader = Reader(line_format='user item rating timestamp', sep='\t')
# 加载数据
data = Dataset.load_from_file(file_path, reader=reader)
# 手动切分成5折(方便交叉验证)
data.split(n_folds=5)

## 参数调优
我们先用传统的grid search的方法来做

In [14]:
from surprise import GridSearch

# 定义好需要优选的参数网格
param_grid = {'n_epochs': [5, 10], 'lr_all': [0.002, 0.005],
              'reg_all': [0.4, 0.6]}
# 使用网格搜索交叉验证
grid_search = GridSearch(SVD, param_grid, measures=['RMSE', 'FCP'])
# 在数据集上找到最好的参数
data = Dataset.load_builtin('ml-100k')
data.split(n_folds=3)
grid_search.evaluate(data)

[{'n_epochs': 5, 'lr_all': 0.002, 'reg_all': 0.4}, {'n_epochs': 5, 'lr_all': 0.002, 'reg_all': 0.6}, {'n_epochs': 5, 'lr_all': 0.005, 'reg_all': 0.4}, {'n_epochs': 5, 'lr_all': 0.005, 'reg_all': 0.6}, {'n_epochs': 10, 'lr_all': 0.002, 'reg_all': 0.4}, {'n_epochs': 10, 'lr_all': 0.002, 'reg_all': 0.6}, {'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.4}, {'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.6}]
------------
Parameters combination 1 of 8
params:  {'n_epochs': 5, 'lr_all': 0.002, 'reg_all': 0.4}
------------
Mean RMSE: 0.9969
Mean FCP : 0.6849
------------
------------
Parameters combination 2 of 8
params:  {'n_epochs': 5, 'lr_all': 0.002, 'reg_all': 0.6}
------------
Mean RMSE: 1.0035
Mean FCP : 0.6859
------------
------------
Parameters combination 3 of 8
params:  {'n_epochs': 5, 'lr_all': 0.005, 'reg_all': 0.4}
------------
Mean RMSE: 0.9740
Mean FCP : 0.6941
------------
------------
Parameters combination 4 of 8
params:  {'n_epochs': 5, 'lr_all': 0.005, 'reg_all': 0.6}
-----

In [15]:
# 输出调优的参数组 
# 输出最好的RMSE结果
print(grid_search.best_score['RMSE'])

# 输出对应最好的RMSE结果的参数
print(grid_search.best_params['RMSE'])

# 最好的FCP得分
print(grid_search.best_score['FCP'])

# 对应最高FCP得分的参数
print(grid_search.best_params['FCP'])

0.963907801579
{'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.4}
0.699019179878
{'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.6}


我们也能用 参数调优库 [hyperopt](http://hyperopt.github.io/hyperopt/)，有点是使用tpe的方法

## 不同的推荐系统算法评估 

In [18]:
import os
from surprise import Reader

# 指定文件所在路径
file_path = os.path.expanduser('~/.surprise_data/ml-100k/ml-100k/u.data')
# 告诉文本阅读器，文本的格式是怎么样的
reader = Reader(line_format='user item rating timestamp', sep='\t')
# 加载数据
data = Dataset.load_from_file(file_path, reader=reader)

# 手动切分成5折(方便交叉验证)
data.split(n_folds=5)

In [20]:
### 使用NormalPredictor
from surprise import NormalPredictor, evaluate
algo = NormalPredictor()
perf = evaluate(algo, data, measures=['RMSE', 'MAE'])

Evaluating RMSE, MAE of algorithm NormalPredictor.

------------
Fold 1
RMSE: 1.5148
MAE:  1.2168
------------
Fold 2
RMSE: 1.5167
MAE:  1.2192
------------
Fold 3
RMSE: 1.5193
MAE:  1.2173
------------
Fold 4
RMSE: 1.5233
MAE:  1.2210
------------
Fold 5
RMSE: 1.5132
MAE:  1.2120
------------
------------
Mean RMSE: 1.5174
Mean MAE : 1.2173
------------
------------


In [23]:
### 使用BaselineOnly
from surprise import BaselineOnly, evaluate
algo = BaselineOnly()
perf = evaluate(algo, data, measures=['RMSE', 'MAE'],verbose=1)

Evaluating RMSE, MAE of algorithm BaselineOnly.

------------
Fold 1
Estimating biases using als...
RMSE: 0.9371
MAE:  0.7410
------------
Fold 2
Estimating biases using als...
RMSE: 0.9488
MAE:  0.7500
------------
Fold 3
Estimating biases using als...
RMSE: 0.9392
MAE:  0.7468
------------
Fold 4
Estimating biases using als...
RMSE: 0.9487
MAE:  0.7546
------------
Fold 5
Estimating biases using als...
RMSE: 0.9461
MAE:  0.7491
------------
------------
Mean RMSE: 0.9440
Mean MAE : 0.7483
------------
------------


In [None]:
# ur[uid].append((iid, r)) # 用户对应的item评分
# ir[iid].append((uid, r)) # item对应的用户评分

# n_users = len(ur)  # number of users
# n_items = len(ir)  # number of items
# n_ratings = len(raw_trainset)

In [31]:
### 使用基础版协同过滤
from surprise import KNNBasic, evaluate
algo = KNNBasic()
perf = evaluate(algo, data, measures=['RMSE', 'MAE'])

Evaluating RMSE, MAE of algorithm KNNBasic.

------------
Fold 1
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9734
MAE:  0.7667
------------
Fold 2
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9849
MAE:  0.7742
------------
Fold 3
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9701
MAE:  0.7698
------------
Fold 4
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9871
MAE:  0.7822
------------
Fold 5
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9776
MAE:  0.7709
------------
------------
Mean RMSE: 0.9786
Mean MAE : 0.7728
------------
------------


In [32]:
### 使用均值协同过滤
from surprise import KNNWithMeans, evaluate
algo = KNNWithMeans()
perf = evaluate(algo, data, measures=['RMSE', 'MAE'])

Evaluating RMSE, MAE of algorithm KNNWithMeans.

------------
Fold 1
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9441
MAE:  0.7425
------------
Fold 2
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9565
MAE:  0.7522
------------
Fold 3
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9447
MAE:  0.7468
------------
Fold 4
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9565
MAE:  0.7551
------------
Fold 5
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9501
MAE:  0.7468
------------
------------
Mean RMSE: 0.9504
Mean MAE : 0.7487
------------
------------


In [34]:
### 使用协同过滤baseline
from surprise import KNNBaseline, evaluate
algo = KNNBaseline()
perf = evaluate(algo, data, measures=['RMSE', 'MAE'])

Evaluating RMSE, MAE of algorithm KNNBaseline.

------------
Fold 1
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9243
MAE:  0.7266
------------
Fold 2
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9364
MAE:  0.7345
------------
Fold 3
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9240
MAE:  0.7305
------------
Fold 4
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9345
MAE:  0.7383
------------
Fold 5
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9309
MAE:  0.7319
------------
------------
Mean RMSE: 0.9300
Mean MAE : 0.7324
------------
------------


In [35]:
### 使用SVD
from surprise import SVD, evaluate
algo = SVD()
perf = evaluate(algo, data, measures=['RMSE', 'MAE'])

Evaluating RMSE, MAE of algorithm SVD.

------------
Fold 1
RMSE: 0.9303
MAE:  0.7311
------------
Fold 2
RMSE: 0.9404
MAE:  0.7392
------------
Fold 3
RMSE: 0.9337
MAE:  0.7367
------------
Fold 4
RMSE: 0.9410
MAE:  0.7446
------------
Fold 5
RMSE: 0.9367
MAE:  0.7378
------------
------------
Mean RMSE: 0.9364
Mean MAE : 0.7379
------------
------------


In [36]:
### 使用SVD++
from surprise import SVDpp, evaluate
algo = SVDpp()
perf = evaluate(algo, data, measures=['RMSE', 'MAE'])

Evaluating RMSE, MAE of algorithm SVDpp.

------------
Fold 1
RMSE: 0.9173
MAE:  0.7182
------------
Fold 2
RMSE: 0.9274
MAE:  0.7249
------------
Fold 3
RMSE: 0.9139
MAE:  0.7194
------------
Fold 4
RMSE: 0.9244
MAE:  0.7277
------------
Fold 5
RMSE: 0.9235
MAE:  0.7216
------------
------------
Mean RMSE: 0.9213
Mean MAE : 0.7224
------------
------------


In [37]:
### 使用NMF
from surprise import NMF
algo = NMF()
perf = evaluate(algo, data, measures=['RMSE', 'MAE'])
print_perf(perf)

Evaluating RMSE, MAE of algorithm NMF.

------------
Fold 1
RMSE: 0.9604
MAE:  0.7523
------------
Fold 2
RMSE: 0.9718
MAE:  0.7607
------------
Fold 3
RMSE: 0.9591
MAE:  0.7555
------------
Fold 4
RMSE: 0.9708
MAE:  0.7669
------------
Fold 5
RMSE: 0.9640
MAE:  0.7588
------------
------------
Mean RMSE: 0.9652
Mean MAE : 0.7589
------------
------------
        Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    
RMSE    0.9604  0.9718  0.9591  0.9708  0.9640  0.9652  
MAE     0.7523  0.7607  0.7555  0.7669  0.7588  0.7589  


In [None]:
### 使用 SlopeOne
from surprise import SlopeOne
algo = SlopeOne()
perf = evaluate(algo, data, measures=['RMSE', 'MAE'])
print_perf(perf)