## 推荐系统——协同过滤-->基于统计方法

### 本节需要用到 scikit-surprise 库

- pip install surprise 进行安装（需要C++编译环境）

In [1]:
from surprise import KNNBasic,SVD # KNNBasic最基础的协同过滤算法（可以基于用户或者物品） SVD是基于矩阵分解的！
from surprise import Dataset # Dataset默认的数据库进行练习。基础的电影的数据。（下方网址）
from surprise import evaluate, print_perf
# http://surprise.readthedocs.io/en/stable/index.html
# http://files.grouplens.org/datasets/movielens/ml-100k-README.txt

# Load the movielens-100k dataset (download it if needed),
# and split it into 3 folds for cross-validation.
data = Dataset.load_builtin('ml-100k') # 拿到内置的数据集
data.split(n_folds=3)  # 进行交叉验证的折数。

# We'll use the famous KNNBasic algorithm.
algo = KNNBasic()

# Evaluate performances of our algorithm on the dataset.
perf = evaluate(algo, data, measures=['RMSE', 'MAE'])  
#####################################################
# 之前的基本是 fit 这里不一样，可以自己指定三个参数（算法名字，数据，评估方法）
# 这里制定了均方误差和绝对误差
#####################################################
print_perf(perf)

Evaluating RMSE, MAE of algorithm KNNBasic.

------------
Fold 1
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9876
MAE:  0.7807
------------
Fold 2
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9871
MAE:  0.7796
------------
Fold 3
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9902
MAE:  0.7818
------------
------------
Mean RMSE: 0.9883
Mean MAE : 0.7807
------------
------------
        Fold 1  Fold 2  Fold 3  Mean    
MAE     0.7807  0.7796  0.7818  0.7807  
RMSE    0.9876  0.9871  0.9902  0.9883  


## 推荐系统——进行矩阵分解求解（隐语义模型）-->基于模型

- 需要进行迭代求解，需要传一些参数

In [2]:
from surprise import GridSearch

param_grid = {'n_epochs': [5, 10], 'lr_all': [0.002, 0.005],
              'reg_all': [0.4, 0.6]}
#####################################
# 指定了三个值，迭代次数，学习率，正则化的强度
# 做其8种组合。
#####################################
grid_search = GridSearch(SVD, param_grid, measures=['RMSE', 'FCP']) # SVD 矩阵分解！
data = Dataset.load_builtin('ml-100k')
data.split(n_folds=3)

grid_search.evaluate(data)

------------
Parameters combination 1 of 8
params:  {'lr_all': 0.002, 'n_epochs': 5, 'reg_all': 0.4}
------------
Mean RMSE: 0.9972
Mean FCP : 0.6843
------------
------------
Parameters combination 2 of 8
params:  {'lr_all': 0.005, 'n_epochs': 5, 'reg_all': 0.4}
------------
Mean RMSE: 0.9734
Mean FCP : 0.6946
------------
------------
Parameters combination 3 of 8
params:  {'lr_all': 0.002, 'n_epochs': 10, 'reg_all': 0.4}
------------
Mean RMSE: 0.9777
Mean FCP : 0.6926
------------
------------
Parameters combination 4 of 8
params:  {'lr_all': 0.005, 'n_epochs': 10, 'reg_all': 0.4}
------------
Mean RMSE: 0.9635
Mean FCP : 0.6987
------------
------------
Parameters combination 5 of 8
params:  {'lr_all': 0.002, 'n_epochs': 5, 'reg_all': 0.6}
------------
Mean RMSE: 1.0029
Mean FCP : 0.6875
------------
------------
Parameters combination 6 of 8
params:  {'lr_all': 0.005, 'n_epochs': 5, 'reg_all': 0.6}
------------
Mean RMSE: 0.9820
Mean FCP : 0.6953
------------
------------
Paramet

**GridSearch会帮我们存一些函数值**

In [3]:
# best RMSE score
print(grid_search.best_score['RMSE'])

# combination of parameters that gave the best RMSE score
print(grid_search.best_params['RMSE'])


# best FCP score
print(grid_search.best_score['FCP'])


# combination of parameters that gave the best FCP score
print(grid_search.best_params['FCP'])


0.963501988854
{'lr_all': 0.005, 'n_epochs': 10, 'reg_all': 0.4}
0.699084153002
{'lr_all': 0.005, 'n_epochs': 10, 'reg_all': 0.6}


In [5]:
import pandas as pd  

results_df = pd.DataFrame.from_dict(grid_search.cv_results) 
#####################################
# pd.DataFrame.from_dict可以传字典进行数据查看
#####################################
results_df

Unnamed: 0,FCP,RMSE,lr_all,n_epochs,params,scores
0,0.684266,0.99716,0.002,5,"{'lr_all': 0.002, 'n_epochs': 5, 'reg_all': 0.4}","{'RMSE': 0.997160189649, 'FCP': 0.684266412476}"
1,0.694552,0.973383,0.005,5,"{'lr_all': 0.005, 'n_epochs': 5, 'reg_all': 0.4}","{'RMSE': 0.973383132387, 'FCP': 0.694551932996}"
2,0.692616,0.977697,0.002,10,"{'lr_all': 0.002, 'n_epochs': 10, 'reg_all': 0.4}","{'RMSE': 0.977696629511, 'FCP': 0.692615513155}"
3,0.698722,0.963502,0.005,10,"{'lr_all': 0.005, 'n_epochs': 10, 'reg_all': 0.4}","{'RMSE': 0.963501988854, 'FCP': 0.698721750945}"
4,0.687482,1.002855,0.002,5,"{'lr_all': 0.002, 'n_epochs': 5, 'reg_all': 0.6}","{'RMSE': 1.00285516237, 'FCP': 0.687481665759}"
5,0.695337,0.982047,0.005,5,"{'lr_all': 0.005, 'n_epochs': 5, 'reg_all': 0.6}","{'RMSE': 0.98204676013, 'FCP': 0.695337489535}"
6,0.694338,0.985981,0.002,10,"{'lr_all': 0.002, 'n_epochs': 10, 'reg_all': 0.6}","{'RMSE': 0.985980855401, 'FCP': 0.694337564062}"
7,0.699084,0.973282,0.005,10,"{'lr_all': 0.005, 'n_epochs': 10, 'reg_all': 0.6}","{'RMSE': 0.973281870802, 'FCP': 0.699084153002}"


### 模型搭建出来了，用它来推荐东西！

In [6]:
from __future__ import (absolute_import, division, print_function,
                        unicode_literals)
import os
import io

from surprise import KNNBaseline
from surprise import Dataset


def read_item_names():
    
    """把电影的名字做成了id的映射"""

    file_name = ('./ml-100k/u.item')
    rid_to_name = {}
    name_to_rid = {}
    with io.open(file_name, 'r', encoding='ISO-8859-1') as f:
        for line in f:
            line = line.split('|')
            rid_to_name[line[0]] = line[1]
            name_to_rid[line[1]] = line[0]

    return rid_to_name, name_to_rid

# 1.导入数据
data = Dataset.load_builtin('ml-100k')
# 2.数据是一行的，将其转换成矩阵（稀疏的）
trainset = data.build_full_trainset()
# 3.指定相似度的方法——此处用了皮尔孙，指定了基于物品的相似度。
sim_options = {'name': 'pearson_baseline', 'user_based': False}
algo = KNNBaseline(sim_options=sim_options)
algo.train(trainset)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.


In [11]:
rid_to_name, name_to_rid = read_item_names()

toy_story_raw_id = name_to_rid['Now and Then (1995)']
# 直接传电影名字不行，因为传的是id，先对其进行id的转换。

toy_story_raw_id # 在数据的id

'1053'

In [12]:
toy_story_inner_id = algo.trainset.to_inner_iid(toy_story_raw_id)
toy_story_inner_id # 在实际计算的(矩阵的)id

961

In [13]:
toy_story_neighbors = algo.get_neighbors(toy_story_inner_id, k=10)
# 找出最接近的10个电影id
toy_story_neighbors

[291, 82, 366, 528, 179, 101, 556, 310, 431, 543]

In [14]:
toy_story_neighbors = (algo.trainset.to_raw_iid(inner_id)
                       for inner_id in toy_story_neighbors)
toy_story_neighbors = (rid_to_name[rid]
                       for rid in toy_story_neighbors)

print()
print('The 10 nearest neighbors of Toy Story are:')
for movie in toy_story_neighbors:
    print(movie)


The 10 nearest neighbors of Toy Story are:
While You Were Sleeping (1995)
Batman (1989)
Dave (1993)
Mrs. Doubtfire (1993)
Groundhog Day (1993)
Raiders of the Lost Ark (1981)
Maverick (1994)
French Kiss (1995)
Stand by Me (1986)
Net, The (1995)
