#### surprise baseline 算法理解
surprise包中的baselineOnly算法, 通过引入基准线（平均分）以及item和用户的偏差进行估算  
基本思想都是通过矩阵分解，把大的稀疏矩阵拆分成 user矩阵^T 和 item矩阵 的内积，并学习user和item矩阵 补全缺失值，使差异最小。  

baselineOnly算法可以传入ALS和SGD参数。  
ALS参数使算法通过交替最小二乘法的方法，先固定一个矩阵，优化另一个，然后反过来继续优化值，直至梯度为零。  
SGD参数使算法通过梯度下降的方法，使损失函数最小化。  

In [6]:
import pandas as pd
import os

path = os.getcwd()

from surprise import Dataset
from surprise import Reader
from surprise import BaselineOnly, KNNBasic, NormalPredictor
from surprise import accuracy
from surprise.model_selection import KFold

In [15]:
# 查看一下数据
movies = pd.read_csv(path+'/movies.csv', encoding='ansi')
rating = pd.read_csv(path+'/ratings.csv', encoding='ansi')

In [16]:
rating.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,2,3.5,1112486027
1,1,29,3.5,1112484676
2,1,32,3.5,1112484819
3,1,47,3.5,1112484727
4,1,50,3.5,1112484580


In [17]:
rating[rating['userId'] == 196]

Unnamed: 0,userId,movieId,rating,timestamp
24860,196,592,4.0,937944680
24861,196,709,3.0,937945099
24862,196,1033,3.0,937944772
24863,196,1036,4.0,937944562
24864,196,1088,5.0,937945099
...,...,...,...,...
24945,196,2787,5.0,937944772
24946,196,2794,3.0,937945378
24947,196,2797,5.0,937944098
24948,196,2805,3.0,937943509


In [9]:
rating = pd.merge(rating, movies, on='movieId', how='left')

In [2]:
# 读取本地数据集
reader = Reader(line_format='user item rating timestamp', sep=',', skip_lines=1)
data = Dataset.load_from_file('./ratings.csv', reader=reader)
train_set = data.build_full_trainset()

In [19]:
# ALS 优化器
bsl_options = {'method': 'als',
               'n_epochs': 5,
               'reg_u': 12,
               'reg_i': 5}

algo = BaselineOnly(bsl_options=bsl_options)

kf = KFold(n_splits=3)
for trainset, testset in kf.split(data):
    algo.fit(trainset)
    predictions = algo.test(testset)
    accuracy.rmse(predictions, verbose=True)


# try predictions
uid = str(196)
iid = str(592)
pred = algo.predict(uid, iid, r_ui=4, verbose=True)

Estimating biases using als...
RMSE: 0.8643
Estimating biases using als...
RMSE: 0.8649
Estimating biases using als...
RMSE: 0.8622
user: 196        item: 592        r_ui = 4.00   est = 3.74   {'was_impossible': False}


In [14]:
# SGD优化器
bsl_options = {'method': 'sgd','n_epochs': 5}
algo = BaselineOnly(bsl_options=bsl_options)
#algo = NormalPredictor()

# 定义K折交叉验证迭代器，K=3
kf = KFold(n_splits=3)
for trainset, testset in kf.split(data):
    algo.fit(trainset)
    predictions = algo.test(testset)
    accuracy.rmse(predictions, verbose=True)

uid = str(196)
iid = str(592)
# 输出uid对iid的预测结果
pred = algo.predict(uid, iid, r_ui=4, verbose=True)

Estimating biases using sgd...
RMSE: 0.8731
Estimating biases using sgd...
RMSE: 0.8758
Estimating biases using sgd...
RMSE: 0.8745
user: 196        item: 302        r_ui = 4.00   est = 3.90   {'was_impossible': False}


#### Slope one理解
Slope One 属于 Item-Based 的协同过滤推荐算法，通过计算共现item之间的打分差异补全稀疏矩阵，和其它算法相比SlopeOne更简单, 具有更高效, 同时推荐的准确性相对较高。

In [4]:
from surprise import SVD
from surprise import Dataset
# from surprise.model_selection import cross_validate
# from surprise import evaluate, print_perf
from surprise import Reader
from surprise import BaselineOnly, KNNBasic, KNNBaseline, SlopeOne
from surprise import accuracy
from surprise.model_selection import KFold
# import pandas as pd
# import io


# 读取物品（电影）名称信息
'''
def read_item_names():
    file_name = ('./movies.csv') 
    data = pd.read_csv('./movies.csv')
    rid_to_name = {}
    name_to_rid = {}
    for i in range(len(data['movieId'])):
        rid_to_name[data['movieId'][i]] = data['title'][i]
        name_to_rid[data['title'][i]] = data['movieId'][i]

    return rid_to_name, name_to_rid 
'''


# 数据读取
reader = Reader(line_format='user item rating timestamp', sep=',', skip_lines=1)
data = Dataset.load_from_file('./ratings.csv', reader=reader)
train_set = data.build_full_trainset()

algo = SlopeOne()
# algo = SVD()

# 使用 3折交叉验证，输出 RMSE
kf = KFold(n_splits=3)
for trainset, testset in kf.split(data):
    algo.fit(trainset)
    pred = algo.test(testset)
    accuracy.rmse(pred, verbose=True)
    



# algo = SVD()  # 使用SVD算法
# algo = SlopeOne() # 使用SlopeOne算法
# algo.fit(train_set)

# 对指定用户和商品进行评分预测
# uid = str(196) 
# iid = str(302) 
# pred = algo.predict(uid, iid, r_ui=4, verbose=True)

RMSE: 0.8679
RMSE: 0.8682
RMSE: 0.8685


In [5]:
# 针对指定用户和商品打分
uid = str(196) 
iid = str(302) 
pred = algo.predict(uid, iid, r_ui=4, verbose=True)

user: 196        item: 302        r_ui = 4.00   est = 4.65   {'was_impossible': False}
