Surprise でいろんなレコメンドを扱ってみる。

カスタムデータを使って、レコメンドしてみる。  
https://surprise.readthedocs.io/en/stable/getting_started.html#use-a-custom-dataset

In [2]:
import pandas as pd
import surprise

In [5]:
# data = surprise.Dataset.load_builtin('ml-100k')

user_id, item_id, rating の組を作る。

In [8]:
data_df = pd.read_csv("~/.surprise_data/ml-100k/ml-100k/u.data", sep='\t', header=None)
data_df.columns = ['user_id', 'item_id', 'rating', 'timestamp']

display(data_df.shape)
data_df.head(3)

(100000, 4)

Unnamed: 0,user_id,item_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116


In [10]:
reader = surprise.Reader(rating_scale=(1, 5))
data = surprise.Dataset.load_from_df(data_df.drop(columns=['timestamp']), reader)

In [12]:
train, test = surprise.model_selection.train_test_split(data, test_size=0.25)

レコメンドしてみる。  
推定のベースラインとなる手法は「ALS」と「SGD」があり、指定がなければ ALS で学習する。
https://surprise.readthedocs.io/en/stable/prediction_algorithms.html#baselines-estimates-configuration

類似の評価方法は以下のページ参考。指定がなければ、 MSD を使う。  （MSD: mean squared difference）
https://surprise.readthedocs.io/en/stable/prediction_algorithms.html#similarity-measure-configuration

アルゴリズムのパッケージがいろいろあるので試してみる。  
https://surprise.readthedocs.io/en/stable/prediction_algorithms_package.html

## KNN

In [33]:
# knn-basic
from surprise import KNNBasic

model = KNNBasic(k=10, min_k=1)
model.fit(train)

preds = model.test(test)
display(surprise.accuracy.rmse(preds))
display(preds[:3])

# (uid, iid)
model.predict(899, 414)

Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9970


0.9970334424786582

[Prediction(uid=899, iid=414, r_ui=2.0, est=3.278719037032611, details={'actual_k': 10, 'was_impossible': False}),
 Prediction(uid=721, iid=299, r_ui=3.0, est=2.6871815781621486, details={'actual_k': 10, 'was_impossible': False}),
 Prediction(uid=796, iid=826, r_ui=2.0, est=3.185112434443205, details={'actual_k': 10, 'was_impossible': False})]

Prediction(uid=899, iid=414, r_ui=None, est=3.278719037032611, details={'actual_k': 10, 'was_impossible': False})

In [32]:
# knn-with-means
from surprise import KNNWithMeans

model = KNNWithMeans(k=10, min_k=1)
model.fit(train)

preds = model.test(test)
display(surprise.accuracy.rmse(preds))
display(preds[:3])

# (uid, iid)
model.predict(899, 414)

Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9829


0.9828628061662306

[Prediction(uid=899, iid=414, r_ui=2.0, est=3.1688408524492853, details={'actual_k': 10, 'was_impossible': False}),
 Prediction(uid=721, iid=299, r_ui=3.0, est=2.8427338235927704, details={'actual_k': 10, 'was_impossible': False}),
 Prediction(uid=796, iid=826, r_ui=2.0, est=3.1177282711223415, details={'actual_k': 10, 'was_impossible': False})]

Prediction(uid=899, iid=414, r_ui=None, est=3.1688408524492853, details={'actual_k': 10, 'was_impossible': False})

In [31]:
from surprise import KNNWithZScore

model = KNNWithZScore(k=10, min_k=1)
model.fit(train)

preds = model.test(test)
display(surprise.accuracy.rmse(preds))
display(preds[:3])

# (uid, iid)
model.predict(899, 414)

Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9895


0.9895299458370242

[Prediction(uid=899, iid=414, r_ui=2.0, est=3.0909100375599565, details={'actual_k': 10, 'was_impossible': False}),
 Prediction(uid=721, iid=299, r_ui=3.0, est=2.7574135243284634, details={'actual_k': 10, 'was_impossible': False}),
 Prediction(uid=796, iid=826, r_ui=2.0, est=3.063007370952015, details={'actual_k': 10, 'was_impossible': False})]

Prediction(uid=899, iid=414, r_ui=None, est=3.0909100375599565, details={'actual_k': 10, 'was_impossible': False})

In [30]:
from surprise import KNNBaseline

model = KNNBaseline(k=10, min_k=1)
model.fit(train)

preds = model.test(test)
display(surprise.accuracy.rmse(preds))
display(preds[:3])

# (uid, iid)
model.predict(899, 414)

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9577


0.9576563091609651

[Prediction(uid=899, iid=414, r_ui=2.0, est=3.1689622719087995, details={'actual_k': 10, 'was_impossible': False}),
 Prediction(uid=721, iid=299, r_ui=3.0, est=2.748796861454316, details={'actual_k': 10, 'was_impossible': False}),
 Prediction(uid=796, iid=826, r_ui=2.0, est=3.0278242898934873, details={'actual_k': 10, 'was_impossible': False})]

Prediction(uid=899, iid=414, r_ui=None, est=3.1689622719087995, details={'actual_k': 10, 'was_impossible': False})

## Matrix Factorization

In [29]:
from surprise import SVD

model = SVD(n_factors=20, n_epochs=20)
model.fit(train)

preds = model.test(test)
display(surprise.accuracy.rmse(preds))
display(preds[:3])

# (uid, iid)
model.predict(899, 414)

RMSE: 0.9401


0.9401387432312075

[Prediction(uid=899, iid=414, r_ui=2.0, est=3.3281851669930713, details={'was_impossible': False}),
 Prediction(uid=721, iid=299, r_ui=3.0, est=2.9968902502531702, details={'was_impossible': False}),
 Prediction(uid=796, iid=826, r_ui=2.0, est=2.844650721979961, details={'was_impossible': False})]

Prediction(uid=899, iid=414, r_ui=None, est=3.3281851669930713, details={'was_impossible': False})

In [34]:
from surprise import SVDpp

model = SVDpp(n_factors=20, n_epochs=20)
model.fit(train)

preds = model.test(test)
display(surprise.accuracy.rmse(preds))
display(preds[:3])

# (uid, iid)
model.predict(899, 414)

RMSE: 0.9226


0.922615620486141

[Prediction(uid=899, iid=414, r_ui=2.0, est=3.2735526509591653, details={'was_impossible': False}),
 Prediction(uid=721, iid=299, r_ui=3.0, est=2.955180601387324, details={'was_impossible': False}),
 Prediction(uid=796, iid=826, r_ui=2.0, est=3.0996444853988425, details={'was_impossible': False})]

Prediction(uid=899, iid=414, r_ui=None, est=3.2735526509591653, details={'was_impossible': False})

In [35]:
from surprise import NMF

model = NMF(n_factors=20, n_epochs=20)
model.fit(train)

preds = model.test(test)
display(surprise.accuracy.rmse(preds))
display(preds[:3])

# (uid, iid)
model.predict(899, 414)

RMSE: 1.1052


1.1052331739323384

[Prediction(uid=899, iid=414, r_ui=2.0, est=3.862925277607394, details={'was_impossible': False}),
 Prediction(uid=721, iid=299, r_ui=3.0, est=2.8468461331986687, details={'was_impossible': False}),
 Prediction(uid=796, iid=826, r_ui=2.0, est=3.168556792212676, details={'was_impossible': False})]

Prediction(uid=899, iid=414, r_ui=None, est=3.862925277607394, details={'was_impossible': False})

## Simple Collaboration Filtering

In [36]:
from surprise import SlopeOne

model = SlopeOne()
model.fit(train)

preds = model.test(test)
display(surprise.accuracy.rmse(preds))
display(preds[:3])

# (uid, iid)
model.predict(899, 414)

RMSE: 0.9474


0.9474056158487151

[Prediction(uid=899, iid=414, r_ui=2.0, est=3.346051394500792, details={'was_impossible': False}),
 Prediction(uid=721, iid=299, r_ui=3.0, est=2.738100757521191, details={'was_impossible': False}),
 Prediction(uid=796, iid=826, r_ui=2.0, est=2.8599386582729753, details={'was_impossible': False})]

Prediction(uid=899, iid=414, r_ui=None, est=3.346051394500792, details={'was_impossible': False})

## Co-clustering Collaboration filtering

In [37]:
from surprise import CoClustering

model = CoClustering(n_cltr_u=5, n_cltr_i=10)
model.fit(train)

preds = model.test(test)
display(surprise.accuracy.rmse(preds))
display(preds[:3])

# (uid, iid)
model.predict(899, 414)

RMSE: 0.9923


0.9923024156568413

[Prediction(uid=899, iid=414, r_ui=2.0, est=3.6671790923516627, details={'was_impossible': False}),
 Prediction(uid=721, iid=299, r_ui=3.0, est=2.8545283238470986, details={'was_impossible': False}),
 Prediction(uid=796, iid=826, r_ui=2.0, est=2.3026791199076917, details={'was_impossible': False})]

Prediction(uid=899, iid=414, r_ui=None, est=3.6671790923516627, details={'was_impossible': False})

同じデータで比較したところ、svdpp の rmse が一番小さくてよい。

In [42]:
model.predict(899, 414)

3.6671790923516627