### 使用するデータ

- u.data：ユーザがつけた評価の一覧
- u.item：各映画のデータ、idとタイトルの対応

In [1]:
import pandas as pd
from datetime import datetime

In [2]:
df_data = pd.read_csv("data/u.data", sep="\t", header=None,
                      names=["user_id", "movie_id", "rating", "timestamp"],
                      parse_dates=[3], date_parser=lambda x: datetime.fromtimestamp(int(x)))

df_item = pd.read_csv("data/u.item", sep="|", header=None,
                     names=["movie_id", "movie_title", "release_date"],
                     usecols=[0, 1, 2], parse_dates=[2], index_col=0)

In [3]:
df_data.head()

Unnamed: 0,user_id,movie_id,rating,timestamp
0,196,242,3,1997-12-05 00:55:49
1,186,302,3,1998-04-05 04:22:22
2,22,377,1,1997-11-07 16:18:36
3,244,51,2,1997-11-27 14:02:03
4,166,346,1,1998-02-02 14:33:16


In [4]:
df_item.head()

Unnamed: 0_level_0,movie_title,release_date
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Toy Story (1995),1995-01-01
2,GoldenEye (1995),1995-01-01
3,Four Rooms (1995),1995-01-01
4,Get Shorty (1995),1995-01-01
5,Copycat (1995),1995-01-01


### データの加工

Surprise 用に加工する必要がある

In [5]:
from surprise import Dataset
from surprise import Reader
from surprise.model_selection import cross_validate

In [11]:
# 「どのユーザがどの映画にどのような点をつけた」だけを取得してデータを加工
reader = Reader(rating_scale=(1,5))
data = Dataset.load_from_df(df_data[["user_id", "movie_id", "rating"]], reader)

### SVD の学習

In [12]:
from surprise import SVD

In [14]:
model = SVD(random_state=1)
model.fit(data.build_full_trainset())

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x1257a5190>

映画を見た数が少ない人を抽出

In [15]:
aggregated = df_data[["user_id", "movie_id"]].groupby("user_id").count().rename(columns = {"movie_id" : "count"})
aggregated.min()

count    20
dtype: int64

In [17]:
aggregated[aggregated["count"] == 20]

Unnamed: 0_level_0,count
user_id,Unnamed: 1_level_1
19,20
34,20
36,20
93,20
143,20
147,20
166,20
202,20
242,20
300,20


In [18]:
uid = 364
df_data[df_data["user_id"] == uid].merge(df_item, "left", on="movie_id")

Unnamed: 0,user_id,movie_id,rating,timestamp,movie_title,release_date
0,364,690,4,1997-10-04 11:15:09,Seven Years in Tibet (1997),1997-01-01
1,364,1048,5,1997-10-04 11:19:45,She's the One (1996),1996-08-23
2,364,321,2,1997-10-04 11:17:58,Mother (1996),1996-12-25
3,364,289,3,1997-10-04 11:17:12,Evita (1996),1996-12-25
4,364,288,4,1997-10-04 11:17:12,Scream (1996),1996-12-20
5,364,269,4,1997-10-04 11:15:09,"Full Monty, The (1997)",1997-01-01
6,364,875,3,1997-10-04 11:19:45,She's So Lovely (1997),1997-08-22
7,364,988,2,1997-10-04 11:19:21,"Beautician and the Beast, The (1997)",1997-02-07
8,364,294,5,1997-10-04 11:17:12,Liar Liar (1997),1997-03-21
9,364,262,3,1997-10-04 11:17:12,In the Company of Men (1997),1997-08-01


In [22]:
# uid=364 がもし movie_id = 1 を見たらどのような評価をするか予測
iid = 1
model.predict(uid=uid, iid=iid).est

4.1793513287571535

In [48]:
def get_ranking(user_id, model, candidates):
    pred = [(i, model.predict(user_id, i).est) for i in candidates]
    return sorted(pred, key=lambda x : x[1], reverse=True)

def show_result(res):
    for movie, score in res:
        print("{:4d} {:70s} {:f}".format(movie, df_item.loc[movie]["movie_title"], score))

In [49]:
# おすすめ映画一覧表示（ランキング順）
watched = set(df_data[df_data["user_id"] == iid]["movie_id"])
r = get_ranking(uid, model, set(df_item.index) - watched)

In [50]:
show_result(r[:30])

 408 Close Shave, A (1995)                                                  4.602473
 318 Schindler's List (1993)                                                4.549611
 169 Wrong Trousers, The (1993)                                             4.525021
 483 Casablanca (1942)                                                      4.515352
  64 Shawshank Redemption, The (1994)                                       4.508791
 603 Rear Window (1954)                                                     4.477425
 114 Wallace & Gromit: The Best of Aardman Animation (1996)                 4.453611
  12 Usual Suspects, The (1995)                                             4.444211
 178 12 Angry Men (1957)                                                    4.434682
  50 Star Wars (1977)                                                       4.404653
 272 Good Will Hunting (1997)                                               4.363911
 285 Secrets & Lies (1996)                                       

#### データの追加

人工的にユーザを追加することで映画"Star Wars"（邦題「スター・ウォーズ」）が好きな人にはどんな映画がおすすめかを計算してみます。まず準備としてユーザID（user id）の最大値を表示します。

In [30]:
df_data["user_id"].max()

943

In [31]:
df_item[df_item["movie_title"].str.contains("Star Wars")]

Unnamed: 0_level_0,movie_title,release_date
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1
50,Star Wars (1977),1977-01-01


ユーザIDの最大値が943だったので、ここでユーザIDが944の人を人工的に加えて、その人は"Star Wars"（映画ID=50）を5と評価したものとします。

In [32]:
uid = 944
iid = 50

df_data2 = df_data.append({"user_id": uid, "movie_id": iid, "rating": 5}, ignore_index=True).convert_dtypes()

In [34]:
df_data2.tail()

Unnamed: 0,user_id,movie_id,rating,timestamp
99996,716,204,5,1997-11-18 04:39:03
99997,276,1090,1,1997-09-21 07:49:55
99998,13,225,2,1997-12-18 07:52:36
99999,12,203,3,1997-11-20 02:13:03
100000,944,50,5,NaT


おすすめを計算して表示する

In [38]:
model = SVD(random_state=1)
data2 = Dataset.load_from_df(df_data2[["user_id", "movie_id", "rating"]], reader)
model.fit(data2.build_full_trainset())

r = get_ranking(uid, model, [x for x in df_item.index if x != iid])

show_result(r[:30])

 169 Wrong Trousers, The (1993)                                             4.685172
 483 Casablanca (1942)                                                      4.644552
 318 Schindler's List (1993)                                                4.626595
 603 Rear Window (1954)                                                     4.623174
 408 Close Shave, A (1995)                                                  4.591420
 165 Jean de Florette (1986)                                                4.587939
 480 North by Northwest (1959)                                              4.548979
 515 Boot, Das (1981)                                                       4.539462
 127 Godfather, The (1972)                                                  4.488410
 657 Manchurian Candidate, The (1962)                                       4.483740
 272 Good Will Hunting (1997)                                               4.468293
 427 To Kill a Mockingbird (1962)                                

### アルゴリズムの評価

比較するモデル：KNNBasic, NMF, SVD     
評価指標：RMSE（Root of Mean Square Error）

NMF と SVD は両方とも行列分解によるアルゴリズム
- NMF（Non-negative Matrix Factorization）：非負係数の行列による分解を利用
- SVD（Singular Value Decomposition）：レイティングで良い結果を出すことを目的につかられたアルゴリズム
- KNNBasic：k近傍法に基づくものでベクトルの近接関係を着目したもの

In [39]:
from surprise.model_selection import train_test_split
from surprise import accuracy

from surprise.prediction_algorithms.knns import KNNBasic
from surprise.prediction_algorithms.matrix_factorization import NMF

In [40]:
trainset, testset = train_test_split(data, test_size = 0.2)

In [42]:
algorithms = [KNNBasic, NMF, SVD]
algo_names = ["KNNBasic", "NMF", "SVD"]

for algo, name in zip(algorithms, algo_names):
    model = algo()
    model.fit(trainset)
    predictions = model.test(testset)
    print(name)
    print(accuracy.rmse(predictions, verbose=False))

Computing the msd similarity matrix...
Done computing similarity matrix.
KNNBasic
0.9786231543157083
NMF
0.9593563239030428
SVD
0.9326836956463491


SVD が最も良い結果である。