# 1. Collaborative Filtering (협업 필터링) 리뷰
- 추천 시스템: 고객의 선호, 관심, 구매경력과 같은 개인화 정보를 기초로 고객에게 가장 알맞은 구매정보 제공
- 방식
  - Item-based collaborative filtering: item 간 similarity를 기반으로 추천
  - User-based collaborative filtering: user 간 similarity를 기반으로 추천
    - 성능이 다른 방식에 비해 떨어지는 것으로 알려져있음
  - Matrix factorization collaborative filtering: 잠재요인이 있다고 가정하여, 그 잠재요인을 행렬분해를 통해 찾아냄
    - 일반적으로 SVD(singular value decomposition)을 활용

- Input: user-item matrix (preference matrix)
  - 일반적으로 5점 척도로 구성된 rating + cosine similarity
  - binary인 경우 jaccard similarity 사용

- 한계
  - 단순 matrix를 사용하여 추천하기 때문에 context/content를 고려하지 않음
    - 최근에는 item의 text 정보, user의 context 정보들을 반영하여 딥러닝 기반의 추천 시스템 등장

In [2]:
import pandas as pd
df = pd.read_csv('movies.csv')

In [3]:
df.head()

Unnamed: 0,userId,title,rating,timestamp
0,1,American Pie,4.0,1260759139
1,4,American Pie,4.0,949896114
2,15,American Pie,4.0,1052896867
3,30,American Pie,2.0,994439964
4,34,American Pie,4.0,973747765


In [4]:
df = df.pivot_table('rating', index = 'userId', columns = 'title')
df.head()

title,10 Things I Hate About You,12 Angry Men,1408,15 Minutes,16 Blocks,"20,000 Leagues Under the Sea",2001: A Space Odyssey,2046,21 Grams,25th Hour,...,Willy Wonka & the Chocolate Factory,World Trade Center,X-Men Origins: Wolverine,Y Tu Mamá También,You Only Live Twice,"You, Me and Dupree",Young Frankenstein,Zodiac,eXistenZ,xXx
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,3.0,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,3.0,,,,,...,,,5.0,,,,5.0,,,
5,,,,,,,,,,,...,,,,,,,,,,


In [5]:
df.shape

(670, 856)

- user들별로 평점을 주는 범위가 다르기 때문에 scaling하여 조정

In [6]:
import numpy as np
df_scaled = df.apply(lambda x: (x-np.mean(x))/(np.max(x)-np.min(x)), axis=1)

## Item-based collaborative filtering
- Item*user으로 구성된 matrix를 사용하여 유사도 측정

In [8]:
df_scaled = df_scaled.transpose().fillna(0)
df_scaled.head(5)

title,10 Things I Hate About You,12 Angry Men,1408,15 Minutes,16 Blocks,"20,000 Leagues Under the Sea",2001: A Space Odyssey,2046,21 Grams,25th Hour,...,Willy Wonka & the Chocolate Factory,World Trade Center,X-Men Origins: Wolverine,Y Tu Mamá También,You Only Live Twice,"You, Me and Dupree",Young Frankenstein,Zodiac,eXistenZ,xXx
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,-0.097222,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,-0.324468,0.0,0.0,0.0,0.0,...,0.0,0.0,0.175532,0.0,0.0,0.0,0.175532,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


- user들 간 평점이 비슷한 정도를 기반으로 영화들의 유사도 평가

In [9]:
from sklearn.metrics.pairwise import cosine_similarity
movie_sim = cosine_similarity(df_scaled, df_scaled)
print(movie_sim.shape)

(670, 670)


In [10]:
movie_sim

array([[ 1.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  1.        ,  0.04975587, ..., -0.03629348,
        -0.20534444,  0.03750097],
       [ 0.        ,  0.04975587,  1.        , ...,  0.28561068,
        -0.04124805,  0.08952754],
       ...,
       [ 0.        , -0.03629348,  0.28561068, ...,  1.        ,
         0.        ,  0.05218536],
       [ 0.        , -0.20534444, -0.04124805, ...,  0.        ,
         1.        , -0.0271319 ],
       [ 0.        ,  0.03750097,  0.08952754, ...,  0.05218536,
        -0.0271319 ,  1.        ]])

In [11]:
movie_sim = pd.DataFrame(movie_sim, index = df_scaled.index, columns = df_scaled.index)

In [12]:
movie_sim['Snow White and the Seven Dwarfs'].sort_values(ascending=False)[:10]

KeyError: ignored

#2. Surprise

In [14]:
!pip install surprise



In [15]:
df = pd.read_csv('movies.csv')
df

Unnamed: 0,userId,title,rating,timestamp
0,1,American Pie,4.0,1260759139
1,4,American Pie,4.0,949896114
2,15,American Pie,4.0,1052896867
3,30,American Pie,2.0,994439964
4,34,American Pie,4.0,973747765
...,...,...,...,...
18566,652,Lion of the Desert,5.0,1440269953
18567,652,Mud,5.0,1439586954
18568,659,K-PAX,4.0,836137550
18569,659,Starship Troopers,3.0,834694187


In [16]:
from surprise import Reader, Dataset, KNNBasic, SVD
from surprise.model_selection import train_test_split, cross_validate
import pandas as pd

# Prepare the data to be used in Surprise
reader = Reader(rating_scale=(0,5))
data = Dataset.load_from_df(df[['userId', 'title', 'rating']], reader=reader)


In [17]:
data

<surprise.dataset.DatasetAutoFolds at 0x7949ddcb7a00>

In [31]:
#df에서 rating이 3점 이상인 행들로만 추천을 하세요
#train, test 7:3
#테스트에 대해서 SVD(), IBCF, UBCF의 rmse를 비교
#45분까지....
df1= df[df.rating>=3]

sim_options = {
    'name': 'cosine',
    'user_based': True
}

algo_u = KNNBasic(sim_options=sim_options)

sim_options = {
    'name': 'cosine',
    'user_based': False
}

algo_i = KNNBasic(sim_options=sim_options)

algo_s=SVD()

#아래는 비슷한거
predictions = recom.test(testset)
accuracy.rmse(predictions)

In [18]:
sim_options = {
    'name': 'cosine',
    'user_based': True
}
algo = KNNBasic(sim_options=sim_options)

# Retrieve the trainset.
trainset, testset = train_test_split(data)
algo.fit(trainset)

# Predict
print(algo.predict(1, 5, r_ui=None, verbose=True))

Computing the cosine similarity matrix...
Done computing similarity matrix.
user: 1          item: 5          r_ui = None   est = 3.54   {'was_impossible': True, 'reason': 'User and/or item is unknown.'}
user: 1          item: 5          r_ui = None   est = 3.54   {'was_impossible': True, 'reason': 'User and/or item is unknown.'}


In [19]:
prediction = algo.test(testset)
prediction[:5]

[Prediction(uid=561, iid='Monsoon Wedding', r_ui=4.0, est=3.7248032037506023, details={'actual_k': 40, 'was_impossible': False}),
 Prediction(uid=547, iid='Escape from New York', r_ui=5.0, est=3.9030025102146686, details={'actual_k': 35, 'was_impossible': False}),
 Prediction(uid=311, iid='The Terminal', r_ui=3.0, est=3.6817995622078805, details={'actual_k': 40, 'was_impossible': False}),
 Prediction(uid=602, iid='Reservoir Dogs', r_ui=4.0, est=3.239011593861146, details={'actual_k': 40, 'was_impossible': False}),
 Prediction(uid=412, iid='Silent Hill', r_ui=4.0, est=3.6248513221932432, details={'actual_k': 40, 'was_impossible': False})]

- Surprise 지원 알고리즘
 - Random : 랜덤한 추천
 - Baseline : ALS(Alternating Least Square), SGD(Stochastic Gradient Descent)
 - Matrix factorization: SVD, SVD++
 - KNNs
 - Slope one, Co clustering

 - https://surprise.readthedocs.io/en/stable/prediction_algorithms_package.html

In [20]:
from surprise import BaselineOnly
from surprise import KNNWithMeans, KNNBasic, KNNBaseline
from surprise import SVD
from surprise import SVDpp

from surprise import accuracy
from surprise import Dataset
from surprise.model_selection import train_test_split
import matplotlib.pyplot as plt


In [21]:
trainset, testset = train_test_split(data, test_size = 0.25)

In [22]:
recom = KNNBasic()  #CF
recom.fit(trainset)
predictions = recom.test(testset)
accuracy.rmse(predictions)

Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9436


0.9435626609948521

In [23]:
recom = SVD()  #matrix 분해 (SVD) 이용
recom.fit(trainset)
predictions = recom.test(testset)
accuracy.rmse(predictions)

RMSE: 0.8965


0.8965387192033527

- ml-100k에 대해서 여러 추천 알고리즘을 수행하고 비교

In [24]:
# MovieLens 100K
data = Dataset.load_builtin(name=u'ml-100k', prompt = False)
trainset, testset = train_test_split(data, test_size = 0.25)

Trying to download dataset from https://files.grouplens.org/datasets/movielens/ml-100k.zip...
Done! Dataset ml-100k has been saved to /root/.surprise_data/ml-100k


In [25]:
algorithms = [KNNBasic, SVD]

algos=[]
rmses=[]

for i in algorithms:
  algo = i()
  algos.append(i.__name__)
  algo.fit(trainset)
  predictions = algo.test(testset)
  rmses.append(accuracy.rmse(predictions))

Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9832
RMSE: 0.9387
