# 1. Collaborative Filtering (협업 필터링)
- 추천 시스템: 고객의 선호, 관심, 구매경력과 같은 개인화 정보를 기초로 고객에게 가장 알맞은 구매정보 제공
- 방식
  - Item-based collaborative filtering: item 간 similarity를 기반으로 추천
  - User-based collaborative filtering: user 간 similarity를 기반으로 추천
    - 성능이 다른 방식에 비해 떨어지는 것으로 알려져있음
  - Matrix factorization collaborative filtering: 잠재요인이 있다고 가정하여, 그 잠재요인을 행렬분해를 통해 찾아냄
    - 일반적으로 SVD(singular value decomposition)을 활용

- Input: user-item matrix (preference matrix)
  - 일반적으로 5점 척도로 구성된 rating + cosine similarity
  - binary인 경우 jaccard similarity 사용

- 한계
  - 단순 matrix를 사용하여 추천하기 때문에 context/content를 고려하지 않음
    - 최근에는 item의 text 정보, user의 context 정보들을 반영하여 딥러닝 기반의 추천 시스템 등장

In [1]:
import pandas as pd
df = pd.read_csv('movies.csv')

In [2]:
df.head()

Unnamed: 0,userId,title,rating,timestamp
0,1,American Pie,4.0,1260759139
1,4,American Pie,4.0,949896114
2,15,American Pie,4.0,1052896867
3,30,American Pie,2.0,994439964
4,34,American Pie,4.0,973747765


In [3]:
df = df.pivot_table('rating', index = 'userId', columns = 'title')
df.head()

title,10 Things I Hate About You,12 Angry Men,1408,15 Minutes,16 Blocks,"20,000 Leagues Under the Sea",2001: A Space Odyssey,2046,21 Grams,25th Hour,...,Willy Wonka & the Chocolate Factory,World Trade Center,X-Men Origins: Wolverine,Y Tu Mamá También,You Only Live Twice,"You, Me and Dupree",Young Frankenstein,Zodiac,eXistenZ,xXx
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,3.0,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,3.0,,,,,...,,,5.0,,,,5.0,,,
5,,,,,,,,,,,...,,,,,,,,,,


In [4]:
df.shape

(670, 856)

- user들별로 평점을 주는 범위가 다르기 때문에 scaling하여 조정

In [5]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df), columns = df.columns, index = df.index)
df_scaled

title,10 Things I Hate About You,12 Angry Men,1408,15 Minutes,16 Blocks,"20,000 Leagues Under the Sea",2001: A Space Odyssey,2046,21 Grams,25th Hour,...,Willy Wonka & the Chocolate Factory,World Trade Center,X-Men Origins: Wolverine,Y Tu Mamá También,You Only Live Twice,"You, Me and Dupree",Young Frankenstein,Zodiac,eXistenZ,xXx
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,0.5,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,0.555556,,,,,...,,,1.0,,,,1.0,,,
5,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
667,,,,,,,,,,,...,,,,,,,,,,
668,,,,,,,,,,,...,,,,,,,,,,
669,,,,,,,,,,,...,,,,,,,,,,
670,,,,,,,,,,,...,,,,,,,,,,


## Item-based collaborative filtering
- Item*user으로 구성된 matrix를 사용하여 유사도 측정

In [6]:
df_scaled = df_scaled.transpose().fillna(0)
df_scaled.head(5)

userId,1,2,3,4,5,6,7,8,9,10,...,662,663,664,665,666,667,668,669,670,671
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10 Things I Hate About You,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.666667,0.0,0.0,0.0,0.0,0.0,0.0
12 Angry Men,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1408,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
15 Minutes,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
16 Blocks,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


- user들 간 평점이 비슷한 정도를 기반으로 영화들의 유사도 평가

In [7]:
from sklearn.metrics.pairwise import cosine_similarity
movie_sim = cosine_similarity(df_scaled, df_scaled)
print(movie_sim.shape)

(856, 856)


In [8]:
movie_sim

array([[1.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 0.08998054,
        0.07767356],
       [0.        , 0.        , 0.        , ..., 0.08998054, 1.        ,
        0.0226314 ],
       [0.        , 0.        , 0.        , ..., 0.07767356, 0.0226314 ,
        1.        ]])

In [9]:
movie_sim = pd.DataFrame(movie_sim, index = df_scaled.index, columns = df_scaled.index)

In [10]:
movie_sim['Sin City'].sort_values(ascending=False)[:10]

Unnamed: 0_level_0,Sin City
title,Unnamed: 1_level_1
Sin City,1.0
The Sentinel,0.692308
Freddy vs. Jason,0.667124
Austin Powers in Goldmember,0.24082
The Mummy Returns,0.217571
"Monsters, Inc.",0.208739
Saw IV,0.200063
Dances with Wolves,0.186198
Apocalypse Now,0.176147
"Good Morning, Vietnam",0.168964


#2. Surprise

In [25]:
!pip install scikit-surprise



In [26]:
df = pd.read_csv('movies.csv')
df

Unnamed: 0,userId,title,rating,timestamp
0,1,American Pie,4.0,1260759139
1,4,American Pie,4.0,949896114
2,15,American Pie,4.0,1052896867
3,30,American Pie,2.0,994439964
4,34,American Pie,4.0,973747765
...,...,...,...,...
18566,652,Lion of the Desert,5.0,1440269953
18567,652,Mud,5.0,1439586954
18568,659,K-PAX,4.0,836137550
18569,659,Starship Troopers,3.0,834694187


In [27]:
from surprise import Reader, Dataset, KNNBasic, SVD
from surprise.model_selection import train_test_split, cross_validate
import pandas as pd

# Prepare the data to be used in Surprise
reader = Reader(rating_scale=(0,5))
data = Dataset.load_from_df(df[['userId', 'title', 'rating']], reader=reader)


In [28]:
data.raw_ratings
data.df

Unnamed: 0,userId,title,rating
0,1,American Pie,4.0
1,4,American Pie,4.0
2,15,American Pie,4.0
3,30,American Pie,2.0
4,34,American Pie,4.0
...,...,...,...
18566,652,Lion of the Desert,5.0
18567,652,Mud,5.0
18568,659,K-PAX,4.0
18569,659,Starship Troopers,3.0


In [29]:
sim_options = {
    'name': 'cosine',
    'user_based': True
}
algo = KNNBasic(sim_options=sim_options)

# Retrieve the trainset.
trainset, testset = train_test_split(data)
algo.fit(trainset)

# Predict
print(algo.predict(1, 5, r_ui=None, verbose=True))

Computing the cosine similarity matrix...
Done computing similarity matrix.
user: 1          item: 5          r_ui = None   est = 3.54   {'was_impossible': True, 'reason': 'User and/or item is unknown.'}
user: 1          item: 5          r_ui = None   est = 3.54   {'was_impossible': True, 'reason': 'User and/or item is unknown.'}


In [30]:
prediction = algo.test(testset)
prediction[:5]

[Prediction(uid=635, iid='Scarface', r_ui=3.5, est=4.27509034743391, details={'actual_k': 40, 'was_impossible': False}),
 Prediction(uid=550, iid='L.A. Confidential', r_ui=5.0, est=3.612066845316744, details={'actual_k': 14, 'was_impossible': False}),
 Prediction(uid=598, iid='Terminator 3: Rise of the Machines', r_ui=5.0, est=4.0375, details={'actual_k': 40, 'was_impossible': False}),
 Prediction(uid=551, iid='The Lost World: Jurassic Park', r_ui=3.0, est=2.626892580838845, details={'actual_k': 8, 'was_impossible': False}),
 Prediction(uid=646, iid='Mulholland Drive', r_ui=5.0, est=3.2059671243338728, details={'actual_k': 7, 'was_impossible': False})]

- Surprise 지원 알고리즘
 - Random : 랜덤한 추천
 - Baseline : ALS(Alternating Least Square), SGD(Stochastic Gradient Descent)
 - Matrix factorization: SVD, SVD++
 - KNNs
 - Slope one, Co clustering

 - https://surprise.readthedocs.io/en/stable/prediction_algorithms_package.html

In [31]:
from surprise import BaselineOnly
from surprise import KNNWithMeans, KNNBasic, KNNBaseline
from surprise import SVD
from surprise import SVDpp

from surprise import accuracy
from surprise import Dataset
from surprise.model_selection import train_test_split
import matplotlib.pyplot as plt


In [32]:
trainset, testset = train_test_split(data, test_size = 0.25)

In [33]:
recom = KNNBasic()  #CF
recom.fit(trainset)
predictions = recom.test(testset)
accuracy.rmse(predictions)

Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9660


0.9660342076959553

In [34]:
recom = SVD()  #matrix 분해 (SVD) 이용
recom.fit(trainset)
predictions = recom.test(testset)
accuracy.rmse(predictions)

RMSE: 0.9227


0.9227256700213065

- ml-100k에 대해서 여러 추천 알고리즘을 수행하고 비교

In [35]:
# MovieLens 100K
data = Dataset.load_builtin(name='ml-100k', prompt = False)  #ml-1m, jester
trainset, testset = train_test_split(data, test_size = 0.25)

In [36]:
algorithms = [KNNBasic, SVD]

algos=[]
rmses=[]

for i in algorithms:
  algo = i()
  algos.append(i.__name__)
  algo.fit(trainset)
  predictions = algo.test(testset)
  rmses.append(accuracy.rmse(predictions))

Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9841
RMSE: 0.9436


- cross validate

In [37]:
cross_validate( SVD(), data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9359  0.9327  0.9416  0.9306  0.9384  0.9358  0.0039  
MAE (testset)     0.7387  0.7361  0.7429  0.7339  0.7374  0.7378  0.0030  
Fit time          2.00    1.89    1.39    1.39    1.39    1.61    0.27    
Test time         0.18    0.30    0.12    0.12    0.12    0.17    0.07    


{'test_rmse': array([0.93587088, 0.93266376, 0.94157984, 0.93060883, 0.93837622]),
 'test_mae': array([0.7386503 , 0.73606601, 0.74287378, 0.73387285, 0.73737316]),
 'fit_time': (1.9993460178375244,
  1.8940515518188477,
  1.3909261226654053,
  1.3902513980865479,
  1.389066457748413),
 'test_time': (0.1810469627380371,
  0.30067944526672363,
  0.11614584922790527,
  0.12479853630065918,
  0.11665487289428711)}