# Collaborative Filtering(협업 필터링)

* 많은 유저들로부터 얻은 기호 정보를 이용해 유저의 관심사를 자동으로 예측하게 하는 방법
    * 이 상품을 구매한 유저가 구매한 다른 상품들
    * 이 영화를 선호하는 유저가 관람한 다른 영화들
* 유저 A와 비슷한 성향을 갖는 유저들이 선호하는 아이템을 추천
* 아이템이 가진 속성을 사용하지 않으면서도 높은 성능을 보임

## CF의 종류

### Neighborhood-based CF
* User-based
* Item-based

### Model-based CF
* Non-parametric(KNN, SVD)
* Matrix Factorization
* Deep Learning

### Hybrid CF
* Contents-based와 결합

## CF 로직
* 유저 u의 아이템 i에 대한 평점/구매여부 등을 예측하는 것
    * 주어진 데이터를 활용해 유저-아이템 행렬을 생성
    * 유사도 기준을 정하고 유저 또는 아이템 간의 유사도 계산
    * 주어진 평점/구매여부와 유사도를 활용하여 행렬의 비어있는 값(평점/구매여부)을 예측

## CF 특징
* 구현이 간단하고 이해가 쉬움
* 아이템이나 유저가 계속 늘어날 경우 확장성이 떨어짐(Scalability)
* Sparse한 데이터의 경우 성능이 저하됨(Sparsity)
* Model-based CF에 비해 계산량이 적음
* 새로운 유저, 아이템이 추가되더라도 비교적 안정적
* 새로운 content(유저 또는 아이템)을 추천할 수 있음

### Sparsity
* 주어진 데이터를 활용해 유저-아이템 행렬을 만들면 대부분의 값은 비어있음(sparse matrix)
* CF를 적용하려면 적어도 sparsity가 99.5%를 넘지 않도록 하는 것이 좋음

## Neighborhood-based CF(Memory-based)

### User-based
* 두 유저가 얼마나 유사한 아이템을 선호하는가?
* 유저 간의 유사도를 구한 뒤, 나와 유사도가 높은 유저들이 선호하는 아이템을 추천

* 구현이 쉽고 유사한 Neighborhood의 수 K가 늘어날 수록 성능이 높아짐
* Item-based보다 더 다양한 추천 결과들이 제공됨(Diversity)
* Sparsity, Cold Start에 좀 더 취약함
* Pearson 유사도를 사용할 때 성능이 높음

* rating을 구하는 방법
    1. Average : 유저들의 rating의 평균(유사도 고려 X)
    2. Weighted Average : 유저 간의 유사도 값을 Weight로 사용하여 rating의 평균

* Absolute rating의 문제점
    * 내가 평점을 내리는 기준은 다른 유저와 다름
    * 어떤 유저는 전체적으로 높게 평점을 줄 수도 있고 반대로 낮게 줄 수도 있음

* Deviation을 사용
    * 유저가 아이템에 내린 절대 평점을 사용하지 않음
    * 대신 유저의 평균 평점에서 얼마나 높은지 혹은 낮은지 그 편차를 사용
        * 어떤 유저의 평균이 2.5점인데 5점을 줬다면 아주 높게 평가한 것
        * 모든 아이템의 평점을 5점으로 준 유저는 아이템끼리의 비교가 어려움
    * predicted rating = 유저 평균 rating + predicted deviation
        * $dev(u,i) = r(u,i) - \bar{r_{u'}}\text(\text{ }for\text{ } known\text{ } rating)$

* Weighted Average with deviation
    * $$\hat{r}(u,i) = \bar{r_u} + \frac{\sum_{u'\in\omega_i}sim(u,u')(r(u',i)-\bar{r_{u'})}}{\sum_{u'\in\omega_i}sim(u,u')}$$

* K-Nearest Neighbor CF
    * 모든 유저를 사용할 경우 연산은 많아지고 오히려 성능은 떨어지기도 함
    * $\omega_i$에 속한 유저 가운데 유저 u와 가장 유사한 K명의 유저를 이용해 평점을 예측(KNN)
        * 보통 K=25~50을 많이 사용하지만 직접 튜닝해야 하는 하이퍼 파라미터

### Item-based
* 두 아이템이 유저로부터 얼마나 유사한 평가를 받았는가?
* 아이템 선호도를 바탕으로 연관성이 높은 다른 아이템을 추천, 아이템 간의 유사도를 구함

* 보통 실제 서비스에서 User-based CF 보다 높은 성능을 냄
* 아이템 간의 유사도를 사용하는 것이 더 Robust함
    * 아이템 기준의 Neighborhood들이 사용자 기준의 Neighborhood보다 훨씬 덜 변함
* 추천에 대한 이유를 설명하기 훨씬 쉬움
* 유저가 과거에 선호했던 다른아이템과 비슷하기 때문에 추천
* Cosine 유사도를 사용할 때 성능이 높음

## CF의 한계

* Cold Start 문제
    * 데이터가 충분하지 않다면 추천 성능이 떨어짐
    * 데이터가 전혀 없는 신규 유저, 아이템의 경우 추천이 불가능
* 계산 효율
    * 유저와 아이템이 늘어날수록 유사도 계산이 늘어남
    * 유저, 아이템이 많아야 정확한 예측을 하지만 반대로 시간이 오래걸림
* Long-tail 추천의 한계
    * 많은 유저들이 선호하는 소수의 아이템이 보통 CF 추천 결과로 나타남
    * 롱테일을 이루는 비주류의 아이템이 추천되기 어려움

## Tutorial

In [20]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.auto import tqdm
from sklearn.model_selection import train_test_split

In [2]:
path = '/Users/yeomyungro/Documents/github/recommendation/'

In [3]:
rating = pd.read_csv(path+'data/ml-latest-small/ratings.csv', encoding='utf-8')
tag = pd.read_csv(path+'data/ml-latest-small/tags.csv', encoding='utf-8')
movie = pd.read_csv(path+'data/ml-latest-small/movies.csv', encoding='utf-8')

In [4]:
rating.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [5]:
user_list = list(rating['userId'].unique())
movie_list = list(rating['movieId'].unique())

In [6]:
print('# of users : ', len(user_list))
print('# of movies : ', len(movie_list))

# of users :  610
# of movies :  9724


In [7]:
train_df, test_df = train_test_split(rating, test_size=0.2, random_state=1990)
print(train_df.shape)
print(test_df.shape)

(80668, 4)
(20168, 4)


In [8]:
u_i_mt = train_df.pivot('userId', 'movieId', 'rating').fillna(0)
u_i_mt

movieId,1,2,3,4,5,6,7,8,9,10,...,190219,190221,193565,193567,193573,193579,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,0.0,4.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606,2.5,0.0,0.0,0.0,0.0,0.0,2.5,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
607,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
608,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
609,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### User-based CF

In [9]:
from sklearn.metrics.pairwise import cosine_similarity

In [10]:
user_sim = cosine_similarity(u_i_mt, u_i_mt)

In [11]:
type(user_sim)

numpy.ndarray

In [12]:
user_sim.shape

(610, 610)

In [13]:
user_sim_df = pd.DataFrame(data=user_sim, columns=u_i_mt.index, index=u_i_mt.index)
user_sim_df.head()

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,0.035851,0.072229,0.163819,0.132198,0.141923,0.127616,0.083753,0.06638,0.020214,...,0.053327,0.174144,0.188579,0.07795,0.135921,0.130039,0.215591,0.213103,0.085611,0.127099
2,0.035851,1.0,0.0,0.004723,0.021234,0.033896,0.030002,0.038496,0.0,0.033198,...,0.189049,0.022387,0.015564,0.0,0.0,0.037262,0.0,0.059655,0.033706,0.092516
3,0.072229,0.0,1.0,0.002627,0.005905,0.004848,0.0,0.0,0.0,0.0,...,0.006072,0.003735,0.028521,0.0,0.012927,0.014702,0.002897,0.023233,0.0,0.026953
4,0.163819,0.004723,0.002627,1.0,0.078251,0.081348,0.066388,0.02313,0.013402,0.019947,...,0.062687,0.117292,0.261713,0.045639,0.047881,0.150434,0.108491,0.104666,0.036454,0.096423
5,0.132198,0.021234,0.005905,0.078251,1.0,0.262141,0.130235,0.372298,0.0,0.037666,...,0.083023,0.333826,0.106464,0.20919,0.114114,0.076921,0.158528,0.099468,0.225357,0.039868


In [14]:
user_sim_df.shape

(610, 610)

#### Average Rating

In [15]:
# train data의 전체 평균 평점
global_average = train_df['rating'].mean()

In [16]:
# 전체 test 데이터에 대해서 average rating

result = []

for _, row in tqdm(test_df.iterrows()):
    
    user_id = row['userId']
    movie_id = row['movieId']
    rating = row['rating']
    
    if movie_id in u_i_mt.columns and user_id in u_i_mt.index:
        
        u_i_mt_temp = u_i_mt[u_i_mt[movie_id] > 0][[movie_id]]
        rating_sum = u_i_mt_temp[movie_id].sum()
        n_users = len(u_i_mt_temp)
        predicted_rating = rating_sum / n_users
        
        result.append([user_id, movie_id, rating, predicted_rating])
    
    else:
        result.append([user_id, movie_id, rating, global_average])
        
    
result_df = pd.DataFrame(result, columns=['userId', 'movieId', 'rating', 'predictedRating'])

0it [00:00, ?it/s]

In [17]:
result_df

Unnamed: 0,userId,movieId,rating,predictedRating
0,64.0,2291.0,4.0,3.656716
1,414.0,185.0,2.0,3.000000
2,60.0,1242.0,4.0,4.111111
3,230.0,1682.0,2.5,3.779412
4,431.0,3948.0,3.0,3.450000
...,...,...,...,...
20163,480.0,196.0,1.0,2.694444
20164,387.0,4881.0,3.5,3.769231
20165,318.0,1273.0,4.0,4.187500
20166,474.0,832.0,2.0,3.344828


In [18]:
from sklearn.metrics import mean_squared_error, mean_absolute_error

In [46]:
# 성능 구하기

# rmse 구하기
avg_mse = mean_squared_error(y_true = result_df['rating'].values, y_pred = result_df['predictedRating'].values)
avg_rmse = np.sqrt(avg_mse)

# mae 구하기
avg_mae = mean_absolute_error(y_true = result_df['rating'].values, y_pred = result_df['predictedRating'].values)

print("rmse", avg_rmse)
print("mae", avg_mae)

rmse 0.9837487535894798
mae 0.758933100235046


####  Weighted Average

In [25]:
row = test_df.iloc[0, :]

user_id = row['userId']
movie_id = row['movieId']
rating = row['rating']

In [26]:
movie_id

2291.0

In [28]:
movie_ratings = u_i_mt[u_i_mt[movie_id]>0][movie_id]
user_ids = movie_ratings.index
user_ids

Int64Index([  1,  18,  33,  47,  51,  57,  66,  80,  91, 100, 103, 114, 122,
            153, 177, 182, 187, 200, 202, 212, 226, 230, 232, 246, 274, 287,
            288, 292, 301, 307, 311, 313, 330, 332, 357, 362, 365, 368, 381,
            387, 391, 392, 401, 414, 424, 425, 428, 434, 448, 450, 474, 477,
            480, 483, 489, 509, 534, 560, 561, 590, 593, 597, 599, 600, 603,
            605, 606],
           dtype='int64', name='userId')

In [32]:
sim = user_sim_df[user_ids].loc[user_id]

In [35]:
sim

userId
1      0.248962
18     0.257794
33     0.186385
47     0.135247
51     0.220862
         ...   
599    0.308934
600    0.322104
603    0.233522
605    0.147796
606    0.251187
Name: 64, Length: 67, dtype: float64

In [37]:
np.inner(movie_ratings.values, sim)

61.287749701938225

In [40]:
sim.sum()

16.682378942802867

In [41]:
result = []

for _, row in tqdm(test_df.iterrows()):
    
    user_id = row['userId']
    movie_id = row['movieId']
    rating = row['rating']
    
    if movie_id in u_i_mt.columns and user_id in u_i_mt.index:
        
        movie_ratings = u_i_mt[u_i_mt[movie_id]>0][movie_id]
        user_ids = movie_ratings.index
        
        sim = user_sim_df[user_ids].loc[user_id]
        
        rating_sum = np.inner(movie_ratings.values, sim)
        sim_sum = sim.sum()
        
        if sim_sum == 0:
            continue
        predicted_rating = rating_sum / sim_sum
        
        result.append([user_id, movie_id, rating, predicted_rating])
    else:
        result.append([user_id, movie_id, rating, global_average])
    
weighted_average_result_df = pd.DataFrame(result, columns=['userId', 'movieId', 'rating', 'predictedRating'])

0it [00:00, ?it/s]

In [47]:
# 성능 구하기

# rmse 구하기
w_mse = mean_squared_error(y_true = weighted_average_result_df['rating'].values,
                         y_pred = weighted_average_result_df['predictedRating'].values)
w_rmse = np.sqrt(w_mse)

# mae 구하기
w_mae = mean_absolute_error(y_true = weighted_average_result_df['rating'].values,
                          y_pred = weighted_average_result_df['predictedRating'].values)

print("rmse", w_rmse)
print("mae", w_mae)

rmse 0.9779688128133676
mae 0.7541945124419374


#### K-Nearest Neighbor CF

In [45]:
# 전체 test 데이터에 대해서 similarity가 높은 유저 k명에 대해서만 weighted average rating

k = 20

result = []

for _, row in tqdm(test_df.iterrows()):
    
    user_id = row['userId']
    movie_id = row['movieId']
    rating = row['rating']
    
    if movie_id in u_i_mt.columns and user_id in u_i_mt.index:
        
        # 영화에 평점을 매긴 다른 유저들을 가져옵니다
        movie_ratings = u_i_mt[u_i_mt[movie_id] > 0][movie_id]
        user_ids = movie_ratings.index
        
        # 예측하려는 유저와 해당 영화에 평점을 매긴 다른 유저들 가운데 유사도가 높은 k명의 유저만을 사용합니다.
        sim_k = user_sim_df[user_ids].loc[user_id].sort_values(ascending=False)[:k]
        movie_ratings_k = movie_ratings[sim_k.index]
        
        # 유사도가 높은 k명의 유저의 유사도와 평점을 사용하여 예측 유저의 평점을 예측합니다.
        rating_sum = np.inner(movie_ratings_k.values, sim_k)
        sim_sum = sim_k.sum()
        
        if sim_sum == 0:
            result.append([user_id, movie_id, rating, global_average])
            continue
        
        predicted_rating = rating_sum / sim_sum
        
        result.append([user_id, movie_id, rating, predicted_rating])
    
    else:
        result.append([user_id, movie_id, rating, global_average])
    
k_weighted_average_result_df = pd.DataFrame(result, columns=['userId', 'movieId', 'rating', 'predictedRating'])

0it [00:00, ?it/s]

In [48]:
# 성능 구하기

# rmse 구하기
k_mse = mean_squared_error(y_true = k_weighted_average_result_df['rating'].values,
                         y_pred = k_weighted_average_result_df['predictedRating'].values)
k_rmse = np.sqrt(k_mse)

# mae 구하기
k_mae = mean_absolute_error(y_true = k_weighted_average_result_df['rating'].values,
                          y_pred = k_weighted_average_result_df['predictedRating'].values)

print("rmse", k_rmse)
print("mae", k_mae)

rmse 0.9775970407807314
mae 0.7531149381016463


In [54]:
comp_result = pd.DataFrame({'type':['Average', 'Weighted', 'KNN'],
                            'rmse':[avg_rmse, w_rmse, k_rmse], 
                            'mae':[avg_mae, w_mae, k_mae]})
comp_result

Unnamed: 0,type,rmse,mae
0,Average,0.983749,0.758933
1,Weighted,0.977969,0.754195
2,KNN,0.977597,0.753115
