**Lab 6 : Collaborative Filtering**
- Instuctor : Kijung Shin
- Teaching Assistants : Hyunju Lee(main), Deukryeol Yoon, Shinhwan Kang 
- 본 실습에서는 추천 시스템의 여러가지 방법 중 Collaborative Filtering을 구현해본다.

In [1]:
import numpy as np 
import pandas as pd
from sklearn.metrics import mean_squared_error

In [2]:
import os, sys 
from google.colab import drive 

### 해당 코드 실행 시 colab에서 실행중인 폴더의 /content/drive/My Drive가 구글 드라이브에 연결됨

drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
os.getcwd()

'/content'

**Dataset Loading and Preprocessing**
- 100,000개의 평점으로 구성된 MovieLens 데이터셋 사용

In [3]:
# 데이터셋 불러오기(MovieLens 100k)
df_ratings = pd.read_csv('drive/MyDrive/data/others/ratings.csv')

# 평점 데이터셋 형태 확인
print("### Rating Dataset Format ###", end='\n\n')
print(df_ratings.head(), end='\n\n\n')
df_ratings.drop(['timestamp'], axis=1, inplace=True)


df_movies = pd.read_csv('drive/MyDrive/data/others/movies.csv')

# 영화 데이터셋 형태 확인
print("### Movie Dataset Format ###", end = '\n\n')
print("Columns of Movie Dataset : ",df_movies.columns, end = '\n\n')
print(df_movies.head())

### Rating Dataset Format ###

   userId  movieId  rating  timestamp
0       1        1     4.0  964982703
1       1        3     4.0  964981247
2       1        6     4.0  964982224
3       1       47     5.0  964983815
4       1       50     5.0  964982931


### Movie Dataset Format ###

Columns of Movie Dataset :  Index(['movieId', 'title', 'genres'], dtype='object')

   movieId  ...                                       genres
0        1  ...  Adventure|Animation|Children|Comedy|Fantasy
1        2  ...                   Adventure|Children|Fantasy
2        3  ...                               Comedy|Romance
3        4  ...                         Comedy|Drama|Romance
4        5  ...                                       Comedy

[5 rows x 3 columns]


**Add Your OWN Ratings**

movie.csv 파일에서 시청했던 영화 제목와 id를 확인하고, 여러분의 rating을 추가해 보세요.

In [4]:
### Add Your Own Data ### 
# 모든 영화를 같은 평점을 주지 않도록 주의 #

###################################### Example #################################################
# User 800 is a HUGE fan of Harry Potter
rows = []                               # row = [user_id, movie_id, rating]
user_id = 800
rows.append([user_id, 1, 4])        # movie     1: Toy Story(1995)
rows.append([user_id, 4896, 4])     # movie  4896: Harry Potter and the Socerer's Stone 
rows.append([user_id, 5816, 5])     # movie  5896: Harry Potter and the Chamber of Secrets
rows.append([user_id, 69844, 5])    # movie 69844: Harry Potter and the Half-Blood Prince(2009)
rows.append([user_id, 12, 1])       # movie    12: Dracula: Dead and Loving It(1995)
rows.append([user_id, 177, 1])      # movie   177: Lord of Illusions(1995)
##################################################################################################
########################### Add Your Own Ratings using 'movie.csv' data #########################
# my_rows = []
# my_id = 2021
# rows.append([user_id, ,])       # Fill your movie id and rating     
# rows.append([user_id, ,])       # 여러분이 평가할 영화의 id와 점수를 입력하세요.
# rows.append([user_id, ,])

##################################################################################################
for row in rows:
    df_ratings = df_ratings.append(pd.Series(row, index=df_ratings.columns), ignore_index=True)
print(df_ratings)

        userId  movieId  rating
0            1        1     4.0
1            1        3     4.0
2            1        6     4.0
3            1       47     5.0
4            1       50     5.0
...        ...      ...     ...
100837     800     4896     4.0
100838     800     5816     5.0
100839     800    69844     5.0
100840     800       12     1.0
100841     800      177     1.0

[100842 rows x 3 columns]


In [5]:
# Dataset의 User, Movie 수 확인
# 평점 데이터의 userId column에 대한 unique한 값들의 개수
n_users = df_ratings.userId.unique().shape[0]
n_items = df_ratings.movieId.unique().shape[0]
print("num users: {}, num items:{}".format(n_users, n_items))

num users: 611, num items:9724


In [6]:
# 데이터 전처리 
# user id, movie id의 범위를 (0 ~ 사용자 수 -1), (0 ~ 영화 수 -1) 사이로 맞춰줌.

# 후속 단계에서의 편리성을 위해
# 데이터에서 사용된 식별자: 아이디 -> 새로 부여한 식별자: 인덱스

user_dict = dict()      # {user_id : user_idx}, user_id : original data에서 부여된 user의 id, user_idx : 새로 부여할 user의 id
movie_dict = dict()     # {movie_id: movie_idx}, movie_id : original data에서 부여된 movie의 id, movie_idx: 새로 부여할 movie의 id
user_idx = 0
movie_idx = 0
ratings = np.zeros((n_users, n_items))
for row in df_ratings.itertuples(index=False):
    user_id, movie_id, _ = row
    if user_id not in user_dict:  # user_id가 처음 나타남
        user_dict[user_id] = user_idx  # 새로운 index mapping 해줌
        user_idx += 1
    if movie_id not in movie_dict:  # movie_id 처음 나타남
        movie_dict[movie_id] = movie_idx  # 새로운 index로 mapping
        movie_idx += 1
    ratings[user_dict[user_id], movie_dict[movie_id]] = row[2]
user_idx_to_id = {v: k for k, v in user_dict.items()}

movie_idx_to_name=dict()
movie_idx_to_genre=dict()
for row in df_movies.itertuples(index=False):
    movie_id, movie_name, movie_genre = row
    if movie_id not in movie_dict:              # 어떤 영화가 rating data에 없는 경우 skip
        continue
    movie_idx_to_name[movie_dict[movie_id]] = movie_name 
    movie_idx_to_genre[movie_dict[movie_id]] = movie_genre

**Collaborative Filtering 사용 함수**




In [7]:
######################################################################################################################################
# Training Set과 Test Set을 분리해 주는 함수
######################################################################################################################################
def train_test_split(ratings):
    test = np.zeros_like(ratings)
    train = ratings.copy()
    for x in range(ratings.shape[0]):  # 각 사용자 x 순회
        nonzero_idx = ratings[x, :].nonzero()[0]  # 사용자가 몇 개의 영화에 평점 매겼는지 계산
        test_ratings = np.random.choice(nonzero_idx, 
                                        size=int(len(nonzero_idx)/5),  # 평점의 수의 1/5만큼 랜덤하게 뽑는다
                                        replace=False)
        train[x, test_ratings] = 0.  # 학습 데이터에서 지운다
        test[x, test_ratings] = ratings[x, test_ratings]  # 평점 데이터에서 가져와서 평가 데이터에 저장
        
    assert(np.all((train * test) == 0))     # train set과 test set이 완전히 분리되었는지 확인

    # 5 : 1 비율로 나눈다
    return train, test


######################################################################################################################################
# Pearson 상관계수를 계산하기 위해 평균 값을 빼줌.
# (유저별로 평점을 주는 기준이 다를 수 있으므로, 유저 별 평균 평점 값을 실제 평점 값에서 빼준다)
######################################################################################################################################
# 사용자간 취향의 유사도 계산 - 상관 계수 사용
# 상관 계수에는 (각 평점 - 평균 평점) 형태의 정규화된 평점이 많이 사용된다
def subtract_mean(ratings):

    mean_subtracted_ratings = np.zeros_like(ratings)
    
    # i는 각 사용자
    for i in range(ratings.shape[0]):
        nonzero_idx = ratings[i].nonzero()[0]                       # rating 값이 존재하는(0이 아닌) index 추출
        
        # 평점의 합계
        sum_ratings = np.sum(ratings[i])        
        
        # 각 사용자가 매긴 평점의 수
        num_nonzero = len(nonzero_idx)

        # 사용자 i의 평점 평균
        avg_rating = sum_ratings / num_nonzero                      # rating 값들의 평균값 계산
        
        # 평점 하나도 없는 것에 대한 예외 처리
        if num_nonzero == 0: 
            print("No Rating: ", i)                    
            avg_rating = 0
        
        # 정규화된 평균 평점: (원본 평점 - 평균 평점) 
        mean_subtracted_ratings[i, nonzero_idx] = ratings[i, nonzero_idx] - avg_rating 
                                                                    # 원 rating matrix에서 평균 값을 빼줌

    return mean_subtracted_ratings

######################################################################################################################################
# 두 rating의 Pearson Correlation을 값으로 갖는 similarity matrix를 생성하여 return해주는 함수
######################################################################################################################################
# 사용자간 취향의 유사도 계산
def collaborative_filtering(ratings):
    similarity = np.zeros((ratings.shape[0], ratings.shape[0]))                             # user-user collaborative filtering : (num_user, num_user)
    num_r, num_c = ratings.shape  # num_users, num_items
    ############################ Fill in Your Code ###############################
    for i in range (num_r):  # 사용자 i
        for j in range(i+1, num_r):  # 사용자 j
            sum_i = 0
            sum_j = 0
            dot_product = 0
            for k in range(num_c):  # 함께 본 영화 k 순회
                if ratings[i,k] !=0 and ratings[j,k] != 0:  # 함께 봤는지 판단
                    sum_i += ratings[i,k]**2
                    sum_j += ratings[j,k]**2
                    dot_product += ratings[i,k] * ratings[j,k]
                
            if dot_product!=0 :  # sim(x, y) 계산
                similarity[i,j] = dot_product / sqrt(sum_i) / sqrt(sum_j)
                similarity[j,i] = similarity[i,j]  # (i, j)와 (j, i)는 같다
            print("i:{}, j:{}".format(i,j))
    ################################################################################
    return similarity

######################################################################################################################################
# collaborative filtering을 통해 구한 similarity matrix와 주어진 rating matrix를 사용하여 rating을 예측하는 함수
# 주어진 유저(영화)와의 Pearson Correlation이 양수인 유저(영화) 중, 
# 본인을 제외한 top k개의 rating을 similarity에 따라 weighted sum해주어 점수를 예측
######################################################################################################################################
# 유사도를 사용한 가중 평균을 통해, 사용자-영화 쌍 각각에 대해 점수를 추정
def predict(ratings, similarity, k=10):

    pred = np.zeros(ratings.shape)

    ############################ Fill in Your Code ###############################

    # 사용자 u의 영화 i에 대한 평점 예측
    for u in range(ratings.shape[0]):  # 사용자 순회
        for i in range(ratings.shape[1]):  # 영화 순회
            watched_i = ratings[:,i].nonzero()[0]  # 해당 영화을 본 사용자들의 목록                                 # 영화 i를 본 user들을 추출
            if u in watched_i:
                watched_i = np.setdiff1d(watched_i, u)                             # 본인은 제외
            
            similarity_u = similarity[u, watched_i]                                # 영화 i를 본 user들의 유사도를
            similar_idx = np.argsort(similarity_u)[::-1]                           # 높은 순으로 정렬
            similar_idx = similar_idx[:k]                                          # 유사도가 가장 높은 k개의 index만 추출     
            similar_idx = np.where(similarity_u[similar_idx] > 0)[0]               # 양수값을 갖는 유사도만 사용

            # k명에 대한 유사도 합 - 분모 계산
            sum_similarity = np.sum(similarity[u, similar_idx])                    # 0/0 = nan 문제 피하기 위해
            if sum_similarity == 0:
                sum_similarity = 1

            pred[u, i] = np.sum(similarity[u, similar_idx].reshape([-1, 1]) * ratings[similar_idx, i]) / sum_similarity

  
    return pred

######################################################################################################################################
# Test Score와 Predicted Score의 Mean Squared Error를 계산
######################################################################################################################################
# 평가 데이터와 추정한 점수를 비교, MSE를 계산
def get_mse(pred, actual):
    pred = pred[actual.nonzero()].flatten()
    actual = actual[actual.nonzero()].flatten()
    return mean_squared_error(pred, actual)



######################################################################################################################################
# 특정 user와 유사한 영화를 추천
######################################################################################################################################
def recommend(watched_rating, pred, user_id, user_dict, movie_idx_to_name, movie_idx_to_genre):
    movies_in_order = np.argsort(pred[user_dict[user_id]])[::-1]
    watched_movie = watched_rating[user_dict[user_id]].nonzero()[0]
    cnt = 0
    ##################################### Fill in Your Code ##########################################################################
    for movie in movies_in_order:
        if pred[user_dict[user_id], movie] == 0:
            if cnt== 0:
                print("### Cannot Recommend a Movie : All Input Ratings Have Same Value ###")
            break
        if movie in watched_movie: continue
        cnt += 1 
        print("### Top {} Movie for User {} : {} \t Genre: {} ###".format(cnt, user_id, movie_idx_to_name[movie], movie_idx_to_genre[movie]))
        if cnt == 5: break
    #####################################################################################################################################

**Collaborative Filtering to our Data**

제공된 데이터에 collaborative filtering을 적용해 봅니다.

In [8]:
train_ratings, test_ratings = train_test_split(ratings)

In [9]:
# 유저별로 평점을 주는 기준이 다를 수 있으므로, 유저 별 평균 평점 값을 실제 평점 값에서 빼준다
mean_subtracted_ratings = subtract_mean(train_ratings)

In [None]:
### It may take a long time ### 

similarity = collaborative_filtering(mean_subtracted_ratings) 

[1;30;43m스트리밍 출력 내용이 길어서 마지막 5000줄이 삭제되었습니다.[0m
i:510, j:561
i:510, j:562
i:510, j:563
i:510, j:564
i:510, j:565
i:510, j:566
i:510, j:567
i:510, j:568
i:510, j:569
i:510, j:570
i:510, j:571
i:510, j:572
i:510, j:573
i:510, j:574
i:510, j:575
i:510, j:576
i:510, j:577
i:510, j:578
i:510, j:579
i:510, j:580
i:510, j:581
i:510, j:582
i:510, j:583
i:510, j:584
i:510, j:585
i:510, j:586
i:510, j:587
i:510, j:588
i:510, j:589
i:510, j:590
i:510, j:591
i:510, j:592
i:510, j:593
i:510, j:594
i:510, j:595
i:510, j:596
i:510, j:597
i:510, j:598
i:510, j:599
i:510, j:600
i:510, j:601
i:510, j:602
i:510, j:603
i:510, j:604
i:510, j:605
i:510, j:606
i:510, j:607
i:510, j:608
i:510, j:609
i:510, j:610
i:511, j:512
i:511, j:513
i:511, j:514
i:511, j:515
i:511, j:516
i:511, j:517
i:511, j:518
i:511, j:519
i:511, j:520
i:511, j:521
i:511, j:522
i:511, j:523
i:511, j:524
i:511, j:525
i:511, j:526
i:511, j:527
i:511, j:528
i:511, j:529
i:511, j:530
i:511, j:531
i:511, j:532
i:511, j:533
i:511, j:534
i

In [None]:
### It may take a long time ### 

predicted_ratings = predict(train_ratings, similarity)

In [None]:
### Fill your own user id and test your result! ### 
user_id = 800
recommend(train_ratings, predicted_ratings, user_id, user_dict, movie_idx_to_name, movie_idx_to_genre)


### Top 1 Movie for User 800 : Silence of the Lambs, The (1991) 	 Genre: Crime|Horror|Thriller ###
### Top 2 Movie for User 800 : Fugitive, The (1993) 	 Genre: Thriller ###
### Top 3 Movie for User 800 : Beauty and the Beast (1991) 	 Genre: Animation|Children|Fantasy|Musical|Romance|IMAX ###
### Top 4 Movie for User 800 : Fargo (1996) 	 Genre: Comedy|Crime|Drama|Thriller ###
### Top 5 Movie for User 800 : Pulp Fiction (1994) 	 Genre: Comedy|Crime|Drama|Thriller ###


In [None]:
# Grid Search of Hyperparameters
########### 굉장히 오래 걸릴 수 있습니다 #############
hyperparams = [2, 5, 10, 20, 30]
min_mse = 9999
for k in hyperparams:
    predicted_ratings = predict(train_ratings, similarity, k=k)
    mse = get_mse(predicted_ratings, test_ratings)
    print("### When k is {} : MSE = {}".format(k, mse))
    if mse < min_mse:
        min_mse = mse
        best_k = k
        best_predicted_ratings = predicted_ratings

print("### Best k Value : ", best_k)


### When k is 2 : MSE = 11.378842585627602
### When k is 5 : MSE = 11.47807281480367
### When k is 10 : MSE = 16.911865001755178
### When k is 20 : MSE = 67.25348528158062
### When k is 30 : MSE = 150.6101750162981
### Best k Value :  2


In [None]:
## Fill your own user id ### 
user_id = 800
recommend(train_ratings, best_predicted_ratings, user_id, user_dict, movie_idx_to_name, movie_idx_to_genre)

### Top 1 Movie for User 800 : Gladiator (2000) 	 Genre: Action|Adventure|Drama ###
### Top 2 Movie for User 800 : Tommy Boy (1995) 	 Genre: Comedy ###
### Top 3 Movie for User 800 : Star Wars: Episode V - The Empire Strikes Back (1980) 	 Genre: Action|Adventure|Sci-Fi ###
### Top 4 Movie for User 800 : Clockwork Orange, A (1971) 	 Genre: Crime|Drama|Sci-Fi|Thriller ###
### Top 5 Movie for User 800 : Duck Soup (1933) 	 Genre: Comedy|Musical|War ###


In [None]:
### Test your own result ### 
user_id = my_id
recommend(train_ratings, best_predicted_ratings, user_id, user_dict, movie_idx_to_name, movie_idx_to_genre)

NameError: ignored