**Lab 8 : Latent Factor based Recommendation System**
- Instuctor : Kijung Shin
- Teaching Assistants : Hyunju Lee(main), Deukryeol Yoon, Shinhwan Kang 
- 본 실습에서는 추천 시스템에서 가장 흔히 쓰이는 surprise library를 사용하여 Latent Factor 기반 추천 시스템을 구현해본다.

In [1]:
!pip install surprise

Collecting surprise
  Downloading https://files.pythonhosted.org/packages/61/de/e5cba8682201fcf9c3719a6fdda95693468ed061945493dea2dd37c5618b/surprise-0.1-py2.py3-none-any.whl
Collecting scikit-surprise
[?25l  Downloading https://files.pythonhosted.org/packages/97/37/5d334adaf5ddd65da99fc65f6507e0e4599d092ba048f4302fe8775619e8/scikit-surprise-1.1.1.tar.gz (11.8MB)
[K     |████████████████████████████████| 11.8MB 349kB/s 
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.1-cp37-cp37m-linux_x86_64.whl size=1615294 sha256=deb38875924420b8460b87e316df7c906cad097255310fd28a0a3790c84a5532
  Stored in directory: /root/.cache/pip/wheels/78/9c/3d/41b419c9d2aff5b6e2b4c0fc8d25c538202834058f9ed110d0
Successfully built scikit-surprise
Installing collected packages: scikit-surprise, surprise
Successfully installed scikit-surprise-1.1.1 surprise-0.1


In [4]:
import numpy as np 
import pandas as pd
from surprise import SVD  # 잠재 인수 모형
from surprise.model_selection import train_test_split
from surprise.dataset import DatasetAutoFolds
from surprise.model_selection import cross_validate
from surprise import Dataset, Reader
from surprise import accuracy

In [3]:
import os, sys 
from google.colab import drive 

### 해당 코드 실행 시 colab에서 실행중인 폴더의 /content/drive/My Drive가 구글 드라이브에 연결됨

drive.mount('/content/drive')

Mounted at /content/drive


**Dataset Loading**

- 여기서는 100,000개의 평점으로 구성된 MovieLens 데이터셋 사용

In [5]:
#### 데이터셋 불러오기(MovieLens 10k) ####
df_ratings = pd.read_csv('drive/MyDrive/data/others/ratings.csv')

#### 평점 데이터셋 형태 확인#### 
# surprise library의 Reader 사용 시 반드시 사용자-아이템-평점 순으로 정보가 들어가 있어야 함

print("### Rating Dataset Format ###", end='\n\n')
print(df_ratings.head(), end='\n\n\n')
df_ratings.drop(['timestamp'], axis=1, inplace=True)
print("### Rating Dataset - Timestamp Removed ###", end='\n\n')
print(df_ratings)


### Rating Dataset Format ###

   userId  movieId  rating  timestamp
0       1        1     4.0  964982703
1       1        3     4.0  964981247
2       1        6     4.0  964982224
3       1       47     5.0  964983815
4       1       50     5.0  964982931


### Rating Dataset - Timestamp Removed ###

        userId  movieId  rating
0            1        1     4.0
1            1        3     4.0
2            1        6     4.0
3            1       47     5.0
4            1       50     5.0
...        ...      ...     ...
100831     610   166534     4.0
100832     610   168248     5.0
100833     610   168250     5.0
100834     610   168252     5.0
100835     610   170875     3.0

[100836 rows x 3 columns]


In [6]:
df_movies = pd.read_csv('drive/MyDrive/data/others/movies.csv')

#### 영화 데이터셋 형태 확인 ####
print("### Movie Dataset Format ###", end = '\n\n')
print("Columns of Movie Dataset : ",df_movies.columns, end = '\n\n')
print(df_movies.head())

### Movie Dataset Format ###

Columns of Movie Dataset :  Index(['movieId', 'title', 'genres'], dtype='object')

   movieId  ...                                       genres
0        1  ...  Adventure|Animation|Children|Comedy|Fantasy
1        2  ...                   Adventure|Children|Fantasy
2        3  ...                               Comedy|Romance
3        4  ...                         Comedy|Drama|Romance
4        5  ...                                       Comedy

[5 rows x 3 columns]


In [7]:
#### Dataset의 User, Movie 수 확인 ####
n_users = df_ratings.userId.unique().shape[0]  # userID column에 고유한 값 몇개 있는지 확인
n_items = df_ratings.movieId.unique().shape[0]
print("num users: {}, num items:{}".format(n_users, n_items))

num users: 610, num items:9724


In [8]:
### Add Your Own Data ### 

###################################### Example 1#################################################
# User 800 is a HUGE fan of Musical Movies
rows = []                               # row = [user_id, movie_id, rating]
user_id = 800
rows.append([user_id, 73, 5])        # movie    73: Miserables, Les (1995)
rows.append([user_id, 107780, 5])     # movie  107780: Cats(1998) 
rows.append([user_id, 588, 5])     # movie  588: Aladin(1992)
rows.append([user_id, 60397, 5])    # movie 69397: Mamma Mia!(2008)
rows.append([user_id, 99149, 5])    # movie 99149: Miserables, Les (2012)
rows.append([user_id, 138186, 1])    # movie 138186: Sorrow(2015)
rows.append([user_id, 1997, 1])    # movie 1997: Scream 2 (1991)

##################################################################################################

###################################### Example 2#################################################
# User 900 is a HUGE fan of Animation Movies
rows = []                               # row = [user_id, movie_id, rating]
user_id = 900
rows.append([user_id, 1022, 5])        # movie    1022: Cinderella(1950)
rows.append([user_id, 594, 5])     # movie  594: Snow White and the Seven Dwarfs(1937) 
rows.append([user_id, 106696, 5])     # movie  106696: Frozen(2013)
rows.append([user_id, 166461, 5])    # movie 166461: Moana(2016)
rows.append([user_id, 595, 5])    # movie 595: Beauty and the Beast (1991)
rows.append([user_id, 138168, 1])    # movie 138168: Sorrow(2015)
rows.append([user_id, 1997, 1])    # movie 1997: Scream 2 (1991)

##################################################################################################


########################### Add Your Own Ratings using 'movie.csv' data #########################
# my_rows = []
# my_id = 2021
# rows.append([user_id, ,])       # Fill your movie id and rating     
# rows.append([user_id, ,])       # 여러분이 평가할 영화의 id와 점수를 입력하세요.
# rows.append([user_id, ,])
# rows.append([user_id, ,])
# rows.append([user_id, ,])

##################################################################################################
for row in rows:
    df_ratings = df_ratings.append(pd.Series(row, index=df_ratings.columns), ignore_index=True)
print(df_ratings)

        userId  movieId  rating
0            1        1     4.0
1            1        3     4.0
2            1        6     4.0
3            1       47     5.0
4            1       50     5.0
...        ...      ...     ...
100838     900   106696     5.0
100839     900   166461     5.0
100840     900      595     5.0
100841     900   138168     1.0
100842     900     1997     1.0

[100843 rows x 3 columns]


In [9]:
#### Dataset의 User, Movie 수 확인 ####
n_users = df_ratings.userId.unique().shape[0]
n_items = df_ratings.movieId.unique().shape[0]
print("num users: {}, num items:{}".format(n_users, n_items))

num users: 611, num items:9725


In [10]:
#### Get Movid Name from Movie ID - 영화 ID로부터 영화 제목 얻기 ###

movie_set = set()     
ratings = np.zeros((n_users, n_items))
for (_, movie_id, _) in df_ratings.itertuples(index=False):
    movie_set.add(movie_id)

movie_id_to_name=dict()
movie_id_to_genre=dict()

for (movie_id, movie_name, movie_genre) in df_movies.itertuples(index=False):
    if movie_id not in movie_set:              # 어떤 영화가 rating data에 없는 경우 skip
        continue
    movie_id_to_name[movie_id] = movie_name 
    movie_id_to_genre[movie_id] = movie_genre

    


- 훈련 데이터와 평가 데이터 분리

In [11]:
#### pandas dataframe을 surprise dataset 형태로 바꿔준 후, train set과 test set을 split 해준다 ####
reader = Reader(rating_scale=(0, 5))
data = Dataset.load_from_df(df_ratings[['userId','movieId','rating']], reader=reader)

train, test = train_test_split(data, test_size=0.2, shuffle=True)

print(type(data))
print(type(train))

##################################################################################
## Grid Search를 위해 surprise.trainset 형태의 데이터를 surprise.dataset으로 변경해준다
iterator = train.all_ratings()
train_df = pd.DataFrame(columns=['userId', 'movieId', 'rating'])
i = 0
for (uid, iid, rating) in iterator:
    train_df.loc[i] = [train.to_raw_uid(int(uid)), train.to_raw_iid(iid), rating]
    i = i+1

train_data = Dataset.load_from_df(train_df, reader=reader)

print(type(train))
print(type(train_data))
##################################################################################


<class 'surprise.dataset.DatasetAutoFolds'>
<class 'surprise.trainset.Trainset'>
<class 'surprise.trainset.Trainset'>
<class 'surprise.dataset.DatasetAutoFolds'>


**모델 설정 및 학습**
- 하이퍼파리미터 탐색한 뒤, 잠재 인수 모형 학습

In [12]:
### Hyperparameter Grid Search ### 

from surprise.model_selection import GridSearchCV
param_grid = {'n_factors': [10,15,20,30,50,100]} # 임베딩 공간의 차원

####### Fill in Your Code ##########
# (추천 시스템 모형, 탐색 범위, 탐색 기준, cross validation 데이터 분할)
grid = GridSearchCV(SVD, param_grid, measures = ['rmse', 'mae'], cv=4)
grid.fit(train_data)
#####################################


print(grid.best_score['rmse'])
print(grid.best_params['rmse'])

0.879623243955389
{'n_factors': 15}


In [13]:
### Use the Hyperparameter with best performance ###

print(grid.best_params)
################ Fill in Your Code #################

# 학습한 잠재 인수 모형
algorithm = SVD(grid.best_params['rmse']['n_factors'])
algorithm.fit(train)
####################################################

{'rmse': {'n_factors': 15}, 'mae': {'n_factors': 15}}


<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7f6bc06c9dd0>

**모델 예측**

In [14]:
##### algorithm prediction #####


prediction = algorithm.test(test)
for p in prediction[:5]:            # prediction 결과값 5개 미리보기
    print(p)                        # r_ui : 실제 rating 값, est: 예측된 rating 값
    

user: 599        item: 27722      r_ui = 3.50   est = 2.66   {'was_impossible': False}
user: 599        item: 7285       r_ui = 3.00   est = 2.90   {'was_impossible': False}
user: 555        item: 65         r_ui = 3.00   est = 2.87   {'was_impossible': False}
user: 596        item: 1210       r_ui = 4.00   est = 3.89   {'was_impossible': False}
user: 237        item: 7980       r_ui = 3.50   est = 3.39   {'was_impossible': False}


In [15]:
#### 특정 user, 특정 item에 대한 prediction 값 ###
uid = 800
iid = 8368
prediction_user_item = algorithm.predict(uid, iid)
print(prediction_user_item)     

user: 800        item: 8368       r_ui = None   est = 4.10   {'was_impossible': False}


In [16]:
##############################################################
##### 해당 user가 아직 보지 않은 영화를 return해주는 함수#####
##############################################################
def get_unseen_movies(data, user_id):

    watched_movies = set()
    total_movies = set()
    ########### Fill in Your Code #################
    for (uid, iid, rating) in data.all_ratings():
        
        total_movies.add(iid)
        if uid == user_id:
            watched_movies.add(iid)
    
    unseen_movies = total_movies - watched_movies
    ##################################################
    return unseen_movies
    # return total_movies

- 시청하지 않은 영화 중에 추정 평점이 높은 것들을 추정 평점 역순으로 추천한다

In [17]:
################################################################################
############# 특정 user에게 top k개의 영상을 추천해주는 함수 ###################
################################################################################
def recommend(train, algorithm, user_id, top_k=10):
    ################ Fill in Your Code ########################################
    # 보지 않은 영화 목록 가져온다
    unseen_movies = get_unseen_movies(train, user_id)
    
    # 점수 추정
    prediction = [algorithm.predict(user_id, movie_id) for movie_id in unseen_movies]

    # 점수 역순 정렬
    prediction.sort(key=lambda x:x.est, reverse=True)  

    ###########################################################################
    for _, movie, _, pred, _ in prediction[:top_k]:
        print("movid id: {}, movie genre: {},predicted rating: {}".format(movie_id_to_name[movie], movie_id_to_genre[movie], pred))


In [18]:
#########################################
####### 800번 유저의 추천 결과 ##########
#########################################

recommend(train, algorithm, user_id=800, top_k=20)


movid id: Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1964), movie genre: Comedy|War,predicted rating: 4.379268343299398
movid id: Cool Hand Luke (1967), movie genre: Drama,predicted rating: 4.366337910559869
movid id: Shawshank Redemption, The (1994), movie genre: Crime|Drama,predicted rating: 4.357400944375781
movid id: Great Escape, The (1963), movie genre: Action|Adventure|Drama|War,predicted rating: 4.338102501836261
movid id: Lawrence of Arabia (1962), movie genre: Adventure|Drama|War,predicted rating: 4.338069613240958
movid id: Rear Window (1954), movie genre: Mystery|Thriller,predicted rating: 4.3159192373770505
movid id: Sunset Blvd. (a.k.a. Sunset Boulevard) (1950), movie genre: Drama|Film-Noir|Romance,predicted rating: 4.314247235102333
movid id: Godfather, The (1972), movie genre: Crime|Drama,predicted rating: 4.312106926751112
movid id: Usual Suspects, The (1995), movie genre: Crime|Mystery|Thriller,predicted rating: 4.307639521902191
movid id: C

In [19]:
#########################################
####### 900번 유저의 추천 결과 ##########
#########################################

recommend(train, algorithm, user_id=900, top_k=20)


movid id: Shawshank Redemption, The (1994), movie genre: Crime|Drama,predicted rating: 4.780200505264372
movid id: Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1964), movie genre: Comedy|War,predicted rating: 4.771367032257561
movid id: Lawrence of Arabia (1962), movie genre: Adventure|Drama|War,predicted rating: 4.746860741892152
movid id: Godfather, The (1972), movie genre: Crime|Drama,predicted rating: 4.728139408387854
movid id: Cool Hand Luke (1967), movie genre: Drama,predicted rating: 4.711957023184131
movid id: Usual Suspects, The (1995), movie genre: Crime|Mystery|Thriller,predicted rating: 4.688826831802954
movid id: Great Escape, The (1963), movie genre: Action|Adventure|Drama|War,predicted rating: 4.684900006788807
movid id: Casablanca (1942), movie genre: Drama|Romance,predicted rating: 4.678470198784165
movid id: Lord of the Rings: The Return of the King, The (2003), movie genre: Action|Adventure|Drama|Fantasy,predicted rating: 4.660734997903226
m