### [E14] Movie Recommendation

#### 목차

1. 데이터 준비 및 전처리
2. 데이터 분석
3. 선호하는 영화 추가
4. CSR matrix 생성
5. 모델 설계 및 훈련
6. 훈련된 모델이 예측한 나의 선호도
7. 내가 좋아하는 영화와 비슷한 영화 추천 
8. 내가 좋아할만한 영화 추천 
9. 회고

## 데이터 준비 및 전처리

In [1]:
import pandas as pd
import os

rating_file_path=os.getenv('HOME') + '/aiffel/recommendata_iu/data/ml-1m/ratings.dat'
ratings_cols = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_csv(rating_file_path, sep='::', names=ratings_cols, engine='python', encoding = "ISO-8859-1")
orginal_data_size = len(ratings)
ratings.head()

Unnamed: 0,user_id,movie_id,rating,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


- 평점이 3점 이상인 movie들만 남김
- 1,000,209개의 영화 중, 836,478개의 영화만 평점이 3점 이상 (약 83.63%)

In [2]:
ratings = ratings[ratings['rating']>=3]
filtered_data_size = len(ratings)

print(f'orginal_data_size: {orginal_data_size}, filtered_data_size: {filtered_data_size}')
print(f'Ratio of Remaining Data is {filtered_data_size / orginal_data_size:.2%}')

orginal_data_size: 1000209, filtered_data_size: 836478
Ratio of Remaining Data is 83.63%


- rating column의 이름을 count로 변경

In [3]:
ratings.rename(columns={'rating':'count'}, inplace=True)
ratings['count']

0          5
1          3
2          3
3          4
4          5
          ..
1000203    3
1000205    5
1000206    5
1000207    4
1000208    4
Name: count, Length: 836478, dtype: int64

In [4]:
movie_file_path=os.getenv('HOME') + '/aiffel/recommendata_iu/data/ml-1m/movies.dat'
cols = ['movie_id', 'title', 'genre'] 
movies = pd.read_csv(movie_file_path, sep='::', names=cols, engine='python', encoding='ISO-8859-1')
movies.head()

Unnamed: 0,movie_id,title,genre
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


- ratings dataframe과 movies dataframe 합치기

In [5]:
ratings = ratings.join(movies.set_index('movie_id'), on='movie_id')
ratings.head()

Unnamed: 0,user_id,movie_id,count,timestamp,title,genre
0,1,1193,5,978300760,One Flew Over the Cuckoo's Nest (1975),Drama
1,1,661,3,978302109,James and the Giant Peach (1996),Animation|Children's|Musical
2,1,914,3,978301968,My Fair Lady (1964),Musical|Romance
3,1,3408,4,978300275,Erin Brockovich (2000),Drama
4,1,2355,5,978824291,"Bug's Life, A (1998)",Animation|Children's|Comedy


- timestamp, gerne column은 불필요하므로, 삭제

In [6]:
ratings = ratings.drop(columns=['timestamp', 'genre'])
ratings.head()

Unnamed: 0,user_id,movie_id,count,title
0,1,1193,5,One Flew Over the Cuckoo's Nest (1975)
1,1,661,3,James and the Giant Peach (1996)
2,1,914,3,My Fair Lady (1964)
3,1,3408,4,Erin Brockovich (2000)
4,1,2355,5,"Bug's Life, A (1998)"


## 2. 데이터 분석

- ratings에 존재하는 유니크한 영화 갯수, 영화 제목 수, 사용자 수를 출력

In [7]:
print('# of movie_id: ', ratings['movie_id'].nunique())
print('# of title   : ', ratings['title'].nunique())
print('# of user_id : ', ratings['user_id'].nunique())

# of movie_id:  3628
# of title   :  3628
# of user_id :  6039


- 사용자들이 많이 본 영화 30개 출력

- 가장 인기 있는 영화로 판단

In [8]:
movie_count = ratings.groupby('title')['user_id'].count()
movie_count.sort_values(ascending=False).head(30)

title
American Beauty (1999)                                   3211
Star Wars: Episode IV - A New Hope (1977)                2910
Star Wars: Episode V - The Empire Strikes Back (1980)    2885
Star Wars: Episode VI - Return of the Jedi (1983)        2716
Saving Private Ryan (1998)                               2561
Terminator 2: Judgment Day (1991)                        2509
Silence of the Lambs, The (1991)                         2498
Raiders of the Lost Ark (1981)                           2473
Back to the Future (1985)                                2460
Matrix, The (1999)                                       2434
Jurassic Park (1993)                                     2413
Sixth Sense, The (1999)                                  2385
Fargo (1996)                                             2371
Braveheart (1995)                                        2314
Men in Black (1997)                                      2297
Schindler's List (1993)                                  2257
Pr

In [9]:
user_count = ratings.groupby('user_id')['movie_id'].count()
user_count.describe()

count    6039.000000
mean      138.512668
std       156.241599
min         1.000000
25%        38.000000
50%        81.000000
75%       177.000000
max      1968.000000
Name: movie_id, dtype: float64

## 3. 선호하는 영화 추가

- 영화 제목을 키워드로 검색


In [10]:
movies[movies['title'].str.lower().str.contains('men in', regex=False)]

Unnamed: 0,movie_id,title,genre
516,520,Robin Hood: Men in Tights (1993),Comedy
1539,1580,Men in Black (1997),Action|Adventure|Comedy|Sci-Fi


- 내가 좋아하는 영화 5편을 선정하여 id와 title 리스트 만들기

- user_id = lil_park, 각각의 영화를 5번씩 봤다고 가정

In [11]:
my_favorite_id = [1240, 2571, 1580 , 648, 3623]

my_favorite_title = []
for i in my_favorite_id:
    my_favorite_title.extend(list(movies[movies['movie_id'] == i]['title']))

my_movielist = pd.DataFrame({'user_id': ['jinhyun내가 좋아하는 영화 5편을 선정하여 id와 title 리스트 만들기']*5, 'movie_id': my_favorite_id, 'count': [5]*5, 'title': my_favorite_title})
my_movielist

Unnamed: 0,user_id,movie_id,count,title
0,jinhyun,1240,5,"Terminator, The (1984)"
1,jinhyun,2571,5,"Matrix, The (1999)"
2,jinhyun,1580,5,Men in Black (1997)
3,jinhyun,648,5,Mission: Impossible (1996)
4,jinhyun,3623,5,Mission: Impossible 2 (2000)


- 기존의 ratings dataframe에 새로운 데이터를 추가 (my_movielist)

In [12]:
if not ratings.isin({'user_id':['jinhyun']})['user_id'].any():
    ratings = ratings.append(my_movielist, ignore_index=True)

ratings.tail(10)

Unnamed: 0,user_id,movie_id,count,title
836473,6040,1090,3,Platoon (1986)
836474,6040,1094,5,"Crying Game, The (1992)"
836475,6040,562,5,Welcome to the Dollhouse (1995)
836476,6040,1096,4,Sophie's Choice (1982)
836477,6040,1097,4,E.T. the Extra-Terrestrial (1982)
836478,jinhyun,1240,5,"Terminator, The (1984)"
836479,jinhyun,2571,5,"Matrix, The (1999)"
836480,jinhyun,1580,5,Men in Black (1997)
836481,jinhyun,648,5,Mission: Impossible (1996)
836482,jinhyun,3623,5,Mission: Impossible 2 (2000)


## 4. CSR matrix 생성

- user_unique: 유저의 수

- movie_unique: 영화의 갯수

In [13]:
user_unique = ratings['user_id'].unique()
movie_unique = ratings['title'].unique()

user_to_idx = {v:k for k,v in enumerate(user_unique)}
movie_to_idx = {v:k for k,v in enumerate(movie_unique)}

In [14]:
temp_user_data = ratings['user_id'].map(user_to_idx.get).dropna()
if len(temp_user_data) == len(ratings):
    print('user_id column indexing OK!!')
    ratings['user_id'] = temp_user_data
else:
    print('user_id column indexing Fail!!')

temp_movie_data = ratings['title'].map(movie_to_idx.get).dropna()
if len(temp_movie_data) == len(ratings):
    print('title column indexing OK!!')
    ratings['title'] = temp_movie_data
else:
    print('title column indexing Fail!!')

ratings

user_id column indexing OK!!
title column indexing OK!!


Unnamed: 0,user_id,movie_id,count,title
0,0,1193,5,0
1,0,661,3,1
2,0,914,3,2
3,0,3408,4,3
4,0,2355,5,4
...,...,...,...,...
836478,6039,1240,5,200
836479,6039,2571,5,124
836480,6039,1580,5,175
836481,6039,648,5,58


- CSR matrix 생성

In [15]:
from scipy.sparse import csr_matrix

num_user = ratings['user_id'].nunique()
num_movie = ratings['title'].nunique()

csr_data = csr_matrix((ratings['count'], (ratings.user_id, ratings.title)), shape= (num_user, num_movie))
csr_data

<6040x3628 sparse matrix of type '<class 'numpy.int64'>'
	with 836483 stored elements in Compressed Sparse Row format>

## 5. 모델 설계 및 훈련

In [16]:
from implicit.als import AlternatingLeastSquares
import os
import numpy as np

os.environ['OPENBLAS_NUM_THREADS']='1'
os.environ['KMP_DUPLICATE_LIB_OK']='True'
os.environ['MKL_NUM_THREADS']='1'

- Implicit AlternatingLeastSquares model 선언

In [17]:
als_model = AlternatingLeastSquares(factors=100, regularization=0.01, use_gpu=False, iterations=15, dtype=np.float32)

In [18]:
csr_data_transpose = csr_data.T
csr_data_transpose

<3628x6040 sparse matrix of type '<class 'numpy.int64'>'
	with 836483 stored elements in Compressed Sparse Column format>

- model 훈련

In [19]:
als_model.fit(csr_data_transpose)

  0%|          | 0/15 [00:00<?, ?it/s]

## 6. 훈련된 모델이 예측한 나의 선호도

- model을 이용하여 나의 벡터와 영화 벡터를 구함

- 내가 선호하는 영화 리스트에 포함되어 있는 영화인 Matrix의 벡터를 계산

In [20]:
jinhyun, matrix = user_to_idx['jinhyun'], movie_to_idx['Matrix, The (1999)']
jinhyun_vector, matrix_vector = als_model.user_factors[jinhyun], als_model.item_factors[matrix]

In [21]:
jinhyun_vector

array([-0.10407385, -0.15996921, -0.04658415,  0.6656077 , -0.41091716,
        0.6441916 , -0.14052422, -0.10523397,  0.25680387,  0.64419585,
       -0.5740881 ,  0.12596968, -0.94963324,  0.25905886,  0.57774293,
       -0.21707761, -0.16874526, -0.2067231 , -0.17854705,  0.43732002,
       -0.0627743 ,  0.6545852 , -0.59835035, -0.6273166 , -0.21616676,
       -1.0186985 ,  0.48447376, -0.39903402, -0.5583906 ,  0.5207662 ,
        0.18024871,  0.54136217,  0.59648305, -0.02677123,  0.45193064,
        0.3367994 ,  0.22542396,  0.6913113 ,  1.057372  , -0.06599249,
        0.11168905, -0.22841829, -0.04319756, -0.5076382 ,  0.5031533 ,
       -0.6764759 ,  0.22234018,  0.0166997 ,  0.5561485 ,  0.5321205 ,
       -0.32288212,  0.00699064,  0.63094515,  0.4442744 , -0.49489233,
        0.21633357, -0.9358439 ,  0.25936046,  0.02060665,  0.06043379,
       -0.7194806 ,  0.811943  , -0.27913883, -0.97378665, -0.38428545,
       -0.08008095,  0.55054   , -0.2591602 ,  0.6764263 ,  0.25

In [22]:
matrix_vector

array([-0.0036088 ,  0.01346618, -0.01313983, -0.00241693,  0.018008  ,
        0.00456053,  0.00368894, -0.00391179,  0.00494958,  0.01707732,
       -0.01592822,  0.02945047, -0.02068741,  0.02652671,  0.04261166,
       -0.01056477,  0.01074309,  0.00128489, -0.00082344,  0.00896771,
       -0.0001897 ,  0.02559703, -0.00421177,  0.01371059,  0.00175585,
       -0.03391613,  0.03233499, -0.01316515, -0.00712932,  0.02017915,
        0.00558355,  0.00990691,  0.01101034, -0.00032511,  0.01146411,
        0.00429229,  0.01443505,  0.03113811,  0.04024335,  0.0088894 ,
        0.01436334,  0.0059607 , -0.00557846, -0.008721  ,  0.02070909,
       -0.00368633,  0.00173254,  0.00531164,  0.00880428,  0.0113194 ,
        0.00225   ,  0.00349749,  0.01410162,  0.01854736,  0.00583536,
        0.03445185, -0.01378641,  0.02214003, -0.00395485, -0.01957959,
        0.00409041, -0.00512295,  0.00786609, -0.00549204, -0.01839095,
        0.01866245,  0.01085191,  0.00394807,  0.02997269,  0.01

- 나의 벡터와 영화 Matrix의 벡터를 내적

In [23]:
np.dot(jinhyun_vector, matrix_vector)

0.66690004

- 선호 리스트에 없는 영화의 벡터를 계산

- 굉장히 낮은 값이 출력

In [24]:
Platoon = movie_to_idx['Platoon (1986)']
Platoon_vector = als_model.item_factors[Platoon]
np.dot(jinhyun_vector, Platoon_vector)

-0.16732344

## 7. 내가 좋아하는 영화와 비슷한 영화 추천

- 내가 좋아하는 영화 벡터를 이용하여, input으로 주어진 영화와 비슷한 영화를 총 15개 출력

In [25]:
favorite_movie = 'Mission: Impossible (1996)'
movie_id = movie_to_idx[favorite_movie]
similar_movie = als_model.similar_items(movie_id, N=15)
similar_movie 

[(58, 1.0),
 (138, 0.60378706),
 (118, 0.493976),
 (129, 0.49240384),
 (587, 0.48744965),
 (82, 0.48630193),
 (179, 0.4837196),
 (156, 0.479505),
 (714, 0.45837945),
 (119, 0.4322652),
 (766, 0.42501557),
 (975, 0.42030525),
 (772, 0.4176734),
 (140, 0.4136591),
 (421, 0.4062894)]

- index 값을 이용하여 영화의 이름을 출력

In [26]:
idx_to_movie = {v:k for k,v in movie_to_idx.items()}
[idx_to_movie[i[0]] for i in similar_movie]

['Mission: Impossible (1996)',
 'True Lies (1994)',
 'Con Air (1997)',
 'Conspiracy Theory (1997)',
 'Batman Returns (1992)',
 'Lost World: Jurassic Park, The (1997)',
 'Rock, The (1996)',
 'Mask of Zorro, The (1998)',
 'Waterworld (1995)',
 'Twister (1996)',
 'GoldenEye (1995)',
 'Young Sherlock Holmes (1985)',
 'Rising Sun (1993)',
 'Clear and Present Danger (1994)',
 'Midnight in the Garden of Good and Evil (1997)']

- 함수 내부에서 index 값을 이용하여 input으로 주어진 영화와 비슷한 영화의 제목을 출력

In [27]:
def get_similar_movie(movie_title: str):
    movie_id = movie_to_idx[movie_title]
    similar_movie = als_model.similar_items(movie_id, N=15)
    similar_movie = [idx_to_movie[i[0]] for i in similar_movie]
    return similar_movie

In [28]:
get_similar_movie('Men in Black (1997)')

['Men in Black (1997)',
 'Jurassic Park (1993)',
 'Terminator 2: Judgment Day (1991)',
 'Total Recall (1990)',
 'Matrix, The (1999)',
 'Independence Day (ID4) (1996)',
 'Fifth Element, The (1997)',
 'Schlafes Bruder (Brother of Sleep) (1995)',
 'Lost World: Jurassic Park, The (1997)',
 'Face/Off (1997)',
 'True Lies (1994)',
 'Rocky Horror Picture Show, The (1975)',
 'Braveheart (1995)',
 'Bewegte Mann, Der (1994)',
 'Mission: Impossible (1996)']

In [29]:
get_similar_movie('Terminator, The (1984)')

['Terminator, The (1984)',
 'Aliens (1986)',
 'Die Hard (1988)',
 'Predator (1987)',
 'Alien (1979)',
 'Terminator 2: Judgment Day (1991)',
 'Matrix, The (1999)',
 'Robocop (1987)',
 'Total Recall (1990)',
 'Blade Runner (1982)',
 'Indiana Jones and the Last Crusade (1989)',
 'Mad Max 2 (a.k.a. The Road Warrior) (1981)',
 'Raiders of the Lost Ark (1981)',
 'Star Wars: Episode V - The Empire Strikes Back (1980)',
 'Mad Max (1979)']

## 8. 내가 좋아할 만한 영화 추천

- 나의 벡터를 이용하여, 내가 선호하는 영화들과 비슷한 영화들을 index 값으로 반환

In [30]:
user = user_to_idx['jinhyun']

movie_recommended = als_model.recommend(user, csr_data, N=15, filter_already_liked_items=True)
movie_recommended

[(92, 0.68560356),
 (107, 0.57551026),
 (62, 0.45803493),
 (372, 0.43866682),
 (67, 0.34742075),
 (346, 0.30091265),
 (141, 0.29454082),
 (82, 0.28336498),
 (145, 0.27763194),
 (375, 0.2682832),
 (44, 0.26061863),
 (117, 0.25984251),
 (150, 0.24387723),
 (746, 0.23775703),
 (361, 0.23207167)]

In [31]:
[idx_to_movie[i[0]] for i in movie_recommended]

['Terminator 2: Judgment Day (1991)',
 'Jurassic Park (1993)',
 'Total Recall (1990)',
 'X-Men (2000)',
 'Gladiator (2000)',
 'Patriot, The (2000)',
 'Fugitive, The (1993)',
 'Lost World: Jurassic Park, The (1997)',
 'Fifth Element, The (1997)',
 'Face/Off (1997)',
 'Star Wars: Episode IV - A New Hope (1977)',
 'Star Wars: Episode V - The Empire Strikes Back (1980)',
 'Independence Day (ID4) (1996)',
 'Perfect Storm, The (2000)',
 'Casablanca (1942)']

- 사용자의 선호 영화 리스트에 있는 영화들이 영화 추천에 어느정도 기여했는지 기여도를 확인

In [32]:
recommended = movie_to_idx['Jurassic Park (1993)']
explain = als_model.explain(user, csr_data, itemid=recommended)

[(idx_to_movie[i[0]], i[1]) for i in explain[1]]

[('Men in Black (1997)', 0.34406963963052284),
 ('Matrix, The (1999)', 0.14778931259001069),
 ('Mission: Impossible (1996)', 0.0732708621655865),
 ('Terminator, The (1984)', 0.02131331049830546),
 ('Mission: Impossible 2 (2000)', -0.02133515904148016)]

## 9. 회고

- data의 경우, 평점과 제목이 2개의 파일로 나눠져 있었다. 평점과 영화 제목의 경우, 많이 사용하는 데이터이기 때문에, join을 이용하여 두 개의 dataframe을 하나로 합쳐서 사용하였다.

- 선호하는 영화는 0.66으로 비교적 높은 값을 얻었고 선호하지 않는 영화는 -0.16으로 매우 낮은 값을 얻었다. 저는 액션영화를 더 선호하기 때문에 저의 성향이 잘 반영 된거 같습니다.

- 이번 프로잭트를 진행하는데 지금까지 한 노드들 중에 가장 재미있게 한 노드가 아닐까 합니다. 왜냐하면 이때까지 유튜브, 넷플릭스, 인터넷 광고 등 여러 플랫폼에서 사용하는 추천 시스템에 대해서 궁금한 부분이 많았습니다. 그런데 이번 노드를 진행하면서 어는정도의 궁금증이 풀린거 같습니다. 