# 13. 아이유팬이 좋아할 만한 다른 아티스트 찾기

## 13-9. 프로젝트 - Movielens 영화 추천 실습

#### 라이브러리 버전 확인

In [1]:
import numpy as np
import scipy
import implicit



print(np.__version__)
print(scipy.__version__)
print(implicit.__version__)

1.21.4
1.7.1
0.4.8


별점을 시청횟수로 해석해서 생각하겠습니다.

또한 유저가 3점 미만으로 준 데이터는 선호하지 않는다고 가정하고 제외하겠습니다.

### 1) 데이터 준비와 전처리

In [2]:
import pandas as pd
import os

rating_file_path=os.getenv('HOME') + '/aiffel/recommendata_iu/data/ml-1m/ratings.dat'
ratings_cols = ['user_id', 'movie_id', 'ratings', 'timestamp']
ratings = pd.read_csv(rating_file_path, sep='::', names=ratings_cols, engine='python', encoding = "ISO-8859-1")
orginal_data_size = len(ratings)
ratings.head()

Unnamed: 0,user_id,movie_id,ratings,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


In [3]:
# 3점 이상만 남깁니다.
ratings = ratings[ratings['ratings']>=3]
filtered_data_size = len(ratings)

print(f'orginal_data_size: {orginal_data_size}, filtered_data_size: {filtered_data_size}')
print(f'Ratio of Remaining Data is {filtered_data_size / orginal_data_size:.2%}')

orginal_data_size: 1000209, filtered_data_size: 836478
Ratio of Remaining Data is 83.63%


In [4]:
# ratings 컬럼의 이름을 counts로 바꿉니다.
ratings.rename(columns={'ratings':'counts'}, inplace=True)

In [5]:
ratings['counts']

0          5
1          3
2          3
3          4
4          5
          ..
1000203    3
1000205    5
1000206    5
1000207    4
1000208    4
Name: counts, Length: 836478, dtype: int64

In [6]:
# 영화 제목을 보기 위해 메타 데이터를 읽어옵니다.
movie_file_path=os.getenv('HOME') + '/aiffel/recommendata_iu/data/ml-1m/movies.dat'
cols = ['movie_id', 'title', 'genre'] 
movies = pd.read_csv(movie_file_path, sep='::', names=cols, engine='python', encoding='ISO-8859-1')
movies.head()

Unnamed: 0,movie_id,title,genre
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


In [7]:
movies['genre'].unique()

array(["Animation|Children's|Comedy", "Adventure|Children's|Fantasy",
       'Comedy|Romance', 'Comedy|Drama', 'Comedy',
       'Action|Crime|Thriller', "Adventure|Children's", 'Action',
       'Action|Adventure|Thriller', 'Comedy|Drama|Romance',
       'Comedy|Horror', "Animation|Children's", 'Drama',
       'Action|Adventure|Romance', 'Drama|Thriller', 'Drama|Romance',
       'Thriller', 'Action|Comedy|Drama', 'Crime|Drama|Thriller',
       'Drama|Sci-Fi', 'Romance', 'Adventure|Sci-Fi', 'Adventure|Romance',
       "Children's|Comedy|Drama", 'Documentary', 'Drama|War',
       'Action|Crime|Drama', 'Action|Adventure', 'Crime|Thriller',
       "Animation|Children's|Musical|Romance", 'Action|Drama|Thriller',
       "Children's|Comedy", 'Drama|Mystery', 'Sci-Fi|Thriller',
       'Action|Comedy|Crime|Horror|Thriller', 'Drama|Musical',
       'Crime|Drama|Romance', 'Adventure|Drama', 'Action|Thriller',
       "Adventure|Children's|Comedy|Musical", 'Action|Drama|War',
       'Action|Adventur

### 2) 분석해 봅시다.
ratings에 있는 유니크한 영화 개수

ratings에 있는 유니크한 사용자 수

가장 인기 있는 영화 30개(인기순)

In [8]:
# ratings에 있는 유니크한 영화 개수
ratings['movie_id'].nunique()

3628

In [9]:
# ratings에 있는 유니크한 사용자 수
ratings['user_id'].nunique()

6039

In [10]:
# movies와 ratings 합쳐주기
movie_df = pd.merge(ratings, movies, on='movie_id')
movie_df

Unnamed: 0,user_id,movie_id,counts,timestamp,title,genre
0,1,1193,5,978300760,One Flew Over the Cuckoo's Nest (1975),Drama
1,2,1193,5,978298413,One Flew Over the Cuckoo's Nest (1975),Drama
2,12,1193,4,978220179,One Flew Over the Cuckoo's Nest (1975),Drama
3,15,1193,4,978199279,One Flew Over the Cuckoo's Nest (1975),Drama
4,17,1193,5,978158471,One Flew Over the Cuckoo's Nest (1975),Drama
...,...,...,...,...,...,...
836473,5851,3607,5,957756608,One Little Indian (1973),Comedy|Drama|Western
836474,5854,3026,4,958346883,Slaughterhouse (1987),Horror
836475,5854,690,3,957744257,"Promise, The (Versprechen, Das) (1994)",Romance
836476,5938,2909,4,957273353,"Five Wives, Three Secretaries and Me (1998)",Documentary


In [11]:
# movie_df['title'] = movie_df['title'].str.lower() # 영화제목 소문자로 바꿔주기
movie_df.head(10)

Unnamed: 0,user_id,movie_id,counts,timestamp,title,genre
0,1,1193,5,978300760,One Flew Over the Cuckoo's Nest (1975),Drama
1,2,1193,5,978298413,One Flew Over the Cuckoo's Nest (1975),Drama
2,12,1193,4,978220179,One Flew Over the Cuckoo's Nest (1975),Drama
3,15,1193,4,978199279,One Flew Over the Cuckoo's Nest (1975),Drama
4,17,1193,5,978158471,One Flew Over the Cuckoo's Nest (1975),Drama
5,18,1193,4,978156168,One Flew Over the Cuckoo's Nest (1975),Drama
6,19,1193,5,982730936,One Flew Over the Cuckoo's Nest (1975),Drama
7,24,1193,5,978136709,One Flew Over the Cuckoo's Nest (1975),Drama
8,28,1193,3,978125194,One Flew Over the Cuckoo's Nest (1975),Drama
9,33,1193,5,978557765,One Flew Over the Cuckoo's Nest (1975),Drama


In [12]:
# 가장 인기 있는 영화 30개(인기순)
movie_count = movie_df.groupby('title')['user_id'].count()
movie_count.sort_values(ascending=False).head(30)

title
American Beauty (1999)                                   3211
Star Wars: Episode IV - A New Hope (1977)                2910
Star Wars: Episode V - The Empire Strikes Back (1980)    2885
Star Wars: Episode VI - Return of the Jedi (1983)        2716
Saving Private Ryan (1998)                               2561
Terminator 2: Judgment Day (1991)                        2509
Silence of the Lambs, The (1991)                         2498
Raiders of the Lost Ark (1981)                           2473
Back to the Future (1985)                                2460
Matrix, The (1999)                                       2434
Jurassic Park (1993)                                     2413
Sixth Sense, The (1999)                                  2385
Fargo (1996)                                             2371
Braveheart (1995)                                        2314
Men in Black (1997)                                      2297
Schindler's List (1993)                                  2257
Pr

In [13]:
# 유저별 몇 명의 아티스트를 듣고 있는지에 대한 통계
user_count = movie_df.groupby('user_id')['title'].count()
user_count.describe()

count    6039.000000
mean      138.512668
std       156.241599
min         1.000000
25%        38.000000
50%        81.000000
75%       177.000000
max      1968.000000
Name: title, dtype: float64

In [14]:
# 유저별 play횟수 중앙값에 대한 통계
user_median = movie_df.groupby('user_id')['counts'].median()
user_median.describe()

count    6039.000000
mean        4.055970
std         0.432143
min         3.000000
25%         4.000000
50%         4.000000
75%         4.000000
max         5.000000
Name: counts, dtype: float64

### 3) 내가 선호하는 영화를 5가지 골라서 ratings에 추가해 줍시다.


In [15]:
movie_df[movie_df['title'].str.lower().str.contains('mission')].head(1)

Unnamed: 0,user_id,movie_id,counts,timestamp,title,genre
59058,2,648,4,978299913,Mission: Impossible (1996),Action|Adventure|Mystery


In [16]:
movie_df[movie_df['title'].str.lower().str.contains('toy story')].head(1)

Unnamed: 0,user_id,movie_id,counts,timestamp,title,genre
38620,1,1,5,978824268,Toy Story (1995),Animation|Children's|Comedy


In [17]:
movie_df[movie_df['title'].str.lower().str.contains('good will')].head(1)

Unnamed: 0,user_id,movie_id,counts,timestamp,title,genre
221579,5,1704,3,978244517,Good Will Hunting (1997),Drama


In [18]:
movie_df[movie_df['title'].str.lower().str.contains('truman')].head(1)

Unnamed: 0,user_id,movie_id,counts,timestamp,title,genre
297546,8,1682,4,978230852,"Truman Show, The (1998)",Drama


In [19]:
movie_df[movie_df['title'].str.lower().str.contains('titanic')].head(1)

Unnamed: 0,user_id,movie_id,counts,timestamp,title,genre
29123,1,1721,4,978300055,Titanic (1997),Drama|Romance


In [20]:
# movie_df[movie_df['title'].str.lower().str.contains('snow white')].head(1)

In [21]:
# 본인이 좋아하시는 아티스트 데이터로 바꿔서 추가하셔도 됩니다! 단, 이름은 꼭 데이터셋에 있는 것과 동일하게 맞춰주세요. 

my_favorite = ['Mission: Impossible (1996)' , 'Toy Story (1995)' ,'Good Will Hunting (1997)' ,'Truman Show, The (1998)' ,'Titanic (1997)']

favorite_movie_id = movies[movies['title'].isin(my_favorite)]

my_list = pd.DataFrame({'user_id': ['sun']*5, 'movie_id': favorite_movie_id['movie_id'], 'counts':[5]*5} )

if not ratings.isin({'user_id':['sun']})['user_id'].any():  
    ratings = ratings.append(my_list)                          

ratings.tail(20)  

Unnamed: 0,user_id,movie_id,counts,timestamp
1000192,6040,2019,5,956703977.0
1000193,6040,541,4,956715288.0
1000194,6040,1077,5,964828799.0
1000196,6040,549,4,956704746.0
1000197,6040,2020,3,956715288.0
1000198,6040,2021,3,956716374.0
1000199,6040,2022,5,956716207.0
1000200,6040,2028,5,956704519.0
1000201,6040,1080,4,957717322.0
1000202,6040,1089,4,956704996.0


In [22]:
favorite_movie_id

Unnamed: 0,movie_id,title,genre
0,1,Toy Story (1995),Animation|Children's|Comedy
642,648,Mission: Impossible (1996),Action|Adventure|Mystery
1636,1682,"Truman Show, The (1998)",Drama
1656,1704,Good Will Hunting (1997),Drama
1672,1721,Titanic (1997),Drama|Romance


In [23]:
movie_df = pd.merge(ratings, movies, on='movie_id')
movie_df

Unnamed: 0,user_id,movie_id,counts,timestamp,title,genre
0,1,1193,5,9.783008e+08,One Flew Over the Cuckoo's Nest (1975),Drama
1,2,1193,5,9.782984e+08,One Flew Over the Cuckoo's Nest (1975),Drama
2,12,1193,4,9.782202e+08,One Flew Over the Cuckoo's Nest (1975),Drama
3,15,1193,4,9.781993e+08,One Flew Over the Cuckoo's Nest (1975),Drama
4,17,1193,5,9.781585e+08,One Flew Over the Cuckoo's Nest (1975),Drama
...,...,...,...,...,...,...
836478,5851,3607,5,9.577566e+08,One Little Indian (1973),Comedy|Drama|Western
836479,5854,3026,4,9.583469e+08,Slaughterhouse (1987),Horror
836480,5854,690,3,9.577443e+08,"Promise, The (Versprechen, Das) (1994)",Romance
836481,5938,2909,4,9.572734e+08,"Five Wives, Three Secretaries and Me (1998)",Documentary


In [24]:
# 사용하는 컬럼만 남기기
movie_df = movie_df[['user_id','title', 'counts','movie_id','genre']]
movie_df.sort_index()

Unnamed: 0,user_id,title,counts,movie_id,genre
0,1,One Flew Over the Cuckoo's Nest (1975),5,1193,Drama
1,2,One Flew Over the Cuckoo's Nest (1975),5,1193,Drama
2,12,One Flew Over the Cuckoo's Nest (1975),4,1193,Drama
3,15,One Flew Over the Cuckoo's Nest (1975),4,1193,Drama
4,17,One Flew Over the Cuckoo's Nest (1975),5,1193,Drama
...,...,...,...,...,...
836478,5851,One Little Indian (1973),5,3607,Comedy|Drama|Western
836479,5854,Slaughterhouse (1987),4,3026,Horror
836480,5854,"Promise, The (Versprechen, Das) (1994)",3,690,Romance
836481,5938,"Five Wives, Three Secretaries and Me (1998)",4,2909,Documentary


In [25]:
# 고유한 유저, 영화제목 찾아내는 코드
user_unique = movie_df['user_id'].unique()
movie_unique = movie_df['title'].unique()

# 유저, 영화제목 indexing 하는 코드 
user_to_idx = {v:k for k,v in enumerate(user_unique)}
movie_to_idx = {v:k for k,v in enumerate(movie_unique)}

In [26]:
# 인덱싱이 잘 되었는지 확인해 봅니다. 
print(user_to_idx['sun'])    
print(movie_to_idx['Toy Story (1995)'])

5501
40


In [27]:
temp_user_data = movie_df['user_id'].map(user_to_idx.get).dropna()
if len(temp_user_data) == len(movie_df):   # 모든 row가 정상적으로 인덱싱되었다면
    print('user_id column indexing OK!!')
    movie_df['user_id'] = temp_user_data   # data['user_id']을 인덱싱된 Series로 교체해 줍니다. 
else:
    print('user_id column indexing Fail!!')


temp_movie_data = movie_df['title'].map(movie_to_idx.get).dropna()
if len(temp_movie_data) == len(movie_df):
    print('title column indexing OK!!')
    movie_df['title'] = temp_movie_data
else:
    print('title column indexing Fail!!')

movie_df

user_id column indexing OK!!
title column indexing OK!!


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movie_df['user_id'] = temp_user_data   # data['user_id']을 인덱싱된 Series로 교체해 줍니다.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movie_df['title'] = temp_movie_data


Unnamed: 0,user_id,title,counts,movie_id,genre
0,0,0,5,1193,Drama
1,1,0,5,1193,Drama
2,2,0,4,1193,Drama
3,3,0,4,1193,Drama
4,4,0,5,1193,Drama
...,...,...,...,...,...
836478,1621,3623,5,3607,Comedy|Drama|Western
836479,3481,3624,4,3026,Horror
836480,3481,3625,3,690,Romance
836481,4159,3626,4,2909,Documentary


In [28]:
movie_df[movie_df['user_id']==5501]

Unnamed: 0,user_id,title,counts,movie_id,genre
30393,5501,27,5,1721,Drama|Romance
40621,5501,40,5,1,Animation|Children's|Comedy
60341,5501,58,5,648,Action|Adventure|Mystery
223078,5501,248,5,1704,Drama
298496,5501,385,5,1682,Drama


### 4) CSR matrix를 직접 만들어 봅시다.

In [29]:
from scipy.sparse import csr_matrix

num_user = movie_df['user_id'].nunique()
num_movie = movie_df['title'].nunique()

print(num_user)
print(num_movie)

csr_data = csr_matrix((movie_df.counts, (movie_df.user_id, movie_df.title)), shape=(num_user, num_movie)) 
csr_data

6040
3628


<6040x3628 sparse matrix of type '<class 'numpy.int64'>'
	with 836483 stored elements in Compressed Sparse Row format>

### 5) als_model = AlternatingLeastSquares 모델을 직접 구성하여 훈련시켜 봅시다.

In [30]:
from implicit.als import AlternatingLeastSquares
import os
import numpy as np

# implicit 라이브러리에서 권장하고 있는 부분입니다. 학습 내용과는 무관합니다.
os.environ['OPENBLAS_NUM_THREADS']='1'
os.environ['KMP_DUPLICATE_LIB_OK']='True'
os.environ['MKL_NUM_THREADS']='1'

In [31]:
# Implicit AlternatingLeastSquares 모델의 선언
als_model = AlternatingLeastSquares(factors=200, 
                                    regularization=0.001, 
                                    use_gpu=False, 
                                    iterations=200, 
                                    dtype=np.float32)

In [32]:
# als_model = AlternatingLeastSquares(factors=300, 
#                                     regularization=0.001, 
#                                     use_gpu=False, 
#                                     iterations=500, 
#                                     dtype=np.float32)

##### AlternatingLeastSquares 매개변수

factors : 유저와 아이템의 벡터를 몇 차원으로 할 것인지

regularization : 과적합을 방지하기 위해 정규화 값을 얼마나 사용할 것인지

use_gpu : GPU를 사용할 것인지

iterations : epochs와 같은 의미입니다. 데이터를 몇 번 반복해서 학습할 것인지


factors, iterations를 늘릴수록 학습데이터를 잘 학습하지만 과적합의 우려가있음

참고 : https://implicit.readthedocs.io/en/latest/als.html

In [33]:
# als 모델은 input으로 (item X user 꼴의 matrix를 받기 때문에 Transpose해줍니다.)
csr_data_transpose = csr_data.T
csr_data_transpose

<3628x6040 sparse matrix of type '<class 'numpy.int64'>'
	with 836483 stored elements in Compressed Sparse Column format>

In [34]:
# 모델 훈련
als_model.fit(csr_data_transpose)

  0%|          | 0/200 [00:00<?, ?it/s]

### 6) 내가 선호하는 5가지 영화 중 하나와 그 외의 영화 하나를 골라 훈련된 모델이 예측한 나의 선호도를 파악해 보세요.

In [35]:
sun, toy_story = user_to_idx['sun'], movie_to_idx['Toy Story (1995)']
sun_vector, toy_story_vector = als_model.user_factors[5501], als_model.item_factors[40]


In [36]:
sun_vector

array([ 0.71708894, -0.38025627,  0.01593096, -0.19383731, -0.44145074,
        0.40300167, -0.03232934, -0.38977554,  0.42519888,  0.09044586,
       -0.31636167, -0.20429493,  0.4276552 ,  0.0440283 ,  0.28277367,
       -0.37310007, -0.8273601 ,  0.38614643,  0.65656435,  0.87323076,
       -0.2322492 ,  0.24666548,  0.26140466, -0.4658931 ,  0.25427473,
        0.07191107, -0.28252313,  0.57615614,  0.22179292,  0.57998943,
       -0.6774413 ,  0.6997919 , -0.81087524, -0.12628086,  0.42993435,
       -0.27245447,  0.7536834 , -0.05997403, -0.34592873, -0.8465699 ,
       -0.93866706, -0.447879  , -0.02329869,  0.0340606 , -0.20889953,
       -0.9633595 , -0.10883541,  0.03083906, -0.08684506, -0.7981731 ,
       -0.5820328 , -0.06130525,  0.03057117, -1.0907747 ,  0.1455328 ,
        0.17914018, -0.06515335, -0.23966128, -0.78478557,  0.20961802,
        0.18800996, -0.2830212 ,  0.09912226, -0.6749385 , -0.12630877,
        0.37909988, -0.10164568,  0.3465204 ,  0.8641781 ,  0.06

In [37]:
toy_story_vector

array([ 3.60299908e-02, -1.14280833e-02,  2.59132776e-02,  1.27144502e-02,
       -4.54341201e-03,  1.32459616e-02,  2.03998131e-03,  8.85181874e-03,
       -1.28680402e-02,  3.33399288e-02,  1.39214527e-02,  9.83679201e-04,
        9.94005532e-05,  1.25219510e-03,  2.00155545e-02, -1.18178884e-02,
       -2.62494362e-03, -1.11503797e-02, -1.45894417e-03,  3.34138125e-02,
        1.30263139e-02,  7.21739652e-03, -2.39697807e-02, -9.94939264e-03,
        1.54907415e-02, -7.53495097e-03, -1.15609169e-03,  1.34449685e-02,
       -1.14352563e-02,  1.33629693e-02, -2.57965364e-02, -1.32293664e-02,
       -2.04179920e-02,  1.08773558e-04,  1.06555130e-02,  6.81631675e-04,
        2.78898422e-03,  2.03152858e-02, -1.07598258e-02, -7.30702595e-04,
       -1.36124026e-02,  1.50598511e-02, -7.00580364e-04,  7.56961200e-03,
        6.53643440e-03, -1.34702623e-02, -1.09802727e-02,  1.40256844e-02,
        3.03087831e-02,  2.31005717e-03, -1.39846783e-02, -1.50781237e-02,
        1.43498005e-02,  

In [38]:
#vector을 곱했을 때의 값 확인,내적하는 코드
# 1이 나와야한다
np.dot(sun_vector, toy_story_vector)

0.6872404

0.6으로 생각보다 1에 가까운 숫자가 나왔다.

1에 더 가까워지도록 값을 조정해보기도했지만, 

과적합의 우려가있어 이정도에서 만족한다

In [39]:
# Men in Black (1997)을 가지고 선호도 예측 확인
man_in_black = movie_to_idx['Men in Black (1997)']
man_in_black_vector = als_model.item_factors[man_in_black]

print(np.dot(sun_vector, man_in_black_vector))

0.06852578


0.06으로 거리가 먼것을 확인하였다

### 7) 내가 좋아하는 영화와 비슷한 영화를 추천받아 봅시다.

In [40]:
favorite_movie = 'Toy Story (1995)'
movie_id = movies[movies['title']=='Toy Story (1995)']['movie_id']
similar_movie = als_model.similar_items(movie_id.values[0], N=15)
similar_movie

[(1, 1.0000001),
 (1513, 0.60211444),
 (2182, 0.58826655),
 (35, 0.5792842),
 (951, 0.567573),
 (1986, 0.5664632),
 (3149, 0.56618685),
 (3029, 0.56460816),
 (2023, 0.55343837),
 (1934, 0.5529043),
 (3412, 0.5526461),
 (3587, 0.55250245),
 (2608, 0.55091846),
 (2572, 0.5505611),
 (2002, 0.55041933)]

In [41]:
idx_to_movie = {v:k for k,v in movie_to_idx.items()}

def get_similar_movie(movie_name: str):
    movie_id = movie_to_idx[movie_name]
    similar_movie = als_model.similar_items(movie_id)
    similar_movie = [idx_to_movie[i[0]] for i in similar_movie]
    return similar_movie

'Mission: Impossible (1996)' , 'Toy Story (1995)' ,'Good Will Hunting (1997)' ,'Truman Show, The (1998)' ,'Titanic (1997)']

In [42]:
get_similar_movie('Mission: Impossible (1996)') 

['Mission: Impossible (1996)',
 'True Lies (1994)',
 'Before and After (1996)',
 'Battle of the Sexes, The (1959)',
 'Month by the Lake, A (1995)',
 'Stage Fright (1950)',
 'Smiling Fish and Goat on Fire (1999)',
 'Under Suspicion (2000)',
 'Murder! (1930)',
 'Horseman on the Roof, The (Hussard sur le toit, Le) (1995)']

In [43]:
get_similar_movie('Toy Story (1995)') 

['Toy Story (1995)',
 'Toy Story 2 (1999)',
 "Bug's Life, A (1998)",
 'Aladdin (1992)',
 'Lion King, The (1994)',
 'Babe (1995)',
 'Nobody Loves Me (Keiner liebt mich) (1994)',
 'Soft Toilet Seats (1999)',
 'Splendor (1999)',
 "There's Something About Mary (1998)"]

In [44]:
get_similar_movie('Good Will Hunting (1997)') 

['Good Will Hunting (1997)',
 'Truman Show, The (1998)',
 'Apollo 13 (1995)',
 'Shawshank Redemption, The (1994)',
 'Dead Man Walking (1995)',
 'Puppet Master 5: The Final Chapter (1994)',
 'Jerry Maguire (1996)',
 'Pest, The (1997)',
 'Puppet Master 4 (1993)',
 'Story of G.I. Joe, The (1945)']

In [45]:
get_similar_movie('Truman Show, The (1998)')      

['Truman Show, The (1998)',
 'Good Will Hunting (1997)',
 'Apollo 13 (1995)',
 'Fried Green Tomatoes (1991)',
 'People vs. Larry Flynt, The (1996)',
 "Mr. Holland's Opus (1995)",
 'Scent of a Woman (1992)',
 'Jerry Maguire (1996)',
 'Philadelphia (1993)',
 'Pump Up the Volume (1990)']

In [46]:
get_similar_movie('Titanic (1997)')   

['Titanic (1997)',
 'Nightwatch (1997)',
 'Lady of Burlesque (1943)',
 'Nina Takes a Lover (1994)',
 'Coldblooded (1995)',
 'Tales of Terror (1962)',
 'Funeral, The (1996)',
 'Map of the World, A (1999)',
 'Snow Day (2000)',
 'Suture (1993)']

In [47]:
get_similar_movie('Snow White and the Seven Dwarfs (1937)') 

['Snow White and the Seven Dwarfs (1937)',
 'Dumbo (1941)',
 'Cinderella (1950)',
 'Sleeping Beauty (1959)',
 'Pinocchio (1940)',
 'Peter Pan (1953)',
 'Bambi (1942)',
 'Alice in Wonderland (1951)',
 'Jungle Book, The (1967)',
 'Fantasia (1940)']

결과를 보니 대체적으로 비슷한 장르의 영화가 나오는 것을 확인하였다

### 8) 내가 가장 좋아할 만한 영화들을 추천받아 봅시다.

AlternatingLeastSquares 클래스에 구현되어 있는 recommend 메서드를 통하여 좋아할 만한 아티스트를 추천받습니다. 
filter_already_liked_items 는 유저가 이미 평가한 아이템은 제외하는 Argument

In [48]:
user = user_to_idx['sun']
# recommend에서는 user*item CSR Matrix를 받습니다.
movie_recommended = als_model.recommend(user, csr_data, N=20, filter_already_liked_items=True)
movie_recommended

[(50, 0.37316483),
 (384, 0.37101665),
 (39, 0.35313624),
 (160, 0.24989845),
 (323, 0.18624541),
 (474, 0.185342),
 (641, 0.1830162),
 (18, 0.18208523),
 (317, 0.17956863),
 (482, 0.17513698),
 (374, 0.17433631),
 (533, 0.17371057),
 (4, 0.17092821),
 (128, 0.16973406),
 (38, 0.16782218),
 (330, 0.16535634),
 (124, 0.16476962),
 (157, 0.16277036),
 (752, 0.1623801),
 (1931, 0.160429)]

In [49]:
[idx_to_movie[i[0]] for i in movie_recommended]

['Toy Story 2 (1999)',
 'Jerry Maguire (1996)',
 'Apollo 13 (1995)',
 'Forrest Gump (1994)',
 'Dead Man Walking (1995)',
 'Chicken Run (2000)',
 "Mr. Holland's Opus (1995)",
 'Awakenings (1990)',
 'Twelve Monkeys (1995)',
 'Mission: Impossible 2 (2000)',
 'Exorcist, The (1973)',
 'Fried Green Tomatoes (1991)',
 "Bug's Life, A (1998)",
 'French Connection, The (1971)',
 'Sixth Sense, The (1999)',
 'Lion King, The (1994)',
 'Matrix, The (1999)',
 'Shawshank Redemption, The (1994)',
 'Wag the Dog (1997)',
 "Miller's Crossing (1990)"]

이 추천에 기여한 정도 확인하기

In [50]:
toy_story = movie_to_idx['Toy Story 2 (1999)']
explain = als_model.explain(user, csr_data, itemid=toy_story)

In [51]:
[(idx_to_movie[i[0]], i[1]) for i in explain[1]]

[('Toy Story (1995)', 0.37958921699711745),
 ('Mission: Impossible (1996)', 0.028364613478480835),
 ('Titanic (1997)', 0.013372094524969111),
 ('Good Will Hunting (1997)', -0.0239930970804797),
 ('Truman Show, The (1998)', -0.024708706626603103)]

역시 생각대로

Toy Story 2 (1999)를 추천하는데에 가장 큰 기여를 한것은 

Toy Story (1995) 였다

In [52]:
star_wars = movie_to_idx['Star Wars: Episode IV - A New Hope (1977)']
explain = als_model.explain(user, csr_data, itemid=star_wars)

In [53]:
[(idx_to_movie[i[0]], i[1]) for i in explain[1]]

[('Titanic (1997)', 0.0498432373380026),
 ('Mission: Impossible (1996)', 0.042102868319266226),
 ('Truman Show, The (1998)', 0.014873838690940215),
 ('Toy Story (1995)', -0.00015347755522957971),
 ('Good Will Hunting (1997)', -0.03781159030677697)]

스타워즈를 추천하는데에 기여를 한것 중에

타이타닉도 있어 의외였다

# 다른 영화로 시도해보기 ===========================

### 1) 데이터 준비와 전처리

In [54]:
import pandas as pd
import os

rating_file_path=os.getenv('HOME') + '/aiffel/recommendata_iu/data/ml-1m/ratings.dat'
ratings_cols = ['user_id', 'movie_id', 'ratings', 'timestamp']
ratings = pd.read_csv(rating_file_path, sep='::', names=ratings_cols, engine='python', encoding = "ISO-8859-1")
orginal_data_size = len(ratings)
ratings.head()

Unnamed: 0,user_id,movie_id,ratings,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


In [55]:
# 3점 이상만 남깁니다.
ratings = ratings[ratings['ratings']>=3]
filtered_data_size = len(ratings)

print(f'orginal_data_size: {orginal_data_size}, filtered_data_size: {filtered_data_size}')
print(f'Ratio of Remaining Data is {filtered_data_size / orginal_data_size:.2%}')

orginal_data_size: 1000209, filtered_data_size: 836478
Ratio of Remaining Data is 83.63%


In [56]:
# ratings 컬럼의 이름을 counts로 바꿉니다.
ratings.rename(columns={'ratings':'counts'}, inplace=True)

In [57]:
ratings['counts']

0          5
1          3
2          3
3          4
4          5
          ..
1000203    3
1000205    5
1000206    5
1000207    4
1000208    4
Name: counts, Length: 836478, dtype: int64

In [58]:
# 영화 제목을 보기 위해 메타 데이터를 읽어옵니다.
movie_file_path=os.getenv('HOME') + '/aiffel/recommendata_iu/data/ml-1m/movies.dat'
cols = ['movie_id', 'title', 'genre'] 
movies = pd.read_csv(movie_file_path, sep='::', names=cols, engine='python', encoding='ISO-8859-1')
movies.head()

Unnamed: 0,movie_id,title,genre
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


In [59]:
# movies['genre'].unique()

### 2) 분석해 봅시다.
ratings에 있는 유니크한 영화 개수

ratings에 있는 유니크한 사용자 수

가장 인기 있는 영화 30개(인기순)

In [60]:
# ratings에 있는 유니크한 영화 개수
ratings['movie_id'].nunique()

3628

In [61]:
# ratings에 있는 유니크한 사용자 수
ratings['user_id'].nunique()

6039

In [62]:
# movies와 ratings 합쳐주기
movie_df = pd.merge(ratings, movies, on='movie_id')
movie_df

Unnamed: 0,user_id,movie_id,counts,timestamp,title,genre
0,1,1193,5,978300760,One Flew Over the Cuckoo's Nest (1975),Drama
1,2,1193,5,978298413,One Flew Over the Cuckoo's Nest (1975),Drama
2,12,1193,4,978220179,One Flew Over the Cuckoo's Nest (1975),Drama
3,15,1193,4,978199279,One Flew Over the Cuckoo's Nest (1975),Drama
4,17,1193,5,978158471,One Flew Over the Cuckoo's Nest (1975),Drama
...,...,...,...,...,...,...
836473,5851,3607,5,957756608,One Little Indian (1973),Comedy|Drama|Western
836474,5854,3026,4,958346883,Slaughterhouse (1987),Horror
836475,5854,690,3,957744257,"Promise, The (Versprechen, Das) (1994)",Romance
836476,5938,2909,4,957273353,"Five Wives, Three Secretaries and Me (1998)",Documentary


In [63]:
# movie_df['title'] = movie_df['title'].str.lower() # 영화제목 소문자로 바꿔주기
# movie_df.head(10)

In [64]:
# 가장 인기 있는 영화 30개(인기순)
movie_count = movie_df.groupby('title')['user_id'].count()
movie_count.sort_values(ascending=False).head(30)

title
American Beauty (1999)                                   3211
Star Wars: Episode IV - A New Hope (1977)                2910
Star Wars: Episode V - The Empire Strikes Back (1980)    2885
Star Wars: Episode VI - Return of the Jedi (1983)        2716
Saving Private Ryan (1998)                               2561
Terminator 2: Judgment Day (1991)                        2509
Silence of the Lambs, The (1991)                         2498
Raiders of the Lost Ark (1981)                           2473
Back to the Future (1985)                                2460
Matrix, The (1999)                                       2434
Jurassic Park (1993)                                     2413
Sixth Sense, The (1999)                                  2385
Fargo (1996)                                             2371
Braveheart (1995)                                        2314
Men in Black (1997)                                      2297
Schindler's List (1993)                                  2257
Pr

### 3) 내가 선호하는 영화를 5가지 골라서 ratings에 추가해 줍시다.

이번엔 장르가 로맨스인것들을 4개 + 다르장르 1개 ratings에 추가

In [65]:
movie_df[movie_df['genre'].str.lower().str.contains('romance')]

Unnamed: 0,user_id,movie_id,counts,timestamp,title,genre
2123,1,914,3,978301968,My Fair Lady (1964),Musical|Romance
2124,6,914,5,978237767,My Fair Lady (1964),Musical|Romance
2125,10,914,5,978226805,My Fair Lady (1964),Musical|Romance
2126,33,914,5,978108939,My Fair Lady (1964),Musical|Romance
2127,35,914,3,978101982,My Fair Lady (1964),Musical|Romance
...,...,...,...,...,...,...
836391,4169,1071,4,979527174,For the Moment (1994),Romance|War
836392,5281,1071,3,1002834556,For the Moment (1994),Romance|War
836393,5734,1071,5,960828361,For the Moment (1994),Romance|War
836444,4842,3748,4,1010087721,"Match, The (1999)",Comedy|Romance


In [66]:
movie_df[movie_df['title'].str.lower().str.contains('my fair')].head(1)

Unnamed: 0,user_id,movie_id,counts,timestamp,title,genre
2123,1,914,3,978301968,My Fair Lady (1964),Musical|Romance


In [67]:
movie_df[movie_df['title'].str.lower().str.contains('for the mom')].head(1)

Unnamed: 0,user_id,movie_id,counts,timestamp,title,genre
836390,3963,1071,3,977982833,For the Moment (1994),Romance|War


In [68]:
movie_df[movie_df['title'].str.lower().str.contains('promise')].head(1)

Unnamed: 0,user_id,movie_id,counts,timestamp,title,genre
821891,225,1533,4,976832220,"Promise, The (La Promesse) (1996)",Drama


In [69]:
movie_df[movie_df['title'].str.lower().str.contains('match,')].head(1)

Unnamed: 0,user_id,movie_id,counts,timestamp,title,genre
836444,4842,3748,4,1010087721,"Match, The (1999)",Comedy|Romance


In [70]:
# 1개는 다른 장르 넣기

movie_df[movie_df['title'].str.lower().str.contains('slappy and the')].head(1)

Unnamed: 0,user_id,movie_id,counts,timestamp,title,genre
836466,5420,1843,3,960156505,Slappy and the Stinkers (1998),Children's|Comedy


In [71]:
movie_df[movie_df['title'].str.lower().str.contains('sound')].head(1)

Unnamed: 0,user_id,movie_id,counts,timestamp,title,genre
15006,1,1035,5,978301753,"Sound of Music, The (1965)",Musical


In [72]:
# 본인이 좋아하시는 아티스트 데이터로 바꿔서 추가하셔도 됩니다! 단, 이름은 꼭 데이터셋에 있는 것과 동일하게 맞춰주세요. 

my_favorite = ['My Fair Lady (1964)' , 'For the Moment (1994)' ,\
               'Promise, The (La Promesse) (1996)' ,'Match, The (1999)' ,'Slappy and the Stinkers (1998)']

favorite_movie_id = movies[movies['title'].isin(my_favorite)]

my_list = pd.DataFrame({'user_id': ['kim']*5, 'movie_id': favorite_movie_id['movie_id'], 'counts':[5]*5} )

if not ratings.isin({'user_id':['kim']})['user_id'].any():  
    ratings = ratings.append(my_list)                          

ratings.tail(20)  

Unnamed: 0,user_id,movie_id,counts,timestamp
1000192,6040,2019,5,956703977.0
1000193,6040,541,4,956715288.0
1000194,6040,1077,5,964828799.0
1000196,6040,549,4,956704746.0
1000197,6040,2020,3,956715288.0
1000198,6040,2021,3,956716374.0
1000199,6040,2022,5,956716207.0
1000200,6040,2028,5,956704519.0
1000201,6040,1080,4,957717322.0
1000202,6040,1089,4,956704996.0


In [73]:
movies[movies['title'].isin(my_favorite)]

Unnamed: 0,movie_id,title,genre
902,914,My Fair Lady (1964),Musical|Romance
1057,1071,For the Moment (1994),Romance|War
1496,1533,"Promise, The (La Promesse) (1996)",Drama
1774,1843,Slappy and the Stinkers (1998),Children's|Comedy
3679,3748,"Match, The (1999)",Comedy|Romance


In [74]:
favorite_movie_id

Unnamed: 0,movie_id,title,genre
902,914,My Fair Lady (1964),Musical|Romance
1057,1071,For the Moment (1994),Romance|War
1496,1533,"Promise, The (La Promesse) (1996)",Drama
1774,1843,Slappy and the Stinkers (1998),Children's|Comedy
3679,3748,"Match, The (1999)",Comedy|Romance


In [75]:
movie_df = pd.merge(ratings, movies, on='movie_id')
movie_df

Unnamed: 0,user_id,movie_id,counts,timestamp,title,genre
0,1,1193,5,9.783008e+08,One Flew Over the Cuckoo's Nest (1975),Drama
1,2,1193,5,9.782984e+08,One Flew Over the Cuckoo's Nest (1975),Drama
2,12,1193,4,9.782202e+08,One Flew Over the Cuckoo's Nest (1975),Drama
3,15,1193,4,9.781993e+08,One Flew Over the Cuckoo's Nest (1975),Drama
4,17,1193,5,9.781585e+08,One Flew Over the Cuckoo's Nest (1975),Drama
...,...,...,...,...,...,...
836478,5851,3607,5,9.577566e+08,One Little Indian (1973),Comedy|Drama|Western
836479,5854,3026,4,9.583469e+08,Slaughterhouse (1987),Horror
836480,5854,690,3,9.577443e+08,"Promise, The (Versprechen, Das) (1994)",Romance
836481,5938,2909,4,9.572734e+08,"Five Wives, Three Secretaries and Me (1998)",Documentary


In [76]:
# 사용하는 컬럼만 남기기
movie_df = movie_df[['user_id','title', 'counts','movie_id','genre']]
movie_df.sort_index()

Unnamed: 0,user_id,title,counts,movie_id,genre
0,1,One Flew Over the Cuckoo's Nest (1975),5,1193,Drama
1,2,One Flew Over the Cuckoo's Nest (1975),5,1193,Drama
2,12,One Flew Over the Cuckoo's Nest (1975),4,1193,Drama
3,15,One Flew Over the Cuckoo's Nest (1975),4,1193,Drama
4,17,One Flew Over the Cuckoo's Nest (1975),5,1193,Drama
...,...,...,...,...,...
836478,5851,One Little Indian (1973),5,3607,Comedy|Drama|Western
836479,5854,Slaughterhouse (1987),4,3026,Horror
836480,5854,"Promise, The (Versprechen, Das) (1994)",3,690,Romance
836481,5938,"Five Wives, Three Secretaries and Me (1998)",4,2909,Documentary


In [77]:
# 고유한 유저, 영화제목 찾아내는 코드
user_unique = movie_df['user_id'].unique()
movie_unique = movie_df['title'].unique()

# 유저, 영화제목 indexing 하는 코드 
user_to_idx = {v:k for k,v in enumerate(user_unique)}
movie_to_idx = {v:k for k,v in enumerate(movie_unique)}

In [78]:
# 인덱싱이 잘 되었는지 확인해 봅니다. 
print(user_to_idx['kim'])    
print(movie_to_idx['Slappy and the Stinkers (1998)'])
print(movie_to_idx['My Fair Lady (1964)'])

2219
3617
2


In [79]:
temp_user_data = movie_df['user_id'].map(user_to_idx.get).dropna()
if len(temp_user_data) == len(movie_df):   # 모든 row가 정상적으로 인덱싱되었다면
    print('user_id column indexing OK!!')
    movie_df['user_id'] = temp_user_data   # data['user_id']을 인덱싱된 Series로 교체해 줍니다. 
else:
    print('user_id column indexing Fail!!')


temp_movie_data = movie_df['title'].map(movie_to_idx.get).dropna()
if len(temp_movie_data) == len(movie_df):
    print('title column indexing OK!!')
    movie_df['title'] = temp_movie_data
else:
    print('title column indexing Fail!!')

movie_df

user_id column indexing OK!!
title column indexing OK!!


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movie_df['user_id'] = temp_user_data   # data['user_id']을 인덱싱된 Series로 교체해 줍니다.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movie_df['title'] = temp_movie_data


Unnamed: 0,user_id,title,counts,movie_id,genre
0,0,0,5,1193,Drama
1,1,0,5,1193,Drama
2,2,0,4,1193,Drama
3,3,0,4,1193,Drama
4,4,0,5,1193,Drama
...,...,...,...,...,...
836478,1621,3623,5,3607,Comedy|Drama|Western
836479,3482,3624,4,3026,Horror
836480,3482,3625,3,690,Romance
836481,4160,3626,4,2909,Documentary


In [80]:
movie_df[movie_df['user_id']==2219]

Unnamed: 0,user_id,title,counts,movie_id,genre
2734,2219,2,5,914,Musical|Romance
821915,2219,2733,5,1533,Drama
836396,2219,3566,5,1071,Romance|War
836448,2219,3600,5,3748,Comedy|Romance
836471,2219,3617,5,1843,Children's|Comedy


### 4) CSR matrix를 직접 만들어 봅시다.

In [81]:
from scipy.sparse import csr_matrix

num_user = movie_df['user_id'].nunique()
num_movie = movie_df['title'].nunique()

print(num_user)
print(num_movie)

csr_data = csr_matrix((movie_df.counts, (movie_df.user_id, movie_df.title)), shape=(num_user, num_movie)) 
csr_data

6040
3628


<6040x3628 sparse matrix of type '<class 'numpy.int64'>'
	with 836483 stored elements in Compressed Sparse Row format>

In [82]:
# movie_df.genre

### 5) als_model = AlternatingLeastSquares 모델을 직접 구성하여 훈련시켜 봅시다.

In [83]:
from implicit.als import AlternatingLeastSquares
import os
import numpy as np

# implicit 라이브러리에서 권장하고 있는 부분입니다. 학습 내용과는 무관합니다.
os.environ['OPENBLAS_NUM_THREADS']='1'
os.environ['KMP_DUPLICATE_LIB_OK']='True'
os.environ['MKL_NUM_THREADS']='1'

In [85]:
# Implicit AlternatingLeastSquares 모델의 선언
als_model = AlternatingLeastSquares(factors=200, 
                                    regularization=0.01, 
                                    use_gpu=False, 
                                    iterations=100, 
                                    dtype=np.float32)

##### AlternatingLeastSquares 매개변수

factors : 유저와 아이템의 벡터를 몇 차원으로 할 것인지

regularization : 과적합을 방지하기 위해 정규화 값을 얼마나 사용할 것인지

use_gpu : GPU를 사용할 것인지

iterations : epochs와 같은 의미입니다. 데이터를 몇 번 반복해서 학습할 것인지


factors, iterations를 늘릴수록 학습데이터를 잘 학습하지만 과적합의 우려가있음

참고 : https://implicit.readthedocs.io/en/latest/als.html

In [86]:
# als 모델은 input으로 (item X user 꼴의 matrix를 받기 때문에 Transpose해줍니다.)
csr_data_transpose = csr_data.T
csr_data_transpose

<3628x6040 sparse matrix of type '<class 'numpy.int64'>'
	with 836483 stored elements in Compressed Sparse Column format>

In [87]:
# 모델 훈련
als_model.fit(csr_data_transpose)

  0%|          | 0/100 [00:00<?, ?it/s]

### 6) 내가 선호하는 5가지 영화 중 하나와 그 외의 영화 하나를 골라 훈련된 모델이 예측한 나의 선호도를 파악해 보세요.

In [88]:
# user_to_idx['kim']

In [89]:
# movie_to_idx['Slappy and the Stinkers (1998)']

In [90]:
kim, kim_movie = user_to_idx['kim'], movie_to_idx['My Fair Lady (1964)']
kim_vector, movie_vector = als_model.user_factors[2219], als_model.item_factors[2]


In [91]:
kim_vector

array([-0.20605858, -0.06283157,  0.03412718,  0.08855617, -0.03519141,
       -0.28142884,  0.0495424 , -0.03593495, -0.04623179, -0.04344502,
        0.08752853, -0.16924731, -0.0083573 ,  0.06572488,  0.012591  ,
       -0.02700183,  0.11106641, -0.12316197,  0.17232835, -0.081228  ,
       -0.01645732,  0.0386376 , -0.14772244,  0.18051946,  0.21584997,
       -0.10533764, -0.09912565,  0.04822126, -0.15152925, -0.05778204,
        0.00151217,  0.10190024,  0.10197041, -0.16941386, -0.19247784,
        0.07332939, -0.06787756, -0.07265681,  0.01219254, -0.12365996,
       -0.0689201 ,  0.09858631,  0.22968547,  0.04356759,  0.1369473 ,
        0.25303757, -0.02208704, -0.13642153,  0.18216108,  0.04612209,
       -0.01175025, -0.02385994, -0.06611299,  0.04574423,  0.14921741,
        0.2937601 ,  0.08922273,  0.04743386,  0.05490373,  0.00932417,
        0.06482359, -0.12206008,  0.02173217, -0.17360656, -0.0432691 ,
        0.1031834 ,  0.10750628, -0.09254893,  0.10777434, -0.00

In [92]:
movie_vector

array([-2.65011247e-02,  7.58511305e-05,  1.65028535e-02,  9.85199213e-03,
       -1.24442074e-02, -4.66007404e-02,  9.95430071e-03,  3.48455063e-03,
        1.23614809e-02, -1.33292388e-05,  2.41027065e-02, -2.78550182e-02,
        1.49657028e-02,  1.05674518e-03, -5.59552293e-03,  2.12476356e-03,
        2.60917656e-02, -2.29192083e-03,  4.72226739e-02, -1.61901303e-02,
       -1.75444130e-02,  1.59168988e-02, -1.07138073e-02,  3.41061018e-02,
        5.41984923e-02,  6.18128618e-03, -1.63871925e-02,  1.36130163e-02,
       -1.92218814e-02, -2.28691623e-02, -3.09375580e-03,  3.58125009e-02,
        1.52987977e-02, -3.99740972e-02, -4.63870391e-02,  1.85647197e-02,
        6.61277073e-03, -1.05918739e-02, -2.21677823e-03, -2.18565725e-02,
       -3.18605034e-03,  6.40293909e-03,  4.76524904e-02,  8.37355573e-03,
        4.86708470e-02,  6.45036623e-02,  2.87482841e-03, -2.26217620e-02,
        4.35157157e-02,  1.42362434e-02,  4.87814005e-03, -4.91093192e-03,
       -1.94413736e-02,  

In [93]:
# kim_vector = kim_vector.to_numpy()
# movie_vector = movie_vector.to_numpy()

In [94]:
kim_vector.shape, movie_vector.shape

((200,), (200,))

In [95]:
#vector을 곱했을 때의 값 확인,내적하는 코드
# 1이 나와야
# np.dot(kim_vector, movie_vector.T)
np.dot(kim_vector, movie_vector)

0.44006246

In [96]:
# Sound of Music, The (1965)을 가지고 선호도 예측 확인
sound = movie_to_idx['Sound of Music, The (1965)']
sound_vector = als_model.item_factors[sound]

print(np.dot(kim_vector, sound_vector))

0.22403917


In [97]:
# toy story을 가지고 선호도 예측 확인
toy_story = movie_to_idx['Toy Story (1995)']
toy_story_vector = als_model.item_factors[toy_story]

print(np.dot(kim_vector, toy_story_vector))

0.0006694715


In [99]:
# Matrix, The (1999) 을 가지고 선호도 예측 확인
mat = movie_to_idx['Matrix, The (1999)']
mat_vector = als_model.item_factors[mat]

print(np.dot(kim_vector, mat_vector))

-0.018160684


로맨스 영화가 많아서인지 

토이스토리나 매트릭스의 경우는 많이 낮은 결과가 나온것을 볼 수 있었다

### 7) 내가 좋아하는 영화와 비슷한 영화를 추천받아 봅시다.

In [109]:
favorite_movie = 'My Fair Lady (1964)'
movie_id = movies[movies['title']=='My Fair Lady (1964)']['movie_id']
similar_movie = als_model.similar_items(movie_id.values[0], N=15)
similar_movie

[(914, 0.9999999),
 (2932, 0.7599311),
 (2795, 0.7207423),
 (2798, 0.7197157),
 (1613, 0.71389264),
 (2152, 0.7026763),
 (2084, 0.6989927),
 (1609, 0.69603825),
 (2635, 0.6907355),
 (1043, 0.6888197),
 (1612, 0.687604),
 (2792, 0.68331474),
 (1614, 0.68091136),
 (1606, 0.6779424),
 (1101, 0.6750742)]

In [110]:
idx_to_movie = {v:k for k,v in movie_to_idx.items()}

def get_similar_movie(movie_name: str):
    movie_id = movie_to_idx[movie_name]
    similar_movie = als_model.similar_items(movie_id)
    similar_movie = [idx_to_movie[i[0]] for i in similar_movie]
    return similar_movie

'My Fair Lady (1964)' , 'For the Moment (1994)' ,\
               'Promise, The (La Promesse) (1996)' ,'Match, The (1999)' ,'Slappy and the Stinkers (1998)'

In [111]:
get_similar_movie('Slappy and the Stinkers (1998)') 

['Slappy and the Stinkers (1998)',
 'Thumbelina (1994)',
 'King and I, The (1999)',
 'Return of Jafar, The (1993)',
 'Rugrats Movie, The (1998)',
 'Theodore Rex (1995)',
 'Swan Princess, The (1994)',
 'Flipper (1996)',
 'Pagemaster, The (1994)',
 "Doug's 1st Movie (1999)"]

In [112]:
get_similar_movie('My Fair Lady (1964)') 

['My Fair Lady (1964)',
 'West Side Story (1961)',
 'King and I, The (1956)',
 'Gigi (1958)',
 'Sound of Music, The (1965)',
 'American in Paris, An (1951)',
 'Slappy and the Stinkers (1998)',
 "Singin' in the Rain (1952)",
 'Oliver! (1968)',
 'South Pacific (1958)']

In [113]:
get_similar_movie('Match, The (1999)') 

['Match, The (1999)',
 'Skipped Parts (2000)',
 'All That Jazz (1979)',
 'Cabaret (1972)',
 'Battleship Potemkin, The (Bronenosets Potyomkin) (1925)',
 'Third Miracle, The (1999)',
 "Antonia's Line (Antonia) (1995)",
 'Dry Cleaning (Nettoyage à sec) (1997)',
 'Soapdish (1991)',
 'Excalibur (1981)']

### 8) 내가 가장 좋아할 만한 영화들을 추천받아 봅시다.

AlternatingLeastSquares 클래스에 구현되어 있는 recommend 메서드를 통하여 좋아할 만한 아티스트를 추천받습니다. 
filter_already_liked_items 는 유저가 이미 평가한 아이템은 제외하는 Argument

In [114]:
user = user_to_idx['kim']
# recommend에서는 user*item CSR Matrix를 받습니다.
movie_recommended = als_model.recommend(user, csr_data, N=20, filter_already_liked_items=True)
movie_recommended

[(354, 0.2973019),
 (14, 0.22403914),
 (663, 0.18238977),
 (1355, 0.16978788),
 (1569, 0.114909455),
 (45, 0.11430973),
 (11, 0.11055977),
 (161, 0.10839946),
 (566, 0.099158324),
 (189, 0.09638801),
 (463, 0.09624775),
 (539, 0.09578668),
 (236, 0.09531787),
 (367, 0.09175623),
 (357, 0.0887589),
 (1844, 0.088627286),
 (486, 0.08802891),
 (369, 0.08782424),
 (360, 0.08746846),
 (47, 0.08563249)]

In [115]:
[idx_to_movie[i[0]] for i in movie_recommended]

['West Side Story (1961)',
 'Sound of Music, The (1965)',
 "Singin' in the Rain (1952)",
 'King and I, The (1956)',
 'American in Paris, An (1951)',
 'Mary Poppins (1964)',
 'Gigi (1958)',
 'Graduate, The (1967)',
 "Breakfast at Tiffany's (1961)",
 'Breakfast Club, The (1985)',
 'Frequency (2000)',
 'Mystery Science Theater 3000: The Movie (1996)',
 'Speed (1994)',
 'Grease (1978)',
 'Sense and Sensibility (1995)',
 'Oliver! (1968)',
 'Remember the Titans (2000)',
 'Witness (1985)',
 'Dirty Dancing (1987)',
 'To Kill a Mockingbird (1962)']

뮤지컬이나 로맨스 등등의 영화추천이 많았다

이 추천에 기여한 정도 확인하기

In [118]:
sound = movie_to_idx['Sound of Music, The (1965)']
explain = als_model.explain(user, csr_data, itemid=sound)

In [119]:
[(idx_to_movie[i[0]], i[1]) for i in explain[1]]

[('My Fair Lady (1964)', 0.20726437622161747),
 ('Promise, The (La Promesse) (1996)', 0.010868939318620225),
 ('Slappy and the Stinkers (1998)', 0.0023767778471358338),
 ('Match, The (1999)', 0.0016642050414196513),
 ('For the Moment (1994)', 0.0010068677968627046)]

sound of music를 추천해준데에 가장 큰 기여를 한것은 

my fair lady였다

# 회고

- 이번 프로젝트에서 **어려웠던 점,**

factors, iterations의 변경에 따라 학습이 더 잘될수도 있다고하고

그럴경우 과적합이 발생한다고했는데

1에 가까운 숫자가 나와도 과적합이 발생했을수도 있다는 생각이 들었다

생각보다 높지 않은 수치에서도 영화추천결과가 나쁘지 않았었기도 하였다.

그래서 여러값을 변경해보며 시도했었지만 수치가 제일 높은게 적합한것인지 판단이 되지 않아 적절한 선에서 멈춘것이 아쉬웠다.


- 프로젝트를 진행하면서 **알아낸 점** 혹은 **아직 모호한 점**.

과적합이 발생했는지의 여부를 알수있는지 더 알아보아야 할것같다.

- 루브릭 평가 지표를 맞추기 위해 **시도한 것들**.

처음에는 다양한 장르의 영화를 추가해 시도해보았고,

두번째 시도에서는 장르를 통일시켜서 시도해보면서 

비슷한 영화를 찾는지 알아보았다

- 만약에 루브릭 평가 관련 지표를 **달성 하지 못했을 때, 이유에 관한 추정**.낌

유사도가 그리 높지 않은것이 마음에 걸리지만 과적합의 여부를 알 수 없었고 

결과가 나쁘지 않다고 생각이 들었다.

- **자기 다짐**

과적합의 여부를 알 수 있는지 더 알아보면 좋을것 같다
