# EX9_아이유팬이 좋아할 만한 다른 아티스트 찾기

********************
> + 평가문항
> 1. CSR matrix가 정상적으로 만들어졌다.
> 2. MF 모델이 정상적으로 훈련되어 그럴듯한 추천이 이루어졌다.
> 3. 비슷한 영화 찾기와 유저에게 추천하기의 과정이 정상적으로 진행되었다.
**************************

- mkdir -p ~/aiffel/recommendata_iu/data/ml-1m
- ln -s ~/data/ml-1m/* ~/aiffel/recommendata_iu/data/ml-1m

## 1. 데이터 준비 및 전처리

In [44]:
import pandas as pd
import os

rating_file_path=os.getenv('HOME') + '/aiffel/recommendata_iu/data/ml-1m/ratings.dat'
ratings_cols = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_csv(rating_file_path, sep='::', names=ratings_cols, engine='python', encoding = "ISO-8859-1")
orginal_data_size = len(ratings)
ratings.head()

Unnamed: 0,user_id,movie_id,rating,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


In [45]:
ratings = ratings[ratings['rating']>=3]
filtered_data_size = len(ratings)

print(f'orginal_data_size: {orginal_data_size}, filtered_data_size: {filtered_data_size}')
print(f'Ratio of Remaining Data is {filtered_data_size / orginal_data_size:.2%}')

orginal_data_size: 1000209, filtered_data_size: 836478
Ratio of Remaining Data is 83.63%


- 평점이 3점 이상인 데이터만 남긴다. (83.63%의 데이터만 남음)

In [46]:
ratings.rename(columns={'rating':'count'}, inplace=True)
ratings['count']

0          5
1          3
2          3
3          4
4          5
          ..
1000203    3
1000205    5
1000206    5
1000207    4
1000208    4
Name: count, Length: 836478, dtype: int64

- rating의 이름을 count로 변경

In [47]:
movie_file_path=os.getenv('HOME') + '/aiffel/recommendata_iu/data/ml-1m/movies.dat'
cols = ['movie_id', 'title', 'genre'] 
movies = pd.read_csv(movie_file_path, sep='::', names=cols, engine='python', encoding='ISO-8859-1')
movies.head()

Unnamed: 0,movie_id,title,genre
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


In [48]:
ratings = ratings.join(movies.set_index('movie_id'), on='movie_id')
ratings.head()

Unnamed: 0,user_id,movie_id,count,timestamp,title,genre
0,1,1193,5,978300760,One Flew Over the Cuckoo's Nest (1975),Drama
1,1,661,3,978302109,James and the Giant Peach (1996),Animation|Children's|Musical
2,1,914,3,978301968,My Fair Lady (1964),Musical|Romance
3,1,3408,4,978300275,Erin Brockovich (2000),Drama
4,1,2355,5,978824291,"Bug's Life, A (1998)",Animation|Children's|Comedy


- ratings dataframe과 movies dataframe을 합친다.

In [49]:
ratings = ratings.drop(columns=['timestamp', 'genre'])
# timestamp, gerne를 표에서 제외
ratings.head()

Unnamed: 0,user_id,movie_id,count,title
0,1,1193,5,One Flew Over the Cuckoo's Nest (1975)
1,1,661,3,James and the Giant Peach (1996)
2,1,914,3,My Fair Lady (1964)
3,1,3408,4,Erin Brockovich (2000)
4,1,2355,5,"Bug's Life, A (1998)"


## 2. 데이터 분석

In [50]:
print('# of movie_id: ', ratings['movie_id'].nunique())
print('# of title   : ', ratings['title'].nunique())
print('# of user_id : ', ratings['user_id'].nunique())

# of movie_id:  3628
# of title   :  3628
# of user_id :  6039


In [51]:
movie_count = ratings.groupby('title')['user_id'].count()
movie_count.sort_values(ascending=False).head(10)

title
American Beauty (1999)                                   3211
Star Wars: Episode IV - A New Hope (1977)                2910
Star Wars: Episode V - The Empire Strikes Back (1980)    2885
Star Wars: Episode VI - Return of the Jedi (1983)        2716
Saving Private Ryan (1998)                               2561
Terminator 2: Judgment Day (1991)                        2509
Silence of the Lambs, The (1991)                         2498
Raiders of the Lost Ark (1981)                           2473
Back to the Future (1985)                                2460
Matrix, The (1999)                                       2434
Name: user_id, dtype: int64

In [52]:
user_count = ratings.groupby('user_id')['movie_id'].count()
user_count.describe()

count    6039.000000
mean      138.512668
std       156.241599
min         1.000000
25%        38.000000
50%        81.000000
75%       177.000000
max      1968.000000
Name: movie_id, dtype: float64

- 유저 별 시청한 영화의 수

## 3. 선호하는 영화 추가

In [53]:

movies[movies['title'].str.lower().str.contains('blood', regex=False)]

Unnamed: 0,movie_id,title,genre
390,394,Coldblooded (1995),Action
607,611,Hellraiser: Bloodline (1996),Action|Horror|Sci-Fi
661,667,Bloodsport 2 (1995),Action
831,842,Tales from the Crypt Presents: Bordello of Blo...,Horror
1149,1165,"Bloody Child, The (1996)",Drama|Thriller
1309,1329,Blood For Dracula (Andy Warhol's Dracula) (1974),Horror
1315,1335,Blood Beach (1981),Action|Horror
1330,1351,Blood & Wine (1997),Drama
1911,1980,Friday the 13th Part VII: The New Blood (1988),Horror
2333,2402,Rambo: First Blood Part II (1985),Action|War


In [54]:
my_favorite_id = [2000, 2402, 2102 , 648, 3623]

my_favorite_title = []
for i in my_favorite_id:
    my_favorite_title.extend(list(movies[movies['movie_id'] == i]['title']))

my_movielist = pd.DataFrame({'user_id': ['seok']*5, 'movie_id': my_favorite_id, 'count': [5]*5, 'title': my_favorite_title})
my_movielist

Unnamed: 0,user_id,movie_id,count,title
0,seok,2000,5,Lethal Weapon (1987)
1,seok,2402,5,Rambo: First Blood Part II (1985)
2,seok,2102,5,Steamboat Willie (1940)
3,seok,648,5,Mission: Impossible (1996)
4,seok,3623,5,Mission: Impossible 2 (2000)


- 5편 선정 후 각각의 영화를 5번 시청 했다는 표 추가

In [55]:
if not ratings.isin({'user_id':['seok']})['user_id'].any():
    ratings = ratings.append(my_movielist, ignore_index=True)

ratings.tail(10) 

Unnamed: 0,user_id,movie_id,count,title
836473,6040,1090,3,Platoon (1986)
836474,6040,1094,5,"Crying Game, The (1992)"
836475,6040,562,5,Welcome to the Dollhouse (1995)
836476,6040,1096,4,Sophie's Choice (1982)
836477,6040,1097,4,E.T. the Extra-Terrestrial (1982)
836478,seok,2000,5,Lethal Weapon (1987)
836479,seok,2402,5,Rambo: First Blood Part II (1985)
836480,seok,2102,5,Steamboat Willie (1940)
836481,seok,648,5,Mission: Impossible (1996)
836482,seok,3623,5,Mission: Impossible 2 (2000)


- 표에 5개 데이터 추가.

## 4. CSR matrix 생성

In [56]:
user_unique = ratings['user_id'].unique()
movie_unique = ratings['title'].unique()

user_to_idx = {v:k for k,v in enumerate(user_unique)}
movie_to_idx = {v:k for k,v in enumerate(movie_unique)}

In [57]:
temp_user_data = ratings['user_id'].map(user_to_idx.get).dropna()
if len(temp_user_data) == len(ratings):
    print('user_id column indexing OK!!')
    ratings['user_id'] = temp_user_data
else:
    print('user_id column indexing Fail!!')

temp_movie_data = ratings['title'].map(movie_to_idx.get).dropna()
if len(temp_movie_data) == len(ratings):
    print('title column indexing OK!!')
    ratings['title'] = temp_movie_data
else:
    print('title column indexing Fail!!')

ratings

user_id column indexing OK!!
title column indexing OK!!


Unnamed: 0,user_id,movie_id,count,title
0,0,1193,5,0
1,0,661,3,1
2,0,914,3,2
3,0,3408,4,3
4,0,2355,5,4
...,...,...,...,...
836478,6039,2000,5,648
836479,6039,2402,5,1698
836480,6039,2102,5,2077
836481,6039,648,5,58


In [58]:
from scipy.sparse import csr_matrix

num_user = ratings['user_id'].nunique()
num_movie = ratings['title'].nunique()

csr_data = csr_matrix((ratings['count'], (ratings.user_id, ratings.title)), shape= (num_user, num_movie))
csr_data

<6040x3628 sparse matrix of type '<class 'numpy.longlong'>'
	with 836483 stored elements in Compressed Sparse Row format>

- CSR matrix 생성 완료

## 5. 모델 설계 및 훈련

In [59]:
from implicit.als import AlternatingLeastSquares
import os
import numpy as np

os.environ['OPENBLAS_NUM_THREADS']='1'
os.environ['KMP_DUPLICATE_LIB_OK']='True'
os.environ['MKL_NUM_THREADS']='1'

In [60]:
als_model = AlternatingLeastSquares(factors=100, regularization=0.01, use_gpu=False, iterations=15, dtype=np.float32)

- 각각의 값을 적절하게 지정해야한다.
#### 1. factors : 유저와 아이템의 벡터를 몇 차원으로 할 것인지
#### 2. regularization : 과적합을 방지, 정규화 값을 얼마나 사용할 것인지 
#### 3. use_gpu : GPU를 사용 유무
#### 4. iterations : epochs

In [61]:
csr_data_transpose = csr_data.T
csr_data_transpose

<3628x6040 sparse matrix of type '<class 'numpy.longlong'>'
	with 836483 stored elements in Compressed Sparse Column format>

In [62]:
als_model.fit(csr_data_transpose)

  0%|          | 0/15 [00:00<?, ?it/s]

## 6. 훈련된 모델이 예측한 나의 선호도

In [67]:
seok, matrix = user_to_idx['seok'], movie_to_idx['Rambo: First Blood Part II (1985)']
seok_vector, Rambo_vector = als_model.user_factors[seok], als_model.item_factors[matrix]

In [68]:
seok_vector

array([ 0.13193284, -0.5888757 ,  0.158662  ,  0.9240484 ,  0.24758372,
        0.2791524 ,  1.0787675 ,  0.1569194 , -0.19162619,  0.3722742 ,
        0.7734157 , -0.23280995,  0.02562603, -0.5031779 ,  0.31240547,
        0.06270296, -0.41679004,  1.1277169 ,  0.38418922,  0.25445375,
       -0.3209979 ,  0.5032384 , -0.01956886, -0.11581302, -0.40351924,
       -0.06918902,  0.12807006, -0.6695047 , -0.16659263,  0.3552883 ,
       -0.5177544 , -0.7855804 , -0.3658844 , -0.7728072 ,  0.17727076,
        0.17681034,  0.06426458, -0.1625024 , -0.9899818 , -0.3516992 ,
        0.05199567, -0.06217567,  0.6443931 , -0.3299193 , -0.1351512 ,
        0.13639453,  0.0643336 , -0.16289799, -0.6386679 ,  0.0991568 ,
       -0.5469778 , -0.1184614 , -0.3584156 , -0.48060656, -0.43233588,
       -0.2721437 ,  0.35081244, -0.6820643 ,  1.3470799 , -0.35761192,
       -0.02492777,  0.03689518,  0.6450365 ,  0.2300903 ,  0.19605851,
        0.20580053,  0.8421535 , -0.39512032,  0.21990344,  1.14

In [69]:
Rambo_vector

array([ 0.01924716,  0.01507034,  0.01953393,  0.0181462 , -0.0100741 ,
        0.01181537,  0.02372582,  0.00935162,  0.01137033,  0.00567765,
        0.01945027,  0.00167606, -0.0007691 ,  0.00062314,  0.0053487 ,
       -0.00575753,  0.00195505, -0.00672377, -0.00530492,  0.01241426,
       -0.01271914,  0.01464199,  0.00632108,  0.00553919,  0.01608408,
        0.00701643, -0.00339277, -0.00527109,  0.01266987,  0.00304394,
       -0.00467921,  0.00549687, -0.01191202,  0.01279883,  0.0049272 ,
        0.00373198, -0.02429988,  0.01573489,  0.00564727,  0.00040444,
        0.00476252,  0.02106658, -0.00201255, -0.0088154 ,  0.01205875,
        0.01190386, -0.01121511, -0.00426914, -0.01266003,  0.01206743,
       -0.00819125, -0.00164507,  0.0151177 , -0.00237229, -0.00768232,
        0.0124587 ,  0.00623317,  0.00035874,  0.01292649,  0.00888687,
       -0.00984593,  0.006693  , -0.01431796,  0.01035476,  0.01241185,
        0.00809824,  0.01303216,  0.0025699 ,  0.01909317,  0.00

In [70]:
np.dot(seok_vector, Rambo_vector)

0.14676905

- 각각의 벡터를 내적
- 내적 값이 큰 것을 확인

In [76]:
Toy = movie_to_idx['Toy Story (1995)']
Toy_vector = als_model.item_factors[Toy]
np.dot(seok_vector, Toy_vector)

0.00027240533

- 선호 리스트에 없는 Toy Story와 나의 벡터를 내적 했을때는 값이 작다.

## 7. 내가 좋아하는 영화와 비슷한 영화 추천

In [77]:
favorite_movie = 'Rambo: First Blood Part II (1985)'
movie_id = movie_to_idx[favorite_movie]
similar_movie = als_model.similar_items(movie_id, N=15)
similar_movie

[(1698, 1.0),
 (1221, 0.88058317),
 (1749, 0.802472),
 (1182, 0.7886733),
 (1126, 0.78118014),
 (2081, 0.7669827),
 (1229, 0.7585215),
 (1219, 0.75802284),
 (2348, 0.7069544),
 (1704, 0.6968944),
 (1962, 0.68815297),
 (1120, 0.6845095),
 (2224, 0.67797744),
 (1967, 0.6588712),
 (1973, 0.65019596)]

In [78]:
idx_to_movie = {v:k for k,v in movie_to_idx.items()}
[idx_to_movie[i[0]] for i in similar_movie]

['Rambo: First Blood Part II (1985)',
 'Rambo III (1988)',
 'Iron Eagle (1986)',
 'Red Dawn (1984)',
 'Rocky III (1982)',
 'Heartbreak Ridge (1986)',
 'Rocky IV (1985)',
 'First Blood (1982)',
 'Death Wish II (1982)',
 'Karate Kid, Part II, The (1986)',
 'Rocky V (1990)',
 'Jaws 2 (1978)',
 'Missing in Action (1984)',
 'Bloodsport (1988)',
 'Missing in Action 2: The Beginning (1985)']

In [79]:
def get_similar_movie(movie_title: str):
    movie_id = movie_to_idx[movie_title]
    similar_movie = als_model.similar_items(movie_id, N=15)
    similar_movie = [idx_to_movie[i[0]] for i in similar_movie]
    return similar_movie

In [80]:
get_similar_movie('Toy Story (1995)')

['Toy Story (1995)',
 'Toy Story 2 (1999)',
 "Bug's Life, A (1998)",
 'Aladdin (1992)',
 'Babe (1995)',
 'Groundhog Day (1993)',
 'Lion King, The (1994)',
 'Beauty and the Beast (1991)',
 'Pleasantville (1998)',
 "There's Something About Mary (1998)",
 'Mulan (1998)',
 'Shakespeare in Love (1998)',
 'Hercules (1997)',
 'Forrest Gump (1994)',
 'Tarzan (1999)']

## 8. 좋아할 만한 영화 추천

In [81]:
user = user_to_idx['seok']

movie_recommended = als_model.recommend(user, csr_data, N=10, filter_already_liked_items=True)
movie_recommended

[(67, 0.44160095),
 (650, 0.35664895),
 (372, 0.31884336),
 (346, 0.2991693),
 (746, 0.26992375),
 (153, 0.22726667),
 (326, 0.19927481),
 (1537, 0.18801063),
 (1502, 0.17869067),
 (1123, 0.17624354)]

In [82]:
[idx_to_movie[i[0]] for i in movie_recommended]

['Gladiator (2000)',
 'Lethal Weapon 2 (1989)',
 'X-Men (2000)',
 'Patriot, The (2000)',
 'Perfect Storm, The (2000)',
 'Lethal Weapon 3 (1992)',
 'Top Gun (1986)',
 'U-571 (2000)',
 'F/X (1986)',
 'Rocky II (1979)']

In [83]:
recommended = movie_to_idx['X-Men (2000)']
explain = als_model.explain(user, csr_data, itemid=recommended)

[(idx_to_movie[i[0]], i[1]) for i in explain[1]]

[('Mission: Impossible 2 (2000)', 0.21560275020147274),
 ('Mission: Impossible (1996)', 0.06445791736961595),
 ('Rambo: First Blood Part II (1985)', 0.029391716255160162),
 ('Lethal Weapon (1987)', 0.008467809654899518),
 ('Steamboat Willie (1940)', -0.0031510094022107805)]

- Mission: Impossible 2가 영화 X-Men을 추천하는데 기여도가 높아서 나에게 추천을 해줬다.

***************
# 회고


### 직접적(Explicit) 선호

> 직접적 선호 : 유저가 자신의 선호도를 직접 표현한 Data. EX) 평점 기록, 영화 리뷰, 구독, 차단 데이터
>- ***데이터를 얻기 힘들다. 참여도가 낮다***


###  간접적(Implicit) 선호

>EX) 검색 기록, 방문 페이지, 구매내역

**1) 부정적인 피드백이 없다.**

**2) 애초에 잡음이 많다.**

**3) 수치는 신뢰도를 의미한다. : Explicit Data의 경우 높은 수치는 높은 선호도를 의미.**

**4) Implicit-feedback Recommender System의 평가는 적절한 방법을 고민해봐야 한다.**

>평점 데이터를 이용시 예측값과 실제값이 얼마나 다른지를 평가하는 Mean Squared Error 방법 사용. 

>시청시간, 클릭수, 조회 기록을 이용하는 경우 정답값을 주기 어렵다. ==> 반복되는 feedback 고려


### 암묵적 데이터의 해석 (애매한 암묵적 데이터 ==> 도메인 지식과 직관이 활용 되어야함) 

**1) 한번이라도 들었으면 선호한다고 판단.**

**2) 많이 재생한 아티스트에 대해 가중치를 주어서 더 확실히 좋아한다고 판단.**

*****************************
**데이터가 우선적으로 신뢰가 가능해야하지만, 필요한 데이터를 뽑아내기 위해서 기준을 명확히 해야하고, 분류를 잘해야한다.
데이터 확보를 위해 참여도를 높이는 방법을 찾아 봐야할 것 같다.**

**간접적 선호 방법에서 높은 수치가 과연 높은 선호도를 의미하는지에 대한 것을 다시 생각해봤다. 분명히 연관성은 깊겠지만, 평상시 높은 선호도를 보이는 것이 아닌, 새로운 분야를 추천했을때는 어떤 결과가 나올지 궁금하다.(광고의 경우 뜬금없는 광고도 존재한다.)**

**신기하게도 람보 영화를 선택했을때 정말 비슷한 장르의 영화가 추천이 되었다. => 수치적인 연관성이 복잡한 사람의 심리를 파악하는데 쉽고 간단하지만, 다른 의미가 있을지 더 찾아보고 싶다.**

***벡터의 내적을 통한 성능 개선이 굉장히 흥미롭다. 확실한 것은 공부 방향을 간단한 계산이 아닌 수학적 의미를 공부하는 쪽으로 해야겠다.***