## 데이터셋 준비 및 전처리

- 유저가 영화에 대해 평점을 매긴 데이터가 데이터 크기 별로 있습니다. MovieLens 1M Dataset 사용을 권장합니다.
- 별점 데이터는 대표적인 explicit 데이터입니다. 하지만 implicit 데이터로 간주하고 테스트해볼 수 있습니다.
- 별점을 시청횟수로 해석해서 생각하겠습니다.
- 또한 유저가 3점 미만으로 준 데이터는 선호하지 않는다고 가정하고 제외하겠습니다.

In [1]:
!pwd

/home/aiffel0049/aiffel/Exploration/7.recommendation


In [2]:
import os
import pandas as pd

rating_file_path='/home/aiffel0049/aiffel/Exploration/7.recommendation/data/ml-1m/ratings.dat'
ratings_cols = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_csv(rating_file_path, sep='::', names=ratings_cols, engine='python')
orginal_data_size = len(ratings)
del ratings['timestamp']
ratings.head()

Unnamed: 0,user_id,movie_id,rating
0,1,1193,5
1,1,661,3
2,1,914,3
3,1,3408,4
4,1,2355,5


In [3]:
# 3점 이상만 남깁니다.
ratings = ratings[ratings['rating']>=3]
filtered_data_size = len(ratings)

print(f'orginal_data_size: {orginal_data_size}, filtered_data_size: {filtered_data_size}')
print(f'Ratio of Remaining Data is {filtered_data_size / orginal_data_size:.2%}')

orginal_data_size: 1000209, filtered_data_size: 836478
Ratio of Remaining Data is 83.63%


In [4]:
# rating 컬럼의 이름을 count로 바꿉니다.
ratings.rename(columns={'rating':'count'}, inplace=True)

In [5]:
# 영화 제목을 보기 위해 메타 데이터를 읽어옵니다.
movie_file_path='/home/aiffel0049/aiffel/Exploration/7.recommendation/data/ml-1m/movies.dat'
cols = ['movie_id', 'title', 'genre'] 
movies = pd.read_csv(movie_file_path, sep='::', names=cols, engine='python')
movies.head()

Unnamed: 0,movie_id,title,genre
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


## 데이터 분석
- ratings에 있는 유니크한 영화 개수
- rating에 있는 유니크한 사용자 수
- 가장 인기있는 영화 30개(인기순)

In [6]:
# 유니크 영화 개수
ratings['movie_id'].nunique()

3628

In [7]:
# 사용자 수
ratings['user_id'].nunique()

6039

In [8]:
# movie_id 기준으로 count의 합을 저장
# 내림차순 정렬
movies_count = ratings.groupby('movie_id')['count'].sum()
movies_count.sort_values(ascending=False).head(30)

movie_id
2858    14449
260     13178
1196    12648
2028    11348
1210    11303
1198    11179
593     11096
2571    10903
2762    10703
589     10513
608     10465
527     10317
110     10125
318     10085
1270    10081
858      9965
1197     9866
480      9601
1617     9515
2396     9503
1580     9057
296      9053
2997     8964
1265     8727
1097     8696
356      8679
1240     8571
1        8475
2716     8282
1214     8244
Name: count, dtype: int64

In [9]:
for name, genre in zip(movies['title'], movies['genre']):
    if 'Fantasy' in genre:
        print(name)

Jumanji (1995)
Kids of the Round Table (1995)
Indian in the Cupboard, The (1995)
NeverEnding Story III, The (1994)
Heavenly Creatures (1994)
Kid in King Arthur's Court, A (1995)
Star Wars: Episode IV - A New Hope (1977)
Santa Clause, The (1994)
Mask, The (1994)
Pagemaster, The (1994)
Dragonheart (1996)
Space Jam (1996)
Nutty Professor, The (1996)
Hungarian Fairy Tale, A (1987)
Kazaam (1996)
Bogus (1996)
Escape to Witch Mountain (1975)
20,000 Leagues Under the Sea (1954)
Willy Wonka and the Chocolate Factory (1971)
E.T. the Extra-Terrestrial (1982)
Drop Dead Fred (1991)
Warriors of Virtue (1997)
Simple Wish, A (1997)
FairyTale: A True Story (1997)
Flubber (1997)
Star Kid (1997)
Borrowers, The (1997)
Quest for Camelot (1998)
Small Soldiers (1998)
Labyrinth (1986)
Goonies, The (1985)
Absent Minded Professor, The (1961)
Babes in Toyland (1961)
Dune (1984)
Darby O'Gill and the Little People (1959)
Honey, I Shrunk the Kids (1989)
One Magic Christmas (1985)
Peter Pan (1953)
Return to Oz (1985

## 선호하는 영화 5가지를 rating에 추가

In [10]:
my_favorite = ['E.T. the Extra-Terrestrial (1982)', 'Toy Story 2 (1999)', 'Jumanji (1995)',
               'Santa Claus: The Movie (1985)','Tron (1982)']
my_favorite_ids = []
for name, ids in zip(movies['title'], movies['movie_id']):
    if name in my_favorite:
        my_favorite_ids.append(ids)

print(my_favorite_ids)
my_movies = pd.DataFrame({'user_id':[6041]*5, 'movie_id': my_favorite_ids, 'count': [4]*5})

if not ratings.isin({'user_id':[6041]})['user_id'].any():  # user_id에 6041이라는 데이터가 없다면
    ratings = ratings.append(my_movies)                           # 위에 임의로 만든 my_favorite_ids 데이터를 추가해 줍니다. 

ratings.tail(10)       # 잘 추가되었는지 확인해 봅시다.

[2, 1097, 2105, 2399, 3114]


Unnamed: 0,user_id,movie_id,count
1000203,6040,1090,3
1000205,6040,1094,5
1000206,6040,562,5
1000207,6040,1096,4
1000208,6040,1097,4
0,6041,2,4
1,6041,1097,4
2,6041,2105,4
3,6041,2399,4
4,6041,3114,4


In [11]:
user_unique = ratings['user_id'].unique()
movie_unique = ratings['movie_id'].unique()
user_to_idx = {v:k for k,v in enumerate(user_unique)}
movie_to_idx = {v:k for k,v in enumerate(movie_unique)}

In [12]:
print(user_to_idx[6041])
print(movie_to_idx[2105])

6039
704


In [13]:
temp_user_data = ratings['user_id'].map(user_to_idx.get).dropna()
if len(temp_user_data) == len(ratings):
    print('user_id column indexing OK')
    ratings['user_id'] = temp_user_data
else:
    print('user_id column indexing Fail')

# movie_to_idx 통해 artist 컬럼도 동일한 방식으로 인덱싱해 줍니다. 
temp_movie_data = ratings['movie_id'].map(movie_to_idx.get).dropna()
if len(temp_movie_data) == len(ratings):
    print('artist column indexing OK')
    ratings['movie_id'] = temp_movie_data
else:
    print('artist column indexing Fail')

ratings

user_id column indexing OK
artist column indexing OK


Unnamed: 0,user_id,movie_id,count
0,0,0,5
1,0,1,3
2,0,2,3
3,0,3,4
4,0,4,5
...,...,...,...
0,6039,513,4
1,6039,26,4
2,6039,704,4
3,6039,701,4


## CSR matrix 제작

In [14]:
from scipy.sparse import csr_matrix

num_user = ratings['user_id'].nunique()
num_movies = ratings['movie_id'].nunique()

csr_data = csr_matrix((ratings['count'], (ratings.user_id, ratings.movie_id)), shape=(num_user, num_movies))
csr_data

<6040x3628 sparse matrix of type '<class 'numpy.int64'>'
	with 836483 stored elements in Compressed Sparse Row format>

## als_model = AlternatingLeastSquares 모델을 직접 구성하여 훈련

In [15]:
from implicit.als import AlternatingLeastSquares
import os
import numpy as np

# implicit 라이브러리에서 권장하고 있는 부분입니다. 학습 내용과는 무관합니다.
os.environ['OPENBLAS_NUM_THREADS']='1'
os.environ['KMP_DUPLICATE_LIB_OK']='True'
os.environ['MKL_NUM_THREADS']='1'

In [16]:
# AlternatingLeastSquares 모델 선언

als_model = AlternatingLeastSquares(factors=100, regularization=0.01, use_gpu=False, iterations=100, dtype=np.float32)

In [17]:
# als 모델은 input으로 (item X user) 형태로 받음. transpose 필요
csr_data_transpose = csr_data.T
csr_data_transpose

<3628x6040 sparse matrix of type '<class 'numpy.int64'>'
	with 836483 stored elements in Compressed Sparse Column format>

In [18]:
als_model.fit(csr_data_transpose)

HBox(children=(FloatProgress(value=0.0), HTML(value='')))




## 내가 선호하는 5가지 영화 중 하나와 그 외의 영화 하나를 골라 훈련된 모델이 예측한 나의 선호도를 파악

In [19]:
me, my_movie = user_to_idx[6041], movie_to_idx[2399]
my_vector, movie_vector = als_model.user_factors[me], als_model.item_factors[my_movie]

In [20]:
np.dot(my_vector, movie_vector)

0.12847224

In [21]:
other_movie = movie_to_idx[1090]
other_movie_vector = als_model.item_factors[other_movie]
np.dot(my_vector, other_movie_vector)

0.07022386

## 좋아하는 영화와 비슷한 영화 추천받기

In [22]:
similar_movie = als_model.similar_items(my_movie, N=10)
similar_movie

[(701, 0.14016704),
 (1813, 0.074067734),
 (2549, 0.0740507),
 (684, 0.073826395),
 (1993, 0.07341926),
 (1985, 0.07289947),
 (2477, 0.07081206),
 (2689, 0.069730766),
 (3232, 0.0696123),
 (722, 0.06886212)]

In [23]:
idx_to_movie = {v:k for k, v in movie_to_idx.items()}
similar_movie = [(idx_to_movie[i[0]], i[1]) for i in similar_movie]
similar_movie

[(2399, 0.14016704),
 (2057, 0.074067734),
 (2050, 0.0740507),
 (2453, 0.073826395),
 (885, 0.07341926),
 (2162, 0.07289947),
 (3673, 0.07081206),
 (2016, 0.069730766),
 (1426, 0.0696123),
 (2047, 0.06886212)]

In [24]:
similar_info = []
genre_list = []
for movie_idx in similar_movie:
    similar_info.append((*movies.loc[movies['movie_id'] == movie_idx[0]].values[0], movie_idx[1]))
    print('{}, {}, {}, {}'.format(*similar_info[-1]))
    genre_list += similar_info[-1][2].split('|')

from collections import Counter
print(Counter(genre_list))

2399, Santa Claus: The Movie (1985), Adventure|Children's|Fantasy, 0.14016704261302948
2057, Incredible Journey, The (1963), Adventure|Children's, 0.07406773418188095
2050, Herbie Goes Bananas (1980), Adventure|Children's|Comedy, 0.07405070215463638
2453, Boy Who Could Fly, The (1986), Drama|Fantasy, 0.0738263949751854
885, Bogus (1996), Children's|Drama|Fantasy, 0.07341925799846649
2162, NeverEnding Story II: The Next Chapter, The (1990), Adventure|Children's|Fantasy, 0.0728994682431221
3673, Benji the Hunted (1987), Adventure|Children's, 0.07081206142902374
2016, Apple Dumpling Gang Rides Again, The (1979), Children's|Comedy|Western, 0.06973076611757278
1426, Zeus and Roxanne (1997), Children's, 0.0696123018860817
2047, Gnome-Mobile, The (1967), Children's, 0.06886211782693863
Counter({"Children's": 9, 'Adventure': 5, 'Fantasy': 4, 'Comedy': 2, 'Drama': 2, 'Western': 1})


## 가장 좋아할만한 영화 추천받기

In [25]:
user = user_to_idx[6041]

movie_recommended = als_model.recommend(user, csr_data, N=10, filter_already_liked_items=True)
movie_recommended

[(40, 0.35363936),
 (4, 0.33799627),
 (531, 0.30815232),
 (499, 0.30187213),
 (561, 0.287902),
 (322, 0.27772212),
 (545, 0.2732821),
 (611, 0.26438951),
 (616, 0.25970992),
 (458, 0.25856438)]

In [26]:
movie_recommended = [(idx_to_movie[i[0]], i[1]) for i in movie_recommended]
movie_recommended

[(1, 0.35363936),
 (2355, 0.33799627),
 (2140, 0.30815232),
 (2054, 0.30187213),
 (2161, 0.287902),
 (34, 0.27772212),
 (317, 0.2732821),
 (1967, 0.26438951),
 (2193, 0.25970992),
 (367, 0.25856438)]

In [27]:
recommended_info = []
genre = list()
for i, tup in enumerate(movie_recommended):
    movie_info = movies.loc[movies['movie_id'] == tup[0]]
    recommended_info.append((movie_info['title'].values[0], movie_info['genre'].values[0], *tup))
    print('{}, {}, {}, {}'.format(*recommended_info[i]))
    genre += recommended_info[i][1].split('|')
print(Counter(genre))

Toy Story (1995), Animation|Children's|Comedy, 1, 0.3536393642425537
Bug's Life, A (1998), Animation|Children's|Comedy, 2355, 0.3379962742328644
Dark Crystal, The (1982), Children's|Fantasy|Sci-Fi, 2140, 0.30815231800079346
Honey, I Shrunk the Kids (1989), Adventure|Children's|Comedy|Fantasy|Sci-Fi, 2054, 0.3018721342086792
NeverEnding Story, The (1984), Adventure|Children's|Fantasy, 2161, 0.28790199756622314
Babe (1995), Children's|Comedy|Drama, 34, 0.2777221202850342
Santa Clause, The (1994), Children's|Comedy|Fantasy, 317, 0.27328211069107056
Labyrinth (1986), Adventure|Children's|Fantasy, 1967, 0.2643895149230957
Willow (1988), Action|Adventure|Fantasy, 2193, 0.2597099244594574
Mask, The (1994), Comedy|Crime|Fantasy, 367, 0.25856438279151917
Counter({"Children's": 8, 'Fantasy': 7, 'Comedy': 6, 'Adventure': 4, 'Animation': 2, 'Sci-Fi': 2, 'Drama': 1, 'Action': 1, 'Crime': 1})


In [28]:
favorite_movie_info = [movies.loc[movies.title == movie_name] for movie_name in my_favorite]
# print(favorite_movie_info)
favorite_movie_info = [(info.movie_id.values[0], info.title.values[0], info.genre.values[0]) for info in favorite_movie_info]
favorite_genre = []
for info in favorite_movie_info:
    print('{}, {}, {}'.format(*info))
    favorite_genre += info[-1].split('|')

print(Counter(favorite_genre))

1097, E.T. the Extra-Terrestrial (1982), Children's|Drama|Fantasy|Sci-Fi
3114, Toy Story 2 (1999), Animation|Children's|Comedy
2, Jumanji (1995), Adventure|Children's|Fantasy
2399, Santa Claus: The Movie (1985), Adventure|Children's|Fantasy
2105, Tron (1982), Action|Adventure|Fantasy|Sci-Fi
Counter({"Children's": 4, 'Fantasy': 4, 'Adventure': 3, 'Sci-Fi': 2, 'Drama': 1, 'Animation': 1, 'Comedy': 1, 'Action': 1})


## 기여도

In [29]:
movie_id = 2355
print(movies.loc[movies['movie_id'] == movie_id].values[0])
movie_idx = movie_to_idx[movie_id]
explain = als_model.explain(user, csr_data, itemid=movie_idx)
explain

[2355 "Bug's Life, A (1998)" "Animation|Children's|Comedy"]


(0.33697338146417277,
 [(50, 0.3064068202811457),
  (704, 0.025424215469956542),
  (26, 0.005258519212720549),
  (513, 0.0022956261132989165),
  (701, -0.002411799612948932)],
 (array([[ 1.05185627e+00,  1.08734323e-01,  3.88522661e-02, ...,
           1.68171934e-02, -5.19482783e-03,  5.10162042e-02],
         [ 1.14372880e-01,  1.03446322e+00,  5.42475165e-02, ...,
           6.50645635e-02,  3.10063123e-02,  1.62765343e-02],
         [ 4.08669996e-02,  6.03416357e-02,  9.79511431e-01, ...,
           2.00388236e-02,  3.62460831e-02,  3.31375918e-02],
         ...,
         [ 1.76892703e-02,  6.91355043e-02,  2.38112339e-02, ...,
           9.67259461e-01,  1.65166249e-02, -7.37702172e-04],
         [-5.46421221e-03,  3.15100337e-02,  3.69836374e-02, ...,
           5.68473777e-02,  9.59319769e-01,  3.76577717e-02],
         [ 5.36617142e-02,  2.23846886e-02,  3.53237067e-02, ...,
           2.67589778e-02,  7.34072745e-02,  9.57503457e-01]]),
  False))

In [30]:
explain_top5 = [(idx_to_movie[i[0]], i[1]) for i in explain[1]]
explain_top5

[(3114, 0.3064068202811457),
 (2105, 0.025424215469956542),
 (1097, 0.005258519212720549),
 (2, 0.0022956261132989165),
 (2399, -0.002411799612948932)]

In [31]:
explain_info = []
explain_genre = []
for i, tup in enumerate(explain_top5):
    movie_info = movies.loc[movies['movie_id'] == tup[0]]
    explain_info.append((movie_info['title'].values[0], movie_info['genre'].values[0], *tup))
    print('{}, {}, {}, {}'.format(*explain_info[i]))
    explain_genre += explain_info[i][1].split('|')
print(Counter(explain_genre))

Toy Story 2 (1999), Animation|Children's|Comedy, 3114, 0.3064068202811457
Tron (1982), Action|Adventure|Fantasy|Sci-Fi, 2105, 0.025424215469956542
E.T. the Extra-Terrestrial (1982), Children's|Drama|Fantasy|Sci-Fi, 1097, 0.005258519212720549
Jumanji (1995), Adventure|Children's|Fantasy, 2, 0.0022956261132989165
Santa Claus: The Movie (1985), Adventure|Children's|Fantasy, 2399, -0.002411799612948932
Counter({"Children's": 4, 'Fantasy': 4, 'Adventure': 3, 'Sci-Fi': 2, 'Animation': 1, 'Comedy': 1, 'Action': 1, 'Drama': 1})


## 분석결과

### 내가 선호하는 영화 정보

- favorite movie info   
1097, 'E.T. the Extra-Terrestrial (1982)', "Children's|Drama|Fantasy|Sci-Fi"   
3114, 'Toy Story 2 (1999)', "Animation|Children's|Comedy"   
2, 'Jumanji (1995)', "Adventure|Children's|Fantasy"   
2399, 'Santa Claus: The Movie (1985)', "Adventure|Children's|Fantasy"   
2105, 'Tron (1982)', 'Action|Adventure|Fantasy|Sci-Fi'

- movie genre   
"Children's": 4, 'Fantasy': 4, 'Adventure': 3, 'Sci-Fi': 2,   
'Drama': 1, 'Animation': 1, 'Comedy': 1, 'Action': 1

### 비슷한 영화 추천받기
- 영화 : Santa Claus: The Movie (1985)', "Adventure|Children's|Fantasy"

- 결과   
2399, Santa Claus: The Movie (1985), Adventure|Children's|Fantasy, 0.14016704261302948   
2057, Incredible Journey, The (1963), Adventure|Children's, 0.07406773418188095   
2050, Herbie Goes Bananas (1980), Adventure|Children's|Comedy, 0.07405070215463638   
2453, Boy Who Could Fly, The (1986), Drama|Fantasy, 0.0738263949751854   
885, Bogus (1996), Children's|Drama|Fantasy, 0.07341925799846649   
2162, NeverEnding Story II: The Next Chapter, The (1990), Adventure|Children's|Fantasy, 0.0728994682431221   
3673, Benji the Hunted (1987), Adventure|Children's, 0.07081206142902374   
2016, Apple Dumpling Gang Rides Again, The (1979), Children's|Comedy|Western, 0.06973076611757278   
1426, Zeus and Roxanne (1997), Children's, 0.0696123018860817   
2047, Gnome-Mobile, The (1967), Children's, 0.06886211782693863   


- Genre   
"Children's": 9, 'Adventure': 5, 'Fantasy': 4, 'Comedy': 2, 'Drama': 2, 'Western': 1


- 장르별 연관성 : Adventure = 6개, Children's = 9개, Fantasy = 5개
- 장르가 다른 영화의 경우 : Drama = 3개, Comedy = 2개, Western = 1개
- 모든 영화가 추천 대상 영화의 장르에 속했다.
- 이 중 일부는 장르가 다른 경우도 존재했다.

### 좋아할만한 영화 추천 받기

- 결과   
Toy Story (1995), Animation|Children's|Comedy, 1, 0.3536393642425537   
Bug's Life, A (1998), Animation|Children's|Comedy, 2355, 0.3379962742328644   
Dark Crystal, The (1982), Children's|Fantasy|Sci-Fi, 2140, 0.30815231800079346   
Honey, I Shrunk the Kids (1989), Adventure|Children's|Comedy|Fantasy|Sci-Fi, 2054, 0.3018721342086792   
NeverEnding Story, The (1984), Adventure|Children's|Fantasy, 2161, 0.28790199756622314   
Babe (1995), Children's|Comedy|Drama, 34, 0.2777221202850342   
Santa Clause, The (1994), Children's|Comedy|Fantasy, 317, 0.27328211069107056   
Labyrinth (1986), Adventure|Children's|Fantasy, 1967, 0.2643895149230957   
Willow (1988), Action|Adventure|Fantasy, 2193, 0.2597099244594574   
Mask, The (1994), Comedy|Crime|Fantasy, 367, 0.25856438279151917   


- Genre   
"Children's": 8, 'Fantasy': 7, 'Comedy': 6, 'Adventure': 4,   
'Animation': 2, 'Sci-Fi': 2, 'Drama': 1, 'Action': 1, 'Crime': 1

- Children's = 9개, Comedy = 7개, Fantasy = 7개, Adventure = 4개
- 기준이 된 영화의 장르 중 2개인 Children's와 Fantasy와 같은 영화가 7개 이상 추천되었다.

#### 기여도 검사

- 영화 정보   
"Bug's Life, A (1998)" "Animation|Children's|Comedy"


- 기여도 정보   
Toy Story 2 (1999), Animation|Children's|Comedy, 3114, 0.3064068202811457   
Tron (1982), Action|Adventure|Fantasy|Sci-Fi, 2105, 0.025424215469956542   
E.T. the Extra-Terrestrial (1982), Children's|Drama|Fantasy|Sci-Fi, 1097, 0.005258519212720549   
Jumanji (1995), Adventure|Children's|Fantasy, 2, 0.0022956261132989165   
Santa Claus: The Movie (1985), Adventure|Children's|Fantasy, 2399, -0.002411799612948932   


- 기여도 영화 장르   
"Children's": 4, 'Fantasy': 4, 'Adventure': 3, 'Sci-Fi': 2,   
'Animation': 1, 'Comedy': 1, 'Drama': 1, 'Action': 1

- 영화 하나 (Toy Story 2)가 강한 기여를 하여 추천되었다.
- 영화의 장르가 100% 일치하는 영화의 영향을 크게 받은 것으로 생각된다.