<a href="https://colab.research.google.com/github/senasung37/recommendation/blob/main/movie_recommender_surprise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

코드출처: 이수안 컴퓨터 연구소 <br>
- 링크: https://www.youtube.com/watch?v=6TP51jvjLsE&t=698s

#라이브러리 & 데이터 불러오기

In [1]:
#surprise 라이브러리 설치
!pip install scikit-surprise

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting scikit-surprise
  Downloading scikit-surprise-1.1.1.tar.gz (11.8 MB)
[K     |████████████████████████████████| 11.8 MB 28.5 MB/s 
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.1-cp37-cp37m-linux_x86_64.whl size=1633993 sha256=7674843a9617fc1f1337b6755a8186f50b69800fd49e17b39261dbea60c63caa
  Stored in directory: /root/.cache/pip/wheels/76/44/74/b498c42be47b2406bd27994e16c5188e337c657025ab400c1c
Successfully built scikit-surprise
Installing collected packages: scikit-surprise
Successfully installed scikit-surprise-1.1.1


In [2]:
#라이브러리 불러오기
from surprise import SVD
from surprise import Dataset
from surprise.model_selection import cross_validate

In [3]:
#데이터 불러오기 (ml-100k: 데이터 종류)
#피쳐: 사용자id, 영화id, 평점, id)
data = Dataset.load_builtin('ml-100k', prompt=False)
data.raw_ratings[:10]

Trying to download dataset from http://files.grouplens.org/datasets/movielens/ml-100k.zip...
Done! Dataset ml-100k has been saved to /root/.surprise_data/ml-100k


[('196', '242', 3.0, '881250949'),
 ('186', '302', 3.0, '891717742'),
 ('22', '377', 1.0, '878887116'),
 ('244', '51', 2.0, '880606923'),
 ('166', '346', 1.0, '886397596'),
 ('298', '474', 4.0, '884182806'),
 ('115', '265', 2.0, '881171488'),
 ('253', '465', 5.0, '891628467'),
 ('305', '451', 3.0, '886324817'),
 ('6', '86', 3.0, '883603013')]

In [4]:
#모델 설정
model = SVD()

In [5]:
cross_validate(model, data, measures=['rmse', 'mae'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9355  0.9328  0.9342  0.9460  0.9375  0.9372  0.0047  
MAE (testset)     0.7392  0.7344  0.7378  0.7445  0.7389  0.7390  0.0032  
Fit time          4.45    5.89    4.80    4.18    4.16    4.69    0.64    
Test time         0.17    0.20    0.25    0.20    0.16    0.20    0.03    


{'fit_time': (4.447268724441528,
  5.886365652084351,
  4.798278570175171,
  4.18296217918396,
  4.156388282775879),
 'test_mae': array([0.73921893, 0.73442842, 0.73782755, 0.74451771, 0.73892017]),
 'test_rmse': array([0.93546513, 0.93279791, 0.93417196, 0.94602469, 0.93745674]),
 'test_time': (0.16616225242614746,
  0.19656610488891602,
  0.25396156311035156,
  0.1988530158996582,
  0.15985894203186035)}

#컨텐츠 기반 필터링
사용자가 좋아하는 상품과 유사한 상품 추천(예시: 나와 비슷한 취향의 사용자가 시청한 영화 추천)
- 장점: 많은 수의 사용자 대상으로 쉽게 확장, 관심 없던 상품 추천 가능
- 단점: 입력 특성을 직접 설계해야 돼서 도메인 지식 필요, 사용자 기존 관심사항 기반으로만 추천 가능

In [6]:
import numpy as np
from surprise import Dataset

In [7]:
raw_data = np.array(data.raw_ratings, dtype=int)
raw_data

array([[      196,       242,         3, 881250949],
       [      186,       302,         3, 891717742],
       [       22,       377,         1, 878887116],
       ...,
       [      276,      1090,         1, 874795795],
       [       13,       225,         2, 882399156],
       [       12,       203,         3, 879959583]])

In [8]:
raw_data[:,0] -= 1
raw_data[:,1] -= 1
raw_data


array([[      195,       241,         3, 881250949],
       [      185,       301,         3, 891717742],
       [       21,       376,         1, 878887116],
       ...,
       [      275,      1089,         1, 874795795],
       [       12,       224,         2, 882399156],
       [       11,       202,         3, 879959583]])

In [9]:
n_users = np.max(raw_data[:, 0])
n_movies = np.max(raw_data[:, 1])
shape = (n_users + 1, n_movies + 1)
shape

(943, 1682)

In [10]:
n_users

942

In [11]:
adj_matrix = np.ndarray(shape, dtype=int)
for user_id, movie_id, rating, time in raw_data:
  adj_matrix[user_id][movie_id] = 1.
adj_matrix

array([[1, 1, 1, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [1, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 1, 0, ..., 0, 0, 0]])

In [12]:
my_id, my_vector = 0, adj_matrix[0]
best_match, best_match_id, best_match_vector = -1, -1, []

for user_id, user_vector in enumerate(adj_matrix):
  if my_id != user_id:
    similarity = np.dot(my_vector, user_vector)
    if similarity > best_match:
      best_match = similarity
      best_match_id = user_id
      best_match_vector = user_vector