<a href="https://colab.research.google.com/github/younghun-cha/DM_2022/blob/main/Recommender_System.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Surprise

- 추천 시스템 개발을 위한 라이브러리
- 다양한 모델과 데이터 제공
- scikit-learn과 유사한 사용 방법

In [None]:
# surprise 설치
!pip install scikit-surprise

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting scikit-surprise
  Downloading scikit-surprise-1.1.1.tar.gz (11.8 MB)
[K     |████████████████████████████████| 11.8 MB 294 kB/s 
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.1-cp37-cp37m-linux_x86_64.whl size=1633715 sha256=9b46c8c6c21ce5a3cb079ae6ef69b2a8252370abe4a9a762891dabb4905b2c53
  Stored in directory: /root/.cache/pip/wheels/76/44/74/b498c42be47b2406bd27994e16c5188e337c657025ab400c1c
Successfully built scikit-surprise
Installing collected packages: scikit-surprise
Successfully installed scikit-surprise-1.1.1


In [None]:
import numpy as np
from surprise import Dataset

In [None]:
"""
영화 평점 데이터셋인 movie lens 데이터(ml-100k) 다운로드
user_id, movie_id, rating, time으로 이뤄져 있음
"""

data = Dataset.load_builtin('ml-100k', prompt=False)
data.raw_ratings[:10]

Trying to download dataset from http://files.grouplens.org/datasets/movielens/ml-100k.zip...
Done! Dataset ml-100k has been saved to /root/.surprise_data/ml-100k


[('196', '242', 3.0, '881250949'),
 ('186', '302', 3.0, '891717742'),
 ('22', '377', 1.0, '878887116'),
 ('244', '51', 2.0, '880606923'),
 ('166', '346', 1.0, '886397596'),
 ('298', '474', 4.0, '884182806'),
 ('115', '265', 2.0, '881171488'),
 ('253', '465', 5.0, '891628467'),
 ('305', '451', 3.0, '886324817'),
 ('6', '86', 3.0, '883603013')]

In [None]:
raw_data = np.array(data.raw_ratings, dtype=int)

In [None]:
# id가 0부터 시작하도록 변환
raw_data[:, 0] -= 1
raw_data[:, 1] -= 1

In [None]:
raw_data

array([[      195,       241,         3, 881250949],
       [      185,       301,         3, 891717742],
       [       21,       376,         1, 878887116],
       ...,
       [      275,      1089,         1, 874795795],
       [       12,       224,         2, 882399156],
       [       11,       202,         3, 879959583]])

In [None]:
# 인접 행렬의 크기
n_users = np.max(raw_data[:, 0])
n_movies = np.max(raw_data[:, 1])
shape = (n_users + 1, n_movies + 1)
shape

(943, 1682)

In [None]:
# <사용자, 영화> 인접 행렬 생성
adj_matrix = np.ndarray(shape, dtype=int)
for user_id, movie_id, rating, time in raw_data:
  adj_matrix[user_id][movie_id] = 1.
adj_matrix

array([[1, 1, 1, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [1, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 1, 0, ..., 0, 0, 0]])

## User-Based Collaborative Filtering 직접 구현


유클리드 거리( $ euclidian = \sqrt{\sum_{d=1}^D (A_i-B_i)^2}$ )를 사용해 추천

In [None]:
my_id, my_vector = 0, adj_matrix[0] # 내 id가 0번 이라고 가정
best_match, best_match_id, best_match_vector = 9999, -1, []

# 유클리드 거리를 통해 유사도 구하기
for user_id, user_vector in enumerate(adj_matrix):
  if my_id != user_id:
    euclidean_dist = np.sqrt(np.sum(np.square(my_vector - user_vector)))
    if euclidean_dist < best_match:
      best_match = euclidean_dist
      best_match_id = user_id
      best_match_vector = user_vector

print('Best Match: {}, Best Match ID: {}'.format(best_match, best_match_id))

Best Match: 14.832396974191326, Best Match ID: 737


In [None]:
print(my_vector, best_match_vector)

[1 1 1 ... 0 0 0] [1 1 0 ... 0 0 0]


In [None]:
# 내가 안 본 영화 중 737번이 본 영화를 추천

recommend_list = []
for i, log in enumerate(zip(my_vector, best_match_vector)):
  log1, log2 = log
  if log1 < 1. and log2 > 0.:
    recommend_list.append(i)
print(recommend_list)

[297, 312, 317, 342, 356, 366, 379, 384, 392, 402, 404, 407, 417, 422, 428, 433, 448, 454, 469, 473, 495, 510, 516, 526, 527, 549, 567, 602, 635, 649, 650, 654, 658, 661, 664, 696, 731, 746, 750, 754, 915, 918, 925, 929, 950, 968, 1015, 1046]


코사인 유사도( $cos(\theta) = \frac{A \cdot B}{\begin{Vmatrix} A \end{Vmatrix} \times \begin{Vmatrix} B \end{Vmatrix}}$ )를 사용해 추천

In [None]:
def compute_cos_similarity(v1, v2):
  norm1 = np.sqrt(np.sum(np.square(v1)))
  norm2 = np.sqrt(np.sum(np.square(v2)))
  dot = np.dot(v1, v2)
  return dot / (norm1 * norm2)

In [None]:
my_id, my_vector = 0, adj_matrix[0] # 내 id가 0번 이라고 가정
best_match, best_match_id, best_match_vector = -1, -1, []

# 코사인 유사도를 통해 유사도 구하기
for user_id, user_vector in enumerate(adj_matrix):
  if my_id != user_id:
    cos_similarity = compute_cos_similarity(my_vector, user_vector)
    if cos_similarity > best_match:
      best_match = cos_similarity
      best_match_id = user_id
      best_match_vector = user_vector

print('Best Match: {}, Best Match ID: {}'.format(best_match, best_match_id))

Best Match: 0.5278586163659506, Best Match ID: 915


In [None]:
# 내가 안 본 영화 중 915번이 본 영화를 추천
recommend_list = []
for i, log in enumerate(zip(my_vector, best_match_vector)):
  log1, log2 = log
  if log1 < 1. and log2 > 0.:
    recommend_list.append(i)
print(recommend_list)

[272, 275, 279, 280, 283, 285, 289, 294, 297, 316, 317, 355, 365, 366, 368, 379, 380, 381, 384, 386, 392, 398, 401, 404, 416, 420, 422, 424, 426, 427, 430, 432, 450, 460, 461, 466, 469, 471, 473, 474, 475, 479, 482, 483, 497, 505, 508, 510, 511, 522, 526, 527, 529, 530, 534, 536, 540, 545, 548, 549, 556, 557, 558, 560, 565, 567, 568, 569, 577, 580, 581, 582, 592, 596, 630, 635, 639, 641, 649, 651, 654, 673, 677, 678, 683, 684, 692, 696, 701, 703, 707, 708, 709, 712, 714, 719, 720, 726, 731, 734, 736, 738, 740, 745, 747, 754, 755, 761, 762, 763, 766, 780, 789, 791, 805, 819, 823, 824, 830, 843, 862, 865, 918, 929, 930, 938, 942, 943, 947, 958, 959, 960, 970, 977, 1004, 1008, 1009, 1010, 1013, 1041, 1045, 1069, 1072, 1073, 1078, 1097, 1100, 1108, 1112, 1118, 1134, 1193, 1205, 1207, 1216, 1219, 1267, 1334, 1400, 1427, 1596, 1681]


#### 명시적 피드백을 추가
기존 방법에 명시적 피드백(사용자가 평가한 영화 점수)을 추가해 실험

In [None]:
# 명시적 피드백을 추가한 <사용자, 영화> 인접 행렬 생성
adj_matrix = np.ndarray(shape, dtype=int)
for user_id, movie_id, rating, time in raw_data:
  adj_matrix[user_id][movie_id] = rating
adj_matrix

array([[5, 3, 4, ..., 0, 0, 0],
       [4, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [5, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 5, 0, ..., 0, 0, 0]])

In [None]:
my_id, my_vector = 0, adj_matrix[0] # 내 id가 0번 이라고 가정
best_match, best_match_id, best_match_vector = 9999, -1, []

# 유클리드 거리를 통해 유사도 구하기
for user_id, user_vector in enumerate(adj_matrix):
  if my_id != user_id:
    euclidean_dist = np.sqrt(np.sum(np.square(my_vector - user_vector)))
    if euclidean_dist < best_match:
      best_match = euclidean_dist
      best_match_id = user_id
      best_match_vector = user_vector

print('Best Match: {}, Best Match ID: {}'.format(best_match, best_match_id))

Best Match: 55.06359959174482, Best Match ID: 737


In [None]:
my_id, my_vector = 0, adj_matrix[0] # 내 id가 0번 이라고 가정
best_match, best_match_id, best_match_vector = -1, -1, []

# 코사인 유사도를 통해 유사도 구하기
for user_id, user_vector in enumerate(adj_matrix):
  if my_id != user_id:
    cos_similarity = compute_cos_similarity(my_vector, user_vector)
    if cos_similarity > best_match:
      best_match = cos_similarity
      best_match_id = user_id
      best_match_vector = user_vector

print('Best Match: {}, Best Match ID: {}'.format(best_match, best_match_id))

Best Match: 0.569065731527988, Best Match ID: 915


### Surprise 라이브러리

In [None]:
from surprise import KNNBasic, SVD
from surprise.model_selection import cross_validate, train_test_split
from surprise import accuracy

#### Nearest Neighbor CF
- KNNBasic()

In [None]:
model_nn = KNNBasic()
cross_validate(model_nn, data, measures=['rmse', 'mae'], cv=5, n_jobs=4, verbose=True);

Evaluating RMSE, MAE of algorithm KNNBasic on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9768  0.9809  0.9814  0.9784  0.9774  0.9790  0.0018  
MAE (testset)     0.7724  0.7755  0.7728  0.7732  0.7729  0.7734  0.0011  
Fit time          0.42    0.68    0.56    0.69    0.41    0.55    0.12    
Test time         5.60    6.61    6.85    5.69    3.59    5.67    1.15    


- RMSE:Root Mean Squared Error
- MAE:Mean Absolute Error

#### Latent Factor CF
- SVD()

In [None]:
model_lf = SVD()
cross_validate(model_lf, data, measures=['rmse', 'mae'], cv=5, n_jobs=4, verbose=True);

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9291  0.9401  0.9350  0.9423  0.9368  0.9367  0.0046  
MAE (testset)     0.7299  0.7437  0.7383  0.7396  0.7391  0.7381  0.0045  
Fit time          12.73   16.45   16.43   13.56   9.24    13.68   2.68    
Test time         0.56    0.54    0.40    0.25    0.17    0.39    0.15    


SVD를 사용하여 추천

In [None]:
# surprise의 train_test_split() 사용
trainset, testset = train_test_split(data, test_size=0.25, random_state=0)

In [None]:
model_lf.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7f6c00f44650>

In [None]:
# userID 196, itemID 302
uid, iid = str(196), str(302)

# 예측 평점
pred = model_lf.predict(uid, iid)
pred

Prediction(uid='196', iid='302', r_ui=None, est=3.920563151103652, details={'was_impossible': False})

--> 예측 결과:사용자ID(uid), 아이템ID(iid), 실제평점(r_ui), 예측평점(est)

In [None]:
pred_test = model_lf.test(testset)

print('prediction type:', type(pred_test), ' size:', len(pred_test))
pred_test[:5]

prediction type: <class 'list'>  size: 25000


[Prediction(uid='120', iid='282', r_ui=4.0, est=3.617096216862972, details={'was_impossible': False}),
 Prediction(uid='882', iid='291', r_ui=4.0, est=3.6691888293952624, details={'was_impossible': False}),
 Prediction(uid='535', iid='507', r_ui=5.0, est=4.12655334976517, details={'was_impossible': False}),
 Prediction(uid='697', iid='244', r_ui=5.0, est=3.7060821929575143, details={'was_impossible': False}),
 Prediction(uid='751', iid='385', r_ui=4.0, est=3.409436073680719, details={'was_impossible': False})]

In [None]:
pred_accuracy = accuracy.rmse(pred_test)

RMSE: 0.9469
