# [Surprise](https://surpriselib.com/)
- 파이썬 기반의 추천 시스템 구축을 위한 전용 패키지 중 하나인 Surprise는 파이썬 기반에서 사이킷런과 유사한 API와 프레임워크를 제공하여, 추천 시스템의 전반적인 알고리즘을 이해하고 사이킷런 사용경험이 있으면 쉽게 사용할 수 있습니다.   

- Surprise는 사용자 아이디, 아이템 아이디, 평점 데이터가 로우 레벨로 된 데이터 세트만 적용할 수 있다. 그래서 데이터의 첫번재 컴럼을 사용자 아이디, 두번째 컬럼을 아이템 아이디, 세번째 컬럼을 평점으로 가정해 데이터를 로딩하고 네번째 컬럼부터는 로딩을 수행하지 않는다.

## [내장 데이터셋](https://surprise.readthedocs.io/en/stable/dataset.html)
- 영화 평가 데이터셋  
`ml-100k`, `ml-1m`

## [알고리즘](https://surprise.readthedocs.io/en/stable/prediction_algorithms_package.html)
- BaselineOnly  
사용자 bias(편향성)와 아이템 bias(편향성)를 고려한 SGD 베이스라인 알고리즘
- KNNWithMeans  
사용자의 평가 경향까지 고려한 KNN 알고리즘
- SVD  
행렬 분해를 통한 잠재요인 협업 필터링을 위한 SVD 알고리즘
- SVDpp  
사용자의 특정 아이템에 대한 평가 여부를 이진값으로 암묵적 평가를 고려한 SVD 알고리즘

In [1]:
!pip install scikit-surprise

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting scikit-surprise
  Downloading scikit-surprise-1.1.3.tar.gz (771 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m772.0/772.0 KB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.3-cp39-cp39-linux_x86_64.whl size=3193652 sha256=9fe1d5d1ed8d4b71aa41373614cd1a3aa404104149c7a1e7df4107d2ce7c440d
  Stored in directory: /root/.cache/pip/wheels/c6/3a/46/9b17b3512bdf283c6cb84f59929cdd5199d4e754d596d22784
Successfully built scikit-surprise
Installing collected packages: scikit-surprise
Successfully installed scikit-surprise-1.1.3


In [2]:
from surprise import Dataset, Reader

from surprise.model_selection import train_test_split

import numpy as np
import pandas as pd

In [3]:
# MovieLens 100K 데이터 불러오기
data = Dataset.load_builtin(name=u'ml-100k')

Dataset ml-100k could not be found. Do you want to download it? [Y/n] ㅛ
Dataset ml-100k could not be found. Do you want to download it? [Y/n] y
Trying to download dataset from https://files.grouplens.org/datasets/movielens/ml-100k.zip...
Done! Dataset ml-100k has been saved to /root/.surprise_data/ml-100k


데이터 확인

In [4]:
data.raw_ratings[:10]

[('196', '242', 3.0, '881250949'),
 ('186', '302', 3.0, '891717742'),
 ('22', '377', 1.0, '878887116'),
 ('244', '51', 2.0, '880606923'),
 ('166', '346', 1.0, '886397596'),
 ('298', '474', 4.0, '884182806'),
 ('115', '265', 2.0, '881171488'),
 ('253', '465', 5.0, '891628467'),
 ('305', '451', 3.0, '886324817'),
 ('6', '86', 3.0, '883603013')]

다운로드 파일 위치 확인

In [5]:
data.ratings_file

'/root/.surprise_data/ml-100k/ml-100k/u.data'

In [6]:
ratings = pd.read_csv(data.ratings_file, sep="\t", header=None) 
ratings.columns = ['user', 'item', 'rating', 'datetime']
ratings.head()

Unnamed: 0,user,item,rating,datetime
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


# EDA

In [7]:
import matplotlib.pyplot as plt

plt.ion();

In [8]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 4 columns):
 #   Column    Non-Null Count   Dtype
---  ------    --------------   -----
 0   user      100000 non-null  int64
 1   item      100000 non-null  int64
 2   rating    100000 non-null  int64
 3   datetime  100000 non-null  int64
dtypes: int64(4)
memory usage: 3.1 MB


In [9]:
ratings.isnull().sum().sum()

0

In [11]:
ratings_groupby = ratings.groupby(['item'])['rating'].agg(['mean', 'count', 'min', 'max']).sort_values(by=['count', 'mean'], ascending=False)

ratings_groupby.head()

Unnamed: 0_level_0,mean,count,min,max
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
50,4.358491,583,1,5
258,3.803536,509,1,5
100,4.155512,508,1,5
181,4.00789,507,1,5
294,3.156701,485,1,5


In [12]:
ratings.describe()

Unnamed: 0,user,item,rating,datetime
count,100000.0,100000.0,100000.0,100000.0
mean,462.48475,425.53013,3.52986,883528900.0
std,266.61442,330.798356,1.125674,5343856.0
min,1.0,1.0,1.0,874724700.0
25%,254.0,175.0,3.0,879448700.0
50%,447.0,322.0,4.0,882826900.0
75%,682.0,631.0,4.0,888260000.0
max,943.0,1682.0,5.0,893286600.0


# Reader

In [13]:
from surprise import Reader

In [14]:
reader = Reader(line_format='user item rating timestamp', sep=',', rating_scale=(0.5, 5))

In [15]:
reader

<surprise.reader.Reader at 0x7f81154c0550>

In [16]:
# Train/Test 0.75 : 0.25로 분리
trainset, testset = train_test_split(data, test_size = 0.25)

# SVD 모델 학습

In [17]:
from surprise import SVD

from surprise import accuracy

In [18]:
SEED = 42
svd = SVD(n_factors=50, random_state=SEED)
svd.fit(trainset)
predictions = svd.test(testset)
accuracy.rmse(predictions)

RMSE: 0.9400


0.9399818507888623

# Cross Validate

In [19]:
from surprise.model_selection import cross_validate

In [20]:
cross_validate(svd, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9374  0.9335  0.9307  0.9375  0.9374  0.9353  0.0028  
MAE (testset)     0.7385  0.7362  0.7338  0.7412  0.7356  0.7370  0.0026  
Fit time          1.39    1.30    1.32    1.72    1.16    1.38    0.19    
Test time         0.16    0.36    0.21    0.24    0.15    0.22    0.07    


{'test_rmse': array([0.93744481, 0.93346356, 0.93068694, 0.93748618, 0.93737668]),
 'test_mae': array([0.73846362, 0.73617184, 0.73377462, 0.74121204, 0.73560347]),
 'fit_time': (1.3880789279937744,
  1.2964458465576172,
  1.319408893585205,
  1.7168397903442383,
  1.1601715087890625),
 'test_time': (0.16087889671325684,
  0.35744762420654297,
  0.20978665351867676,
  0.23643755912780762,
  0.15059995651245117)}