# 3. Collaborative Filtering (협업 필터링 : 사용자 리뷰 기반)

Anaconda Prompt를 관리자 권한으로 실행한 후

pip install scikit-surprise

In [1]:
import surprise
surprise.__version__

'1.1.3'

In [2]:
import pandas as pd
from surprise import Reader, Dataset, SVD
from surprise.model_selection import cross_validate

In [3]:
ratings = pd.read_csv('ratings_small.csv')
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


In [5]:
reader = Reader(rating_scale=(ratings['rating'].min(), ratings['rating'].max()))

In [6]:
data = Dataset.load_from_df(ratings[['userId','movieId','rating']], reader=reader)
data

<surprise.dataset.DatasetAutoFolds at 0x2641db70a30>

In [7]:
svd = SVD(random_state=0)

In [8]:
cross_validate(svd, data, measures=['RMSE','MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8909  0.8989  0.8961  0.9007  0.8967  0.8967  0.0033  
MAE (testset)     0.6876  0.6893  0.6907  0.6938  0.6898  0.6902  0.0020  
Fit time          0.83    0.85    0.79    0.84    0.86    0.84    0.02    
Test time         0.09    0.10    0.13    0.14    0.12    0.12    0.02    


{'test_rmse': array([0.89086616, 0.89886927, 0.89611269, 0.90072302, 0.89669086]),
 'test_mae': array([0.68764673, 0.68925419, 0.69071814, 0.69376742, 0.6898068 ]),
 'fit_time': (0.8343722820281982,
  0.8457889556884766,
  0.793694257736206,
  0.8437094688415527,
  0.8579907417297363),
 'test_time': (0.0864107608795166,
  0.10075688362121582,
  0.1278994083404541,
  0.14489531517028809,
  0.11665034294128418)}

### 교차 검증 (K-Fold 교차 검증)

100개 데이터

cv 갯수만큼 나눔

- A: 1-20
- B: 21-40
- C: 41-60
- D: 61-80
- E: 81-100


- ABCD (train set), E (test set)
- ABCE (train set), D (test set)
- ABDE (train set), C (test set)
- ACDE (train set), B (test set)
- BCDE (train set), A (test set)


In [9]:
trainset = data.build_full_trainset()
# 학습 (모델 생성)
svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x2641db707c0>

In [10]:
ratings[ratings['userId'] == 1]

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205
5,1,1263,2.0,1260759151
6,1,1287,2.0,1260759187
7,1,1293,2.0,1260759148
8,1,1339,3.5,1260759125
9,1,1343,2.0,1260759131


In [13]:
svd.predict(1, 302)

Prediction(uid=1, iid=302, r_ui=None, est=2.7142061734434044, details={'was_impossible': False})

In [15]:
# UserId = 1 번인 사람이 MovieId=1029 인 영화에 대해 실제 평가 3점일때, 예측 평가 점수는?
svd.predict(1, 1029, 3)

Prediction(uid=1, iid=1029, r_ui=3, est=2.8814455446761933, details={'was_impossible': False})

In [16]:
ratings[ratings['userId'] == 100]

Unnamed: 0,userId,movieId,rating,timestamp
15273,100,1,4.0,854193977
15274,100,3,4.0,854194024
15275,100,6,3.0,854194023
15276,100,7,3.0,854194024
15277,100,25,4.0,854193977
15278,100,32,5.0,854193977
15279,100,52,3.0,854194056
15280,100,62,3.0,854193977
15281,100,86,3.0,854194208
15282,100,88,2.0,854194208


In [17]:
# UserId = 100, MovieId = 1029 ==> est=3.77
svd.predict(100, 1029)

Prediction(uid=100, iid=1029, r_ui=None, est=3.7705476478414846, details={'was_impossible': False})