# Simple Algorithm for Recommedation Single Node on Movielens

**SAR**은 유저의 트랜잭션 기록에 기반한 개인화된 추천을 위한 빠르고 확장 가능한 알고리즘입니다. SAR은 설명과 해석이 가능한 추천을 쉽게 만들어내고, cold item/semi-cold item 문제를 다룰 수 있습니다. 이 알고리즘은 각 유저에 대해 아이템의 랭킹을 매기는 neighborhood 기반 알고리즘의 일종입니다.   
SAR은 유저가 이미 친밀도를 갖고 있는 아이템과 가장 비슷한 아이템-아이템과 상호작용한 유저들이 상호작용할 것 같은 아이템-들을 추천합니다. 유저가 과거에 아이템과 상호작용한 적이 있으면 친밀도를 갖고 있다고 봅니다.    

SAR의 이점 :   
- 훈련과 사용이 쉬운 알고리즘에 대한 높은 정확도
- 예측에 사용되는 행렬들을 구성하기 위한 간단한 계산만 필요로 하여 훈련이 빠릅니다.
- 친밀도 벡터를 가진 유사도 행렬의 곱셈만 계산하여 빠른 스코어링이 가능합니다.   

SAR을 잘 사용하기 위해 알아야 할 것들은 :   
- 아이템이나 유저의 feature를 사용하지 않기 때문에, 그 feature들을 사용하는 알고리즘에 비해 불리할 수 있습니다.   
- *m* x *m*(*m*은 아이템의 수) 희소 행렬을 만들어야 하기 때문에 메모리가 부족할 수 있습니다.    
- SAR은 암묵적인 평가 시나리오를 선호하며, 평가를 예측하지 않습니다.

In [1]:
%load_ext autoreload
%autoreload 2

import logging
import sys
import numpy as np
import pandas as pd
import scrapbook as sb
from sklearn.preprocessing import minmax_scale

from recommenders.utils.python_utils import binarize
from recommenders.utils.timer import Timer
from recommenders.datasets import movielens
from recommenders.datasets.python_splitters import python_stratified_split
from recommenders.evaluation.python_evaluation import map_at_k, ndcg_at_k, precision_at_k, recall_at_k, rmse, mae, logloss, rsquared, exp_var
from recommenders.models.sar import SAR

### 1. Load data

In [2]:
top_k = 10
movielens_data_size = '100k'
data = movielens.load_pandas_df(size=movielens_data_size)
data['rating'] = data['rating'].astype(np.float32)

100%|█████████████████████████████████████| 4.81k/4.81k [00:02<00:00, 1.85kKB/s]


In [3]:
train, test = python_stratified_split(data, ratio=0.75, col_user='userID', col_item='itemID', seed=27)

print("""
Train:
Total Ratings: {train_total}
Unique Users: {train_users}
Unique Items: {train_items}

Test:
Total Ratings: {test_total}
Unique Users: {test_users}
Unique Items: {test_items}
""".format(
    train_total=len(train),
    train_users=len(train['userID'].unique()),
    train_items=len(train['itemID'].unique()),
    test_total=len(test),
    test_users=len(test['userID'].unique()),
    test_items=len(test['itemID'].unique()),
))


Train:
Total Ratings: 74992
Unique Users: 943
Unique Items: 1639

Test:
Total Ratings: 25008
Unique Users: 943
Unique Items: 1465



### 2. Train the model

In [4]:
# Instantiate the algorithm and set the index
logging.basicConfig(level=logging.DEBUG, format='%(asctime)s %(levelname)-8s %(message)s')

model = SAR(col_user='userID', col_item='itemID', col_rating='rating',
           col_timestamp='timestamp', similarity_type='jaccard',
           time_decay_coefficient=30, timedecay_formula=True, normalize=True)

# Train the model to training data and get the top-k recs for testing data
with Timer() as train_time:
    model.fit(train)
print('Took {} secs for training.'.format(train_time.interval))

with Timer() as test_time:
    top_k = model.recommend_k_items(test, remove_seen=True)
print('Took {} secs for prediction.'.format(test_time.interval))


2022-06-24 10:16:41,204 INFO     Collecting user affinity matrix
2022-06-24 10:16:41,206 INFO     Calculating time-decayed affinities
2022-06-24 10:16:41,218 INFO     Creating index columns
2022-06-24 10:16:41,263 INFO     Calculating normalization factors
2022-06-24 10:16:41,281 INFO     Building user affinity sparse matrix
2022-06-24 10:16:41,284 INFO     Calculating item co-occurrence
2022-06-24 10:16:41,389 INFO     Calculating item similarity
2022-06-24 10:16:41,389 INFO     Using jaccard based similarity
2022-06-24 10:16:41,417 INFO     Done training
2022-06-24 10:16:41,418 INFO     Calculating recommendation scores
2022-06-24 10:16:41,510 INFO     Removing seen items


Took 0.21453691699997535 secs for training.
Took 0.10647214099998337 secs for prediction.


In [9]:
top_k

Unnamed: 0,userID,itemID,prediction
0,1,433,3.249128
1,1,216,3.221920
2,1,367,3.175474
3,1,234,3.174095
4,1,423,3.165910
...,...,...,...
9425,943,196,3.049325
9426,943,79,3.043222
9427,943,4,3.029936
9428,943,62,3.020848


In [5]:
# Evaluate
eval_map = map_at_k(test, top_k, col_user='userID', col_item='itemID', col_rating='rating', k=10)
eval_ndcg = ndcg_at_k(test, top_k, col_user='userID', col_item='itemID', col_rating='rating', k=10)
eval_precision = precision_at_k(test, top_k, col_user='userID', col_item='itemID', col_rating='rating', k=10)
eval_recall = recall_at_k(test, top_k, col_user='userID', col_item='itemID', col_rating='rating', k=10)
eval_rmse = rmse(test, top_k, col_user='userID', col_item='itemID', col_rating='rating')
eval_mae = mae(test, top_k, col_user='userID', col_item='itemID', col_rating='rating')
eval_rsquared = rsquared(test, top_k, col_user='userID', col_item='itemID', col_rating='rating')
eval_exp_var = exp_var(test, top_k, col_user='userID', col_item='itemID', col_rating='rating')

positivity_threshold = 2 # threshold (float) – Threshold below which all values are set to 0, else 1.
test_bin = test.copy()
test_bin['rating'] = binarize(test_bin['rating'], positivity_threshold)

top_k_prob = top_k.copy()
top_k_prob['prediction'] = minmax_scale(top_k_prob['prediction'].astype(float))

eval_logloss = logloss(test_bin, top_k_prob, col_user='userID', col_item='itemID', col_rating='rating')

In [6]:
print("Model:\t",
      "Top K:\t%d" % 10,
      "MAP:\t%f" % eval_map,
      "NDCG:\t%f" % eval_ndcg,
      "Precision@K:\t%f" % eval_precision,
      "Recall@K:\t%f" % eval_recall,
      "RMSE:\t%f" % eval_rmse,
      "MAE:\t%f" % eval_mae,
      "R2:\t%f" % eval_rsquared,
      "Exp var:\t%f" % eval_exp_var,
      "Logloss:\t%f" % eval_logloss,
      sep='\n')


Model:	
Top K:	10
MAP:	0.112251
NDCG:	0.384147
Precision@K:	0.329692
Recall@K:	0.180919
RMSE:	1.244274
MAE:	1.045549
R2:	-0.499194
Exp var:	0.071549
Logloss:	0.593063


In [8]:
# Results for a specific user
user_id = 876
ground_truth = test[test["userID"]==user_id].sort_values(by='rating', ascending=False)[:10]
print(ground_truth)
prediction = model.recommend_k_items(pd.DataFrame(dict(userID=[user_id])), remove_seen=True)
pd.merge(ground_truth, prediction, on=['userID', 'itemID'], how='left')

2022-06-24 10:21:39,589 INFO     Calculating recommendation scores
2022-06-24 10:21:39,594 INFO     Removing seen items


       userID  itemID  rating  timestamp
86057     876     511     5.0  879428354
97507     876     523     5.0  879428378
71005     876     187     4.0  879428354
92013     876      22     4.0  879428451
81517     876     289     3.0  879428145


Unnamed: 0,userID,itemID,rating,timestamp,prediction
0,876,511,5.0,879428354,
1,876,523,5.0,879428378,
2,876,187,4.0,879428354,
3,876,22,4.0,879428451,
4,876,289,3.0,879428145,


In [9]:
test

Unnamed: 0,userID,itemID,rating,timestamp
39296,1,237,2.0,875071749
24493,1,30,3.0,878542515
7101,1,143,1.0,875072631
820,1,265,4.0,878542441
6541,1,258,5.0,878873389
...,...,...,...,...
76743,943,423,3.0,888639231
81296,943,98,5.0,888638980
75114,943,188,4.0,888639269
85387,943,79,5.0,888639019
