## **Exercise: Collaborative Filtering**

**Dengan menggunakan dataset rating.csv dan anime.csv, buatlah recommendation system dengan skema berikut:**

* Gabungkan kedua dataset (rating.csv dan anime.csv) untuk menampilkan kolom ['user_id', 'anime_id', 'rating', 'name']
* Bandingkan algoritma SVD dan ALS
* Tuning algoritma yang menurut kalian lebih baik

Setelah mendapatkan model terbaik, coba prediksi rating anime berikut:

* Hunter x Hunter (2011), anime_id 11061
* Detective Conan OVA 09, anime_id 6438
* Ranma ½, anime_id 1010
* Saint Seiya: Meiou Hades Juuni Kyuu-hen, anime_id 1257 

Oleh user:

* 50
* 200
* 400
* 800

Bagaimana urutan rekomendasi yang akan kalian berikan untuk masing-masing user?

## **Import libraries**

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns

# Dataset formatting
from surprise import Reader
from surprise import Dataset

from surprise import SVD            # SVD
from surprise import BaselineOnly   # ALS

from surprise import accuracy
from surprise.model_selection import cross_validate, train_test_split
from surprise.model_selection import GridSearchCV, RandomizedSearchCV

import warnings 
warnings.filterwarnings('ignore')

## **Load dataset & preprocessing**

In [2]:
df_rating = pd.read_csv('rating.csv')
df_rating

Unnamed: 0.1,Unnamed: 0,user_id,anime_id,rating
0,47,1,8074,10.0
1,81,1,11617,10.0
2,83,1,11757,10.0
3,101,1,15451,10.0
4,153,2,11771,10.0
...,...,...,...,...
77863,96433,999,11757,6.0
77864,96434,999,16498,9.0
77865,96435,999,21881,5.0
77866,96436,999,22319,8.0


In [3]:
np.array(df_rating).shape

(77868, 4)

In [4]:
np.array(df_rating).reshape(-1,1)[0:8]

array([[4.7000e+01],
       [1.0000e+00],
       [8.0740e+03],
       [1.0000e+01],
       [8.1000e+01],
       [1.0000e+00],
       [1.1617e+04],
       [1.0000e+01]])

In [5]:
np.array(df_rating).reshape(-1,1).shape

(311472, 1)

In [6]:
# Drop kolom yang tidak berguna
df_rating = df_rating.drop(columns='Unnamed: 0', axis=1)
df_rating.head(10)

Unnamed: 0,user_id,anime_id,rating
0,1,8074,10.0
1,1,11617,10.0
2,1,11757,10.0
3,1,15451,10.0
4,2,11771,10.0
5,3,20,8.0
6,3,154,6.0
7,3,170,9.0
8,3,199,10.0
9,3,225,9.0


In [7]:
df_anime = pd.read_csv('anime.csv')[['anime_id', 'name']]
df_anime.head()

Unnamed: 0,anime_id,name
0,32281,Kimi no Na wa.
1,5114,Fullmetal Alchemist: Brotherhood
2,28977,Gintama°
3,9253,Steins;Gate
4,9969,Gintama&#039;


In [8]:
# Menggabungkan df_rating dan df_anime --> Left join pada kolom anime_id
df_merged = pd.merge(df_rating, df_anime, how='left', on=['anime_id'])
df_merged 

Unnamed: 0,user_id,anime_id,rating,name
0,1,8074,10.0,Highschool of the Dead
1,1,11617,10.0,High School DxD
2,1,11757,10.0,Sword Art Online
3,1,15451,10.0,High School DxD New
4,2,11771,10.0,Kuroko no Basket
...,...,...,...,...
77863,999,11757,6.0,Sword Art Online
77864,999,16498,9.0,Shingeki no Kyojin
77865,999,21881,5.0,Sword Art Online II
77866,999,22319,8.0,Tokyo Ghoul


In [9]:
# df_merged[df_merged['name'].str.contains("Hunter")]

In [10]:
df_merged.describe()

# rating dari 1-10

Unnamed: 0,user_id,anime_id,rating
count,77868.0,77868.0,77868.0
mean,517.812786,10721.879116,7.855268
std,278.020509,9033.079184,1.53807
min,1.0,1.0,1.0
25%,288.0,2273.0,7.0
50%,529.0,9513.0,8.0
75%,753.0,16592.0,9.0
max,999.0,34240.0,10.0


In [11]:
# Pivot table menjadi sparse matrix
user_item_rating_matrix = df_merged.pivot_table(values='rating', index ='user_id', columns ='anime_id')
user_item_rating_matrix

anime_id,1,5,6,7,8,15,16,17,18,19,...,33338,33341,33372,33421,33524,33558,33569,33964,34103,34240
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
5,,,8.0,,,6.0,,6.0,6.0,,...,,,,,,,,,,
7,,,,,,,,,,,...,,7.0,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,,,,,,,,,,9.0,...,,,,,,,,,,
996,,,,,,,,,,,...,,,,,,,,,,
997,9.0,,,,,,,,,,...,,,,,,,,,,
998,,,,,,,,,,,...,,,,,,,,,,


* Hunter x Hunter (2011), anime_id 11061
* Detective Conan OVA 09, anime_id 6438
* Ranma ½, anime_id 1010
* Saint Seiya: Meiou Hades Juuni Kyuu-hen, anime_id 1257 

In [12]:
user_item_rating_matrix.loc[[50,200,400,800], [11061,6438,1010,1257]]

anime_id,11061,6438,1010,1257
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
50,10.0,,,
200,,,,
400,9.0,,,
800,,,,


User-Item matrix with rating terdiri dari 940 user dan 4510 anime

## **Modeling**

In [13]:
reader = Reader(rating_scale=(1, 10))

data = Dataset.load_from_df(df_merged[['user_id', 'anime_id', 'rating']], reader)
data 

<surprise.dataset.DatasetAutoFolds at 0x25a7c52ace0>

## **Validation**

In [14]:
trainset, testset = train_test_split(data, test_size=0.2, random_state=1) 

### **SVD**

In [15]:
algo_svd = SVD(random_state=10)

algo_svd.fit(trainset)
prediction_svd = algo_svd.test(testset)

In [16]:
accuracy.rmse(prediction_svd) 

RMSE: 1.1983


1.198316370967339

### **ALS**

In [17]:
bsl_options = {'method': 'als',
               'n_epochs': 5,
               'reg_u': 12,
               'reg_i': 5
               }

algo_als = BaselineOnly(bsl_options=bsl_options)

algo_als.fit(trainset)
prediction_als = algo_als.test(testset)

Estimating biases using als...


In [18]:
accuracy.rmse(prediction_als)

RMSE: 1.2059


1.2059336760803911

SVD memiliki error lebih kecil, maka akan dilakukan hyperparameter tuning terhadap model SVD

## **Cross Validation**

### **SVD**

In [19]:
cv_svd = cross_validate(algo_svd, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.2138  1.1926  1.2072  1.2071  1.2068  1.2055  0.0070  
MAE (testset)     0.9207  0.9032  0.9180  0.9153  0.9159  0.9146  0.0060  
Fit time          1.15    1.18    1.20    1.19    1.23    1.19    0.03    
Test time         0.35    0.16    0.15    0.35    0.17    0.24    0.09    


In [20]:
print('RMSE cv mean', cv_svd['test_rmse'].mean())

RMSE cv mean 1.2055110729488554


### **ALS**

In [21]:
cv_als = cross_validate(algo_als, data, measures=['RMSE','MAE'], cv=5, verbose=True)

Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Evaluating RMSE, MAE of algorithm BaselineOnly on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.2053  1.2122  1.1899  1.2160  1.2060  1.2059  0.0089  
MAE (testset)     0.9247  0.9261  0.9048  0.9198  0.9190  0.9189  0.0075  
Fit time          0.15    0.17    0.21    0.25    0.24    0.21    0.04    
Test time         0.10    0.12    0.12    0.17    0.15    0.13    0.03    


In [22]:
print('RMSE cv mean', cv_als['test_rmse'].mean())

RMSE cv mean 1.2058868794617623


## **Hyperparameter tuning**

**SVD Grid Search**

In [23]:
# Tuning SVD
hyperparam_space = {
    'n_epochs':[5, 10, 20, 30],     # jumlah iterasi
    'lr_all':[0.002, 0.005],        # learning rate
    'reg_all':[0.02, 0.4, 0.6]      # regularization
}

grid_search = GridSearchCV(SVD, hyperparam_space, measures=['rmse', 'mae'], cv=5)

grid_search.fit(data)

In [24]:
print('RMSE')
print(grid_search.best_score['rmse'])
print(grid_search.best_params['rmse'])

print('\nMAE')
print(grid_search.best_score['mae'])
print(grid_search.best_params['mae'])

RMSE
1.2065589891297477
{'n_epochs': 20, 'lr_all': 0.005, 'reg_all': 0.02}

MAE
0.914304946805942
{'n_epochs': 20, 'lr_all': 0.005, 'reg_all': 0.02}


**SVD Randimized Search**

In [25]:
# Tuning SVD
hyperparam_space = {
    'n_epochs':np.arange(1,50,1),                   # jumlah iterasi
    'lr_all':np.arange(0.001, 0.010, 0.001),        # learning rate
    'reg_all':np.arange(0.1, 1, 0.1)                # regularization
}

randomsearch = RandomizedSearchCV(SVD, hyperparam_space, measures=['rmse', 'mae'], cv=5)

randomsearch.fit(data)

In [26]:
print('RMSE')
print(randomsearch.best_score['rmse'])
print(randomsearch.best_params['rmse'])

print('\nMAE')
print(randomsearch.best_score['mae'])
print(randomsearch.best_params['mae'])

RMSE
1.2338995815733067
{'n_epochs': 32, 'lr_all': 0.007, 'reg_all': 0.4}

MAE
0.94409596708951
{'n_epochs': 32, 'lr_all': 0.001, 'reg_all': 0.1}


In [27]:
# Contoh tuning metode ALS
# param_grid = {'bsl_options': {'method': ['als'],
#                               'n_epochs': [5,10,15], 
#                               'reg_u': [12, 18, 27], 
#                               'reg_i': [5,50,100]}
#               }

# gs = GridSearchCV(BaselineOnly, param_grid, measures=['rmse', 'mae'], cv=3)

# gs.fit(data)

## **Model with Hyperparameter Tuning**

In [28]:
svd_tuned = SVD(n_epochs = 20, lr_all = 0.005, reg_all = 0.02)
cv_svd_tuned = cross_validate(svd_tuned, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.2054  1.2191  1.1821  1.2047  1.2171  1.2057  0.0132  
MAE (testset)     0.9132  0.9265  0.8983  0.9071  0.9210  0.9132  0.0100  
Fit time          1.03    1.12    1.07    1.24    1.12    1.12    0.07    
Test time         0.25    0.17    0.18    0.16    0.39    0.23    0.09    


In [29]:
# Perbandingan RMSE sebelum dan sesudah tuning
print('RMSE cv mean before tuning:', cv_svd['test_rmse'].mean())
print('RMSE cv mean after tuning:', cv_svd_tuned['test_rmse'].mean())

RMSE cv mean before tuning: 1.2055110729488554
RMSE cv mean after tuning: 1.2056876245924695


## **Prediction results**

* Hunter x Hunter (2011), anime_id 11061
* Detective Conan OVA 09, anime_id 6438
* Ranma ½, anime_id 1010
* Saint Seiya: Meiou Hades Juuni Kyuu-hen, anime_id 1257 

In [30]:
users = [50, 200, 400, 800]
anime_ids = [11061, 6438, 1010, 1257]
titles = ['Hunter x Hunter (2011)', 'Detective Conan OVA 09', 'Ranma ½', 'Saint Seiya: Meiou Hades Juuni Kyuu-hen']

# Dataframe kosong
df_test = pd.DataFrame(columns=['user_id', 'anime_id', 'title'], dtype='object')
df_test

# Mengisi dataframe dengan user_id dan anime_id beserta titlenya
for i in users:
    for j, k in zip(anime_ids, titles):
        df_test = df_test.append({'user_id':i, 'anime_id':j, 'title':k}, ignore_index=True)
        
df_test 

Unnamed: 0,user_id,anime_id,title
0,50,11061,Hunter x Hunter (2011)
1,50,6438,Detective Conan OVA 09
2,50,1010,Ranma ½
3,50,1257,Saint Seiya: Meiou Hades Juuni Kyuu-hen
4,200,11061,Hunter x Hunter (2011)
5,200,6438,Detective Conan OVA 09
6,200,1010,Ranma ½
7,200,1257,Saint Seiya: Meiou Hades Juuni Kyuu-hen
8,400,11061,Hunter x Hunter (2011)
9,400,6438,Detective Conan OVA 09


In [31]:
df_test.iloc[:, :-1]

Unnamed: 0,user_id,anime_id
0,50,11061
1,50,6438
2,50,1010
3,50,1257
4,200,11061
5,200,6438
6,200,1010
7,200,1257
8,400,11061
9,400,6438


In [32]:
df_merged.iloc[:, [1,3]]

Unnamed: 0,anime_id,name
0,8074,Highschool of the Dead
1,11617,High School DxD
2,11757,Sword Art Online
3,15451,High School DxD New
4,11771,Kuroko no Basket
...,...,...
77863,11757,Sword Art Online
77864,16498,Shingeki no Kyojin
77865,21881,Sword Art Online II
77866,22319,Tokyo Ghoul


In [33]:
df_hasil = pd.merge(df_test.iloc[:, :-1], df_merged.iloc[:, [1,3]], how='inner', on='anime_id')
df_hasil = df_hasil.drop_duplicates(ignore_index=True).sort_values('user_id')
df_hasil

Unnamed: 0,user_id,anime_id,name
0,50,11061,Hunter x Hunter (2011)
4,50,6438,Detective Conan OVA 09: The Stranger in 10 Yea...
8,50,1010,Ranma ½: Chou Musabetsu Kessen! Ranma Team vs....
12,50,1257,Saint Seiya: Meiou Hades Juuni Kyuu-hen
1,200,11061,Hunter x Hunter (2011)
5,200,6438,Detective Conan OVA 09: The Stranger in 10 Yea...
9,200,1010,Ranma ½: Chou Musabetsu Kessen! Ranma Team vs....
13,200,1257,Saint Seiya: Meiou Hades Juuni Kyuu-hen
2,400,11061,Hunter x Hunter (2011)
6,400,6438,Detective Conan OVA 09: The Stranger in 10 Yea...


In [34]:
# define model
svd_predict = SVD(n_epochs=20, lr_all=0.005, reg_all=0.02)

# fitting
svd_predict.fit(trainset)

# untuk menyimpan predicted score
y = []

# Melakukan prediksi pada tiap baris
for index, row in df_test.iterrows():
    est = svd_predict.predict(row['user_id'], row['anime_id'])
    y.append(est[3])
    
df_test['predicted_rating'] = y

df_test.sort_values(by=['user_id', 'predicted_rating'], ascending=[True, False], inplace=True)
df_test

Unnamed: 0,user_id,anime_id,title,predicted_rating
0,50,11061,Hunter x Hunter (2011),9.982468
3,50,1257,Saint Seiya: Meiou Hades Juuni Kyuu-hen,7.979478
1,50,6438,Detective Conan OVA 09,7.770977
2,50,1010,Ranma ½,7.55134
4,200,11061,Hunter x Hunter (2011),10.0
7,200,1257,Saint Seiya: Meiou Hades Juuni Kyuu-hen,8.958208
5,200,6438,Detective Conan OVA 09,8.875329
6,200,1010,Ranma ½,8.727887
8,400,11061,Hunter x Hunter (2011),8.462316
11,400,1257,Saint Seiya: Meiou Hades Juuni Kyuu-hen,6.533136


In [35]:
est

Prediction(uid=800, iid=1257, r_ui=None, est=8.3664800364734, details={'was_impossible': False})

In [36]:
df_test[df_test['user_id'] == 50]

Unnamed: 0,user_id,anime_id,title,predicted_rating
0,50,11061,Hunter x Hunter (2011),9.982468
3,50,1257,Saint Seiya: Meiou Hades Juuni Kyuu-hen,7.979478
1,50,6438,Detective Conan OVA 09,7.770977
2,50,1010,Ranma ½,7.55134


In [37]:
df_test[df_test['user_id'] == 200]

Unnamed: 0,user_id,anime_id,title,predicted_rating
4,200,11061,Hunter x Hunter (2011),10.0
7,200,1257,Saint Seiya: Meiou Hades Juuni Kyuu-hen,8.958208
5,200,6438,Detective Conan OVA 09,8.875329
6,200,1010,Ranma ½,8.727887


In [38]:
df_test[df_test['user_id'] == 400]

Unnamed: 0,user_id,anime_id,title,predicted_rating
8,400,11061,Hunter x Hunter (2011),8.462316
11,400,1257,Saint Seiya: Meiou Hades Juuni Kyuu-hen,6.533136
10,400,1010,Ranma ½,6.298927
9,400,6438,Detective Conan OVA 09,5.77418


In [39]:
df_test[df_test['user_id'] == 800]

Unnamed: 0,user_id,anime_id,title,predicted_rating
12,800,11061,Hunter x Hunter (2011),9.619072
15,800,1257,Saint Seiya: Meiou Hades Juuni Kyuu-hen,8.36648
13,800,6438,Detective Conan OVA 09,8.18551
14,800,1010,Ranma ½,8.035339


In [40]:
df_merged

Unnamed: 0,user_id,anime_id,rating,name
0,1,8074,10.0,Highschool of the Dead
1,1,11617,10.0,High School DxD
2,1,11757,10.0,Sword Art Online
3,1,15451,10.0,High School DxD New
4,2,11771,10.0,Kuroko no Basket
...,...,...,...,...
77863,999,11757,6.0,Sword Art Online
77864,999,16498,9.0,Shingeki no Kyojin
77865,999,21881,5.0,Sword Art Online II
77866,999,22319,8.0,Tokyo Ghoul


In [48]:
df_merged[df_merged['name']=='Hunter x Hunter (2011)']

Unnamed: 0,user_id,anime_id,rating,name
682,7,11061,9.0,Hunter x Hunter (2011)
2578,38,11061,10.0,Hunter x Hunter (2011)
2863,41,11061,10.0,Hunter x Hunter (2011)
3856,50,11061,10.0,Hunter x Hunter (2011)
4518,71,11061,10.0,Hunter x Hunter (2011)
...,...,...,...,...
74985,958,11061,10.0,Hunter x Hunter (2011)
76182,977,11061,10.0,Hunter x Hunter (2011)
76646,984,11061,8.0,Hunter x Hunter (2011)
76902,991,11061,10.0,Hunter x Hunter (2011)


In [54]:
df_merged[df_merged['name']=='Hunter x Hunter (2011)']['rating'].mean()

9.398230088495575

In [53]:
df_merged[df_merged['name']=='Ranma ½']['rating'].mean() 

7.947368421052632

## **Coba lihat rekomendasi anime untuk seorang user**

In [41]:
df_merged[df_merged['user_id']==1]

Unnamed: 0,user_id,anime_id,rating,name
0,1,8074,10.0,Highschool of the Dead
1,1,11617,10.0,High School DxD
2,1,11757,10.0,Sword Art Online
3,1,15451,10.0,High School DxD New


In [42]:
df_merged['anime_id'].nunique()

4510

In [43]:
# cek score untuk masing-masing anime berdasarkan user
user_id = 1

# anime_id dan name yg tidak ada duplikat (unique)
anime = list(df_merged['anime_id'].unique())
name = list(df_merged['name'].unique())

In [44]:
svd_predict = SVD(n_epochs=20, lr_all=0.005, reg_all=0.02)
svd_predict.fit(trainset)

# prediksi score untuk seluruh anime berdasarkan user1
anime_score = [svd_predict.predict(user_id, anime_id).est for anime_id in anime]
anime_score

[9.429491013964357,
 9.154365292148643,
 9.87092194636782,
 9.210142875472952,
 9.254350656237678,
 8.674672149842834,
 8.64628911282941,
 9.560832006967539,
 9.82723986542903,
 7.813435759287004,
 8.381201716680007,
 8.729404826889722,
 8.136844806774922,
 8.75411585921585,
 9.81456201329848,
 8.033659837538055,
 7.88420243770729,
 7.995746873743311,
 7.659282992903695,
 8.352862143404012,
 8.036751080942379,
 8.281356054747295,
 9.26281703172028,
 7.894701420997249,
 9.164970172593886,
 8.407303436232473,
 9.019440905956216,
 7.951066938367645,
 7.919412189532413,
 8.361144011576167,
 9.015065969485319,
 8.263574978222385,
 9.558107929175002,
 8.248172717181726,
 8.293911867445276,
 9.322526730877359,
 8.522580295912151,
 8.120519952097071,
 7.754707572225768,
 8.790380012648399,
 8.519167571270163,
 7.783698310467907,
 8.793909377803317,
 9.408132945219768,
 8.802492027668812,
 8.061249943526473,
 9.139529761105704,
 7.668881901881514,
 8.647061111270052,
 8.635170370930073,
 7.7006

In [45]:
# Rekomendasi untuk seorang user
recomToUser = pd.DataFrame({
                            'anime_id': anime, 
                            'title':name,
                            'score': anime_score
                            }).sort_values(by='score', ascending=False)

recomToUser.head(20)

Unnamed: 0,anime_id,title,score
895,6114,Rainbow: Nisha Rokubou no Shichinin,10.0
297,9969,Gintama&#039;,9.976159
289,9253,Steins;Gate,9.914439
2,11757,Sword Art Online,9.870922
586,11061,Hunter x Hunter (2011),9.837916
8,199,Sen to Chihiro no Kamikakushi,9.82724
14,813,Dragon Ball Z,9.814562
898,6675,Redline,9.795337
725,164,Mononoke Hime,9.755905
1344,28977,Gintama°,9.747485
