## Methodology (Continue)

After we get our FSM and ALS models up and running. I try to look for options to improve our model. I turn to Singular Value Decomposition (SVD) from Surprise library. The famous SVD algorithm, as popularized by Simon Funk during the Netflix Prize in 2006. Read more about Surprise library by [this link](https://surprise.readthedocs.io/en/stable/matrix_factorization.html).

In [7]:
import pandas as pd
import numpy as np
import json
import pickle

from surprise import Reader, Dataset

from surprise.model_selection import cross_validate
from surprise.prediction_algorithms import SVD
from surprise.prediction_algorithms import KNNWithMeans, KNNBasic, KNNBaseline
from surprise.model_selection import GridSearchCV

Same as what we did before, import our interaction dataframe from pickle file. 

In [2]:
df = pd.read_pickle('interact')

In [18]:
df['book_id'].nunique()

17034

We begin with transforming the dataset into something compatible with `surprise`. In order to do this, you're going to need `Reader` and `Dataset` classes. There's a method in `Dataset` specifically for loading dataframes.

In [5]:
reader = Reader()
data = Dataset.load_from_df(df,reader)

Let's look at how many users and items we have in our dataset. If using neighborhood-based methods, this will help us determine whether or not we should perform user-user or item-item similarity.

In [6]:
dataset = data.build_full_trainset()
print('Number of users: ', dataset.n_users, '\n')
print('Number of items: ', dataset.n_items)

Number of users:  881 

Number of items:  17034


### Determine the best model

`Surprise` offers many different models we can use. We will use RMSE to evaluate models to see which ones perform best. 

### SVD

We start with SVD model with cross-validation to find the best hyperparameters.

In [8]:
params = {'n_factors': [20, 50, 100],
         'reg_all': [0.02, 0.05, 0.1]}

g_s_svd = GridSearchCV(SVD, 
                       param_grid=params, 
                       n_jobs=-1)

g_s_svd.fit(data)

In [9]:
print(g_s_svd.best_score)
print(g_s_svd.best_params)

{'rmse': 0.8373443673078397, 'mae': 0.6525923118990467}
{'rmse': {'n_factors': 20, 'reg_all': 0.02}, 'mae': {'n_factors': 20, 'reg_all': 0.02}}


RMSE from SVD reads 0.84 which is the best one we have gotten so far. I record the best hyperparameters for the model. Let's see how other models perform.

### KNNBasic

KNNBasic is one of algorithms that is directly derived from a basic nearest neighbors approach. 

In [10]:
knn_basic = KNNBasic(sim_options={'name':'pearson', 'user_based':True})
cv_knn_basic = cross_validate(knn_basic, data, n_jobs=-1)

In [11]:
for i in cv_knn_basic.items():
    print(i)
print('-----------------------')
print(np.mean(cv_knn_basic['test_rmse']))

('test_rmse', array([0.99683606, 1.00200713, 1.00280278, 1.01272573, 0.99766759]))
('test_mae', array([0.76190968, 0.76151004, 0.77457144, 0.77824296, 0.76421785]))
('fit_time', (0.37100744247436523, 0.3700108528137207, 0.36901235580444336, 0.3640265464782715, 0.34707164764404297))
('test_time', (0.30219316482543945, 0.2952110767364502, 0.2982039451599121, 0.2982029914855957, 0.3011953830718994))
-----------------------
1.0024078592096344


RMSE from KNNBasic reads 1.00 which cannot beat our SVD model.

### KNNBaseline

KNNBaseline is another algorithms that is directly derived from a basic nearest neighbors approach.

In [12]:
knn_baseline = KNNBaseline(sim_options={'name':'pearson', 'user_based':True})
cv_knn_baseline = cross_validate(knn_baseline,data)

Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.


In [13]:
for i in cv_knn_baseline.items():
    print(i)
print('-----------------------')
print(np.mean(cv_knn_baseline['test_rmse']))

('test_rmse', array([0.89205312, 0.88113255, 0.8899517 , 0.88192065, 0.88542957]))
('test_mae', array([0.68712757, 0.67935697, 0.68107544, 0.68181014, 0.67876732]))
('fit_time', (0.3969690799713135, 0.44284653663635254, 0.41288208961486816, 0.42882728576660156, 0.4138932228088379))
('test_time', (0.36900758743286133, 0.3789811134338379, 0.38497066497802734, 0.38796281814575195, 0.35724592208862305))
-----------------------
0.8860975169707116


RMSE from KNNBaseline reads 0.89. Close but not good enough comparing to SVD model.

## Making predictions 

After trying on multiple models, we conclude that SVD model with hyperparameters of `n_factors`=20 and `reg_all`=0.02 is our final model. It is time to make some predictions. 

In [14]:
svd = SVD(n_factors= 20, reg_all=0.02)
svd.fit(dataset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x23c79ca88b0>

In [15]:
svd.predict(2, 4)

Prediction(uid=2, iid=4, r_ui=None, est=3.1289122929962794, details={'was_impossible': False})