## Collaborative Filtering
#### Model Based Approach

In [2]:
import pandas as pd
# import SVD from surprise
from surprise import SVD

# # import dataset from surprise
from surprise import Dataset
from surprise import Reader


# import accuracy from surprise
from surprise import accuracy

# import train_test_split from surprise.model_selection
from surprise.model_selection import train_test_split
# import GridSearchCV from surprise.model_selection
from surprise.model_selection import GridSearchCV
# import cross_validate from surprise.model_selection
from surprise.model_selection import cross_validate

We will be working with the [same data](https://drive.google.com/file/d/1WvTmAfO09TCX7xp7uu06__ziic7JnrL5/view?usp=sharing) we used in the previous exercise.

In [3]:
book_ratings = pd.read_csv('BX-Book-Ratings.csv',sep=";", encoding="latin")

In [5]:
book_ratings.head()

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


* create surprise dataset from book_ratings

In [6]:
reader = Reader(rating_scale=(0, 10))

# Loads Pandas dataframe
data = Dataset.load_from_df(book_ratings, reader)

In [8]:
print(data)

<surprise.dataset.DatasetAutoFolds object at 0x7fbebc4bc190>


* split data to train and test set, use test size 15%

In [9]:
train_set, test_set  = train_test_split(data,test_size=0.15) 

In [10]:
train_set

<surprise.trainset.Trainset at 0x7fbef35486a0>

* Use SVD (with default settings) to create recommendations for each user
    - print default model's rmse that was computed on the test set (using object accuracy we imported in the beginning)

In [11]:
algo = SVD()

In [12]:
algo.fit(train_set)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7fbebc4bb550>

In [13]:
predictions = algo.test(test_set)

In [16]:
print(predictions)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [15]:
accuracy.rmse(predictions)

RMSE: 3.4983


3.4983233507908267

* create parameters grid, use this params:
* 'n_factors': [110, 120, 140, 160]
* 'reg_all': [0.08, 0.1, 0.15]

In [18]:
param_grid={'n_factors':[110, 120, 140, 160] ,'reg_all':[0.08,0.1,0.15]}

* instantiate GridSearch with SVD as model, our pre-defined parameter grid and rmse and mae as evaluation metrics

In [20]:
gs = GridSearchCV(SVD, param_grid,measures=['rmse','mae'])

In [25]:
gs

<surprise.model_selection.search.GridSearchCV at 0x7fbebc5246d0>

* fit GridSearch

In [21]:
gs.fit(data)

* print best RMSE score from training

In [22]:
print(gs.best_score['rmse'])

3.4326043828848656


In [23]:
print(gs.best_params['rmse'])

{'n_factors': 160, 'reg_all': 0.15}


In [24]:
print(gs.cv_results)

{'split0_test_rmse': array([3.45811105, 3.45252714, 3.44208241, 3.45634671, 3.45110468,
       3.43777155, 3.45176574, 3.44597729, 3.43670559, 3.45205878,
       3.44482007, 3.4335811 ]), 'split1_test_rmse': array([3.45761927, 3.44758986, 3.4402825 , 3.45304825, 3.44764012,
       3.43788563, 3.44902994, 3.44460922, 3.43579749, 3.44328379,
       3.44351064, 3.43289785]), 'split2_test_rmse': array([3.4552969 , 3.44902379, 3.43653714, 3.45493797, 3.44834668,
       3.43573586, 3.44927752, 3.44202748, 3.43303298, 3.44363833,
       3.43934838, 3.43064836]), 'split3_test_rmse': array([3.45997294, 3.45147938, 3.44131071, 3.45384359, 3.4485971 ,
       3.43849842, 3.45229593, 3.4425595 , 3.4367084 , 3.44584358,
       3.43937133, 3.43545157]), 'split4_test_rmse': array([3.45534543, 3.44668164, 3.43606881, 3.45122297, 3.44677084,
       3.43394687, 3.44631909, 3.4417492 , 3.42956474, 3.44390176,
       3.44078055, 3.43044303]), 'mean_test_rmse': array([3.45726912, 3.44946036, 3.43925631, 3.4

* predict test set with optimal model based on `RMSE`

In [28]:
algo_best = gs.best_estimator['rmse']
algo_best.fit(train_set)
predictions_test = algo_best.test(test_set)

* print optimal model's RMSE that was computed on test set
    - is it better than the default parameters?

In [29]:
accuracy.rmse(predictions_test)

RMSE: 3.4303


3.4302963983784998