## Using Gridsearch in Surprise

Let’s get started implementing the  GridSearchCV class from Surprise so we can find the best parameters for our SVD recommendation engine.

First, let’s import the necessary packages.

In [1]:
# standard imports
import pandas as pd
import numpy as np

# import needed surprise libraries
from surprise import Reader, Dataset
from surprise.model_selection import GridSearchCV
from surprise.prediction_algorithms import SVD

Let’s bring in our dataset - for demonstration purposes, this tim we'll use the built-in movielens data.

In [3]:
# Use movielens-100K
data = Dataset.load_builtin('ml-100k')

If you're using your own dataset, please uncomment the code below and specify your rating scale and column names in line three to read in those values as a Surprise dataset. The columns must be representive of (1) the items, (2) the ratings, (3) the associated users.

In [None]:
'''
# read in values as a Surprise dataset
reader = Reader(rating_scale=(1,5), line_format=('item rating user'))
data = Dataset.load_from_df(df[['item_id', 'rating_id', 'user_id']],reader)
'''

⏰ NOTE: This process can be especially time-consuming depending on the amount of film ratings data you read in. With the movielens data, the GridSearch below took twenty minutes on my machine.

Through trial and error and feedback from my instructor at Flatiron, I developed a process that can help others bring immediate structure to their search and perhaps newfound insight into the data being worked with.

In [5]:
# tune hyperparameters using GridSearch to get improved model

param_grid = {'lr_all': [0.0025,0.005,0.0075], 'n_factors': [75, 100, 150],
             'reg_all': [0.015,0.02,0.025], 'biased': [True, False]}
gs = GridSearchCV(SVD, param_grid, measures=['rmse'], cv=3)

gs.fit(data)

# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

# best accuracy results for the chosen measure, averaged over all splits
print(gs.best_estimator['rmse'])

0.9434885112563943
{'lr_all': 0.0075, 'n_factors': 75, 'reg_all': 0.025, 'biased': True}
<surprise.prediction_algorithms.matrix_factorization.SVD object at 0x7fd549a36f28>


### Results
RMSE: 0.9434885112563943

Winners: 'lr_all': 0.0075, 'n_factors': 75, 'reg_all': 0.025, 'biased': True

**Dissecting the Parameter Grid**
Above, we used the default values according to four significant GridsearchCV parameters and created a range around them. This means that when we identify our winner, we can further tighten that range around the winning parameter this time to continue increasing accuracy.

![Default SVD](../images/default-svd-hyperparameters.png)

**Understanding our Results**
After running our model, we saw our winning parameters. Here’s what this tells us and what our next steps could be:

* ‘lr_all’: 0.0075 was the winner and shows an increase from our default
* ‘n_factors’: 75 was the winner and shows a decrease from the default
* ‘reg_all’: 0.025 was the winner and shows an increase from our default
* ‘biased’: True was the winner, this is the only default value that won in our param_grid

Because our default metrics did not win across the board, we likely have an improvement on our model!

Due to the fact that this was run on a split of our data (remember our cv) and not the full set, even if it didn’t quite measure up to our default accuracy - this could likely mean an increase in accuracy. Now, with this dataset, you can continue trying to tweak the gridsearch within ranges centering those winning parameters - or if you’re strapped for time, you can move forward and specify these parameters when running your SVD model to compare accuracies!

If any questions arise during this process, please feel free to reach out via stanton.sierraerin@gmail.com