# Cross Validation: Finding our Best Hyperparameters for SVD

We'll use our best performing model to date, `SVD`, and aim to further improve upon the default hyperparameters included in this model.

In [1]:
# standard imports
import pandas as pd
import numpy as np

# import needed surprise libraries
from surprise import Reader, Dataset
from surprise.model_selection import GridSearchCV
from surprise.prediction_algorithms import SVD

# retrieve pickle file
import pickle
df = pickle.load(open("df.pkl", "rb"))

In [2]:
df.head()

Unnamed: 0,movie_id,rating_val,user_id
0,happiest-season,8,deathproof
1,happiest-season,7,davidehrlich
2,happiest-season,4,ingridgoeswest
3,happiest-season,7,silentdawn
4,happiest-season,2,colonelmortimer


In [3]:
# take a subset of our dataframe, 8M represented, take 10% for gridsearch
#df_sub = df.copy()
#df_sub = df.sample(5000000)

Note the code above allows us to take a subset of data for our gridsearch, which is often helpful with larger datasets. However, with our dataset it was noticeably a higher RMSE with our subsets and I decided to run several gridsearch's across our full set within the time I had to do this project.

In [4]:
# read in values as surprise dataset
reader = Reader(rating_scale=(1,10), line_format=('item rating user'))
data = Dataset.load_from_df(df[['user_id', 'movie_id', 'rating_val']],reader)

## Gridsearch

![The Weekend Search](../images/theweekend_superbowl.gif)

As noted in the Surprise library, The `GridSearchCV` class computes accuracy metrics for an algorithm on various combinations of parameters, over a cross-validation procedure. This proves useful for finding the best set of parameters for a prediction algorithm.

This process can be especially time-consuming when dealing with over 8M rows of film ratings data. I developed a process through trial and error that can help others bring structure to their search and perhaps newfound insight into the data you're working with.

Let's take a look at some of the key default parameters of an SVD model.

![Default SVD](../images/default-svd-model-hyperparameters.png)

With the default SVD model in our `modeling` notebook, we achieved a RMSE of 1.4183 (on our elongated ten-point scale) so we can typically predict a user's rating on a five-star scale with an accuracy of less than one star away.

With film recommendation systems, it's still especially helpful to understand your data to a particular granularity in order to better gauage what kind of films will be served to the end-user. Noting factors like behavioral psycology (i.e. recency bias) and qualities to the film's included in your database is helpful. Often, a service like Netflix, would serve recommendations with associated filters that can provide helpful clusters rather than a smorgasboard of films or simply the most popular (where we have the most associated ratings). The __[cold start problem](https://www.kdnuggets.com/2019/01/data-scientist-dilemma-cold-start-machine-learning.html)__ is worth noting and we're especially learning how many ratings we might need on the front-end to get an accurate enough picture of a user in order to provide increasingly helpful recommendations.

Now, let's see what our quantitative measures tell us below.

Because this method is considered exhaustive ⏰ and can take hours to run, I'll comment out the code below with the accompanying results in order to showcase development and aid anyone who'd like to follow suit.

__[Here](https://surprise.readthedocs.io/en/stable/getting_started.html#tune-algorithm-parameters-with-gridsearchcv)__ is the accompanying documentation where an example of this process is shown from surprise.

In [None]:
'''

# tune hyperparameters using GridSearch to get improved model

param_grid = {'n_epochs': [5, 10], 'lr_all': [0.002, 0.005],
              'reg_all': [0.4, 0.6]}
gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=3)

gs.fit(data)

# print our best RMSE score
print(gs.best_score['rmse'])

# print the combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

'''

**RESULTS**
* Best RMSE: 1.5696
* Best Parameters: 'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.4

So far, we can see all the winning parameters are trending toward the default SVD parameters above.

In [None]:
'''

# tune hyperparameters using GridSearch to get improved model

param_grid = {'n_epochs': [25, 27], 'lr_all': [0.0093, 0.0095],
              'reg_all': [0.01, 0.02]}
gs = GridSearchCV(SVD, param_grid, measures=['rmse'], cv=3)

gs.fit(data)

# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])                                

'''

**RESULTS**
* Best RMSE: 1.4772
* Best Parameters: 'n_epochs': 25, 'lr_all': 0.0093, 'reg_all': 0.02

In [None]:
'''

# tune hyperparameters using GridSearch to get improved model

param_grid = {'n_epochs': [25, 26], 'lr_all': [0.0093, 0.009],
              'reg_all': [0.015, 0.02]}
gs = GridSearchCV(SVD, param_grid, measures=['rmse'], cv=3)

gs.fit(data)

# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse']) 

'''

**RESULTS**
* Best RMSE: 1.4760
* Best Parameters: 'n_epochs': 25, 'lr_all': 0.009, 'reg_all': 0.02

This Gridsearches above show us that many of our default parameters are indeed preferred, with the exception of our learning rate which could actually be improved upon with just a slight decrease.

Because this is still not an improvement on our initial model - let's experiment with a parameter we haven't tuned quite yet - `n_factors`.

In [None]:
'''

# tune hyperparameters using GridSearch to get improved model

param_grid = {'lr_all': [0.009], 'n_factors':[50,100,150]}
gs = GridSearchCV(SVD, param_grid, measures=['rmse'], cv=3)

gs.fit(data)

# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse']) 

'''

**RESULTS**
* Best RMSE: 1.4414
* Best Parameters: 'lr_all': 0.009, 'n_factors': 50