# Cross Validation via Gridsearch: Finding our Best Parameters for SVD

We'll use our best performing model to date, `SVD`, and aim to further improve upon the default hyperparameters included in this model.

In [3]:
# standard imports
import pandas as pd
import numpy as np

# import needed surprise libraries
from surprise import Reader, Dataset
from surprise.model_selection import GridSearchCV
from surprise.prediction_algorithms import SVD

# retrieve pickle file
import pickle
df = pickle.load(open("df.pkl", "rb"))

In [4]:
df.head()

Unnamed: 0,movie_id,rating_val,user_id
0,happiest-season,8,deathproof
1,happiest-season,7,davidehrlich
2,happiest-season,4,ingridgoeswest
3,happiest-season,7,silentdawn
4,happiest-season,2,colonelmortimer


In [5]:
# take a subset of our dataframe, 8M represented, take 10% for gridsearch
#df_sub = df.copy()
#df_sub = df.sample(5000000)

Note the code above allows us to take a subset of data for our gridsearch, which is often helpful with larger datasets. This is separate from the fold you can declare with `cv` below. In my case, as often occurs, there was a noticeably higher RMSE when I used a subset of data so I decided to use `cv=3` instead across our full dataframe. The run time for these cells are lengthy (none are under an hour) so if you'd like to run them - just uncomment.

In [6]:
# read in values as surprise dataset
reader = Reader(rating_scale=(1,10), line_format=('item rating user'))
data = Dataset.load_from_df(df[['user_id', 'movie_id', 'rating_val']],reader)

## Gridsearch

![The Weekend Search](../images/theweekend_superbowl.gif)

As noted in the Surprise library, The `GridSearchCV` class computes accuracy metrics for an algorithm on various combinations of parameters, over a cross-validation procedure. This proves useful for finding the best set of parameters for a prediction algorithm.

This process can be especially time-consuming when dealing with over 8M rows of film ratings data. By the end of this notebook, through trial and error, I developed a process that can help others bring structure to their search and perhaps newfound insight into the data you're working with.

**Let's take a look at some of the key default parameters of an SVD model that we'll be altering in this notebook:**

![Default SVD](../images/default-svd-hyperparameters.png)

Recall that the ratings data we imported is on an elongated ten point scale (vs the five stars in Letterboxd) in order to account for the .5 increments that can be rated between stars.

With the default SVD model in our `modeling` notebook, we achieved a RMSE of 1.4183 (on our elongated ten-point scale) so we can typically predict a user's rating on a five-star scale with an accuracy of less than one star away.

With film recommendation systems, it's still especially helpful to understand your data at a level of granularity in order to better gauage what kind of films will be served to the end-user. Noting factors like behavioral psycology (i.e. recency bias) and attributed film qualities is helpful. Often, a service like Netflix, attempts to serve recommendations with associated filters that can provide helpful clusters rather than a smorgasboard of films or solely the most popular (where we often have the most associated ratings). The __[cold start problem](https://www.kdnuggets.com/2019/01/data-scientist-dilemma-cold-start-machine-learning.html)__ is worth noting and we're especially learning how many ratings we might need on the front-end from a user to get an accurate enough picture so we can provide increasingly helpful recommendations. There's a bit of a balancing act here.

Now, let's see what our quantitative measures tell us below.

⏰ NOTE: Because this method is considered exhaustive ⏰ and can take hours to run, I'll comment out the code below with the accompanying results in order to showcase development and aid anyone who'd like to follow suit.

__[This accompanying documentation](https://surprise.readthedocs.io/en/stable/getting_started.html#tune-algorithm-parameters-with-gridsearchcv)__ from Surprise sheds light on the GridSearch process with an example.

In [None]:
'''

# tune hyperparameters using GridSearch to get improved model

param_grid = {'n_epochs': [5, 10], 'lr_all': [0.002, 0.005],
              'reg_all': [0.4, 0.6]}
gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=3)

gs.fit(data)

# print our best RMSE score
print(gs.best_score['rmse'])

# print the combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

'''

**RESULTS**
* Best RMSE: 1.5696
* Best Parameters: 'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.4

From our first Gridsearch, we tested parameters and were able to plug in some of the default SVD parameters as well as other stats. So far, we can see all the winning parameters are trending toward the default SVD parameters in our graphic above.

Let's try some more comparisons below to see which paramaters most improve accuracy.

In [None]:
'''

# tune hyperparameters using GridSearch to get improved model

param_grid = {'n_epochs': [25, 27], 'lr_all': [0.0093, 0.0095],
              'reg_all': [0.01, 0.02]}
gs = GridSearchCV(SVD, param_grid, measures=['rmse'], cv=3)

gs.fit(data)

# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])                                

'''

**RESULTS**
* Best RMSE: 1.4772
* Best Parameters: 'n_epochs': 25, 'lr_all': 0.0093, 'reg_all': 0.02

In [None]:
'''

# tune hyperparameters using GridSearch to get improved model

param_grid = {'n_epochs': [25, 26], 'lr_all': [0.0093, 0.009],
              'reg_all': [0.015, 0.02]}
gs = GridSearchCV(SVD, param_grid, measures=['rmse'], cv=3)

gs.fit(data)

# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse']) 

'''

**RESULTS**
* Best RMSE: 1.4760
* Best Parameters: 'n_epochs': 25, 'lr_all': 0.009, 'reg_all': 0.02

This Gridsearches above show us that many of our default parameters are indeed preferred, with the exception of our learning rate which could actually be improved upon with just a slight decrease.

Because this could likely still be closer to the initial RMSE in SVD model - let's experiment with a parameter we haven't tuned quite yet - `n_factors`.

In [None]:
'''

# tune hyperparameters using GridSearch to get improved model

param_grid = {'lr_all': [0.009], 'n_factors':[50,100,150]}
gs = GridSearchCV(SVD, param_grid, measures=['rmse'], cv=3)

gs.fit(data)

# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse']) 

'''

**RESULTS**
* Best RMSE: 1.4414
* Best Parameters: 'lr_all': 0.009, 'n_factors': 50

We can see that lower `n_factors` gives us a better RMSE.

### Let's Further Improve our Approach
At this point I revisited our potential SVD parameters in Surprise, identified a greater range of notable factors and brought a range of 2-3 values below to test across. This cell below took over 30 hours to run but clearly brought our best RMSE so far and is now worth running across our full training set in our [modeling notebook](/notebooks/modeling.ipynb).

You can see that our learnings above were brought in and tested with a range of values surrounding those winning parameters.

In [8]:
'''

# tune hyperparameters using GridSearch to get improved model

param_grid = {'lr_all': [0.008,0.009,0.0092], 'n_factors': [40,50,75],
             'reg_all': [0.0175,0.02,0.025], 'biased': [True, False]}
gs = GridSearchCV(SVD, param_grid, measures=['rmse'], cv=3)

gs.fit(data)

# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

# best accuracy results for the chosen measure, averaged over all splits
print(gs.best_estimator['rmse'])

'''

1.4214825737387378
{'lr_all': 0.008, 'n_factors': 40, 'reg_all': 0.025, 'biased': True}
<surprise.prediction_algorithms.matrix_factorization.SVD object at 0x7fe59ac31438>


**RESULTS**
* Best RMSE: 1.4215
* Best Parameters: 'lr_all': 0.008, 'n_factors': 40, 'reg_all': 0.025, 'biased': True

**Our best RMSE from Gridsearch to date!**

### What does this tell us?
From our experimented parameters grids above, we ended up determining that our best parameters are the following: 

##### 'lr_all': 0.008, 'n_factors': 40, 'reg_all': 0.025, 'biased': True

It's important to note that since our grid was only tested across a third of the full dataset (reference: `cv=3`) - we'll want to bring these parameters across our full train set in our modeling notebook to test how RMSE might change when fed a larger set of data.

With more time, we could go past these parameters into more granular hypertuning and see how else we can further our model's accuracy. See the __[full list of parameters](https://surprise.readthedocs.io/en/stable/matrix_factorization.html)__ we can potentially tune.