# Letterboxd Analysis Project: Modeling

**Author:** Sierra Stanton
***

![Theater Scene](../images/zach-galifianakis-math.gif)

In this notebook, we'll initially determine a simple recommendation model and iteratively build on our efforts to improve the film recommendations for Letterboxd users.

#### Virtual Environment

In order to ensure you have the required packages to run the code in this notebook, an environment.yaml file is here for your convenience.

In [None]:
# standard imports
import pandas as pd
import numpy as np

# import needed surprise libraries
from surprise import Reader, Dataset, accuracy
from surprise.model_selection import cross_validate, train_test_split, GridSearchCV
from surprise.prediction_algorithms import SVD, KNNWithMeans, KNNBasic, KNNBaseline

# retrieve dataframe as pickle file
import pickle
df = pickle.load(open("df.pkl", "rb"))

In [None]:
# ensure exploratory notebook has brought in our resulting dataset for modeling

df.head()

We've verified that the data we feed into our read has the following required Surprise columns present: `user ; item ; rating ;`

## Load in our dataset

With our modeling, we choose to use the __[Surprise library](https://surprise.readthedocs.io/en/stable/index.html)__, a Python sci-kit for recommender systems. The library contains built in algorithms and cross-validation methods we can use to make an increasingly proficient model.


In [3]:
# read in values as surprise dataset
reader = Reader(rating_scale=(1,10), line_format=('item rating user'))
data = Dataset.load_from_df(df[['user_id', 'movie_id', 'rating_val']],reader)

In [4]:
# train test split

train, test = train_test_split(data, test_size=.2)

In [5]:
train

<surprise.trainset.Trainset at 0x7f8beffef2b0>

## Run our first model: KNNBasic

According to the Surprise library, `KNNBasic` will give us a basic collaborative filtering algorithm to start with. Let's see how our data starts performing with these algorithms so we can continue fine tuning and get the best recommendations possible.

K Nearest Nieghbor methods do X

A little on similarity.


Similarities:
* **cosine**:	Compute the cosine similarity between all pairs of users (or items).
* **pearson**:	Compute the Pearson correlation coefficient between all pairs of users (or items).
* **msd**:	Compute the Mean Squared Difference similarity between all pairs of users (or items).

In [None]:
sim_options = {'name': 'cosine'}
algo_knn_basic = KNNBasic(sim_options=sim_options)
predictions = algo_knn_basic.fit(train).test(test)

In [None]:
accuracy.rmse(predictions)

**RESULTS**
* RMSE = 1.5951053223214062

Now, what if we change the similarity matrix. While cosine does [explainer], the pearson similarity measure [explainer].

In [None]:
sim_options = {'name': 'pearson'}
algo_knn_basic = KNNBasic(sim_options=sim_options)
predictions = algo_knn_basic.fit(train).test(test)

In [None]:
accuracy.rmse(predictions)

**RESULTS**
* RMSE 1.5884529441409094

In [None]:
sim_options = {'name': 'msd'}
algo_knn_basic = KNNBasic(sim_options=sim_options)
predictions = algo_knn_basic.fit(train).test(test)

Computing the msd similarity matrix...
Done computing similarity matrix.


In [None]:
accuracy.rmse(predictions)

**RESULTS**
* RMSE 1.4898636221946142

### Understanding Baseline Results

##### So, what exactly does our RMSE tell us?

Our RMSE shows us the typical amount our model prediction differs from the actual rating a user would give by comparing our predictors with our test data for accuracy.

Since our best performing model so far has a 1.4899 RMSE, our predicted user rating for a film is already typically less than a star from the reality. Let's use what we've learned to see how we can model by trying different algorithms and parameters.

Across our trial and error models, we actually tried all the similarity metrics available in the surprise library but found the two above to give us the best results so far. They can be found __[here](https://surprise.readthedocs.io/en/stable/similarities.html)__.

##### What do our predictions look like in real time?

In [None]:
# output example predictions within our current model
predictions[:5]

##### Let's predict how a particular user might rate a particular film

In [None]:
# for this user
df['user_id'][1]

In [None]:
# for this item
df['movie_id'][1]

In [None]:
# here's our predicted rating
algo_knn_basic.predict(df['user_id'][1], df['movie_id'][1])

## Run our second model: KNNWithMeans

`KNNWithMeans` is also a basic collaborative filtering algorithm, but this algorithm takes into account the mean ratings of each user. We'll see if this algorithm brings better results across the two similarity metrics that have proved most successful to date.

In [None]:
k = 15
min_k = 5
sim_options = {'name': 'pearson'}
knn_means = KNNWithMeans(k=k, min_k=min_k, sim_options=sim_options, verbose=True)
    
predictions = knn_means.fit(train).test(test)

In [None]:
accuracy.rmse(predictions)

**RESULTS**
* RMSE: 1.5607923504104053

In [None]:
k = 15
min_k = 5
sim_options = {'name': 'msd'}
knn_means = KNNWithMeans(k=k, min_k=min_k, sim_options=sim_options, verbose=True)
    
predictions = knn_means.fit(train).test(test)

In [None]:
accuracy.rmse(predictions)

**RESULTS**
* RMSE 1.514157952922513

In [None]:
k = 40
min_k = 5
sim_options = {'name': 'msd'}
knn_means = KNNWithMeans(k=k, min_k=min_k, sim_options=sim_options, verbose=True)
    
predictions = knn_means.fit(train).test(test)

In [None]:
accuracy.rmse(predictions)

**RESULTS**
* RMSE 1.508629823133224

## Run our third model: SVD

`SVD`, which was originally popularized by Simon Funk during the Netflix Prize competition (__[see Simon's breakdown of this model](https://sifter.org/simon/journal/20061211.html)__), is now used across a variety of applications.

Unlike the prior two models we tried above - this algorithm is X based instead of focused around similarity metrics.

A little about bein gMatrix Factorization-based

In [None]:
svd = SVD()
svd.fit(train)
predictions = svd.test(test)

In [None]:
accuracy.rmse(predictions)

**RESULTS**
* RMSE: 1.4183454579066075

This is our best RMSE to date! Our model prediction results would place us within about .70 stars off from the reality.

### Changing our Default Hyperparameters

OPTION1:

The Surprise library has methods to help us find the best parameters for tuning our model. We ran one such method, `Grisearch`, and experimented across parameters but ended up determining our default parameters gave us the very best RMSE.

OPTION 2:

We can try to make this even better by performing a gridsearch to identify our best hyper parameters for tuning our model. I ran surprise's `GridsearchCV` method for the SVD algorithm in the notebook within this folder titled `crossvalidation`. The following parameters were determined to be the best:

Let's run these new parameters on our model below and see how our accuracy changes.

In [None]:
# fit our algorithm with improved parameters, if possible
#svd = SVD(n_factors= 50, reg_all=0.05)
#svd.fit(dataset)

OPTION 1 PREDICTIONS

In [None]:
svd.predict(2, 4)

In [None]:
user_34_prediction = svd.predict('34', '25')
user_34_prediction

In [None]:
user_34_prediction[3]

OPTION 2 PREDICTIONS

## Let's predict how that same user might rate the same film

In [None]:
# for this user
df['user_id'][1]

In [None]:
# for this item
df['movie_id'][1]

In [None]:
# here's our predicted rating
svd.predict(df['user_id'][1], df['movie_id'][1])