# Letterboxd Analysis Project: Modeling

**Author:** Sierra Stanton
***

![Theater Scene](../images/zach-galifianakis-math.gif)

In this notebook, we'll initially determine a simple recommendation model and iteratively build on our efforts to improve the film recommendations for Letterboxd users.

In [1]:
# standard import
import pandas as pd

# import needed surprise libraries
from surprise import Reader, Dataset, accuracy
from surprise.model_selection import train_test_split
from surprise.prediction_algorithms import SVD, KNNWithMeans, KNNBasic

# retrieve dataframe as pickle file
import pickle
df = pickle.load(open("df.pkl", "rb"))

In [2]:
# ensure our resulting dataset is brought in for modeling

df.head()

Unnamed: 0,movie_id,rating_val,user_id
0,happiest-season,8,deathproof
1,happiest-season,7,davidehrlich
2,happiest-season,4,ingridgoeswest
3,happiest-season,7,silentdawn
4,happiest-season,2,colonelmortimer


We've verified that the data we feed into our read has the following required Surprise columns present: `user ; item ; rating ;`

## Load in our dataset

With our modeling, we choose to use the __[Surprise library](https://surprise.readthedocs.io/en/stable/index.html)__, a Python sci-kit for recommender systems. The library contains built in algorithms and cross-validation methods we can use to make an increasingly proficient model.


In [3]:
# read in values as surprise dataset
reader = Reader(rating_scale=(1,10), line_format=('item rating user'))
data = Dataset.load_from_df(df[['user_id', 'movie_id', 'rating_val']],reader)

In [4]:
# train test split

train, test = train_test_split(data, test_size=.2)

In [5]:
train

<surprise.trainset.Trainset at 0x7f934d69fe80>

## Run our first model: KNNBasic

According to the Surprise library, `KNNBasic` will give us a basic collaborative filtering algorithm to start with. Let's see how our data starts performing with these algorithms so we can continue fine tuning and get the best recommendations possible.

K Nearest Nieghbor methods are memory or neighbor-based so they're typically good as baselines. Because this method find the level of similarity between a user and every other user to make a prediction (according to the weighted average) - there are two things we want to take into account based on what we know about our data.

One is the cold start problem - because without initial information on the user's preferences we can't make an adaquate comparison. Two is the tendency toward popularity bias. If a particular item is often rated 5 stars across the board, our prediction would likely be about 5 stars even if our user happens to have a difference in taste from the popular opinion. We know that our film data skews toward more recent years so this is worth noting.

The way we gauge similarity from user to user can be altered. Since pearson is noted to most often be the best performing for rec engines, that's where we'll start in our baseline.

⏰ NOTE: Because a few of these models can be exahustive and take time to run, I'll comment out the code below with the accompanying results in order to showcase development and aid anyone who'd like to follow suit. Simply uncomment if you'd like to run them yourself.

In [None]:
'''
sim_options = {'name': 'cosine'}
algo_knn_basic = KNNBasic(sim_options=sim_options)
predictions = algo_knn_basic.fit(train).test(test)
'''

In [None]:
# accuracy.rmse(predictions)

**RESULTS**
* RMSE = 1.5951053223214062

Now, what if we change the similarity matrix to cosine.

In [None]:
'''
sim_options = {'name': 'pearson'}
algo_knn_basic = KNNBasic(sim_options=sim_options)
predictions = algo_knn_basic.fit(train).test(test)
'''

In [None]:
# accuracy.rmse(predictions)

**RESULTS**
* RMSE 1.5884529441409094

### Understanding Baseline Results

##### So, what exactly does our RMSE tell us?

Our RMSE shows us the typical amount our model prediction differs from the actual rating a user would give by comparing our predictions with our test data for accuracy.

Since our best performing model so far has a 1.4899 RMSE, our predicted user rating for a film is already typically less than a star from the reality. Let's use what we've learned to see how we can model by trying different algorithms and parameters.

Across our trial and error models, we actually tried all the similarity metrics available in the surprise library but found the two above to give us the best results so far. They can be found __[here](https://surprise.readthedocs.io/en/stable/similarities.html)__.

##### What do our predictions look like in real time?

In [18]:
# output example predictions within our current model
predictions[:5]

[Prediction(uid='rbonaime', iid='the-bigger-picture', r_ui=6.0, est=6.853587940579535, details={'actual_k': 40, 'was_impossible': False}),
 Prediction(uid='allisoncm', iid='the-technique-and-the-rite', r_ui=5.0, est=6.0, details={'actual_k': 1, 'was_impossible': False}),
 Prediction(uid='joedicanio', iid='may', r_ui=7.0, est=6.605265251368509, details={'actual_k': 40, 'was_impossible': False}),
 Prediction(uid='dylanblondee', iid='22-july', r_ui=5.0, est=6.33035490908406, details={'actual_k': 40, 'was_impossible': False}),
 Prediction(uid='tubbs', iid='the-sisterhood-of-the-traveling-pants', r_ui=4.0, est=5.3535220692543914, details={'actual_k': 40, 'was_impossible': False})]

##### Let's predict how a particular user might rate a particular film

In [19]:
# for this user
df['user_id'][1]

'davidehrlich'

In [20]:
# for this item
df['movie_id'][1]

'happiest-season'

In [21]:
# here's our predicted rating
algo_knn_basic.predict(df['user_id'][1], df['movie_id'][1])

Prediction(uid='davidehrlich', iid='happiest-season', r_ui=None, est=6.06895844485103, details={'actual_k': 40, 'was_impossible': False})

After iterating our first model, we currently predict user `davidehrlich` would rate Hulu film Happiest Season with a 6/10 aka 3 stars in the Letterboxd app. We can use these predictions to order films to serve him the recommendations that will better fit his taste. Yet, we need to keep in mind that improved accuracy will make this feature more powerful and poignant. Narratives have a special place in our society and with a database of over 250K films to recommend - let's try to further increase our accuracy.

## Run our second model: KNNWithMeans

`KNNWithMeans` is also a basic collaborative filtering algorithm, but this algorithm takes into account the mean ratings of each user. We'll see if this algorithm brings better results across the two similarity metrics that have proved most successful to date.

In [None]:
'''
k = 15
min_k = 5
sim_options = {'name': 'pearson'}
knn_means = KNNWithMeans(k=k, min_k=min_k, sim_options=sim_options, verbose=True)
    
predictions = knn_means.fit(train).test(test)
'''

In [None]:
# accuracy.rmse(predictions)

**RESULTS**
* RMSE: 1.5607923504104053

Here is the breakdown of similarity types:
* **cosine**:	Compute the cosine similarity between all pairs of users (or items).
* **pearson**:	Compute the Pearson correlation coefficient between all pairs of users (or items).
* **msd**:	Compute the Mean Squared Difference similarity between all pairs of users (or items).
    
Let's bring in our `msd` similarity since have yet to explore how that'll perform with our data.

In [None]:
'''
k = 15
min_k = 5
sim_options = {'name': 'msd'}
knn_means = KNNWithMeans(k=k, min_k=min_k, sim_options=sim_options, verbose=True)
    
predictions = knn_means.fit(train).test(test)
'''

In [None]:
# accuracy.rmse(predictions)

**RESULTS**
* RMSE 1.514157952922513

In [None]:
'''
k = 40
min_k = 5
sim_options = {'name': 'msd'}
knn_means = KNNWithMeans(k=k, min_k=min_k, sim_options=sim_options, verbose=True)
    
predictions = knn_means.fit(train).test(test)
'''

In [None]:
#accuracy.rmse(predictions)

**RESULTS**
* RMSE 1.508629823133224

Increasing the `k` parameter clearly helped a bit. Now, let's see if using the similarity type that's winning so far aids the first model we tried.

#### A similarity iteration with our first model: KNNBasic

In [None]:
'''
sim_options = {'name': 'msd'}
algo_knn_basic = KNNBasic(sim_options=sim_options)
predictions = algo_knn_basic.fit(train).test(test)
'''

In [None]:
# accuracy.rmse(predictions)

**RESULTS**
* RMSE 1.4898636221946142

When we brought the winning similarity metric from our KNNWithMeans iterations back to our first model, we clearly improved our RMSE - which previously had a record of 1.5884 (with KNNBasic) and 1.5086 (with KNNWithMeans).

## Run our third model: SVD

`SVD`, or Singular Value Decomposition, was originally popularized by Simon Funk during the Netflix Prize competition (__[see Simon's breakdown of this model](https://sifter.org/simon/journal/20061211.html)__), and is now used across a variety of applications.

Unlike the prior two models we tried above - this algorithm is matrix-factorization based instead of focused around the similarity metrics above.

In [None]:
'''
# run the default SVD model
svd = SVD()
svd.fit(train)
predictions = svd.test(test)
'''

In [None]:
# accuracy.rmse(predictions)

**RESULTS**
* RMSE: 1.4183454579066075

This is our best RMSE to date! Our model prediction results would place us within about .70 stars off from the reality.

### Changing our Default Hyperparameters

The Surprise library has methods to help us find the best parameters for tuning our model. I ran surprise's `GridsearchCV` method across multiple quantities and factors for the SVD algorithm in the notebook titled [Gridsearch](/notebooks/gridsearch.ipynb) and extracted the best performing parameters. Let's run these new parameters on our model below and see how our accuracy changes.

In [7]:
'''
# fit our algorithm with winning hyperparameters
svd = SVD(lr_all=0.008, n_factors=40, reg_all=0.025, biased=True)
svd.fit(train)
predictions = svd.test(test)
'''

In [8]:
#accuracy.rmse(predictions)

RMSE: 1.4022


1.4021701605968926

**RESULTS**
* RMSE: 1.4021701605968926

We successfully used Gridsearch to identify better parameters for our model! This is the highest RMSE we've achieved so far and the run time was under twenty minutes.

## Let's predict how that same user might rate the same film

In [13]:
# for this user
df['user_id'][1]

'davidehrlich'

In [14]:
# for this item
df['movie_id'][1]

'happiest-season'

In [15]:
# here's our predicted rating
svd.predict(df['user_id'][1], df['movie_id'][1])

Prediction(uid='davidehrlich', iid='happiest-season', r_ui=None, est=5.488776723015272, details={'was_impossible': False})

We can now see with greater accuracy that user `davidehrlich` is estimated to rate Hulu film Happiest Season with a 5.5 on a 10 point scale aka 2.5 stars in Letterboxd. With our first model, we predicted 2.5 stars and have gotten increasingly closer to what our user likely would've rated the film. Now, we can surface better recommendations.

## Conclusions

* Our final, iterated SVD model runs in about twenty minutes currently and has prediction accuracy within half a star of how a Letterboxd user tends to rate a film. Next, we'll want to create the front-end experience for a Letterboxd user to source these helpful recommendations. We could perhaps deploy this via Flask and Heroku or aim to bring into the Letterboxd app via web or iOS after testing. We'll want to ask questions to mitigate against the cold start problem and have some initial recommendations sourced in record time prior to the more personalized processing we're now capable of.