# Hulu Recommendation System Competition Notebook

## Overview

This aim of this project is to build a model that provides top 5 movie recommendations to a user, based on their ratings of other movies fro Hulu. Our audience in this case is the Hulu Technology Team headed by the CTO of Hulu. They are holding a competition to upgrade their current machine learning algorithm that oversees the movie recommendations given to a Hulu user/consumer. 

## Business/Industry Understanding 

Todays streaming landscape is vast and diverse, with over 1 billion users worldwide and generating over 500 billion dollars in revenue. Hulu currently has 11% market share of the whole industry, trailing behind Netflix(25%), Prime Video(18%), HBO Max(17%) and Disney Plus(14%).It's also important to note that 85% of Hulu users are also subscribed to Netflix.Currently, Hulu uses a item-based collaborative filtering algorithm, this is a successful approach commonly used by many recommender systems, it takes into consideration the ratings given to items by users as the sole source of information for learning to make recommendations. 

## Imports

Surprise was the main library we used to produce our collaborative filtering model.This library has endogenous methods of splitting our data set, cross validating, performing grid searches as well various different algorithms that can be used to build recommendation systems. As you go through this notebook, you will see that we had to use a combination of both surprise and scikit-learn methods to ensure the successful deployment of model. Since the syntax of surprise and sklearn are similar, this made this process much more efficient, it also ensures that the notebook would be readable to anyone with a background in data science. We mainly utilized surprise to transform the data set to a suitable data format that can be used in recommendation systems. We also employed pickling to make loading onto any new computer an easier process. 

In [43]:
import pandas as pd

from surprise import Dataset, Reader, accuracy
from surprise.model_selection import cross_validate,GridSearchCV
from surprise.prediction_algorithms import KNNBasic,SVD, SVDpp, NMF, BaselineOnly, NormalPredictor
from sklearn.model_selection import train_test_split

## EDA

Our Exploratory Data Analysis also includes data cleaning and reformatting.

In [44]:
movies = pd.read_csv('../../ml-latest-small/movies.csv')
rating = pd.read_csv('../../ml-latest-small/ratings.csv')

We loaded the data in from the two csv files: **movies.csv** and **ratings.csv** 

In [45]:
#looking at the movies df
movies

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy
9739,193585,Flint (2017),Drama
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation


The movies data set has 9742 rows and 3 columns, where each row represents one movie. The first column : *movieId* corresponds to a unique id that defines each movie in this dataset. The second column : *title*  contains the movie title which, in this case, is a combination of both the title of the movie and the year it was released. The final column is *genres* and it contains the genres of the movie, some movies are tied to multiple genres. For example, the movie "Toy Story" is considered a Children's movie, but also an action movie, as well as adventure, comedy and fantasy.

In [46]:
movies.isna().sum()

movieId    0
title      0
genres     0
dtype: int64

In [47]:
movies.title.value_counts()

War of the Worlds (2005)                          2
Emma (1996)                                       2
Saturn 3 (1980)                                   2
Confessions of a Dangerous Mind (2002)            2
Eros (2004)                                       2
                                                 ..
Not One Less (Yi ge dou bu neng shao) (1999)      1
Great Beauty, The (Grande Bellezza, La) (2013)    1
Sand Pebbles, The (1966)                          1
Roommate, The (2011)                              1
Auto Focus (2002)                                 1
Name: title, Length: 9737, dtype: int64

In [48]:
movies.title.duplicated().sum()

5

In [49]:
#Drop the 5 duplicated movie titles
movies.drop_duplicates(subset='title', inplace=True)

In [50]:
#Sanity check to ensure all duplicates were dropped from title column

movies.title.duplicated().sum()

0

In [51]:
movies[movies.title.duplicated() == True]

Unnamed: 0,movieId,title,genres


First, we established that there were no nulls in the data frame, then we checked to see whether there was duplicates, so we did a value_counts specific to movie titles and we noticed 5 duplicated data points, so we dropped the duplicates present in the data. We then reconfirmed that our data had no duplicates and the size of new data frame without duplicates was 9737, 5 less than the original 9742.

In [52]:
len(movies)

9737

In [53]:
# Split title column into two new columns: Title and year 

movies['Title'] = movies['title'].str.split('(', n=1, expand=True)[0]

movies['year_released'] = movies['title'].str.split('(', n=1, expand=True)[1]

We decided to format the title column, so we split it by the delimiter '()' so that we could separate the title and year released. We felt that we wanted to isolate these two variables as we wanted to see what effect year released would have on the model.  

In [54]:
# Drop original column title

movies.drop(columns='title', inplace=True)

In [55]:
# Sanity Check 
movies

Unnamed: 0,movieId,genres,Title,year_released
0,1,Adventure|Animation|Children|Comedy|Fantasy,Toy Story,1995)
1,2,Adventure|Children|Fantasy,Jumanji,1995)
2,3,Comedy|Romance,Grumpier Old Men,1995)
3,4,Comedy|Drama|Romance,Waiting to Exhale,1995)
4,5,Comedy,Father of the Bride Part II,1995)
...,...,...,...,...
9737,193581,Action|Animation|Comedy|Fantasy,Black Butler: Book of the Atlantic,2017)
9738,193583,Animation|Comedy|Fantasy,No Game No Life: Zero,2017)
9739,193585,Drama,Flint,2017)
9740,193587,Action|Animation,Bungo Stray Dogs: Dead Apple,2018)


In [56]:
# Format year_released column

movies['year_released'] = movies.year_released.str.replace(')', '')

In [57]:
# Sanity Check to ensure formatting was completed
movies

Unnamed: 0,movieId,genres,Title,year_released
0,1,Adventure|Animation|Children|Comedy|Fantasy,Toy Story,1995
1,2,Adventure|Children|Fantasy,Jumanji,1995
2,3,Comedy|Romance,Grumpier Old Men,1995
3,4,Comedy|Drama|Romance,Waiting to Exhale,1995
4,5,Comedy,Father of the Bride Part II,1995
...,...,...,...,...
9737,193581,Action|Animation|Comedy|Fantasy,Black Butler: Book of the Atlantic,2017
9738,193583,Animation|Comedy|Fantasy,No Game No Life: Zero,2017
9739,193585,Drama,Flint,2017
9740,193587,Action|Animation,Bungo Stray Dogs: Dead Apple,2018


When we originally ran the formatting code, it didn't delete the 'close' parenthesis, so we had to write additional code above to fix that and ensure clean data. 

In [58]:
#checking rating df
rating

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
...,...,...,...,...
100831,610,166534,4.0,1493848402
100832,610,168248,5.0,1493850091
100833,610,168250,5.0,1494273047
100834,610,168252,5.0,1493846352


The ratings data frame has 100,836 rows and 4 columns, where each row is one rating given by a specific user to a specific movie. The first column : *userId* corresponds to a unique id that identifies the specific user that rated the specific movie, denoted by the second column: *movieId*. This *movieId* is the same identifying key that is presents in the movies data frame. The third column represents the *rating* that the user gave the movie, on a 1-5 rating scale. Lastly, the final column corresponds to the date and time the rating was made, in the format of a UNIX *timestamp*.

In [59]:
#converted timestamp to datetime
rating['timestamp'] = pd.to_datetime(rating['timestamp'], unit='s')

We converted the UNIX timestamp into a date time object, to see if there was any we could incorporate the element of time as a feature in our model, but after reviewing the data, we decided that these datasets will change over time, and are not appropriate for reporting research results. 

In [60]:
rating.rating.value_counts(normalize=True)

4.0    0.265957
3.0    0.198808
5.0    0.131015
3.5    0.130271
4.5    0.084801
2.0    0.074884
2.5    0.055040
1.0    0.027877
1.5    0.017762
0.5    0.013586
Name: rating, dtype: float64

In [61]:
rating.duplicated().sum()

0

In [62]:
rating.isna().sum()

userId       0
movieId      0
rating       0
timestamp    0
dtype: int64

First, we established that there were no nulls in the data frame, then we checked to see whether there was duplicates, so we did a value_counts specific to ratings and confirmed there are no duplicates.

In [63]:
# merging rating and movies
movie_rating = movies.merge(rating, on='movieId', how='outer')

In [64]:
movie_rating

Unnamed: 0,movieId,genres,Title,year_released,userId,rating,timestamp
0,1,Adventure|Animation|Children|Comedy|Fantasy,Toy Story,1995,1.0,4.0,2000-07-30 18:45:03
1,1,Adventure|Animation|Children|Comedy|Fantasy,Toy Story,1995,5.0,4.0,1996-11-08 06:36:02
2,1,Adventure|Animation|Children|Comedy|Fantasy,Toy Story,1995,7.0,4.5,2005-01-25 06:52:26
3,1,Adventure|Animation|Children|Comedy|Fantasy,Toy Story,1995,15.0,2.5,2017-11-13 12:59:30
4,1,Adventure|Animation|Children|Comedy|Fantasy,Toy Story,1995,17.0,4.5,2011-05-18 05:28:03
...,...,...,...,...,...,...,...
100849,64997,,,,68.0,2.5,2008-12-28 20:55:15
100850,144606,,,,111.0,4.0,2018-01-31 23:27:37
100851,147002,,,,318.0,4.0,2017-08-08 15:45:52
100852,26958,,,,509.0,3.5,2015-07-04 17:42:33


In [65]:
movie_rating.dropna(inplace=True)

In [66]:
movie_rating['genres']=[row.strip().lower().replace('|',',') for row in movie_rating['genres']]

The code above is to change the genres from its current format into a list that we can iterate over in our final function

In [67]:
movie_rating['userId'].nunique()

610

The code above was to determine how many unique users contributed to the total ratings, in this case, 610 unique users made 100854 ratings of 9737 movies. 

In [68]:
movie_rating

Unnamed: 0,movieId,genres,Title,year_released,userId,rating,timestamp
0,1,"adventure,animation,children,comedy,fantasy",Toy Story,1995,1.0,4.0,2000-07-30 18:45:03
1,1,"adventure,animation,children,comedy,fantasy",Toy Story,1995,5.0,4.0,1996-11-08 06:36:02
2,1,"adventure,animation,children,comedy,fantasy",Toy Story,1995,7.0,4.5,2005-01-25 06:52:26
3,1,"adventure,animation,children,comedy,fantasy",Toy Story,1995,15.0,2.5,2017-11-13 12:59:30
4,1,"adventure,animation,children,comedy,fantasy",Toy Story,1995,17.0,4.5,2011-05-18 05:28:03
...,...,...,...,...,...,...,...
100843,193581,"action,animation,comedy,fantasy",Black Butler: Book of the Atlantic,2017,184.0,4.0,2018-09-16 14:44:42
100844,193583,"animation,comedy,fantasy",No Game No Life: Zero,2017,184.0,3.5,2018-09-16 14:52:25
100845,193585,drama,Flint,2017,184.0,3.5,2018-09-16 14:56:45
100846,193587,"action,animation",Bungo Stray Dogs: Dead Apple,2018,184.0,3.5,2018-09-16 15:00:21


# Splits

The train test split process was a little different that usual, this is because of how the surprise library's methods for cross validation and grid search are comprised. A different input type in needed for GridSearch and cross validation methods compared to .fit and .test methods. The documentation recommends doing cross validation **OR** the train test split. Since we wanted to do both we had to train test split twice: train_data and test_data, defined below are used for fitting and cross validation while train_data2 is used for predictions 

In [69]:
#for function building/final model
reader = Reader(rating_scale=(1, 5))
complete_data = Dataset.load_from_df(movie_rating[['userId', 'movieId', 'rating']], reader)

In [70]:
#use sklearn for model selection/choosing 
train, test = train_test_split(movie_rating, random_state=42)

reader = Reader(rating_scale=(1, 5))
train_data = Dataset.load_from_df(train[['userId', 'movieId', 'rating']], reader)
test_data = Dataset.load_from_df(test[['userId', 'movieId', 'rating']], reader)


train_data2 = train_data.build_full_trainset()
test_data2 = test_data.build_full_trainset().build_testset()

# Modeling

### Baseline/dummy

The first step in our iterative modeling process is to create and test a dummy model, which will act as our base line RSME (root mean squared error) score that we compare to when judging the validity and performance of our future models.  

In [92]:
from surprise.prediction_algorithms.random_pred import NormalPredictor

In [93]:
dummy_model = NormalPredictor()
cross_validate(dummy_model, train_data, measures=['RMSE'], cv=5, verbose=True)

Evaluating RMSE of algorithm NormalPredictor on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.4235  1.4165  1.4283  1.4101  1.4285  1.4214  0.0071  
Fit time          0.06    0.07    0.07    0.06    0.06    0.06    0.00    
Test time         0.07    0.13    0.06    0.06    0.06    0.08    0.03    


{'test_rmse': array([1.42346113, 1.4165474 , 1.42830858, 1.41006482, 1.42850583]),
 'fit_time': (0.06210899353027344,
  0.06763100624084473,
  0.06775307655334473,
  0.06429505348205566,
  0.062151193618774414),
 'test_time': (0.06682205200195312,
  0.1285099983215332,
  0.06454777717590332,
  0.06496191024780273,
  0.0649406909942627)}

In [94]:
predictions = dummy_model.fit(train_data2).test(test_data2)
baseline_rsme = accuracy.rmse(predictions)
baseline_rsme

RMSE: 1.4256


1.425648950738226

A RSME of 1.425 is considered to be not good

### KNNBasics

The k-nearest neighbors (KNN) algorithm is a simple, supervised machine learning algorithm that can be used to solve both classification and regression problems. In this case, we are using it as the basis for our recommendation system. the premise is that the rating of movies unseen by the user will be predicted using ratings of similar users. In this case, the nearest "neighbours"

In [71]:
param_grid = {'k':[10, 50, 100],'min_k': [1, 5, 10]}
base_model = GridSearchCV(KNNBasic,param_grid=param_grid,joblib_verbose=5)
base_model.fit(train_data)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Computing the msd similarity matrix...
Done computing similarity matrix.


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.5s remaining:    0.0s


Computing the msd similarity matrix...
Done computing similarity matrix.


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    1.0s remaining:    0.0s


Computing the msd similarity matrix...
Done computing similarity matrix.


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    1.5s remaining:    0.0s


Computing the msd similarity matrix...
Done computing similarity matrix.


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    2.0s remaining:    0.0s


Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computi

[Parallel(n_jobs=1)]: Done  45 out of  45 | elapsed:   28.0s finished


In [72]:
base_model.best_params

{'rmse': {'k': 10, 'min_k': 5}, 'mae': {'k': 10, 'min_k': 5}}

In [73]:
#trying different parameters
param_grid = {'k':[5, 10, 15],'min_k': [1, 5, 10]}
base_model = GridSearchCV(KNNBasic,param_grid=param_grid,joblib_verbose=5)
base_model.fit(train_data)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Computing the msd similarity matrix...
Done computing similarity matrix.


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.6s remaining:    0.0s


Computing the msd similarity matrix...
Done computing similarity matrix.


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    1.0s remaining:    0.0s


Computing the msd similarity matrix...
Done computing similarity matrix.


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    1.5s remaining:    0.0s


Computing the msd similarity matrix...
Done computing similarity matrix.


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    2.1s remaining:    0.0s


Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computi

[Parallel(n_jobs=1)]: Done  45 out of  45 | elapsed:   23.6s finished


In [74]:
base_model.best_params

{'rmse': {'k': 15, 'min_k': 5}, 'mae': {'k': 15, 'min_k': 5}}

In [75]:
param_grid = {'k':[10, 15, 20],'min_k': [1, 5, 10]}
base_model = GridSearchCV(KNNBasic,param_grid=param_grid,joblib_verbose=5)
base_model.fit(train_data)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Computing the msd similarity matrix...
Done computing similarity matrix.


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.6s remaining:    0.0s


Computing the msd similarity matrix...
Done computing similarity matrix.


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    1.1s remaining:    0.0s


Computing the msd similarity matrix...
Done computing similarity matrix.


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    1.6s remaining:    0.0s


Computing the msd similarity matrix...
Done computing similarity matrix.


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    2.1s remaining:    0.0s


Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computi

[Parallel(n_jobs=1)]: Done  45 out of  45 | elapsed:   26.1s finished


In [76]:
base_model.best_params

{'rmse': {'k': 15, 'min_k': 5}, 'mae': {'k': 15, 'min_k': 5}}

We performed a grid search until the hyper-parameters were tuned to whats considered the optimal amount.

#### KNN Cross Validation

In [95]:
#instantiate KNN model 
model1=KNNBasic(k=15, min_k=5)

In [96]:
#instaniatite and perform cross val with training data
cross_validate(model1, train_data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNBasic on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9500  0.9494  0.9457  0.9571  0.9621  0.9528  0.0059  
MAE (testset)     0.7267  0.7298  0.7272  0.7348  0.7341  0.7305  0.0034  
Fit time          0.03    0.04    0.04    0.04    0.04    0.04    0.00    
Test time         0.46    0.47    0.49    0.47    0.47    0.47    0.01    


{'test_rmse': array([0.94997893, 0.94937431, 0.94570357, 0.95708816, 0.96206603]),
 'test_mae': array([0.72671395, 0.7297983 , 0.72722415, 0.73475731, 0.7341103 ]),
 'fit_time': (0.03418874740600586,
  0.042458295822143555,
  0.038961172103881836,
  0.03963899612426758,
  0.03948211669921875),
 'test_time': (0.4624638557434082,
  0.4700448513031006,
  0.485076904296875,
  0.4742300510406494,
  0.47359704971313477)}

In [97]:
#use model to predict
model1.predict(uid=10, iid = 20)
#this model just predicted what user 10 would predict movie 20

Prediction(uid=10, iid=20, r_ui=None, est=2.6874288207777863, details={'actual_k': 10, 'was_impossible': False})

looking at the output above we can determine that our model predicted that use 10 would rate movie 20 a 2.5 give or take our RSME which is .95.

### SVD

After defining our baseline model and exploring the KNN algorithm, we decided to try using Singular Value Decomposition (SVD) algorithm, in the hopes that it would improve the accuracy of our model and lower the RSME. The surprise library uses a form of SVD called Funk's SVD, the premise behind which is a matrix factorization-based model. This means it transforms a user-item matrix (in our case, a user-movie ratings matrix)into latent factor matrices. It is also important to note that errors are minimized using stochastic gradient descent, which means that both regularization and learning rate are hyper-parameters that we can tune. Number of factors and number of epochs are also hyper-parameters we tuned.

In [None]:
model2 = SVD()

In [None]:
cross_validate(model2, train_data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

In [None]:
param_grid = {'n_factors':[10,20,50],'n_epochs': [5, 10, 15], 'lr_all': [0.002,0.005,0.01],
             'reg_all': [0.2,0.4,0.6]}
gs_model2 = GridSearchCV(SVD,param_grid=param_grid,joblib_verbose=5)
gs_model2.fit(train_data)

In [None]:
gs_model2.best_params

In [None]:
param_grid = {'n_factors':[25, 50, 75],'n_epochs': [15, 30, 45], 'lr_all': [0.01,0.05,0.1],
             'reg_all': [0.1,0.2,0.3]}
gs_model2 = GridSearchCV(SVD,param_grid=param_grid,joblib_verbose=5)
gs_model2.fit(train_data)

In [None]:
gs_model2.best_params

In [None]:

cross_validate(SVD(n_factors=75,n_epochs=45,lr_all=0.01,reg_all=0.1), train_data, measures=['RMSE', 'MAE'], 
               cv=5, verbose=True)

In [None]:
param_grid = {'n_factors':[75, 100, 125],'n_epochs': [45, 60, 75], 'lr_all': [.005, 0.01, .015],
             'reg_all': [.05, 0.1, 0.15]}
gs_model2 = GridSearchCV(SVD,param_grid=param_grid,joblib_verbose=5)
gs_model2.fit(train_data)

In [None]:
gs_model2.best_params

In [None]:
cross_validate(SVD(n_factors=125,n_epochs=45,lr_all=0.015,reg_all=0.1), train_data, measures=['RMSE', 'MAE'], 
               cv=5, verbose=True)

### NMF

The final algorithm we explored in the Surprise Library was Non-negative Matrix Factorization (NMF). The only difference between this algorithm and the previous algorithm (SVD) is that in NMF, a specific step size is set for the stochastic gradient descent process of regularization. This is done to ensure that all user and item factors are kept positive. While this model can be prone to over-fitting, this can be mitigated with steps to reduce the dimensionality of our data/factors. 

In [77]:
model3 = NMF()

In [78]:
cross_validate(model3, train_data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm NMF on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9296  0.9468  0.9401  0.9397  0.9419  0.9396  0.0056  
MAE (testset)     0.7139  0.7268  0.7251  0.7211  0.7213  0.7216  0.0045  
Fit time          3.35    3.33    3.27    3.37    3.29    3.32    0.04    
Test time         0.05    0.06    0.10    0.06    0.06    0.07    0.02    


{'test_rmse': array([0.92961726, 0.94675717, 0.94010792, 0.9396986 , 0.94188696]),
 'test_mae': array([0.7138535 , 0.72682425, 0.72512208, 0.72110739, 0.72131874]),
 'fit_time': (3.3544530868530273,
  3.327238082885742,
  3.2722790241241455,
  3.373772144317627,
  3.289644956588745),
 'test_time': (0.054811954498291016,
  0.05837702751159668,
  0.09907007217407227,
  0.05770421028137207,
  0.055481910705566406)}

In [79]:
param_grid = {'n_factors':[10,15,20],'n_epochs': [25, 50, 75]}
gs_model3 = GridSearchCV(NMF,param_grid=param_grid,joblib_verbose=5)
gs_model3.fit(train_data)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.4s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    2.9s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    4.4s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    5.9s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  45 out of  45 | elapsed:  2.5min finished


In [80]:
gs_model3.best_params

{'rmse': {'n_factors': 20, 'n_epochs': 50},
 'mae': {'n_factors': 20, 'n_epochs': 50}}

In [81]:
cross_validate(NMF(n_factors=10,n_epochs=25), train_data, measures=['RMSE', 'MAE'], 
               cv=5, verbose=True,  n_jobs= -2)

Evaluating RMSE, MAE of algorithm NMF on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9810  0.9689  0.9558  0.9615  0.9850  0.9704  0.0111  
MAE (testset)     0.7430  0.7350  0.7246  0.7260  0.7460  0.7349  0.0086  
Fit time          1.47    1.50    1.49    1.49    1.52    1.49    0.01    
Test time         0.07    0.07    0.07    0.06    0.05    0.07    0.01    


{'test_rmse': array([0.98104052, 0.96888011, 0.95581827, 0.96153355, 0.98495958]),
 'test_mae': array([0.74296638, 0.73500565, 0.72464844, 0.72603802, 0.746013  ]),
 'fit_time': (1.474731683731079,
  1.5044727325439453,
  1.4921081066131592,
  1.485633134841919,
  1.5155940055847168),
 'test_time': (0.07366108894348145,
  0.06979990005493164,
  0.06772708892822266,
  0.06454992294311523,
  0.052494049072265625)}

In [82]:
param_grid = {'n_factors':[3,5,10],'n_epochs': [15, 20, 25]}
gs_model3 = GridSearchCV(NMF,param_grid=param_grid,joblib_verbose=5)
gs_model3.fit(train_data)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.7s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    1.4s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    2.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    2.8s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  45 out of  45 | elapsed:   49.4s finished


In [83]:
gs_model3.best_params

{'rmse': {'n_factors': 10, 'n_epochs': 25},
 'mae': {'n_factors': 10, 'n_epochs': 25}}

### Final Model Evaluation

In [84]:
final_model = SVD(n_factors=125,n_epochs=45,lr_all=0.015,reg_all=0.1)

In [85]:
predictions = final_model.fit(train_data2).test(test_data2)

In [86]:
accuracy.rmse(predictions)

RMSE: 0.8609


0.8608899210043914

## Deployment

In [87]:
#create trainset object from surprise dataset
full_data = complete_data.build_full_trainset()

In [88]:
final_model.fit(full_data)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7fb5938426d0>

#### Function Building

In [89]:
#setting index to ensure dropping
userdf = movie_rating.set_index('userId')

In [90]:
userdf

Unnamed: 0_level_0,movieId,genres,Title,year_released,rating,timestamp
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1.0,1,"adventure,animation,children,comedy,fantasy",Toy Story,1995,4.0,2000-07-30 18:45:03
5.0,1,"adventure,animation,children,comedy,fantasy",Toy Story,1995,4.0,1996-11-08 06:36:02
7.0,1,"adventure,animation,children,comedy,fantasy",Toy Story,1995,4.5,2005-01-25 06:52:26
15.0,1,"adventure,animation,children,comedy,fantasy",Toy Story,1995,2.5,2017-11-13 12:59:30
17.0,1,"adventure,animation,children,comedy,fantasy",Toy Story,1995,4.5,2011-05-18 05:28:03
...,...,...,...,...,...,...
184.0,193581,"action,animation,comedy,fantasy",Black Butler: Book of the Atlantic,2017,4.0,2018-09-16 14:44:42
184.0,193583,"animation,comedy,fantasy",No Game No Life: Zero,2017,3.5,2018-09-16 14:52:25
184.0,193585,drama,Flint,2017,3.5,2018-09-16 14:56:45
184.0,193587,"action,animation",Bungo Stray Dogs: Dead Apple,2018,3.5,2018-09-16 15:00:21


In [91]:
#create 
userdf1 = userdf['movieId']

In [40]:
userdf1.loc[9,]

userId
9.0      41
9.0     187
9.0     223
9.0     371
9.0     627
9.0     922
9.0     923
9.0    1037
9.0    1095
9.0    1198
9.0    1270
9.0    1674
9.0    1987
9.0    2011
9.0    2012
9.0    2023
9.0    2300
9.0    2877
9.0    2901
9.0    3173
9.0    3328
9.0    3735
9.0    4131
9.0    4558
9.0    4993
9.0    5218
9.0    5378
9.0    5445
9.0    5447
9.0    5451
9.0    5481
9.0    5507
9.0    5841
9.0    5843
9.0    5872
9.0    5890
9.0    5891
9.0    5893
9.0    5902
9.0    5952
9.0    5956
9.0    5962
9.0    5965
9.0    5988
9.0    6001
9.0    6044
Name: movieId, dtype: int64

In [41]:
user_rate = list(userdf1.loc[9,])

In [42]:
user_rate

[41,
 187,
 223,
 371,
 627,
 922,
 923,
 1037,
 1095,
 1198,
 1270,
 1674,
 1987,
 2011,
 2012,
 2023,
 2300,
 2877,
 2901,
 3173,
 3328,
 3735,
 4131,
 4558,
 4993,
 5218,
 5378,
 5445,
 5447,
 5451,
 5481,
 5507,
 5841,
 5843,
 5872,
 5890,
 5891,
 5893,
 5902,
 5952,
 5956,
 5962,
 5965,
 5988,
 6001,
 6044]

In [43]:
#setting index so we can drop base off of specific movieId
moviedf = movies.set_index('movieId')

In [44]:
moviedf

Unnamed: 0_level_0,genres,Title,year_released
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,Adventure|Animation|Children|Comedy|Fantasy,Toy Story,1995
2,Adventure|Children|Fantasy,Jumanji,1995
3,Comedy|Romance,Grumpier Old Men,1995
4,Comedy|Drama|Romance,Waiting to Exhale,1995
5,Comedy,Father of the Bride Part II,1995
...,...,...,...
193581,Action|Animation|Comedy|Fantasy,Black Butler: Book of the Atlantic,2017
193583,Animation|Comedy|Fantasy,No Game No Life: Zero,2017
193585,Drama,Flint,2017
193587,Action|Animation,Bungo Stray Dogs: Dead Apple,2018


In [45]:
moviedf.drop(user_rate, inplace = True)

In [46]:
moviedf

Unnamed: 0_level_0,genres,Title,year_released
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,Adventure|Animation|Children|Comedy|Fantasy,Toy Story,1995
2,Adventure|Children|Fantasy,Jumanji,1995
3,Comedy|Romance,Grumpier Old Men,1995
4,Comedy|Drama|Romance,Waiting to Exhale,1995
5,Comedy,Father of the Bride Part II,1995
...,...,...,...
193581,Action|Animation|Comedy|Fantasy,Black Butler: Book of the Atlantic,2017
193583,Animation|Comedy|Fantasy,No Game No Life: Zero,2017
193585,Drama,Flint,2017
193587,Action|Animation,Bungo Stray Dogs: Dead Apple,2018


In [47]:
#reset index to be able to use .apply on the movieId
moviedf = moviedf.reset_index()

In [48]:
moviedf

Unnamed: 0,movieId,genres,Title,year_released
0,1,Adventure|Animation|Children|Comedy|Fantasy,Toy Story,1995
1,2,Adventure|Children|Fantasy,Jumanji,1995
2,3,Comedy|Romance,Grumpier Old Men,1995
3,4,Comedy|Drama|Romance,Waiting to Exhale,1995
4,5,Comedy,Father of the Bride Part II,1995
...,...,...,...,...
9686,193581,Action|Animation|Comedy|Fantasy,Black Butler: Book of the Atlantic,2017
9687,193583,Animation|Comedy|Fantasy,No Game No Life: Zero,2017
9688,193585,Drama,Flint,2017
9689,193587,Action|Animation,Bungo Stray Dogs: Dead Apple,2018


In [49]:
moviedf['est_rating'] = moviedf['movieId'].apply(lambda x: final_model.predict(9, x).est)
moviedf.sort_values(by='est_rating', ascending=False, inplace=True)

In [56]:
moviedf.head(5)

Unnamed: 0,movieId,genres,Title,year_released,est_rating
7656,89904,Comedy|Drama|Romance,The Artist,2011,4.780234
4014,5747,Drama|War,Gallipoli,1981,4.705655
4344,6442,Comedy|Romance,Belle époque,1992,4.634932
5574,27156,Action|Animation|Drama|Fantasy|Sci-Fi,Neon Genesis Evangelion: The End of Evangelion,Shin seiki Evangelion Gekijô-ban: Air/Magokoro...,4.620769
4458,6666,Comedy|Drama|Fantasy,"Discreet Charm of the Bourgeoisie, The","Charme discret de la bourgeoisie, Le (1972",4.592631


#### Function

In [60]:
#creating a variable that only has the userid and movies that they've rated
#this uses the original train dataset not surprise's version
users_movies_seen = movie_rating[['movieId', 'userId']]
#setting user id as index to make it easier to use loc on it
users_movies_seen = users_movies_seen.set_index('userId')

In [63]:
def recommender():
    user = float(input('userId: '))
    genre = input('What genres are you interested in? ')
    num_recs = int(input('How many recomendations would you like? '))

    #using the train data
    #create the seen_movie
    seen_movie = list(users_movies_seen.loc[user, 'movieId'])

    #create df with all the unseen/unrated movies
    train_copy = movies.copy()
    train_copy2 = train_copy.set_index('movieId')

    #using the not_seen df, create a rating for a user
    not_seen = train_copy2.drop(index = seen_movie)

    #subsetting the df to unseen movies of a certain genre
    not_seen = not_seen[not_seen['genres'].str.contains(genre)]
    
    #reset not_seen index
    not_seen = not_seen.reset_index()
    
    #creating an est rating and the sorted it
    not_seen['est_rating'] = not_seen['movieId'].apply(lambda x: final_model.predict(user, x).est)
    not_seen.sort_values(by='est_rating', ascending=False, inplace=True)

    #returns the recommendations that the model predicts to be the highest rating
    return (not_seen.head(num_recs))

In [67]:
movies

Unnamed: 0,movieId,genres,Title,year_released
0,1,Adventure|Animation|Children|Comedy|Fantasy,Toy Story,1995
1,2,Adventure|Children|Fantasy,Jumanji,1995
2,3,Comedy|Romance,Grumpier Old Men,1995
3,4,Comedy|Drama|Romance,Waiting to Exhale,1995
4,5,Comedy,Father of the Bride Part II,1995
...,...,...,...,...
9737,193581,Action|Animation|Comedy|Fantasy,Black Butler: Book of the Atlantic,2017
9738,193583,Animation|Comedy|Fantasy,No Game No Life: Zero,2017
9739,193585,Drama,Flint,2017
9740,193587,Action|Animation,Bungo Stray Dogs: Dead Apple,2018


In [66]:
recommender()

userId: 3
What genres are you interested in? Comedy|Romance
How many recomendations would you like? 5


Unnamed: 0,movieId,genres,Title,year_released,est_rating
4166,141718,Comedy|Horror,Deathgasm,2015,3.70497
2535,25947,Comedy,Unfaithfully Yours,1948,3.641275
3528,87234,Comedy|Drama|Romance,Submarine,2010,3.61342
3631,93988,Drama|Romance,North & South,2004,3.590857
1668,4914,Crime|Drama|Romance,Breathless,À bout de souffle (1960,3.579387
