# Building Recommendation System

To build this model, we'll be using `surprise`, a Python [scikit](https://www.scipy.org/scikits.html) for building and analyzing recommender systems that deal with explicit rating data

You can read more about the package in the [documentation](http://surpriselib.com/) here.

In [11]:
# ! pip install scikit-surprise

In [1]:
import surprise
from surprise.model_selection import train_test_split, GridSearchCV, cross_validate
from surprise.prediction_algorithms import SVD, SVDpp, knns, KNNWithMeans, KNNBasic, KNNBaseline
from surprise.similarities import cosine, msd, pearson
from surprise import Dataset, accuracy
import pandas as pd
import numpy as np

In [2]:
# load the movielens-100k datase

# data = Dataset.load_builtin('ml-100k')
df = pd.read_csv('./data/ratings.csv')

In [3]:
# drop unnecessary columns
new_df = df.drop(columns='timestamp')


## Transforming Data

It's now time to transform the dataset into something compatible with surprise. In order to do this, we're going to need Reader and Dataset classes. There's a method in Dataset specifically for loading dataframes.


In [4]:
from surprise import Reader, Dataset
reader = Reader()
data = Dataset.load_from_df(new_df,reader)

Let's look at how many users and items we have in our dataset. If using neighborhood-based methods, this will help us determine whether or not we should perform user-user or item-item similarity

In [5]:
dataset = data.build_full_trainset()
print('Number of users: ', dataset.n_users, '\n')
print('Number of items: ', dataset.n_items)

Number of users:  610 

Number of items:  9724


# Modeling

### Singular Value Decomposition (SVD) Model

With SVD, we turn the recommendation problem into an Optimization problem that deals with how good we are in predicting the rating for items given a user. One common metric to achieve such optimization is Root Mean Square Error (RMSE). A lower RMSE is indicative of improved performance and vice versa. RMSE is minimized on the known entries in the utility matrix. SVD has a great property that it has the minimal reconstruction Sum of Square Error (SSE); therefore, it is also commonly used in dimensionality reduction.

### Grid Search on SVD

In [6]:
%%time
params = {'n_factors': [20, 50, 100],
         'reg_all': [0.02, 0.05, 0.1]}
g_s_svd = GridSearchCV(SVD,param_grid=params,n_jobs=-1)
g_s_svd.fit(data)

CPU times: user 4.27 s, sys: 347 ms, total: 4.62 s
Wall time: 39.8 s


In [7]:
print(g_s_svd.best_score)
print(g_s_svd.best_params)

{'rmse': 0.869098218320201, 'mae': 0.6682080150046845}
{'rmse': {'n_factors': 100, 'reg_all': 0.05}, 'mae': {'n_factors': 100, 'reg_all': 0.05}}


Although we used gridsearch, this model had an RMSE of only 0.869. Let's see if we can improve that score.

Next, let's try to cross validate with a KNN model.

In [9]:
# cross validating with KNNBasic
knn_basic = KNNBasic(sim_options={'name':'pearson', 'user_based':True})
cv_knn_basic = cross_validate(knn_basic, data, n_jobs=-1)

In [10]:
for i in cv_knn_basic.items():
    print(i)
print('-----------------------')
print(np.mean(cv_knn_basic['test_rmse']))

('test_rmse', array([0.9678457 , 0.97839148, 0.96996964, 0.97358059, 0.96586955]))
('test_mae', array([0.74840297, 0.75563622, 0.74893211, 0.75109506, 0.74412268]))
('fit_time', (0.5178658962249756, 0.5363750457763672, 0.5197341442108154, 0.5028798580169678, 0.4622361660003662))
('test_time', (1.6939599514007568, 1.6690781116485596, 1.6748919486999512, 1.6828718185424805, 1.6874818801879883))
-----------------------
0.9711313922230905


Let's try KNNBaseline now.

In [11]:
# cross validating with KNNBaseline
knn_baseline = KNNBaseline(sim_options={'name':'pearson', 'user_based':True})
cv_knn_baseline = cross_validate(knn_baseline,data)

Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.


In [12]:
for i in cv_knn_baseline.items():
    print(i)

np.mean(cv_knn_baseline['test_rmse'])

('test_rmse', array([0.87815233, 0.87536537, 0.87740202, 0.87300185, 0.87649267]))
('test_mae', array([0.67043078, 0.66889433, 0.66953151, 0.66893167, 0.67039132]))
('fit_time', (0.756706953048706, 0.769845724105835, 0.7901058197021484, 0.8651351928710938, 0.6298890113830566))
('test_time', (1.6798529624938965, 1.7510740756988525, 1.6572132110595703, 2.2883729934692383, 1.7342009544372559))


0.8760828469416297

Based off these outputs, it seems like the best performing model is the SVD model with n_factors = 50 and a regularization rate of 0.05. Use that model or if you found one that performs better, feel free to use that to make some predictions.


## Making Predictions

In [13]:
df_movies = pd.read_csv('./data/movies.csv')

In [15]:
df_movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


## Simple Predictions

First, we'll fit the SVD model we had from before. 

In [16]:
svd = SVD(n_factors= 50, reg_all=0.05)

In [17]:
%%time

svd.fit(dataset)

CPU times: user 3.18 s, sys: 38.6 ms, total: 3.22 s
Wall time: 3.26 s


<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7fb725478460>

In [18]:
svd.predict(2, 4)

Prediction(uid=2, iid=4, r_ui=None, est=3.1097730823247884, details={'was_impossible': False})

This prediction value is a tuple and each of the values within it can be accessed by way of indexing. 

Next, we can make predictions for a new user.

### Obtaining User Ratings

The first step is to create a function that allows us to pick randomly selected movies. The function should present users with a movie and ask them to rate it. If they have not seen the movie, they should be able to skip rating it.

The function `movie_rater()` takes the parameters:

- `movie_df`: DataFrame - a dataframe containing the movie ids, name of movie, and genres
- `num`: int - number of ratings
- `genre`: string - a specific genre from which to draw movies

The function returns:
- `rating_list`: list - a collection of dictionaries in the format of {'userId': int , 'movieId': int , 'rating': float}


In [19]:
def movie_rater(movie_df,num, genre=None):
    userID = 1000
    rating_list = []
    while num > 0:
        if genre:
            movie = movie_df[movie_df['genres'].str.contains(genre)].sample(1)
        else:
            movie = movie_df.sample(1)
        print(movie)
        rating = input('How do you rate this movie on a scale of 1-5, press n if you have not seen :\n')
        if rating == 'n':
            continue
        else:
            rating_one_movie = {'userId':userID,'movieId':movie['movieId'].values[0],'rating':rating}
            rating_list.append(rating_one_movie) 
            num -= 1
    return rating_list

In [20]:
user_rating = movie_rater(df_movies, 4, 'Comedy')

      movieId                    title  genres
1680     2261  One Crazy Summer (1986)  Comedy
      movieId                                title                      genres
7929    95543  Ice Age 4: Continental Drift (2012)  Adventure|Animation|Comedy
      movieId                      title          genres
3698     5102  Rookie of the Year (1993)  Comedy|Fantasy
      movieId                         title            genres
3673     5060  M*A*S*H (a.k.a. MASH) (1970)  Comedy|Drama|War


## Making Predictions with the New Ratings

In [22]:
## add the new ratings to the original ratings DataFrame
new_ratings_df = new_df.append(user_rating,ignore_index=True)
new_data = Dataset.load_from_df(new_ratings_df,reader)

In [23]:
%%time
# train a model using the new combined DataFrame
svd_ = SVD(n_factors= 50, reg_all=0.05)
svd_.fit(new_data.build_full_trainset())

CPU times: user 3.26 s, sys: 46.1 ms, total: 3.3 s
Wall time: 3.34 s


<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7fb7396f5e20>

In [24]:
# make predictions for the user
# creating a list of tuples in the format (movie_id, predicted_score)
list_of_movies = []
for m_id in new_df['movieId'].unique():
    list_of_movies.append( (m_id,svd_.predict(1000,m_id)[3]))

In [25]:
# order the predictions from highest to lowest rated
ranked_movies = sorted(list_of_movies, key=lambda x:x[1], reverse=True)

Creating a function `recommended_movies()` that takes in the parameters:

- `user_ratings`: list - list of tuples formulated as (user_id, movie_id) (should be in order of best to worst for this individual)
- `movie_title_df`: DataFrame
- `n`: int - number of recommended movies

In [26]:
# return the top n recommendations
def recommended_movies(user_ratings,movie_title_df,n):
        for idx, rec in enumerate(user_ratings):
            title = movie_title_df.loc[movie_title_df['movieId'] == int(rec[0])]['title']
            print('Recommendation # ', idx+1, ': ', title, '\n')
            n-= 1
            if n == 0:
                break
            
recommended_movies(ranked_movies,df_movies,5)

Recommendation #  1 :  4909    Eternal Sunshine of the Spotless Mind (2004)
Name: title, dtype: object 

Recommendation #  2 :  602    Dr. Strangelove or: How I Learned to Stop Worr...
Name: title, dtype: object 

Recommendation #  3 :  277    Shawshank Redemption, The (1994)
Name: title, dtype: object 

Recommendation #  4 :  906    Lawrence of Arabia (1962)
Name: title, dtype: object 

Recommendation #  5 :  901    Brazil (1985)
Name: title, dtype: object 

