## MOVIE RECOMMENDER SYSTEM - COLLABORATIVE FILTERING

#### SREENATH S

**NOTE: It is assumed that all the required input files are present in the same folder where this notebook is copied to.**

This notebook is part of the project Movie Recommendation System. Basic functionality of this notebook is to perform the collaborative filtering. 

1. Get user rating dataset, both trainset and testset
2. Get movie metadata dataset.
3. Create a user-item ratings matrix from the train dataset.
4. Perform Singular Value Decomposition(SVD) on the rating matrix. Note that the number of latent factors are configurable through config file. Store the resultant user_matrix, sigma, and item_matrix
5. Construct the prediction matrix as follows: predicted_rating = (user_matrix) x (sigma) x (item_matrix)
6. Convert the prediction matrix to a Dataframe with userId as columns and imdbId as index.
7. Provided a class CollabFiltering which predicts the list of recommendation for a given user based on similarity score.

**Notebook from the walkthrough session is used as base version, changes are made as required on top of the initial version**

Load the required packages.

Also load the required data from configs.

In [1]:
import import_ipynb
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix
from scipy.sparse.linalg import svds
from configs import NUM_LATENT_FACTORS,MODEL_COLLAB

Loading other modules which are created as part of Movie Recommendation system. Other modules of interest are:
1. MovieRecommender_TrainTestDataGenerator
2. MovieRecommeder_evaluations

Please note that disabled print functionality for this cell, otherwise it will be showing print statements from these modules.

In [2]:
%%capture
import MovieRecommender_TrainTestDataGenerator as RecDataGenerator
import MovieRecommeder_evaluations as evaluations 

Retrieve the train and test dataset from MovieRecommender_TrainTestDataGenerator module

In [3]:
user_ratings_train_df, user_ratings_test_df = RecDataGenerator.train_test_user_behaviour()

Printing it for making sure data loaded successfully

In [4]:
user_ratings_train_df.head()

Unnamed: 0,userId,imdbId,rating
2457,73,112864,3.5
49661,472,120906,5.0
48470,584,119116,3.5
33782,18,117913,4.0
80144,614,80549,2.0


Retrieve the movie_metadata dataset from MovieRecommender_TrainTestDataGenerator

In [5]:
movie_meta_data = RecDataGenerator.get_movie_metadata()
movie_meta_data.shape

(8989, 9)

In [6]:
movie_meta_data.head()

Unnamed: 0,original_language,original_title,title,overview,movie_genre,movie_production,movie_keywords,spoken_language,imdbId
0,en,Toy Story,Toy Story,"Led by Woody, Andy's toys live happily in his ...",Animation Comedy Family,Pixar Animation Studios,jealousy toy boy friendship friends rivalry bo...,en,114709
1,en,Jumanji,Jumanji,When siblings Judy and Peter discover an encha...,Adventure Fantasy Family,"TriStar Pictures,Teitler Film,Interscope Commu...",board game disappearance based on children's b...,en fr,113497
2,en,Grumpier Old Men,Grumpier Old Men,A family wedding reignites the ancient feud be...,Romance Comedy,"Warner Bros.,Lancaster Gate",fishing best friend duringcreditsstinger old men,en,113228
3,en,Waiting to Exhale,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",Comedy Drama Romance,Twentieth Century Fox Film Corporation,based on novel interracial relationship single...,en,114885
4,en,Father of the Bride Part II,Father of the Bride Part II,Just when George Banks has recovered from his ...,Comedy,"Sandollar Productions,Touchstone Pictures",baby midlife crisis confidence aging daughter ...,en,113041


One of the main step of collaborative filtering is to create a user intraction matrix. here we will create a user-movie interaction matrix. Index of this DF will be userIds and columns will be movieId(imdbIds). Also the values will be the ratings. All the non interacted user-movie combo will be filled with a value of 0

In [7]:
# generate a sparse pivot table with users in rows and items in columns
users_items_matrix_df = user_ratings_train_df.pivot(index='userId', columns='imdbId', values='rating').fillna(0)
users_items_matrix_df.head(5)

imdbId,417,4972,6333,8133,9018,9932,10323,11237,11439,11984,...,4537842,4831420,5022702,5161502,5165344,5278868,5480340,5595168,5613056,5794766
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


To perform SVD we need to convert it to a numpy matrix.

In [8]:
users_items_pivot_matrix = users_items_matrix_df.values
users_items_pivot_matrix[:1]

array([[0., 0., 0., ..., 0., 0., 0.]])

Confirm the shape of the dataset after and before conversion

In [9]:
users_items_matrix_df.shape

(671, 8500)

In [10]:
users_items_matrix_df.shape

(671, 8500)

Create the sparse data matrix

In [11]:
users_ids = list(users_items_matrix_df.index)
print("users_ids:\n",users_ids[:10])

users_ids:
 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]


In [12]:
users_items_pivot_csr_matrix = csr_matrix(users_items_matrix_df)

#### Performing matrix factorization on the original user item matrix


In [13]:
# Method     : perform_matrix_factorization
# Input      : num_latent_features by default its assigned with NUM_LATENT_FACTORS value given in configs module
# Description: This method will convert the given user interaction matrix to three matrix which are of smaller dimension.
#              user_matrix, sigma, item_matrix. Output sigma will be an array, we need to convert it to diagonal matrix.

In [14]:
def perform_matrix_factorization(num_latent_features = NUM_LATENT_FACTORS):
    #Perform SVD on the user interaction matrix.
    user_matrix, sigma, item_matrix = svds(users_items_pivot_csr_matrix, k = num_latent_features)
    #Convert the sigma to diagonal matrix
    sigma = sigma = np.diag(sigma)
    return user_matrix, sigma, item_matrix

In [15]:
user_matrix, sigma, item_matrix = perform_matrix_factorization()

Let us confirm the dimensions of user matrix and item matrix

User matrix dimension should be equal to (NUMBER OF USERS X NUM LATENT FACTORS)
Item matrix dimension should be equal to (NUM LATENT FACTORS X NUMBER OF MOVIES)

In [16]:
user_matrix.shape

(671, 15)

In [17]:
item_matrix.shape

(15, 8500)

Let us confirm the shape of sigma it should be a diagonal matrix with shape (NUM LATENT FACTORS X NUM LATENT FACTORS)

In [18]:
sigma.shape

(15, 15)

#### CREATE PREDICTION MATRIX FROM THE SVD OUTPUT

In collaborative filtering we will use "user matrix",  sigma, and "item matrix" to recreate the predicted rating as follows:
    
    predicted_rating = (user_matrix) x (sigma) x (item_matrix)

Note that here the multiplication is matrix multiplication instead of sclar

In [19]:
def get_predicted_user_ratings(user_matrix, sigma, item_matrix):
    weighted_user_mat = np.dot(user_matrix, sigma)
    all_user_predicted_ratings = np.dot(weighted_user_mat, item_matrix)
    all_user_predicted_ratings_norm = (all_user_predicted_ratings - all_user_predicted_ratings.min()) / (all_user_predicted_ratings.max() - all_user_predicted_ratings.min())
    return all_user_predicted_ratings_norm

Let us create the predicted matrix using above method

In [20]:
all_user_predicted_ratings_norm = get_predicted_user_ratings(user_matrix, sigma, item_matrix)
all_user_predicted_ratings_norm[:5]

array([[0.27865233, 0.27825798, 0.27816195, ..., 0.27837707, 0.27731775,
        0.27719756],
       [0.28074198, 0.27934785, 0.27870051, ..., 0.27671085, 0.27810318,
        0.27693948],
       [0.27715329, 0.27796166, 0.27692869, ..., 0.27385508, 0.27579406,
        0.27670744],
       [0.27849427, 0.27751615, 0.27825199, ..., 0.28139348, 0.27920849,
        0.27645293],
       [0.28292936, 0.28185069, 0.2801127 , ..., 0.27767576, 0.27617735,
        0.27673008]])

Exploring the predicted matrix for its shape, max and min value 

In [21]:
all_user_predicted_ratings_norm.shape

(671, 8500)

In [22]:
all_user_predicted_ratings_norm.max()

1.0

In [23]:
all_user_predicted_ratings_norm.min()

0.0

#### Converting the reconstructed matrix back to a Pandas dataframe

Next step is to convert the prediction matrix back to Dataframe. Here the columns are corresponding to userId, and index in imdbId 

In [24]:
cf_preds_df = pd.DataFrame(all_user_predicted_ratings_norm, columns = users_items_matrix_df.columns, index=users_ids).transpose()
cf_preds_df.head(10)

Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10,...,662,663,664,665,666,667,668,669,670,671
imdbId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
417,0.278652,0.280742,0.277153,0.278494,0.282929,0.277539,0.27066,0.271179,0.273094,0.274852,...,0.2826,0.276458,0.292035,0.281113,0.279737,0.273546,0.275611,0.277805,0.274823,0.267997
4972,0.278258,0.279348,0.277962,0.277516,0.281851,0.278764,0.27243,0.272359,0.273667,0.274626,...,0.281531,0.276302,0.286211,0.282612,0.279648,0.274827,0.276195,0.277478,0.274765,0.271138
6333,0.278162,0.278701,0.276929,0.278252,0.280113,0.278535,0.274218,0.271874,0.273998,0.275418,...,0.280276,0.276396,0.284084,0.280152,0.279068,0.275229,0.275671,0.277467,0.274495,0.271539
8133,0.278289,0.279994,0.277809,0.27542,0.282326,0.279071,0.272062,0.27234,0.273236,0.273937,...,0.282088,0.276199,0.285707,0.280927,0.280021,0.275426,0.276206,0.27711,0.274569,0.271112
9018,0.278404,0.278676,0.277228,0.279246,0.281185,0.278944,0.273553,0.271308,0.273453,0.27514,...,0.280922,0.2763,0.28572,0.281897,0.279319,0.274371,0.275553,0.277693,0.274068,0.270857
9932,0.278021,0.278483,0.276964,0.278098,0.279693,0.278341,0.274641,0.272632,0.274452,0.27567,...,0.279833,0.276507,0.283097,0.279727,0.278798,0.275507,0.275886,0.277425,0.274878,0.272344
10323,0.280206,0.279051,0.281048,0.271607,0.287924,0.28364,0.266173,0.281446,0.27125,0.271082,...,0.281765,0.274064,0.300174,0.27211,0.279805,0.273133,0.281016,0.276798,0.276816,0.273185
11237,0.277167,0.277208,0.277263,0.277406,0.27719,0.277214,0.277389,0.277419,0.277203,0.27726,...,0.277198,0.277204,0.277242,0.277117,0.27718,0.27719,0.277206,0.277183,0.277232,0.277403
11439,0.277074,0.27763,0.27769,0.278072,0.277371,0.277263,0.278341,0.278162,0.277448,0.277556,...,0.277623,0.277415,0.277674,0.277525,0.27733,0.277442,0.277262,0.277256,0.277516,0.278387
11984,0.278303,0.278918,0.276893,0.278406,0.280532,0.278729,0.273796,0.271117,0.273544,0.275167,...,0.280719,0.276284,0.285071,0.280577,0.279338,0.274951,0.275456,0.277508,0.274112,0.270733


In [25]:
cf_preds_df.shape

(8500, 671)

In [26]:
# Method     : get_ratings_df
# Input      : Predicted rating matrix
# Description: This method will convert input predicted rating matrix to a DF. Columns will be userId and index will be imdbId

In [27]:
def get_ratings_df(all_user_predicted_ratings_norm):
    
    #Converting the reconstructed matrix back to a Pandas dataframe
    
    cf_preds_df = pd.DataFrame(all_user_predicted_ratings_norm, columns = users_items_matrix_df.columns, index=users_ids).transpose()
    cf_preds_df.head(1)
    return cf_preds_df
    

### COLLABORATIVE FILTERING MODEL

**CLASS: CollabFiltering** <br>
This class implements collaorative filtering model and API to retrieve collborative filtering recommendations.<br>
**Class constructor:**__init__ method requires two inputs:<br>
       svd_matrix_df: predicted ratings DF and <br>
       items_df: movie metadata DF<br>
           
**Method:** get_item_recommendations()<br>
**Input :** 
user_id - Id of the user for whom we need to perform recommedation<br>
items_to_ignore - List of items to be excluded as part of the recommended list(this will be list of movies user has already interacted as part of train set)<br>
       
**Description:** 
Get the predicted ratings for the given user.<br>
Exlcude the items to be ifnored from this list.<br>
Sort the user ratings in descending order<br>
Take the topn items as recommedation.<br>
Merge the recommedation with movie metadata to get the Movie Title<br>
Recommend the list to user<br>
       

In [28]:
class CollabFiltering:
    
    
    def __init__(self, svd_matrix_df, items_df):
        self.svd_matrix_df = svd_matrix_df
        self.items_df = items_df

    def get_item_recommendations(self, user_id, items_to_ignore=[], topn=10):
        # Filter the predicted ratings for specified user and sort it based on the predicted ratings
        predicted_user_ratings = self.svd_matrix_df[user_id].sort_values(ascending=False).reset_index() 
        sorted_user_predictions = predicted_user_ratings.rename(columns={user_id: 'sim_score'})

        # Remove the items to be ignored from the sorted predictions for the user.
        recommendations_df = sorted_user_predictions[~sorted_user_predictions['imdbId'].isin(items_to_ignore)]
        recommendations_df = recommendations_df.sort_values('sim_score', ascending = False)
        # Retrieve topn recommendations
        recommendations_df = recommendations_df.head(topn)
        movie_dataset = self.items_df[['imdbId', 'title']]
        # Merge the topn recommendation with movie dataset to retrieve the title of the movie
        recommendations_df = pd.merge(left=recommendations_df, right=movie_dataset, left_on='imdbId', right_on='imdbId')
        return recommendations_df

Perform matrix factorization with number of latent factors as 15.<br>
Create a predicted rating matrix and use it for creating the collaborative filtering model

In [29]:
user_matrix, sigma, item_matrix = perform_matrix_factorization(NUM_LATENT_FACTORS)
all_user_predicted_ratings_norm = get_predicted_user_ratings(user_matrix, sigma, item_matrix)
svd_preds_df=get_ratings_df(all_user_predicted_ratings_norm)    
recommender_cf = CollabFiltering(svd_preds_df, movie_meta_data)

Invoke the evaluation module with the collaborative recommedation model created above. The efficiency of the recommendation will be mentioned in recall and hit rate. 

In [30]:
collab_overall_metrics, collab_eval_results_df = evaluations.model_evaluator.evaluate_model(recommender_cf,MODEL_COLLAB)
print('overall metrics:\n', collab_overall_metrics)

Number of users processed :  667
overall metrics:
 {'model_type': 'collaborative_filtering', 'recallscore@5': 0.6098375994118826, 'recallscore@10': 0.7213125710084876}


As we can see above the recall@5 = 0.61 and recall@10 = 0.72 which is pretty good.

In [31]:
#let us print the result dataframe
collab_eval_results_df.head(10)

Unnamed: 0,hitrate@5_count,hitrate@10_count,interacted_count,recallscore@5,recallscore@10,userId
61,78,105,372,0.209677,0.282258,547
62,57,88,266,0.214286,0.330827,624
12,76,108,264,0.287879,0.409091,73
17,81,116,258,0.313953,0.449612,564
66,116,141,249,0.465863,0.566265,15
41,84,110,214,0.392523,0.514019,468
2,83,104,185,0.448649,0.562162,452
50,70,89,169,0.414201,0.526627,30
23,86,103,156,0.551282,0.660256,311
79,68,86,145,0.468966,0.593103,213


**LET US PERFORM A MANUAL EVALUATION TO CHECK HOW GOOD ARE THESE RECOMMENDATIONS**

We will take a random user with id '311'. Lets see the user's interaction details as part of the training dataset. 

In [32]:
user_id = 311
user_specific_interaction_df = user_ratings_train_df[user_ratings_train_df.userId == user_id]
user_specific_interaction_df

Unnamed: 0,userId,imdbId,rating
94903,311,50490,4.5
89075,311,36872,4.0
92965,311,109198,3.0
95977,311,92605,3.0
10367,311,101921,3.0
...,...,...,...
70791,311,112579,4.0
83986,311,105477,1.5
95517,311,50972,3.5
96050,311,59183,3.0


In [33]:
user_specific_interaction_df.imdbId.nunique()

862

This indicates the user has rated almost 862 movies as part of training dataset. Which is quite a good number for collaborative filtering. As the interactions increases the model will be able to predict more confidently as it can lean the latent feature for this user more accurately.

To get the recommendation for the user we need to input the list of movies to be ignored from prediction. It is nothing but those movies user has interacted as part of the training adatset. let's retrieve those list of movies user already interacted as part of the training adatset and pass it as the items to be ignored

In [34]:
list_of_interacted_items = list(user_specific_interaction_df.imdbId.unique())

Invoke the model to get top 20 recommendations for this user

In [35]:
preds_df = recommender_cf.get_item_recommendations(user_id, items_to_ignore=list_of_interacted_items, topn=20)
preds_df

Unnamed: 0,imdbId,sim_score,title
0,118571,0.637364,Air Force One
1,116282,0.590484,Fargo
2,119654,0.58715,Men in Black
3,31381,0.562397,Gone with the Wind
4,107290,0.549755,Jurassic Park
5,108160,0.542198,Sleepless in Seattle
6,114388,0.533936,Sense and Sensibility
7,109830,0.531218,Forrest Gump
8,71562,0.523932,The Godfather: Part II
9,68646,0.516621,The Godfather


Let us see how the users hitrate and recall score

In [36]:
evaluations.model_evaluator.evaluate_model_for_user(recommender_cf, user_id)

{'hitrate@5_count': 86,
 'hitrate@10_count': 103,
 'interacted_count': 156,
 'recallscore@5': 0.5512820512820513,
 'recallscore@10': 0.6602564102564102}

Now we will see out of the above top 20 movie recommendation, how many this user actually wateched in unseen dataset (test dataset)

In [37]:
user_specific_interaction_test_df = user_ratings_test_df[user_ratings_test_df.userId == user_id]
movie_dataset = movie_meta_data[['imdbId', 'title']]
user_specific_interaction_test_df = pd.merge(left=user_specific_interaction_test_df, right=movie_dataset, left_on='imdbId', right_on='imdbId')

Let us filter the common movies between top20 recommedation and user interacted movies in test set

In [38]:
merged_pred_df = pd.merge(left=user_specific_interaction_test_df, right=preds_df, left_on='imdbId', right_on='imdbId')
merged_pred_df 

Unnamed: 0,userId,imdbId,rating,title_x,sim_score,title_y
0,311,32904,2.0,The Philadelphia Story,0.505135,The Philadelphia Story
1,311,118571,4.0,Air Force One,0.637364,Air Force One
2,311,107290,4.5,Jurassic Park,0.549755,Jurassic Park
3,311,114388,2.0,Sense and Sensibility,0.533936,Sense and Sensibility
4,311,105435,4.0,Sneakers,0.510195,Sneakers
5,311,125439,5.0,Notting Hill,0.489483,Notting Hill
6,311,109830,5.0,Forrest Gump,0.531218,Forrest Gump
7,311,108160,4.0,Sleepless in Seattle,0.542198,Sleepless in Seattle
8,311,119654,3.0,Men in Black,0.58715,Men in Black
9,311,97165,1.5,Dead Poets Society,0.500819,Dead Poets Society


**This is very impressive result out of 20 movies, 16 movies user had wateched in test set. So the recommedations are quite good and model is able to capture user likings and able to provide them with relevant recommendations**