## MOVIE RECOMMENDER SYSTEM - CONTENT BASED RECOMMENDATION

#### SREENATH S

**NOTE: It is assumed that all the required input files are present in the same folder where this notebook is copied to.**

This notebook is part of the project Movie Recommendation System. Basic functionality of this notebook is to perform the content based filtering. 

1. Get movie metadata dataset.<br>
2. Get user rating dataset, both trainset and testset<br>
3. Apply TfIdf vectorizer on the Movie Metadata dataset. We are taking only two columns "Title" and "movie_keywords"<br>
4. Index the TfIdf DF with imdbId<br>
5. Generate user profiles for each user as follows:<br>
   a. Filter all the movies interacted by the user as part of training set.<br>
   b. Create a sparse matrix, in which each row correspond to the TfIdf representation of the movie user watched<br>
   c. Compute wighted average by mutiplying above matrix (step 5.b) with user ratings.<br>
   d. Normalize the data in the weighted average matrix.<br>
6. Compute the recommendation for every user as follows:
   a. Compute the cosine similarity between the user's weighted average vector with TfIdf representation of each movie.
   b. Filter out the movies already watched by the ser.
   c. Predict the TopN movies with highest similarity score.

**Notebook from the walkthrough session is used as base version, changes are made as required on top of the initial version**

Import all the packages as needed.

In [1]:
import numpy as np
import pandas as pd
import scipy
import math
import random
import sklearn
import import_ipynb
from sklearn.metrics.pairwise import cosine_similarity
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from configs import MODEL_CONTENT

Loading other modules which are created as part of Movie Recommendation system. Other modules of interest are:
1. MovieRecommender_TrainTestDataGenerator
2. MovieRecommeder_evaluations

Please note that disabled print functionality for this cell, otherwise it will be showing print statements from these modules.

In [2]:
%%capture
import MovieRecommender_TrainTestDataGenerator as DataGen
import MovieRecommeder_evaluations as ModelEval 

Get the movie metadata from the MovieRecommender_TrainTestDataGenerator module

In [3]:
movie_meta_data = DataGen.get_movie_metadata()
movie_meta_data.shape

(8989, 9)

Get the user ratings train dataset, and user ratings test dataset. We will be using it for creating the wighter user profile vectors.

In [4]:
user_ratings_train_df, user_ratings_test_df = DataGen.train_test_user_behaviour()

Exploring and printing the datasets imported, to make sure everything imported correctly

In [5]:
user_ratings_train_df.head(5)

Unnamed: 0,userId,imdbId,rating
2457,73,112864,3.5
49661,472,120906,5.0
48470,584,119116,3.5
33782,18,117913,4.0
80144,614,80549,2.0


In [6]:
movie_meta_data.head(5)

Unnamed: 0,original_language,original_title,title,overview,movie_genre,movie_production,movie_keywords,spoken_language,imdbId
0,en,Toy Story,Toy Story,"Led by Woody, Andy's toys live happily in his ...",Animation Comedy Family,Pixar Animation Studios,jealousy toy boy friendship friends rivalry bo...,en,114709
1,en,Jumanji,Jumanji,When siblings Judy and Peter discover an encha...,Adventure Fantasy Family,"TriStar Pictures,Teitler Film,Interscope Commu...",board game disappearance based on children's b...,en fr,113497
2,en,Grumpier Old Men,Grumpier Old Men,A family wedding reignites the ancient feud be...,Romance Comedy,"Warner Bros.,Lancaster Gate",fishing best friend duringcreditsstinger old men,en,113228
3,en,Waiting to Exhale,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",Comedy Drama Romance,Twentieth Century Fox Film Corporation,based on novel interracial relationship single...,en,114885
4,en,Father of the Bride Part II,Father of the Bride Part II,Just when George Banks has recovered from his ...,Comedy,"Sandollar Productions,Touchstone Pictures",baby midlife crisis confidence aging daughter ...,en,113041


In [7]:
user_ratings_train_df.columns

Index(['userId', 'imdbId', 'rating'], dtype='object')

Let us see how many unique movies in movie meta data df.

In [8]:
movie_meta_data.imdbId.nunique()

8989

In [9]:
meta_imdbId = set(movie_meta_data.imdbId.unique())

In [10]:
user_ratings_train_df.imdbId.nunique()

8500

We can observe that there are a total of 8989 movies out of which 8500 are part of train dataset, which is good.

In [11]:
train_imdbId = set(user_ratings_train_df.imdbId.unique())

In [12]:
len(meta_imdbId.intersection(train_imdbId))

8500

In [13]:
user_ratings_test_df.imdbId.nunique()

4212

Similary there are only 4212 movies are part of the testing datset.

In [14]:
test_imdbId = set(user_ratings_test_df.imdbId.unique())

In [15]:
len(meta_imdbId.intersection(test_imdbId))

4212

Training a model with vectors size 1500 (i.e.top max_features ordered by term frequency across the corpus). We will be ignoring the terms with document frequency lower that 0.003 and higher than 0.5. Also we will be taking. both unigram and bigram for this analysis. Stop words are instantiated from the nltk and is passed to the vectorizer


In [16]:
stopwords_list = stopwords.words('english')

In [17]:
# model class
vectorizer = TfidfVectorizer(analyzer='word',min_df=0.003,max_df=0.5,max_features=1500,ngram_range=(1, 2),stop_words=stopwords_list)
print(vectorizer)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.float64'>, encoding='utf-8',
                input='content', lowercase=True, max_df=0.5, max_features=1500,
                min_df=0.003, ngram_range=(1, 2), norm='l2', preprocessor=None,
                smooth_idf=True,
                stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours',
                            'ourselves', 'you', "you're", "you've", "you'll",
                            "you'd", 'your', 'yours', 'yourself', 'yourselves',
                            'he', 'him', 'his', 'himself', 'she', "she's",
                            'her', 'hers', 'herself', 'it', "it's", 'its',
                            'itself', ...],
                strip_accents=None, sublinear_tf=False,
                token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
                vocabulary=None)


Now we will create a list of all movies (we will be having a list of imdbIds)<br>
Will fit the vectorizer on Title and movie_keywords

In [18]:
movie_meta_data['overview'] = movie_meta_data['overview'].fillna("")

In [19]:
item_ids = movie_meta_data['imdbId'].tolist()
tfidf_matrix = vectorizer.fit_transform(movie_meta_data.title+""+movie_meta_data.movie_keywords)
tfidf_feature_names = vectorizer.get_feature_names()

In [20]:
len(tfidf_feature_names)

731

We can see thers are only 731 features as per the min max ferquency we specified

In [21]:
print(tfidf_feature_names[:100])

['1970s', '19th', '19th century', '3d', 'abuse', 'accident', 'action', 'actor', 'addiction', 'adoption', 'adult', 'adult novel', 'adultery', 'adventure', 'affair', 'aftercreditsstinger', 'aftercreditsstinger duringcreditsstinger', 'age', 'agent', 'air', 'airplane', 'airport', 'alcohol', 'alcoholic', 'alien', 'alien invasion', 'america', 'american', 'anarchic', 'anarchic comedy', 'ancient', 'angeles', 'animal', 'animation', 'anime', 'anti', 'apartment', 'apocalypse', 'apocalyptic', 'apocalyptic dystopia', 'army', 'art', 'artist', 'arts', 'assassin', 'assassination', 'astronaut', 'asylum', 'attack', 'attempt', 'australia', 'author', 'baby', 'back', 'bad', 'ball', 'band', 'bank', 'bar', 'baseball', 'based', 'based comic', 'based novel', 'based play', 'based true', 'based tv', 'based video', 'based young', 'battle', 'beach', 'bear', 'beauty', 'best', 'best friend', 'betrayal', 'big', 'biography', 'birth', 'black', 'blackmail', 'blood', 'blue', 'boat', 'body', 'bomb', 'book', 'boss', 'boxer

In [22]:
tfidf_matrix.shape

(8989, 731)

In [23]:
type(tfidf_matrix)

scipy.sparse.csr.csr_matrix

In [24]:
tfidf_matrix_df = pd.DataFrame.sparse.from_spmatrix(tfidf_matrix, columns=tfidf_feature_names)

In [25]:
tfidf_matrix_df['imdbId'] = item_ids
tfidf_matrix_df = tfidf_matrix_df.set_index('imdbId')
tfidf_matrix_df.shape

(8989, 731)

In [26]:
tfidf_matrix_df.head(5)

Unnamed: 0_level_0,1970s,19th,19th century,3d,abuse,accident,action,actor,addiction,adoption,...,world war,writer,year,york,york city,young,young adult,younger,youth,zombie
imdbId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
114709,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
113497,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
113228,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
114885,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
113041,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Now we will make the above steps into a function so that it can be called from the Content Based Filtering Class. So the COntent Based Class can be made self suffiecient and it wont be dependent on any of the global variables created above.

In [27]:
# METHOD      : create_tfidf_matrix
# INPUT       : NA.
# DESCRIPTION : Returns a TfIdf Matrix by performing the following steps:. 
#               1. Get the movie metadata dataset
#               2. Get the ratings data set-train
#               3. Instantiate a TfIdf vectorizer with stop words for 'English'
#               4. Fit the vectorizer on movie metadata features, Title and Keywords. This will yield a tfidf sparse matrix.
#               5. Create a data frame from the sparse matrix output.
#               6. Set the index of the DF as 'imdbId'
def create_tfidf_matrix():
    movie_meta_data = DataGen.get_movie_metadata()
    user_ratings_train_df, user_ratings_test_df = DataGen.train_test_user_behaviour()
    stopwords_list = stopwords.words('english')
    vectorizer = TfidfVectorizer(analyzer='word',min_df=0.003,max_df=0.5,max_features=1500,ngram_range=(1, 2),stop_words=stopwords_list)
    item_ids = movie_meta_data['imdbId'].tolist()
    tfidf_matrix = vectorizer.fit_transform(movie_meta_data.title+""+movie_meta_data.movie_keywords)
    tfidf_feature_names = vectorizer.get_feature_names()
    tfidf_matrix_df = pd.DataFrame.sparse.from_spmatrix(tfidf_matrix, columns=tfidf_feature_names)
    tfidf_matrix_df['imdbId'] = item_ids
    tfidf_matrix_df = tfidf_matrix_df.set_index('imdbId')
    return tfidf_matrix_df, tfidf_matrix

Method to create the weighted user profile

In [28]:
# METHOD      : create_item_profiles
# INPUT       : List of all item ids to be considered for creating the profile.
# DESCRIPTION : Get the rows corresponding to each of the movies in the iput list. 
#               Stack it vertically and return the sparse matrix.
def create_item_profiles(all_item_ids, tfidf_matrix):
    
    item_profiles_list = [tfidf_matrix[item_ids.index(x):item_ids.index(x)+1] for x in all_item_ids]
    item_profiles = scipy.sparse.vstack(item_profiles_list)
    return item_profiles

In [29]:
# METHOD      : generate_users_profiles
# INPUT       : TfIdf Matrix.
# DESCRIPTION : For every user in the training dataset, perform the following. 
#               Filter the movies watched by the user.
#               Invoke create_item_profiles with the list of movies filtered above.
#               Compute the wighted average vector for each user by mutiplying user ratings with the matrix returned by create_item_profiles
#               Normalize the data.

def generate_users_profiles(TfIdfMatrix):
    user_ratings_train_df = DataGen.train_test_user_behaviour()[0]
    user_behaviour_indexed_df = user_ratings_train_df.set_index('userId')
    user_profiles = {}
    for person_id in user_behaviour_indexed_df.index.unique():
        # Filter the movies interacted by user
        user_behaviour_person_df = user_behaviour_indexed_df.loc[person_id]
        # Create the item profile matrix
        user_item_profiles = create_item_profiles(user_behaviour_person_df['imdbId'], TfIdfMatrix)
        # Get the user ratings vector
        user_item_strengths = np.array(user_behaviour_person_df['rating']).reshape(-1,1)
        # Weighted average of item profiles by the user_behaviour strength
        user_item_strengths_weighted_avg = np.sum(user_item_profiles.multiply(user_item_strengths), axis=0) / np.sum(user_item_strengths)
        user_profile_normalized = sklearn.preprocessing.normalize(user_item_strengths_weighted_avg)
        # Store it on to the user_profile set
        user_profiles[person_id] = user_profile_normalized
        
    return user_profiles

Let us test the correctness of the above function by invoking it.

In [30]:
#Invoke the generate_users_profiles() to create user profile data
TfIdfDF, TfIdfMatrix = create_tfidf_matrix()
user_profiles = generate_users_profiles(TfIdfMatrix)
len(user_profiles)

671

In [31]:
#Let us print the vector for a user
print(user_profiles[472])

[[0.02882835 0.00856112 0.00856112 0.         0.03223596 0.02113753
  0.01351029 0.03791213 0.00916059 0.00340706 0.02155316 0.
  0.02531783 0.01187935 0.02148889 0.05332248 0.02724138 0.05250307
  0.0433457  0.01370486 0.03872674 0.0147711  0.02858992 0.0110799
  0.03082857 0.00121231 0.0156084  0.11350479 0.04463024 0.04463024
  0.00333713 0.04102523 0.04137946 0.04512172 0.01017047 0.01740467
  0.01058502 0.01219099 0.01670856 0.01121853 0.0601521  0.06374564
  0.01714152 0.03079727 0.03204589 0.01137682 0.00501587 0.01514912
  0.03208264 0.04742968 0.01695676 0.033104   0.00308042 0.00518507
  0.02179618 0.00777938 0.03073505 0.00955398 0.00573716 0.01226112
  0.09992833 0.00234071 0.06676847 0.02103683 0.02594992 0.01339461
  0.         0.         0.04789131 0.03236571 0.00203087 0.00671723
  0.02177413 0.01670444 0.01643529 0.05348671 0.03104512 0.01946954
  0.04057754 0.03284045 0.0229174  0.0319363  0.0269355  0.0188169
  0.02520829 0.04160455 0.03487652 0.01467232 0.00697282 0

Content Based Recommender

In [32]:
class ContentFiltering:
    
    
    def __init__(self, item_ids, items_df):
        self.item_ids = item_ids
        self.items_df = items_df
        self.TfIdfDF, self.tfidf_matrix = create_tfidf_matrix()
        self.user_profiles = generate_users_profiles(self.tfidf_matrix)
        
    def compute_user_item_profile_similarity(self, person_id, topn=2000):
        # Computes the cosine similarity between the user profile and all item profiles
        cosine_similarities = cosine_similarity(self.user_profiles[person_id], self.tfidf_matrix)

        # Sort the movies based on similarity and get the topn movies
        similar_indices = cosine_similarities.argsort().flatten()[-topn:]
        
        # Sort the similar items by similarity
        similar_items = [(item_ids[i], cosine_similarities[0,i]) for i in similar_indices]
        similar_items = sorted(similar_items, key=lambda x: x[1], reverse = True)
        return similar_items
        
    def get_item_recommendations(self, user_id, items_to_ignore=[], topn=10, verbose=False):
        similar_items = self.compute_user_item_profile_similarity(user_id)
        #Filter movies the user has already watched
        similar_items_filtered = list(filter(lambda x: x[0] not in items_to_ignore, similar_items))
        
        recommendations_df = pd.DataFrame(similar_items_filtered, columns=['imdbId', 'sim_score']).head(topn)
        movie_dataset = self.items_df[['imdbId', 'title']]
        
        recommendations_df = pd.merge(left=recommendations_df, right=movie_dataset, left_on='imdbId', right_on='imdbId')
        #recommendations_df
        return recommendations_df

Let us instantiate the content based model

In [33]:
content_based_recommender_model = ContentFiltering(item_ids, movie_meta_data)

Now we will check the prediction for a particular user, by invoking get_item_recommendations with userId

In [34]:
print("content based recommendations: \n\n")
content_based_recommender_model.get_item_recommendations(472)

content based recommendations: 




Unnamed: 0,imdbId,sim_score,title
0,60304,0.400069,2 or 3 Things I Know About Her
1,110395,0.382622,Love and a .45
2,165874,0.382622,The Mating Habits of the Earthbound Human
3,159272,0.382388,Beautiful People
4,120831,0.368626,Slums of Beverly Hills
5,102721,0.361512,Proof
6,355702,0.356105,Lords of Dogtown
7,101588,0.355322,City of Hope
8,460829,0.353727,INLAND EMPIRE
9,132910,0.350745,The Crow: Salvation


**Evaluate the model on the train dataset**

In [35]:
content_overall_metrics, content_eval_results_df = ModelEval.model_evaluator.evaluate_model(content_based_recommender_model,MODEL_CONTENT)
print('overall metrics:\n', content_overall_metrics)
content_eval_results_df.head(10)

Number of users processed :  667
overall metrics:
 {'model_type': 'content_based', 'recallscore@5': 0.14422241529105126, 'recallscore@10': 0.2359152576355009}


Unnamed: 0,hitrate@5_count,hitrate@10_count,interacted_count,recallscore@5,recallscore@10,userId
61,26,50,372,0.069892,0.134409,547
62,27,48,266,0.101504,0.180451,624
12,30,53,264,0.113636,0.200758,73
17,14,31,258,0.054264,0.120155,564
66,12,32,249,0.048193,0.128514,15
41,7,22,214,0.03271,0.102804,468
2,11,29,185,0.059459,0.156757,452
50,19,30,169,0.112426,0.177515,30
23,17,33,156,0.108974,0.211538,311
79,41,62,145,0.282759,0.427586,213
