# Content Based Recommender System

While building the knowledge based recommender systems I did not take into consideration the users' preference. Example, the knowledge based recommender did take into consideration the users' preference for genres, timelines, and duration , but the model and its recommendations still remained very generic.
Imagin that A likes the movies 'The Dark Knight', 'Iron Man', 'Man of Steel'. It is pretty evident that A likes has a taste for Superhero movies. Our previous model would not be able to capture this detail. 

An obvious fix for this problem is to ask the user for more metadaata as input. This is exactly what sites like Netflix do, when you initially sign up to their website they ask what movies you've watched and based on your preferences they build a profile

In this notebook we are going to build two types of content-based recommenders:
- ### Plot-description based recommender
> This model compares the descriptions and taglines of two different movies and provides recommendations that have the most similar plot descriptions
- ### Metadata-based recommender
> This model takes a host of features, such as genres, keywords, cast and crew, into consideration and provides recommendations that are the most similar with respect to the aforementaione features.

In [1]:
import pandas as pd
import numpy as np

In [2]:
df=pd.read_csv('./movies_metadata_clean.csv')

In [3]:
df.head()

Unnamed: 0,title,genres,runtime,vote_average,vote_count,year
0,Toy Story,"['animation', 'comedy', 'family']",81.0,7.7,5415.0,1995
1,Jumanji,"['adventure', 'fantasy', 'family']",104.0,6.9,2413.0,1995
2,Grumpier Old Men,"['romance', 'comedy']",101.0,6.5,92.0,1995
3,Waiting to Exhale,"['comedy', 'drama', 'romance']",127.0,6.1,34.0,1995
4,Father of the Bride Part II,['comedy'],106.0,5.7,173.0,1995


# Document Vectors

The models we are building now will compute the pairwise similarity between bodies of text. The question arises how do we quantify the similarity between two bodies of text?
To put it another way, consider three movies: A, B and C. How can we mathematically prove that the plot of movie A is more simila to the plot of B than that of C? or vice versa. 
> The first towards answering these questions is to represent the bodies of text as mathematical quantities. This is done by representing these documents as vectors. In other words, every document is depicted as a series of n numbers, where each number represents a dumension and n is the size of the vocabulary of all the documents put together

>We will be using a vectorizer to convert our document into vectors. The two most popular vectorizers are
- Count Vectorizer
- TFIDF Vectorizer

## Count Vectorizer

In count vectorizer we compute the number of unique words present in the document. It is a common practice to not include common words such as a, the, had, my etc which are termed as stop words which lead to unnecessary waste of computations and do not yeild accurate results. So, we remove the stop words. 

Now, after we have removed the stop words from our vocabulary we can go ahead and convert this vocabulary into a vector. The size of the vector will be the number of words in our vocabulary. 

Now, given an input sentence. Each sentence will be represented as a vector with each element of the vector indicating the number of times that particular word has occured in the document

## TFIDF (Term Frequency Document Inverse Frequency)
When working with text it is important to note that not all words carry the same weightage. Consider a corpus of document of dogs. Now, it is obvious that all these documents will frequently contain the word dog. Therefore, the appearance of the word dog isnt as important as another word that only appears in a few documents. 


<b>TFIDF</b> takes the aforementioned point into consideration and assigns weights to each word according to the following formula. For every word i in the document j, the following applies

wij= tfij X Log (N/dfi)

- wij is the weight of word i in the document j
- dfi is the number of documets that contain the term i
- N is the total number of documents

- Just keep in mind that the weight of a word in a document is greater if it occurs more frequently in that document and is present in fewer documents. The weight is between 0 and 1

We will be using TFIDF Vectorizer because some words occur much more frequently in plot descriptions than others. It is therefore a good idea to assign weights in a document according to the TFIDF formula 

## Cosine Similarity Score
The cosine similarity is extremely robust and easy to calculate . The cosine similarity score between two documents x and y .

The cosine similarity can take values between -1 and 1. The higher the cosine score, the more similar the documents are to each other. We now have a good theoretical base to proceed to build the content based recommenders using python

# Plot Description Based Recommender

This recommender will take in a movie title as an argument and output a list of movies that are most similar based on their plots. These are the steps we are going to perform in building the model.

1. Obtain th data required to build the model
2. Create TF-IDF vectors for the plot descriptions of every movie
3. Compute the pairwise cosine similarity score of every movie
4. Write the recommender function that takes in a movie and title as an argument and outputs movies most similar to it based on the plot.

#### Preparing the Data
In its present form, our clean dataframe doesnt contain the feature that are required to build the plot description based recommender. Fortunately, these requisite features are available in the original metadata file. 

In [4]:
orig_df=pd.read_csv('./movies_metadata.csv', low_memory=False)
#Add the useful features into our cleaned dataframe
df['overview'], df['id']=orig_df['overview'], orig_df['id']

In [5]:
df.head()

Unnamed: 0,title,genres,runtime,vote_average,vote_count,year,overview,id
0,Toy Story,"['animation', 'comedy', 'family']",81.0,7.7,5415.0,1995,"Led by Woody, Andy's toys live happily in his ...",862
1,Jumanji,"['adventure', 'fantasy', 'family']",104.0,6.9,2413.0,1995,When siblings Judy and Peter discover an encha...,8844
2,Grumpier Old Men,"['romance', 'comedy']",101.0,6.5,92.0,1995,A family wedding reignites the ancient feud be...,15602
3,Waiting to Exhale,"['comedy', 'drama', 'romance']",127.0,6.1,34.0,1995,"Cheated on, mistreated and stepped on, the wom...",31357
4,Father of the Bride Part II,['comedy'],106.0,5.7,173.0,1995,Just when George Banks has recovered from his ...,11862


Now, our model contains the id and overview features. We will be using the overview feature now and the id feature in the next recommender.

## Creating the TFIDF matrix
The next step is to create a DF where each row represents the TFIDF vector of the overview feature of the corresponding movie in our main Dataframe. To do this, we will use the skikit-learn library which gives us access to a TfIdfVectorizer to perform this operation effortlessly.

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer

#Define the TFIDF Vectorizer Object. Remove all the english stopwords
tfidf=TfidfVectorizer(stop_words='english')

In [7]:
#Replace NaN with an empty string
df['overview']=df['overview'].fillna('')

In [8]:
df.head()

Unnamed: 0,title,genres,runtime,vote_average,vote_count,year,overview,id
0,Toy Story,"['animation', 'comedy', 'family']",81.0,7.7,5415.0,1995,"Led by Woody, Andy's toys live happily in his ...",862
1,Jumanji,"['adventure', 'fantasy', 'family']",104.0,6.9,2413.0,1995,When siblings Judy and Peter discover an encha...,8844
2,Grumpier Old Men,"['romance', 'comedy']",101.0,6.5,92.0,1995,A family wedding reignites the ancient feud be...,15602
3,Waiting to Exhale,"['comedy', 'drama', 'romance']",127.0,6.1,34.0,1995,"Cheated on, mistreated and stepped on, the wom...",31357
4,Father of the Bride Part II,['comedy'],106.0,5.7,173.0,1995,Just when George Banks has recovered from his ...,11862


In [9]:
tfidfmatrix=tfidf.fit_transform(df['overview'])

In [12]:
tfidfmatrix.shape

(45466, 75827)

We see that the vectorizer has created a 75827 dimensional vector for the overview of every movie. Now, we shall go ahead and calculate the cosine similarity score. 

We are going to create a 45466X45466 matrix, where the cell in the ith row and jth column represent the similarity score between the movies i and j. We can easily see that this matrix is symmetric in nature and every element in the diagonal is 1, since it is the similarity score of te movie with itself.

Calculating the cosine similarity is computationally expensive process. Fortunately, since our movie plots are represented as TFIDF vectors, their magnitude will always be 1.Hence, we do not need to calculate the denominator in the cosine similarity formula as it will always be 1. Lets see this process in action

In [13]:
from sklearn.metrics.pairwise import linear_kernel
# Compute the cosine similarity matrix
cosine_sim=linear_kernel(tfidfmatrix, tfidfmatrix)

MemoryError: 