<a id='section_id3'></a>
# Content Based Filtering Model

### Recommender System:
A Recommender System is a simple algorithm, whose aim is to provide the most relevant information to a user by discovering patterns in a dataset.

### Content Based Recommender Systems:
It is born from the idea of using content of each item for recommending purposes. The content of an item is a very abstract thing and gives us a lot of options. We could use a lot of different variables. For example, for a movie we could consider the director, cast, genre, the plot of the movie itself... the list goes on.

This type of recommender uses the description of the item to recommend next most similar item. It uses the product features or keywords used in description to find the similarity between the items.
When we know which content we will consider(Director) , we need to transform this data into a Vector Space Model, an algebraic representation of text documents.

Generally, we do this with a Bag of Words model, that represents documents ignoring the order of the words. In this model, each document looks like a bag containing some words. Therefore this method allows word modelling based on dictionaries, where each bag contains a few words from the dictionary.

A specific implementation of a Bag of Words is the TF-IDF representation, where TF is for Term Frequency and IDF is Inverse Document Frequency. This model combines how important is the word in the document (local importance) with how important is the word in the corpus(global importance).

TF-IDF is used in Information Retrieval for feature extraction purposes and it is sub area of Natural Language Processing(NLP).

![title](images/tf.png)
![title](images/idf.png)

TF-IDF is a measure used to evaluate how important a word is to a document in a document corpus. The importance of the word increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.

In [63]:
import pandas as pd
import numpy as np
import time

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import euclidean_distances
from scipy.stats import pearsonr

In [2]:
pd.set_option('display.max_colwidth', -1)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

## Load Pickled File

In [6]:
new_df = pd.read_pickle('new_df.pkl')
new_df.head()

Unnamed: 0_level_0,director,cast,listed_in,key_words
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Norm of the North: King Sized Adventure,"[richardfinn, timmaltby]","[alanmarriott, andrewtoth, briandobson, colehoward, jennifercameron, jonathanholmes, leetockar, lisadurupt, mayakay, michaeldobson]","[children & family movies, comedies]","[planning, evil, archaeologist, first, awesome, wedding, polar, bear, king, must, take, back, stolen, artifact, grandfather]"
#realityhigh,[fernandolebrija],"[nestacooper, katewalsh, johnmichaelhiggins, keithpowers, aliciasanz, jakeborelli, kidink, youseferakat, rebekahgraf, annewinters, petergilroy, patrickdavis]",[comedies],"[cross, hairs, interest, longtime, crush, ex, nerdy, high, schooler, dani, finally, attracts, lands, social, media, celebrity]"
Automata,[gabeibáñez],"[antoniobanderas, dylanmcdermott, melaniegriffith, birgittehjortsørensen, robertforster, christacampbell, timmcinnerny, andynyman, davidryall]","[international movies, sci-fi & fantasy, thrillers]","[dystopian, future, discovers, global, conspiracy, violating, protocol, tech, company, investigates, insurance, adjuster, robot, killed]"
Fabrizio Copano: Solo pienso en mi,"[rodrigotoro, franciscoschultz]",[fabriziocopano],[stand-up comedy],"[stand, sperm, banks, set, family, whatsapp, groups, fabrizio, copano, takes, audience, participation, reflecting, next, level]"
Good People,[henrikrubengenz],"[jamesfranco, katehudson, tomwilkinson, omarsy, samspruell, annafriel, thomasarnold, oliverdimsdale, dianahardcastle, michaeljibson, diarmaidmurtagh]","[action & adventure, thrillers]","[find, neighbor, apartment, luck, recently, murdered, struggling, couple, believe, stash, money]"


## Bag of Words

Building a Netflix Movie Recommender System based on Bag of Words model and Cosine Similairity Matrix

In [7]:
new_df['bag_of_words'] = ''
columns = new_df.columns
for index, row in new_df.iterrows():
    words = ''
    for col in columns:
        words = words + ' '.join(row[col])+ ' '
    row['bag_of_words'] = words
    
new_df.drop(columns = [col for col in new_df.columns if col!= 'bag_of_words'], inplace = True)

In [8]:
new_df.head()

Unnamed: 0_level_0,bag_of_words
title,Unnamed: 1_level_1
Norm of the North: King Sized Adventure,richardfinn timmaltby alanmarriott andrewtoth briandobson colehoward jennifercameron jonathanholmes leetockar lisadurupt mayakay michaeldobson children & family movies comedies planning evil archaeologist first awesome wedding polar bear king must take back stolen artifact grandfather
#realityhigh,fernandolebrija nestacooper katewalsh johnmichaelhiggins keithpowers aliciasanz jakeborelli kidink youseferakat rebekahgraf annewinters petergilroy patrickdavis comedies cross hairs interest longtime crush ex nerdy high schooler dani finally attracts lands social media celebrity
Automata,gabeibáñez antoniobanderas dylanmcdermott melaniegriffith birgittehjortsørensen robertforster christacampbell timmcinnerny andynyman davidryall international movies sci-fi & fantasy thrillers dystopian future discovers global conspiracy violating protocol tech company investigates insurance adjuster robot killed
Fabrizio Copano: Solo pienso en mi,rodrigotoro franciscoschultz fabriziocopano stand-up comedy stand sperm banks set family whatsapp groups fabrizio copano takes audience participation reflecting next level
Good People,henrikrubengenz jamesfranco katehudson tomwilkinson omarsy samspruell annafriel thomasarnold oliverdimsdale dianahardcastle michaeljibson diarmaidmurtagh action & adventure thrillers find neighbor apartment luck recently murdered struggling couple believe stash money


In [9]:
# instantiating and generating the count matrix
count = TfidfVectorizer()
count_matrix = count.fit_transform(new_df['bag_of_words'])

In [24]:
count_matrix

<3909x35149 sparse matrix of type '<class 'numpy.float64'>'
	with 106124 stored elements in Compressed Sparse Row format>

In [11]:
# creating a Series for the movie titles so they are associated to an ordered numerical
# list I will use later to match the indexes
indices = pd.Series(new_df.index)
indices[:5]

0    Norm of the North: King Sized Adventure
1    #realityhigh                           
2    Automata                               
3    Fabrizio Copano: Solo pienso en mi     
4    Good People                            
Name: title, dtype: object

We are using dictionary to hold our dataset, what we are going to do is we will iterate over all of values in the dictionary and check if the value is present in the token.

Let's convert the query and documents to vectors. We are going to use total_vocab variable which has all the list of unique tokens to generate a index for each token and we will use numpy of shape(docs, total_vocab) to store the document vectors.

In [12]:
inddict = indices.to_dict()

In [13]:
inddict = dict((v,k) for k,v in inddict.items())

## Cosine Similarity

Cosine similarity is a metric used to measure how similar the documents are irrespective of their size. Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space.
![title](images/cosinesimilarity)

In [14]:
# generating the cosine similarity matrix
cosine_sim = cosine_similarity(count_matrix, count_matrix)
cosine_sim

array([[1.        , 0.00331223, 0.00124324, ..., 0.        , 0.        ,
        0.00335971],
       [0.00331223, 1.        , 0.        , ..., 0.        , 0.01040268,
        0.00315705],
       [0.00124324, 0.        , 1.        , ..., 0.00222796, 0.00172473,
        0.00208643],
       ...,
       [0.        , 0.        , 0.00222796, ..., 1.        , 0.13772212,
        0.17456276],
       [0.        , 0.01040268, 0.00172473, ..., 0.13772212, 1.        ,
        0.12897354],
       [0.00335971, 0.00315705, 0.00208643, ..., 0.17456276, 0.12897354,
        1.        ]])

In [15]:
# Function to get the most similar movies
def recommend_cosine(Title):
    id = inddict[Title]
    #Get the pairwise similarity scores of all movies compared to that movies
    #sorting them ang getting top 10
    similarity_scores = list(enumerate(cosine_sim[id]))
    similarity_scores = sorted(similarity_scores, key =lambda x : x[1], reverse = True)
    similarity_scores = similarity_scores[1:11]
    
    # Get the movies index
    movies_index = [i[0] for i in similarity_scores]
    
    #Return the top 10 most similar movies using iloc
    return list(new_df.iloc[movies_index].index)

In [68]:
recommend_cosine('Rocky')

['Rocky III',
 'Rocky II',
 'Rocky IV',
 'Rocky V',
 'The Bleeder',
 'Tunisian Victory',
 'The Blue Planet: A Natural History of the Oceans',
 'Ghost Rider',
 'Wheelman',
 'Spy Kids 3: Game Over']

## Euclidean Distance

Similar items will lie in close proximity to each other if plotted in n-dimensional space. So, we can calculate the distance between items and based on that distance, recommend items to the user.
![title](images/euclideandist)

In [17]:
eucli_distance = euclidean_distances(count_matrix)

In [18]:
def recommend_euclidean_distance(Title):
    ind = inddict[Title]
    distance = list(enumerate(eucli_distance[ind]))
    distance = sorted(distance,key = lambda x : x[1])
    distance = distance[1:11]
    
    # Get the movies index
    movies_index = [i[0] for i in distance]
    
    #Return the top 10 most similar movies using iloc
    return list(new_df.iloc[movies_index].index)

In [19]:
recommend_euclidean_distance('Rocky')

['Rocky III',
 'Rocky II',
 'Rocky IV',
 'Rocky V',
 'The Bleeder',
 'Tunisian Victory',
 'The Blue Planet: A Natural History of the Oceans',
 'Ghost Rider',
 'Wheelman',
 'Spy Kids 3: Game Over']

## Pearson's Correlation

It tells us how much two items are correlated. Higher the correlation, more will be the similarity.
![title](images/pearson)

In [20]:
tfidf_matrix_array = count_matrix.toarray()

In [21]:
def recommend_pearson(isbn):
    id = inddict[isbn]
    correlation = []
    for i in range(len(tfidf_matrix_array)):
        correlation.append(pearsonr(tfidf_matrix_array[id],tfidf_matrix_array[i])[0])
    correlation = list(enumerate(correlation))
    sorted_corr = sorted(correlation, reverse = True, key = lambda x : x[1])[1:11]
    movies_index = [i[0] for i in sorted_corr]
    return list(new_df.iloc[movies_index].index)

In [22]:
recommend_pearson('Rocky')

['Rocky III',
 'Rocky II',
 'Rocky IV',
 'Rocky V',
 'The Bleeder',
 'Tunisian Victory',
 'The Blue Planet: A Natural History of the Oceans',
 'Ghost Rider',
 'Wheelman',
 'Spy Kids 3: Game Over']

We did matching score using cosine similarity, Euclidean Distance and PearsonR correlation.

Our query was the movie Rocky and we found all the movies similar to Rocky. By this we can say that our recommendation engine is working.

## Conclusion
A major drawback of this content based filtering is that it is limited to recommending items that are of the same type. It will never recommend products which the user has not watched or liked in the past. So, if a user has watched a movie about Rocky in the past, then the system will recommend only movies similar to Rocky. Its a very narrow way of building an engine. To improve on this type of system, we need an algorithm that can recommend items not just based on the content but the behaviour of users as well.