## Content Based Filtering

Content based recommender systems use some kind of past browsing history of the user to recommend a new list. It uses item factors like overview, genres, cast, crew to find the similarity between the movies and thereby generate an item matrix. Based on the user's previous preferences combined with the similarity of movies, it recommends a list of movies.


Movie features like overview, genres etc. are text data and hence we need to convert this into the vector space to calculate the similarity. We use the Term Frequency- Inverse Document Frequency (TF-IDF) to convert the word to its vector form and thereby determine the importance of each feature. TF is the relative frequency of any word in the document and IDF is the inverse of document frequency which is the count of documents containing the words. Once we calculate the TF-IDF, we use the Vector Space Model to compute the proximity between different vectors. There are different parameters to calculate the similarity between two vectors: Cosine similarity, Dot Product and Euclidean Distance. Now, we can rank and sort the movies according to this similarity matrix

This recommnder system cabe divided into 2 : Recommender Based on Movie Description or Overview and Recommnder based on Cast and Crew.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from ast import literal_eval
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel 
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

Read all the data in the movies file and merge it to the links file. Convert the 'id' column in the movies file to int to do the merge.

In [2]:
data = pd.read_csv('movies_metadata.csv',low_memory=False)
links = pd.read_csv('links_small.csv',low_memory=False)
links = links[links['tmdbId'].isna() == False]['tmdbId']
data = data[['id','title','overview',"genres"]]
data['id'] = pd.to_numeric(data['id'],errors="coerce")
data[data['id'].isna()]

Unnamed: 0,id,title,overview,genres
19730,,,Released,"[{'name': 'Carousel Productions', 'id': 11176}..."
29503,,,Released,"[{'name': 'Aniplex', 'id': 2883}, {'name': 'Go..."
35587,,,Released,"[{'name': 'Odyssey Media', 'id': 17161}, {'nam..."


Drop 'id' columns with null values ( NaN )

In [3]:
data.dropna(subset=["id","title","overview"],inplace=True)

In [4]:
df = data.copy()
df.head()

Unnamed: 0,id,title,overview,genres
0,862.0,Toy Story,"Led by Woody, Andy's toys live happily in his ...","[{'id': 16, 'name': 'Animation'}, {'id': 35, '..."
1,8844.0,Jumanji,When siblings Judy and Peter discover an encha...,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '..."
2,15602.0,Grumpier Old Men,A family wedding reignites the ancient feud be...,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ..."
3,31357.0,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...","[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam..."
4,11862.0,Father of the Bride Part II,Just when George Banks has recovered from his ...,"[{'id': 35, 'name': 'Comedy'}]"


To scale down the data, we select the data in the links file.

In [5]:
df = df[df['id'].isin(links)]

#### Recommender Based on Movie Description or Overview

In [6]:
df[df['overview'].duplicated()]['overview']

1465     East-Berlin, 1961, shortly after the erection ...
1613                                    No overview found.
7327                                    No overview found.
8472     Adventurer Allan Quartermain leads an expediti...
8959     Director Michael Apted revisits the same group...
9165     Hitman Jef Costello is a perfectionist who alw...
9327     In feudal India, a warrior (Khan) who renounce...
11444    Wilbur the pig is scared of the end of the sea...
15074    British nurse Catherine Barkley (Helen Hayes) ...
15765    Since women are banned from soccer matches, Ir...
21854    More than two decades after catapulting to sta...
23044    Former Danish servicemen Lars and Jimmy are th...
23534    Winter, 1915. Confined by her family to an asy...
24844    As an ex-gambler teaches a hot-shot college ki...
26160    Nick Carraway, a young Midwesterner now living...
26625    Count de Chagnie has discovered Christine's si...
28860    In Zola's Paris, an ingenue arrives at a tony .

When we check the 'Overview', we find columns with values 'No overview found'. We need to replace them as NaN.

In [7]:
df['overview'] = df['overview'].replace('No overview found.', np.nan)
print(df.shape)

(9087, 4)


In [8]:
df[df['id'].duplicated()]

Unnamed: 0,id,title,overview,genres
1465,105045.0,The Promise,"East-Berlin, 1961, shortly after the erection ...","[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n..."
9165,5511.0,Le Samouraï,Hitman Jef Costello is a perfectionist who alw...,"[{'id': 80, 'name': 'Crime'}, {'id': 18, 'name..."
9327,23305.0,The Warrior,"In feudal India, a warrior (Khan) who renounce...","[{'id': 12, 'name': 'Adventure'}, {'id': 16, '..."
15074,22649.0,A Farewell to Arms,British nurse Catherine Barkley (Helen Hayes) ...,"[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n..."
15765,13209.0,Offside,"Since women are banned from soccer matches, Ir...","[{'id': 18, 'name': 'Drama'}, {'id': 35, 'name..."
21854,152795.0,The Congress,More than two decades after catapulting to sta...,"[{'id': 18, 'name': 'Drama'}, {'id': 878, 'nam..."
23044,25541.0,Brotherhood,Former Danish servicemen Lars and Jimmy are th...,"[{'id': 18, 'name': 'Drama'}]"
23534,110428.0,Camille Claudel 1915,"Winter, 1915. Confined by her family to an asy...","[{'id': 18, 'name': 'Drama'}]"
24844,11115.0,Deal,As an ex-gambler teaches a hot-shot college ki...,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam..."
26625,69234.0,The Phantom of the Opera,Count de Chagnie has discovered Christine's si...,"[{'id': 18, 'name': 'Drama'}, {'id': 27, 'name..."


In [9]:
df.drop_duplicates(['id'],inplace=True)
df.dropna(subset=["overview"],inplace=True)
df = df.reset_index(drop=True)

In [10]:
df.isna().sum()

id          0
title       0
overview    0
genres      0
dtype: int64

The data is now completely clean and good for creating the model.

Use TF-IDF (unigram and bigram) to convert the words in 'Overview' column to it vector space.

In [11]:
mapping = pd.Series(df.index,index = df['title'])
tf = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
sim_matrix = tf.fit_transform(df['overview'])
sim_matrix.shape

(9067, 244199)

In [12]:
print (mapping[mapping.index.str.startswith('Harry Potter')])

title
Harry Potter and the Philosopher's Stone        3798
Harry Potter and the Chamber of Secrets         4310
Harry Potter and the Prisoner of Azkaban        5381
Harry Potter and the Goblet of Fire             6269
Harry Potter and the Order of the Phoenix       6709
Harry Potter and the Half-Blood Prince          7246
Harry Potter and the Deathly Hallows: Part 1    7635
Harry Potter and the Deathly Hallows: Part 2    7807
dtype: int64


In [13]:
cosine_sim = linear_kernel(sim_matrix, sim_matrix)
cosine_sim[0]

array([1.        , 0.00742716, 0.        , ..., 0.        , 0.        ,
       0.00496914])

Function to find the recommended movies given a movie name as input. It calculates the similarity of all the movies and sorts them according to the highest value. It returns the top 20 similar movies.

In [14]:
def recommended_movies(movie_name):
    index = mapping[movie_name]
    sim_score = list(enumerate(cosine_sim[index]))
    sim_score = sorted(sim_score, key=lambda x: x[1], reverse=True)
    sim_score = sim_score[1:20]
    movie_indices = [i[0] for i in sim_score]
    return (df['title'].iloc[movie_indices])

In [15]:
recommended_movies("Harry Potter and the Philosopher's Stone")

6269             Harry Potter and the Goblet of Fire
3249                        Harry, He's Here To Help
3534                                   The Dead Pool
5381        Harry Potter and the Prisoner of Azkaban
4310         Harry Potter and the Chamber of Secrets
6709       Harry Potter and the Order of the Phoenix
7635    Harry Potter and the Deathly Hallows: Part 1
3941                                    The Good Son
8160                      Rebecca of Sunnybrook Farm
1196                    Turbo: A Power Rangers Movie
3850                                       Clockwise
7807    Harry Potter and the Deathly Hallows: Part 2
8593                                  Once My Mother
6597                                      Epic Movie
7246          Harry Potter and the Half-Blood Prince
1692                                 Sixteen Candles
8806                                Midnight Special
5758                                 Harry and Tonto
3425                                      The 

In [16]:
recommended_movies('The Godfather')

969      The Godfather: Part II
29               Shanghai Triad
5658                       Fury
3501                       Made
2404             American Movie
1575    The Godfather: Part III
4213                    8 Women
3707                   3 Ninjas
3601              Harlem Nights
8795              Run All Night
2151              Summer of Sam
6387                Renaissance
3280          Jaws: The Revenge
5397            The Kid Brother
2184           The Color Purple
616                     Thinner
227              The Jerky Boys
3599            Family Business
5584                     Eulogy
Name: title, dtype: object

In [17]:
recommended_movies('The Dark Knight')

7917                      The Dark Knight Rises
132                              Batman Forever
1109                             Batman Returns
523                                      Batman
7552                 Batman: Under the Red Hood
8212    Batman: The Dark Knight Returns, Part 2
7887                           Batman: Year One
2688                                        JFK
8150    Batman: The Dark Knight Returns, Part 1
2571               Batman: Mask of the Phantasm
7231                  The File on Thelma Jordon
4481                                      Q & A
7919         Sherlock Holmes: A Game of Shadows
6133                              Batman Begins
1131                   Night Falls on Manhattan
5502                            To End All Wars
7332                        Law Abiding Citizen
2885                              Flying Tigers
8662                          The Young Savages
Name: title, dtype: object

#### Recommnder based on Cast and Crew

In [18]:
credits = pd.read_csv('credits.csv',low_memory=False)
credits

Unnamed: 0,cast,crew,id
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844
2,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602
3,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...",31357
4,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...",11862
...,...,...,...
45471,"[{'cast_id': 0, 'character': '', 'credit_id': ...","[{'credit_id': '5894a97d925141426c00818c', 'de...",439050
45472,"[{'cast_id': 1002, 'character': 'Sister Angela...","[{'credit_id': '52fe4af1c3a36847f81e9b15', 'de...",111109
45473,"[{'cast_id': 6, 'character': 'Emily Shaw', 'cr...","[{'credit_id': '52fe4776c3a368484e0c8387', 'de...",67758
45474,"[{'cast_id': 2, 'character': '', 'credit_id': ...","[{'credit_id': '533bccebc3a36844cf0011a7', 'de...",227506


In [19]:
credits[credits.duplicated()].shape

(37, 3)

Drop duplicates from the credits file.

In [20]:
credits.drop_duplicates(['id'],inplace=True)
credits.reset_index(drop=True,inplace=True)
credits.shape

(45432, 3)

Merge ceredits file with the movies file on 'id' feature

In [21]:
df1 = data.copy()
df1 = df1[["id","title","genres"]]
df1.drop_duplicates(inplace=True)
df1 = df1.merge(credits,on="id")
df1 = df1[df1['id'].isin(links)]
df1 = df1.reset_index(drop=True)
df1

Unnamed: 0,id,title,genres,cast,crew
0,862.0,Toy Story,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...","[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de..."
1,8844.0,Jumanji,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...","[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de..."
2,15602.0,Grumpier Old Men,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...","[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de..."
3,31357.0,Waiting to Exhale,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...","[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de..."
4,11862.0,Father of the Bride Part II,"[{'id': 35, 'name': 'Comedy'}]","[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de..."
...,...,...,...,...,...
9065,159550.0,The Last Brickmaker in America,"[{'id': 18, 'name': 'Drama'}]","[{'cast_id': 1, 'character': 'Henry Cobb', 'cr...","[{'credit_id': '544475aac3a36819fb000578', 'de..."
9066,392572.0,Rustom,"[{'id': 53, 'name': 'Thriller'}, {'id': 10749,...","[{'cast_id': 0, 'character': 'Rustom Pavri', '...","[{'credit_id': '5951baf692514129c4016600', 'de..."
9067,402672.0,Mohenjo Daro,"[{'id': 12, 'name': 'Adventure'}, {'id': 18, '...","[{'cast_id': 0, 'character': 'Sarman', 'credit...","[{'credit_id': '57cd5d3592514179d50018e8', 'de..."
9068,315011.0,Shin Godzilla,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...","[{'cast_id': 4, 'character': 'Rando Yaguchi : ...","[{'credit_id': '560892fa92514177550018b2', 'de..."


Most people watch movies based on the actors or the director in the movie. So extract the director from the crew file and merge it with the cast of the movie which in turn will be the item features used for this recommender.

In [22]:
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

In [23]:
df1['director'] = df1['crew'].apply(literal_eval).apply(get_director)
df1['cast'] = df1['cast'].apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
df1['cast'] = df1['cast'].apply(lambda x: x[:3] if len(x) >=3 else x)

In [24]:
def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''

In [25]:
df1['director'] = df1['director'].apply(clean_data)
df1['cast'] = df1['cast'].apply(clean_data)

In [26]:
df1['combined_feat'] = df1["cast"].str.join(" ") + " " + df1["director"]   
df1 = df1.reset_index(drop=True)

In this case we use a Count Vectorizer instead of TF-IDF. Since this is not a text data and each word has the same weightage, Count Vectorizer would be a better option. 

In [27]:
model = CountVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
model_matrix = model.fit_transform(df1['combined_feat'])
model_matrix

<9070x40440 sparse matrix of type '<class 'numpy.int64'>'
	with 64706 stored elements in Compressed Sparse Row format>

Find the cosine similarity for the matrix and recommend the top 10 simialr movies.

In [28]:
cosine_sim = cosine_similarity(model_matrix)

def get_title_from_index(index):
    return df1[df1.index == index]["title"].values[0]

def get_index_from_title(title):
    return df1[df1.title == title].index.values.astype(int)[0]

def recommended_movies(movie_user_likes):
    movie_index = get_index_from_title(movie_user_likes)
    similar_movies =  list(enumerate(cosine_sim[movie_index]))

    sorted_similar_movies = sorted(similar_movies,key=lambda x:x[1],reverse=True)[1:]
    i=0
    print("Top 10 similar movies to "+movie_user_likes+" are:\n")
    for element in sorted_similar_movies:
        print(get_title_from_index(element[0]))
        i=i+1
        if i>=10:
            break


In [29]:
recommended_movies('The Dark Knight')

Top 10 similar movies to The Dark Knight are:

Batman Begins
The Prestige
The Dark Knight Rises
The Man Who Would Be King
The Muppet Christmas Carol
Swing Kids
Blame It on Rio
Velvet Goldmine
Mona Lisa
Little Voice


In [30]:
recommended_movies('Deadpool')

Top 10 similar movies to Deadpool are:

National Lampoon’s Van Wilder
The Amityville Horror
Waiting...
Just Friends
Smokin' Aces
Definitely, Maybe
Chaos Theory
The Proposal
Buried
Green Lantern


In [31]:
recommended_movies('Harry Potter and the Philosopher\'s Stone')

Top 10 similar movies to Harry Potter and the Philosopher's Stone are:

Harry Potter and the Chamber of Secrets
Harry Potter and the Prisoner of Azkaban
Harry Potter and the Goblet of Fire
Harry Potter and the Order of the Phoenix
Harry Potter and the Half-Blood Prince
Harry Potter and the Deathly Hallows: Part 2
Harry Potter and the Deathly Hallows: Part 1
Nine Months
Mrs. Doubtfire
Home Alone


All the recommendations given for the three movies have the same actor or director and looks like the model works fine.