# Content-Based Filtering

In this tutorial, a simple approach to content based-filtering is applied to the [TMDb movie dataset](https://www.themoviedb.org/). In particular, we will explore feature engineering and inference as it relates to content-based filtering. 

![](https://www.naukri.com/learning/articles/wp-content/uploads/sites/11/2022/01/Content-Based-Filtering.png)

[Tutorial Source](https://www.kaggle.com/code/ibtesama/getting-started-with-a-movie-recommendation-system)



## Imports

In [1]:
import pandas as pd
import numpy as np


## Dataset

In this demo, we will be using the TMDB 5000 Movie Dataset which contains metadata for ~5000 movies from [TMDb](https://www.themoviedb.org/).

The `tmdb_5000_credits.csv` contains 
- movie_id
- title
- cast
- crew

whereas the `tmdb_5000_movies.csv` contains further details about the movie such as 
- budget
- keywords
- overview

Let's start by loading the data.

In [2]:
credits_df = pd.read_csv('/ssd003/projects/aieng/public/recsys_datasets/tmdb/tmdb_5000_credits.csv')
movies_df = pd.read_csv('/ssd003/projects/aieng/public/recsys_datasets/tmdb/tmdb_5000_movies.csv')

In [3]:
credits_df.head(5)

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [4]:
movies_df.head(5)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124


In [5]:
# Merge the data
credits_df.rename(columns = {'movie_id':'id'}, inplace = True)
df = movies_df.merge(credits_df, on='id', suffixes=('', '_drop'))
# Drop the duplicate columns
df.drop([col for col in df.columns if 'drop' in col], axis=1, inplace=True)

In [6]:
df.head(4)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",...,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...",...,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...",...,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."


## Plot-based Recommender

First, we will recommend movies to users based on their plot descriptions; movies with similar plots will rank higher. To compare plot descriptions, we will use the cosine similarity measure. 

In [7]:
df['overview']

0       In the 22nd century, a paraplegic Marine is di...
1       Captain Barbossa, long believed to be dead, ha...
2       A cryptic message from Bond’s past sends him o...
3       Following the death of District Attorney Harve...
4       John Carter is a war-weary, former military ca...
                              ...                        
4798    El Mariachi just wants to play his guitar and ...
4799    A newlywed couple's honeymoon is upended by th...
4800    "Signed, Sealed, Delivered" introduces a dedic...
4801    When ambitious New York attorney Sam is sent t...
4802    Ever since the second grade when he first saw ...
Name: overview, Length: 4803, dtype: object

We now need to convert each overiew into a vector by computing the Term Frequency-Inverse Document Frequency (TF-IDF) scores.

TF-IDF is a statistical measure that evaluates the relevancy of a word to a document, in a collection of documents.

Specifically, the first part, "Term Frequency" is the relative frequency of a particular word in a document. For instance, the raw count - the number of times the word appears in a document, or even the frequency scaled by the length of the document - the raw count of occurences of the word divided by the total number of words in the document. 

The second part, "Inverse Document Frequency" measures how common a word is among the collection of documents or corpus. The formula is below where *t* is the word and *N* is the number of documents *d* in the collection *D*.

![](https://ecm.capitalone.com/WCM/tech/tf-idf-1.png)

The log of the number of documents in the corpus divided by the documents containing the word *t*. Since it is possible that a word does not appear in the corpus at all, resulting in a divide-by-zero error, it is commmon to add 1 to the exisiting count as follows,

![](https://ecm.capitalone.com/WCM/tech/tf-idf-2.png)

IDF is needed to help correct for words that commonly occur like "of", "the, "as" etc. The inverse document frequency minimizes the weight of these terms and allows infrequent words to have a higher impact.

To get the final TF-IDF value, we multiply these two terms together, 

![](https://ecm.capitalone.com/WCM/tech/tf-idf-3.png)

The higher the value, the more important or relevant the term is.

From this measure we will compute a matrix in which each column represents a word in the overview corpus and each row represents a movie. In this way, each row is a vector of the overview of each movie which we can compare.

We can use the built-in TfIdfVectorizer class from scikit-learn to produce this matrix.

In [8]:
# Import TfIdfVectorizer from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

# Define a TF-IDF Vectorizer Object
# Remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')

In [9]:
# Replace NaN values with an empty string
df['overview'] = df['overview'].fillna('')

In [10]:
# Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(df['overview'])

In [11]:
tfidf_matrix.shape

(4803, 20978)

From the shape of the matrix, we can see that there are over 20,000 words describing the 4800 movies in the dataset.

We can now compute a cosine similarity score. This similarity score is a measure of closeness between two vectors. The higher the cosine value, the closer the vectors. 

![](https://datascience-enthusiast.com/figures/cosine_sim.png)


We can use the `linear_kernel` from scikit-learn's metrics module to calculate the cosine similarity (dot product) of each point in the matrix to every other point. This metric is used over the `cosine_similarity` metric because, although they both produce the same result, `linear_kernel` has faster computation which is prefered when using large amounts of data.

In [12]:
# Import linear_kernel
from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

Since we want to define a function that takes in a movie title and returns similar movies, we first need a reverse mapping of the movie titles and Dataframe indices - an index for each movie title.

In [13]:
# Construct a reverse mapping of indices and movie titles
indices = pd.Series(df.index, index=df['title']).drop_duplicates()

In [14]:
indices

title
Avatar                                         0
Pirates of the Caribbean: At World's End       1
Spectre                                        2
The Dark Knight Rises                          3
John Carter                                    4
                                            ... 
El Mariachi                                 4798
Newlyweds                                   4799
Signed, Sealed, Delivered                   4800
Shanghai Calling                            4801
My Date with Drew                           4802
Length: 4803, dtype: int64

Now, we can start writing out our function `get_recommendations`.
This function 
- takes in the movie title as an input and gets its index
- finds the cosine similarity list for the movie compared to other movies
- converts that list to a list of tuples where the first element is its ranking and the second element is the similarity score
- sorts this list in decending order and gets the top 10 most similar movies

In [15]:
# Function that takes in movie title as input and outputs most similar movies
def get_recommendations(title, cosine_sim=cosine_sim, plot_based=True):
    # Get the index of the movie that matches the title
    idx = indices[title]
    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]
    
    print('Input Movie:')
    if not plot_based:
        print(pd.DataFrame(data=df.iloc[idx, [17, 1, 22, 20, 4]]))
        result = pd.DataFrame(data={'Title': df['title'].iloc[movie_indices], 'Genres': df['genres'].iloc[movie_indices], 'Director': df['director'].iloc[movie_indices], 'Cast': df['cast'].iloc[movie_indices], 'Keywords': df['keywords'].iloc[movie_indices], 'Similarity Score': [i[1] for i in sim_scores]})
    else:
        print(pd.DataFrame(data=df.iloc[idx, [6, 1, 7]]))
        result = pd.DataFrame(data={'Title': df['title'].iloc[movie_indices], 'Genres': df['genres'].iloc[movie_indices], 'Overview': df['overview'].iloc[movie_indices], 'Similarity Score': [i[1] for i in sim_scores]})
    
    return result

In [16]:
get_recommendations('The Dark Knight Rises')

Input Movie:
                                                                3
original_title                              The Dark Knight Rises
genres          [{"id": 28, "name": "Action"}, {"id": 80, "nam...
overview        Following the death of District Attorney Harve...


Unnamed: 0,Title,Genres,Overview,Similarity Score
65,The Dark Knight,"[{""id"": 18, ""name"": ""Drama""}, {""id"": 28, ""name...",Batman raises the stakes in his war on crime. ...,0.301512
299,Batman Forever,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",The Dark Knight of Gotham City confronts a das...,0.29857
428,Batman Returns,"[{""id"": 28, ""name"": ""Action""}, {""id"": 14, ""nam...","Having defeated the Joker, Batman now faces th...",0.287851
1359,Batman,"[{""id"": 14, ""name"": ""Fantasy""}, {""id"": 28, ""na...",The Dark Knight of Gotham City begins his war ...,0.264461
3854,"Batman: The Dark Knight Returns, Part 2","[{""id"": 28, ""name"": ""Action""}, {""id"": 16, ""nam...",Batman has stopped the reign of terror that Th...,0.18545
119,Batman Begins,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...","Driven by tragedy, billionaire Bruce Wayne ded...",0.167996
2507,Slow Burn,"[{""id"": 9648, ""name"": ""Mystery""}, {""id"": 80, ""...",A district attorney (Ray Liotta) is involved i...,0.166829
9,Batman v Superman: Dawn of Justice,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",Fearing the actions of a god-like Super Hero l...,0.13374
1181,JFK,"[{""id"": 18, ""name"": ""Drama""}, {""id"": 53, ""name...",New Orleans District Attorney Jim Garrison dis...,0.132197
210,Batman & Robin,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",Along with crime-fighting partner Robin and ne...,0.130455


In [17]:
get_recommendations('The Avengers')

Input Movie:
                                                               16
original_title                                       The Avengers
genres          [{"id": 878, "name": "Science Fiction"}, {"id"...
overview        When an unexpected enemy emerges and threatens...


Unnamed: 0,Title,Genres,Overview,Similarity Score
7,Avengers: Age of Ultron,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",When Tony Stark tries to jumpstart a dormant p...,0.146374
3144,Plastic,"[{""id"": 18, ""name"": ""Drama""}, {""id"": 28, ""name...",Sam &amp; Fordy run a credit card fraud scheme...,0.122791
1715,Timecop,"[{""id"": 53, ""name"": ""Thriller""}, {""id"": 878, ""...",An officer for a security agency that regulate...,0.110385
4124,This Thing of Ours,"[{""id"": 18, ""name"": ""Drama""}, {""id"": 28, ""name...","Using the Internet and global satellites, a gr...",0.107529
3311,Thank You for Smoking,"[{""id"": 35, ""name"": ""Comedy""}, {""id"": 18, ""nam...",The chief spokesperson and lobbyist Nick Naylo...,0.106203
3033,The Corruptor,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...","Danny is a young cop partnered with Nick, a se...",0.097598
588,Wall Street: Money Never Sleeps,"[{""id"": 18, ""name"": ""Drama""}, {""id"": 80, ""name...",As the global economy teeters on the brink of ...,0.094084
2136,Team America: World Police,"[{""id"": 10402, ""name"": ""Music""}, {""id"": 12, ""n...",Team America World Police follows an internati...,0.092244
1468,The Fountain,"[{""id"": 18, ""name"": ""Drama""}, {""id"": 12, ""name...","Spanning over one thousand years, and three pa...",0.086643
1286,Snowpiercer,"[{""id"": 28, ""name"": ""Action""}, {""id"": 878, ""na...",In a future where a failed global-warming expe...,0.086189


In [18]:
get_recommendations('Legally Blonde')

Input Movie:
                                                             2316
original_title                                     Legally Blonde
genres                             [{"id": 35, "name": "Comedy"}]
overview        Elle Woods has it all. She's the president of ...


Unnamed: 0,Title,Genres,Overview,Similarity Score
4418,Steppin: The Movie,"[{""id"": 35, ""name"": ""Comedy""}, {""id"": 10749, ""...",Every college campus has its rivalries and UTS...,0.152268
1833,"Legally Blonde 2: Red, White & Blonde","[{""id"": 35, ""name"": ""Comedy""}]","After Elle Woods, the eternally perky, fashion...",0.137402
2005,The Longshots,"[{""id"": 18, ""name"": ""Drama""}, {""id"": 10751, ""n...","The true story of Jasmine Plummer who, at the ...",0.112638
3423,Dressed to Kill,"[{""id"": 27, ""name"": ""Horror""}, {""id"": 9648, ""n...","A mysterious, tall, blonde woman, wearing sung...",0.108434
3478,College,"[{""id"": 35, ""name"": ""Comedy""}]",A wild weekend is in store for three high scho...,0.096509
3652,Decoys,"[{""id"": 27, ""name"": ""Horror""}, {""id"": 878, ""na...",Luke and Roger are just another couple of coll...,0.088048
1385,Neighbors 2: Sorority Rising,"[{""id"": 35, ""name"": ""Comedy""}]",A sorority moves in next door to the home of M...,0.085667
3175,Black Christmas,"[{""id"": 27, ""name"": ""Horror""}, {""id"": 53, ""nam...",An escaped maniac returns to his childhood hom...,0.083382
4229,The To Do List,"[{""id"": 35, ""name"": ""Comedy""}, {""id"": 10749, ""...",Feeling pressured to become more sexually expe...,0.078661
1605,The Cabin in the Woods,"[{""id"": 27, ""name"": ""Horror""}, {""id"": 53, ""nam...",Five college friends spend the weekend at a re...,0.070885


From these results, we can see that many of the top ranking movies are those that are prequels/sequels to the query movie; for example, the results for "The Dark Night Rises" and "The Avengers". The genres of the results of the first two examples seem to be similar to the query - action for the first and drama/thriller for the second. However, the third example "Legally Blonde" is a comedy, yet its results yield movies that are labelled as horror. These two genres are quite different.

Let's see if we can improve these results.

## Credits, Genres and Keywords Based Recommender

The quality of our recommender system can be improved by the use of better metadata. In this section, we will use the director, top 3 actors, genres, and plot keywords to find similar movies. 

Currently, the data is in the form of stringified lists, these need to be converted into usable structures. We can do this using the `ast` module's `literal_eval` method which evaluates a string consisting of a Python literal. It will parse the stringified features into their corresponding objects.

In [19]:
from ast import literal_eval

features = ['cast', 'crew', 'keywords', 'genres']
for feature in features:
    df[feature] = df[feature].apply(literal_eval)

In [20]:
df.head(4)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,cast,crew
0,237000000,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",http://www.avatarmovie.com/,19995,"[{'id': 1463, 'name': 'culture clash'}, {'id':...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,"[{'cast_id': 242, 'character': 'Jake Sully', '...","[{'credit_id': '52fe48009251416c750aca23', 'de..."
1,300000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",http://disney.go.com/disneypictures/pirates/,285,"[{'id': 270, 'name': 'ocean'}, {'id': 726, 'na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",...,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,"[{'cast_id': 4, 'character': 'Captain Jack Spa...","[{'credit_id': '52fe4232c3a36847f800b579', 'de..."
2,245000000,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{'id': 470, 'name': 'spy'}, {'id': 818, 'name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...",...,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466,"[{'cast_id': 1, 'character': 'James Bond', 'cr...","[{'credit_id': '54805967c3a36829b5002c41', 'de..."
3,250000000,"[{'id': 28, 'name': 'Action'}, {'id': 80, 'nam...",http://www.thedarkknightrises.com/,49026,"[{'id': 849, 'name': 'dc comics'}, {'id': 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...",...,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106,"[{'cast_id': 2, 'character': 'Bruce Wayne / Ba...","[{'credit_id': '52fe4781c3a36847f81398c3', 'de..."


We can define some functions that will help us extract the features we need.

In [21]:
# Get the director's name from the crew feature. If director is not listed, return NaN
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

In [22]:
# Returns the list top 3 elements or entire list; whichever is more.
def get_list(x):
    if isinstance(x, list):
        names = [i['name'] for i in x]
        # Check if more than 3 elements exist. If yes, return only first three. If no, return entire list.
        if len(names) > 3:
            names = names[:3]
        return names

    # Return empty list in case of missing/malformed data
    return []

In [23]:
# Define new director, cast, genres and keywords features that are in a suitable form.
df['director'] = df['crew'].apply(get_director)

features = ['cast', 'keywords', 'genres']
for feature in features:
    df[feature] = df[feature].apply(get_list)

In [24]:
df[['title', 'cast', 'director', 'keywords', 'genres']].head(3)

Unnamed: 0,title,cast,director,keywords,genres
0,Avatar,"[Sam Worthington, Zoe Saldana, Sigourney Weaver]",James Cameron,"[culture clash, future, space war]","[Action, Adventure, Fantasy]"
1,Pirates of the Caribbean: At World's End,"[Johnny Depp, Orlando Bloom, Keira Knightley]",Gore Verbinski,"[ocean, drug abuse, exotic island]","[Adventure, Fantasy, Action]"
2,Spectre,"[Daniel Craig, Christoph Waltz, Léa Seydoux]",Sam Mendes,"[spy, based on novel, secret agent]","[Action, Adventure, Crime]"


We now need to clean up our data. This involves converting names and keywords into lowercase and removing the spaces between them. This must be done so that our vectorizer does not mistake the name Chris Evans and Chris Pine as the same.

In [25]:
# Function to convert all strings to lower case and strip names of spaces
def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        # Check if director exists. If not, return empty string
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''

In [26]:
# Apply clean_data function to your features.
features = ['cast', 'keywords', 'director', 'genres']

clean_df = df.copy()
for feature in features:
    clean_df[feature] = clean_df[feature].apply(clean_data)

We can now define an aggregation function that will join all of the metadata into one string which will be the input to our vectorizer. 

In [27]:
def aggregate(x):
    return ' '.join(x['keywords']) + ' ' + ' '.join(x['cast']) + ' ' + x['director'] + ' ' + ' '.join(x['genres'])
clean_df['aggregation'] = clean_df.apply(aggregate, axis=1)

In [28]:
clean_df.head(3)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,spoken_languages,status,tagline,title,vote_average,vote_count,cast,crew,director,aggregation
0,237000000,"[action, adventure, fantasy]",http://www.avatarmovie.com/,19995,"[cultureclash, future, spacewar]",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,"[samworthington, zoesaldana, sigourneyweaver]","[{'credit_id': '52fe48009251416c750aca23', 'de...",jamescameron,cultureclash future spacewar samworthington zo...
1,300000000,"[adventure, fantasy, action]",http://disney.go.com/disneypictures/pirates/,285,"[ocean, drugabuse, exoticisland]",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",...,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,"[johnnydepp, orlandobloom, keiraknightley]","[{'credit_id': '52fe4232c3a36847f800b579', 'de...",goreverbinski,ocean drugabuse exoticisland johnnydepp orland...
2,245000000,"[action, adventure, crime]",http://www.sonypictures.com/movies/spectre/,206647,"[spy, basedonnovel, secretagent]",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...",...,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466,"[danielcraig, christophwaltz, léaseydoux]","[{'credit_id': '54805967c3a36829b5002c41', 'de...",sammendes,spy basedonnovel secretagent danielcraig chris...


Our next steps are similar to the plot-based recommender. However, we now use CountVectorizer() instead of TF-IDF. 

We use CountVectorizer() to convert a collection of text documents (in this case our aggregated strings) to a matrix of token counts.
We do not need to use TF-IDF becasue we do not need to decrease the weight of a popular actor or actress if they have acted in more films. CountVectorizer() transforms a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text.

In [29]:
# Import CountVectorizer and create the count matrix
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(clean_df['aggregation'])

In [30]:
# Compute the Cosine Similarity matrix based on the count_matrix
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim2 = cosine_similarity(count_matrix, count_matrix)

In [31]:
# Reset index of our main DataFrame and construct reverse mapping as before
clean_df = clean_df.reset_index()
indices = pd.Series(clean_df.index, index=clean_df['title'])

In [32]:
get_recommendations('The Dark Knight Rises', cosine_sim2, plot_based=False)

Input Movie:
                                                     3
title                            The Dark Knight Rises
genres                          [Action, Crime, Drama]
director                             Christopher Nolan
cast      [Christian Bale, Michael Caine, Gary Oldman]
keywords         [dc comics, crime fighter, terrorist]


Unnamed: 0,Title,Genres,Director,Cast,Keywords,Similarity Score
65,The Dark Knight,"[Drama, Action, Crime]",Christopher Nolan,"[Christian Bale, Heath Ledger, Aaron Eckhart]","[dc comics, crime fighter, secret identity]",0.7
119,Batman Begins,"[Action, Crime, Drama]",Christopher Nolan,"[Christian Bale, Michael Caine, Liam Neeson]","[himalaya, martial arts, dc comics]",0.7
4638,Amidst the Devil's Wings,"[Drama, Action, Crime]",,[],[],0.547723
1196,The Prestige,"[Drama, Mystery, Thriller]",Christopher Nolan,"[Hugh Jackman, Christian Bale, Michael Caine]","[competition, secret, obsession]",0.4
3073,Romeo Is Bleeding,"[Action, Crime, Drama]",Peter Medak,"[Gary Oldman, Lena Olin, Annabella Sciorra]","[police operation, sex addiction, police]",0.4
3326,Black November,"[Drama, Action, Crime]",Jeta Amata,"[Razaaq Adoti, Sarah Wayne Callies, Mickey Rou...",[],0.358569
1503,Takers,"[Action, Crime, Drama]",John Luessenhop,"[Chris Brown, Hayden Christensen, Matt Dillon]",[heist],0.33541
1986,Faster,"[Crime, Drama, Action]","George Tillman, Jr.","[Dwayne Johnson, Billy Bob Thornton, Maggie Gr...",[],0.33541
303,Catwoman,"[Action, Crime]",Pitof,"[Halle Berry, Benjamin Bratt, Sharon Stone]","[white russian, sex, dc comics]",0.316228
747,Gangster Squad,"[Crime, Drama, Action]",Ruben Fleischer,"[Josh Brolin, Ryan Gosling, Nick Nolte]","[los angeles, gangster]",0.316228


In [33]:
get_recommendations('The Godfather', cosine_sim2, plot_based=False)

Input Movie:
                                                  3337
title                                    The Godfather
genres                                  [Drama, Crime]
director                          Francis Ford Coppola
cast            [Marlon Brando, Al Pacino, James Caan]
keywords  [italy, love at first sight, loss of father]


Unnamed: 0,Title,Genres,Director,Cast,Keywords,Similarity Score
867,The Godfather: Part III,"[Crime, Drama, Thriller]",Francis Ford Coppola,"[Al Pacino, Diane Keaton, Andy García]","[italy, christianity, new york]",0.527046
2731,The Godfather: Part II,"[Drama, Crime]",Francis Ford Coppola,"[Al Pacino, Robert Duvall, Diane Keaton]","[italo-american, cuba, vororte]",0.421637
4638,Amidst the Devil's Wings,"[Drama, Action, Crime]",,[],[],0.3849
2649,The Son of No One,"[Drama, Thriller, Crime]",Dito Montiel,"[Channing Tatum, Al Pacino, Juliette Binoche]",[],0.377964
1525,Apocalypse Now,"[Drama, War]",Francis Ford Coppola,"[Martin Sheen, Marlon Brando, Robert Duvall]","[guerrilla, river, vietnam]",0.333333
1018,The Cotton Club,"[Music, Drama, Crime]",Francis Ford Coppola,"[Richard Gere, Gregory Hines, Diane Lane]","[jazz, jazz musician, musical]",0.316228
1170,The Talented Mr. Ripley,"[Thriller, Crime, Drama]",Anthony Minghella,"[Matt Damon, Gwyneth Paltrow, Jude Law]","[venice, italy, gay]",0.316228
1209,The Rainmaker,"[Drama, Crime, Thriller]",Francis Ford Coppola,"[Matt Damon, Danny DeVito, Jon Voight]","[jurors, proof, court case]",0.316228
1394,Donnie Brasco,"[Crime, Drama, Thriller]",Mike Newell,"[Johnny Depp, Al Pacino, Michael Madsen]","[undercover, colombia, mafia]",0.316228
1850,Scarface,"[Action, Crime, Drama]",Brian De Palma,"[Al Pacino, Steven Bauer, Michelle Pfeiffer]","[miami, corruption, capitalism]",0.316228


In [34]:
get_recommendations('27 Dresses', cosine_sim2, plot_based=False)

Input Movie:
                                                     1560
title                                          27 Dresses
genres                                  [Comedy, Romance]
director                                    Anne Fletcher
cast      [Katherine Heigl, James Marsden, Malin Åkerman]
keywords                   [lovesickness, newspaper, bar]


Unnamed: 0,Title,Genres,Director,Cast,Keywords,Similarity Score
1300,The Ugly Truth,"[Comedy, Romance]",Robert Luketic,"[Katherine Heigl, Gerard Butler, Eric Winter]","[romantic comedy, romance, tv morning show]",0.402015
4247,Me You and Five Bucks,"[Romance, Comedy, Drama]",,[],[],0.3849
705,Couples Retreat,"[Comedy, Romance]",Peter Billingsley,"[Vince Vaughn, Malin Åkerman, Jason Bateman]","[island, married couple, yoga]",0.333333
957,Bridget Jones: The Edge of Reason,"[Comedy, Romance]",Beeban Kidron,"[Renée Zellweger, Hugh Grant, Colin Firth]","[london england, lovesickness, thailand]",0.333333
2171,My Best Friend's Girl,"[Romance, Comedy]",Howard Deutch,"[Dane Cook, Kate Hudson, Alec Baldwin]","[date, sex, bar]",0.333333
2299,Leap Year,"[Romance, Comedy]",Anand Tucker,"[Amy Adams, Matthew Goode, Adam Scott]","[taxi, bar, wales]",0.333333
1150,The Proposal,"[Comedy, Romance, Drama]",Anne Fletcher,"[Sandra Bullock, Ryan Reynolds, Mary Steenburgen]","[fictitious marriage, deportation, immigration...",0.316228
1744,Knocked Up,"[Comedy, Romance, Drama]",Judd Apatow,"[Seth Rogen, Katherine Heigl, Leslie Mann]","[alcohol, one-night stand, bed]",0.316228
1798,New Year's Eve,"[Comedy, Romance]",Garry Marshall,"[Robert De Niro, Katherine Heigl, Ashton Kutcher]","[new year's eve, illustrator, caterer]",0.316228
1806,Accidental Love,"[Romance, Comedy]",David O. Russell,"[Jake Gyllenhaal, Jessica Biel, James Marsden]","[one-night stand, romantic comedy, accidental ...",0.316228


We can see that these results are much better as we are comparing more important metadata. The genres of the recommended movies are similar, the recommended movies also often have the same directors and actors. The keywords may not be the same, but express the same sentiment. 

If we wanted to recommend movies that certain fanbases would enjoy, for example a Marvel fan would want to watch their movies, then we could include production company in our recommendation system. 