Example of a simple content based recommender.
Also, uses the MovieLens dataset (actually a sub-set of that dataset downloaded from Kaggle).

This system is used to recommend to other movies that are similar to particular movie. This is achieved by computing pairwise cosine similarity scores for all movies based on their plot descriptions, and recommending movies based on that similarity score threshold.

In [1]:
import pandas as pd

# Load movies metadata
metadata = pd.read_csv("../data/external/movies_metadata.csv", low_memory=False)

# Print plot overview of the movie at head of dataframe
metadata['overview'].head()


0    Led by Woody, Andy's toys live happily in his ...
1    When siblings Judy and Peter discover an encha...
2    A family wedding reignites the ancient feud be...
3    Cheated on, mistreated and stepped on, the wom...
4    Just when George Banks has recovered from his ...
Name: overview, dtype: object

Clearly, it is not possible to compute the similarity between any two overviews in their raw forms. In order to do anything mathematical with the overviews, we first compute the word vectors of each overview (also referred to as 'document').

Word vectors are vectorized representation of words in a document. These vectors carry a semantic meaning with it. The specific computation applied here is Term Frequency-Inverse Document Frequency (TF-IDF). The TF-IDF score is the frequency of a word occuring in a document, down-weighted by the number of documents in which it occurs. This reduces the importance of words that frequently occur in plot overviews and, therefore their significance in computing the final similarity score.

Scikit-learn package provides a built-in TfldfVectorizer class that produces the TF-IDF matrix in a couple of lines.

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

# Define TF-IDF vectorizer object. Remove all english stop words such as 'the', 'a'ArithmeticError
tfidf = TfidfVectorizer(stop_words='english', dtype=np.float32)

# Replace NaN with empty string
metadata['overview'] = metadata['overview'].fillna('')

# Construct the required TF-IDF matric by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(metadata['overview'])

tfidf_matrix.shape


(45466, 75827)

In [17]:
# Array mapping from feature integer indices to feature names
tfidf.get_feature_names_out()[50000:50010]

array(['period', 'periodic', 'periodical', 'periodically', 'periods',
       'peripatetic', 'peripheral', 'periphery', 'perish', 'perished'],
      dtype=object)

Incidentally, there are a total of 75827 different words in our movies dataset!

Now with the tftdf_matrix in hand, we can compute the cosine similarity scores. There a number of similarity scores that could be used here -- manhattan, euclidean, the Pearson, etc. Different scores work well in different scenarios.

Since we have used the TF-IDF vectorizer, calculating the dot product between each vector will directly give us the cosine similarity score. We will use sklearn's linear_kernel() instead of cosine_similarity(), since it is faster.

This would return a matrix of shape 45466x45466, which means each movie overview similarity score with every other movies overview. Hence, each movie will be a 1x45466 column vector.

In [18]:
type(tfidf_matrix[0][0])

scipy.sparse._csr.csr_matrix

In [19]:
from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

cosine_sim.shape

cosine_sim[1]


array([0.01504121, 1.0000001 , 0.04681952, ..., 0.        , 0.0219864 ,
       0.00929411], dtype=float32)

Next, we add a function that takes a movie title as input, and outputs a list of the 10 most similar movies. For this, first we need a reverse mapping from DataFrame index to movie title.

In [38]:
indices = pd.Series(metadata.index, index=metadata['title']).drop_duplicates()

indices[:10]

title
Toy Story                      0
Jumanji                        1
Grumpier Old Men               2
Waiting to Exhale              3
Father of the Bride Part II    4
Heat                           5
Sabrina                        6
Tom and Huck                   7
Sudden Death                   8
GoldenEye                      9
dtype: int64

In [40]:
# Function that takes in movie title as input and outputs most similar movies
def get_recommendation(title, cosine_sim=cosine_sim):
    # Get index of the movie that matches the titles
    idx = indices[title]

    # Get the pairwise similarity scores of all movies with the input movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movie titles
    return metadata['title'].iloc[movie_indices]


In [42]:
list_of_movies = get_recommendation('The Dark Knight Rises')

list_of_movies

12481                                      The Dark Knight
150                                         Batman Forever
1328                                        Batman Returns
15511                           Batman: Under the Red Hood
585                                                 Batman
21194    Batman Unmasked: The Psychology of the Dark Kn...
9230                    Batman Beyond: Return of the Joker
18035                                     Batman: Year One
19792              Batman: The Dark Knight Returns, Part 1
3095                          Batman: Mask of the Phantasm
Name: title, dtype: object

In [44]:
get_recommendation('GoldenEye')

2874           Licence to Kill
37440               Dream Work
2875          Live and Let Die
7330                 Octopussy
7329       You Only Live Twice
8331                  Doctor X
7333     Never Say Never Again
37934      Johnny Stool Pigeon
5658             Casino Royale
4316     The Way of the Dragon
Name: title, dtype: object

In [46]:
get_recommendation('The Godfather')

1178               The Godfather: Part II
44030    The Godfather Trilogy: 1972-1990
1914              The Godfather: Part III
23126                          Blood Ties
11297                    Household Saints
34717                   Start Liquidation
10821                            Election
38030            A Mother Should Be Loved
17729                   Short Sharp Shock
26293                  Beck 28 - Familjen
Name: title, dtype: object

In [48]:
get_recommendation('Waiting to Exhale')

26594       I Don't Buy Kisses Anymore
27337                Robin of Locksley
34887               Starring Adam West
35930                  Poil de Carotte
41416                           Hunted
24588                       Chatterbox
43426    Robin Williams - Off the Wall
18999                           Bernie
28207                The Boy Next Door
2141                              Hero
Name: title, dtype: object