# Content-Based Filtering: Product Recommendation

Note that this tutorial has been adapted from https://www.datacamp.com/tutorial/recommender-systems-python

We are looking at a dataset of movies and their metadata attributes. 

Based on the movies that the user has chosen to watch in the past, we recommend new movies to them.

In [1]:
import utils
import numpy as np
import pandas as pd

metadata = utils.get_product_recommendation_data_content()

metadata.to_csv("output_data/product-recommendation.csv", index=False)

We are going to use a combination of the following metadata attributes to recommend new movies to the user:
- movie overview (text description)
- director
- cast members
- crew members

We have a "soup" column which combines the names of the director, cast, crew into a "soup" column. We don't necessarily have to do this, and we can vectorize this information separately as well, but we do that for simplicity. 

In [2]:
print(metadata['overview'].head())

print(metadata['soup'].head())

0    Led by Woody, Andy's toys live happily in his ...
1    When siblings Judy and Peter discover an encha...
4    Just when George Banks has recovered from his ...
5    Obsessive master thief, Neil McCauley leads a ...
9    James Bond must unmask the mysterious head of ...
Name: overview, dtype: object
0    tomhanks timallen donrickles johnlasseter anim...
1    robinwilliams jonathanhyde kirstendunst joejoh...
4    stevemartin dianekeaton martinshort charlesshy...
5    alpacino robertdeniro valkilmer michaelmann ac...
9    piercebrosnan seanbean izabellascorupco martin...
Name: soup, dtype: object


# Vectorizing the attributes

There are several ways to vectorize. 

For the "overview" column, it is a natural text input, so we use the TFIDF vectorizer that is preferred for text inputs. Here, we could also use more advanced embedding methods, for example watsonx.ai's embedding models.

For the "soup" column, we want to preserve all the words, so we use a simple CountVectorizer that counts the presence of specific words.

We will combine the matrices outputted by these two methods into a single large matrix

In [3]:
# Import CountVectorizer and create the count matrix
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(metadata['soup'])

print(count_matrix.shape)

(1000, 2148)


In [4]:
#Import TfIdfVectorizer from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

#Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')

#Replace NaN with an empty string
metadata['overview'] = metadata['overview'].fillna('')

#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(metadata['overview'])

print(tfidf_matrix.shape)


(1000, 8859)


In [5]:
import scipy.sparse as sp

final_matrix = sp.hstack((count_matrix, tfidf_matrix), format='csr')

In [6]:
# Import linear_kernel
from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine similarity matrix
cosine_sim = linear_kernel(final_matrix, final_matrix)


In [8]:
#Construct a reverse map of indices and movie titles
indices = pd.Series(metadata.index, index=metadata['title']).drop_duplicates()


# Getting Recommendations for Similar Movies

Now, based on a specific movie that the user has watched in the past, we can generate similar movies (measured using the attributes/metadata) that are suitable for them.

In [9]:
# Function that takes in movie title as input and outputs most similar movies
def get_recommendations(title, cosine_sim=cosine_sim):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return metadata['title'].iloc[movie_indices]


In [10]:

get_recommendations("The Shawshank Redemption")

2618             Spartacus
108             Braveheart
1192              Das Boot
1914         Seven Samurai
2950       The Longest Day
3618               Serpico
1165    Lawrence of Arabia
1251                Gandhi
2918               Yojimbo
512             Rising Sun
Name: title, dtype: object

# Generating Recommendations for a Specific User

In the above cell we generated movies that are "similar" in metadata to a given movie.

However, perhaps, based on the aggregated movies that the user has watched in the past, we wish to recommend some new movies to watch.

One simple way to do this is to take the averaged vectors for the metadata that the user has watched. Let's try this below:

In [11]:
from sklearn.metrics.pairwise import cosine_similarity

def get_aggregated_recommendations_for_user(watched_titles, cosine_sim=cosine_sim):
    # Get the index of the movie that matches the title
    idx = indices[watched_titles]

    sims = sorted([(e, cosine_similarity(np.array(final_matrix[idx].mean(axis=0)), m)) for e, m in enumerate(final_matrix)], reverse=True, key=lambda x: x[1])

    sims = [i for i in sims if i[0] not in list(idx)]

    # Get the scores of the 10 most similar movies
    sim_scores = sims[:10]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # # Return the top 10 most similar movies
    similar = metadata['title'].iloc[movie_indices]

    return [i for i in similar if i not in watched_titles]


Let's consider a user that has watched 2 movies so far, and see what new movies to recommend to them

In [13]:
movies_watched = ["The Shawshank Redemption", "Braveheart"]

get_aggregated_recommendations_for_user(movies_watched)

['The Terminator',
 'Terminator 2: Judgment Day',
 'Seven Samurai',
 'Breakdown',
 'Enemy of the State',
 'The Abyss',
 'In the Line of Fire',
 'The Siege',
 'Clear and Present Danger',
 'The Bodyguard']

We can see that the user is automatically recommended rather similar "adventure" movies