# Content-Based Filtering: Product Recommendation

Note that this tutorial has been adapted from https://www.datacamp.com/tutorial/recommender-systems-python

We are looking at a dataset of movies and their metadata attributes. 

Based on the movies that the user has chosen to watch in the past, we recommend new movies to them.

In [4]:
import utils
import numpy as np
import pandas as pd

metadata = utils.get_product_recommendation_data_content()

metadata.to_csv("output_data/product-recommendation.csv", index=False)

# Data Preprocessing

In this cell, we preprocess the data to 
- filter out movies with lower watch counts and lower votes. In smaller datasets, we don't necessarily have to do this
- preprocess relevant metadata from the table i.e. the director's name, cast name, and crew names

In [5]:
m = metadata['vote_count'].quantile(0.90)
C = metadata['vote_average'].mean()

# Parse the stringified features into their corresponding python objects
from ast import literal_eval

features = ['cast', 'crew', 'genres']
for feature in features:
    metadata[feature] = metadata[feature].apply(literal_eval)

def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

def get_list(x):
    if isinstance(x, list):
        names = [i['name'] for i in x]
        #Check if more than 3 elements exist. If yes, return only first three. If no, return entire list.
        if len(names) > 3:
            names = names[:3]
        return names

    #Return empty list in case of missing/malformed data
    return []

# Define new director, cast, genres and keywords features that are in a suitable form.
metadata['director'] = metadata['crew'].apply(get_director)

features = ['cast', 'genres']
for feature in features:
    metadata[feature] = metadata[feature].apply(get_list)

# Print the new features of the first 3 films
metadata[['title', 'cast', 'director',  'genres']].head(3)

# Function to convert all strings to lower case and strip names of spaces
def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        #Check if director exists. If not, return empty string
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''

# Apply clean_data function to your features.
features = ['cast',  'director', 'genres']

for feature in features:
    metadata[feature] = metadata[feature].apply(clean_data)

def create_soup(x):
    return ' '.join(x['cast']) + ' ' + x['director'] + ' ' + ' '.join(x['genres'])

# Create a new soup feature
metadata['soup'] = metadata.apply(create_soup, axis=1)


metadata = metadata[metadata["vote_average"] > C]
metadata = metadata[metadata["vote_count"] > m]

q_movies = metadata.iloc[:10000,:]

# Print the first three rows
metadata.head(3)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,status,tagline,title,video,vote_average,vote_count,cast,crew,director,soup
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[animation, comedy, family]",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,Released,,Toy Story,False,7.7,5415.0,"[tomhanks, timallen, donrickles]","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",johnlasseter,tomhanks timallen donrickles johnlasseter anim...
1,False,,65000000,"[adventure, fantasy, family]",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,"[robinwilliams, jonathanhyde, kirstendunst]","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",joejohnston,robinwilliams jonathanhyde kirstendunst joejoh...
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,[comedy],,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0,"[stevemartin, dianekeaton, martinshort]","[{'credit_id': '52fe44959251416c75039ed7', 'de...",charlesshyer,stevemartin dianekeaton martinshort charlesshy...


We are going to use a combination of the following metadata attributes to recommend new movies to the user:
- movie overview (text description)
- director
- cast members
- crew members

In the previous cell we generated a "soup" column which combines the names of the director, cast, crew into a "soup" column. We don't necessarily have to do this, and we can vectorize this information separately as well, but we do that for simplicity. 

In [8]:
print(metadata['overview'].head())

print(metadata['soup'].head())

0    Led by Woody, Andy's toys live happily in his ...
1    When siblings Judy and Peter discover an encha...
4    Just when George Banks has recovered from his ...
5    Obsessive master thief, Neil McCauley leads a ...
9    James Bond must unmask the mysterious head of ...
Name: overview, dtype: object
0    tomhanks timallen donrickles johnlasseter anim...
1    robinwilliams jonathanhyde kirstendunst joejoh...
4    stevemartin dianekeaton martinshort charlesshy...
5    alpacino robertdeniro valkilmer michaelmann ac...
9    piercebrosnan seanbean izabellascorupco martin...
Name: soup, dtype: object


# Vectorizing the attributes

There are several ways to vectorize. 

For the "overview" column, it is a natural text input, so we use the TFIDF vectorizer that is preferred for text inputs. Here, we could also use more advanced embedding methods, for example watsonx.ai's embedding models.

For the "soup" column, we want to preserve all the words, so we use a simple CountVectorizer that counts the presence of specific words.

We will combine the matrices outputted by these two methods into a single large matrix

In [10]:
# Import CountVectorizer and create the count matrix
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(metadata['soup'])

print(count_matrix.shape)

(3797, 6285)


In [11]:
#Import TfIdfVectorizer from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

#Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')

#Replace NaN with an empty string
metadata['overview'] = metadata['overview'].fillna('')

#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(metadata['overview'])

print(tfidf_matrix.shape)


(3797, 18033)


In [12]:
import scipy.sparse as sp

final_matrix = sp.hstack((count_matrix, tfidf_matrix), format='csr')

In [14]:
# Import linear_kernel
from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine similarity matrix
cosine_sim = linear_kernel(final_matrix, final_matrix)


In [106]:
#Construct a reverse map of indices and movie titles
indices = pd.Series(q_movies.index, index=q_movies['title']).drop_duplicates()


# Getting Recommendations for Similar Movies

Now, based on a specific movie that the user has watched in the past, we can generate similar movies (measured using the attributes/metadata) that are suitable for them.

In [107]:
# Function that takes in movie title as input and outputs most similar movies
def get_recommendations(title, cosine_sim=cosine_sim):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return q_movies['title'].iloc[movie_indices]


In [108]:

get_recommendations("The Shawshank Redemption")

2618           Spartacus
15601           Ip Man 2
108           Braveheart
1192            Das Boot
1914       Seven Samurai
2950     The Longest Day
5294         Windtalkers
7431     Throne of Blood
10658             Munich
13276           Defiance
Name: title, dtype: object

# Generating Recommendations for a Specific User

In the above cell we generated movies that are "similar" in metadata to a given movie.

However, perhaps, based on the aggregated movies that the user has watched in the past, we wish to recommend some new movies to watch.

One simple way to do this is to take the averaged vectors for the metadata that the user has watched. Let's try this below:

In [112]:
from sklearn.metrics.pairwise import cosine_similarity

def get_aggregated_recommendations_for_user(watched_titles, cosine_sim=cosine_sim):
    # Get the index of the movie that matches the title
    idx = indices[watched_titles]

    sims = sorted([(e, cosine_similarity(np.array(final_matrix[idx].mean(axis=0)), m)) for e, m in enumerate(final_matrix)], reverse=True, key=lambda x: x[1])

    sims = [i for i in sims if i[0] not in list(idx)]

    # Get the scores of the 10 most similar movies
    sim_scores = sims[:10]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # # Return the top 10 most similar movies
    similar = q_movies['title'].iloc[movie_indices]

    return [i for i in similar if i not in watched_titles]


Let's consider a user that has watched 4 movies so far, and see what new movies to recommend to them

In [113]:
movies_watched = ["The Shawshank Redemption", "Spartacus", "JFK", "Das Boot"]

get_aggregated_recommendations_for_user(movies_watched)

['Throne of Blood',
 'Rashomon',
 'Serpico',
 'Ed Wood',
 'J. Edgar',
 'Munich',
 'The Imitation Game',
 'Amistad',
 'The Good Shepherd']

We can see that the user is automatically recommended rather similar theatrical thriller movies