# Content-Based Filtering: Product Recommendation

Note that this tutorial has been adapted from https://www.datacamp.com/tutorial/recommender-systems-python

We are looking at a dataset of movies and their metadata attributes. 

Based on the attributes/metadata of movies that the user has chosen to watch in the past, we recommend new movies to them.

In [1]:
import numpy as np
import pandas as pd
import os, types
from botocore.client import Config
import ibm_boto3

Click into the next empty cell, and then, at the top of the Watson Studio notebook, click Code Snippets > Read Data and specify your "product-recommendation.csv" data file that you want to read. Save it to df called ```metadata```

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,status,tagline,title,video,vote_average,vote_count,cast,crew,director,soup
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"['animation', 'comedy', 'family']",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,Released,,Toy Story,False,7.7,5415.0,"['tomhanks', 'timallen', 'donrickles']","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",johnlasseter,tomhanks timallen donrickles johnlasseter anim...
1,False,,65000000,"['adventure', 'fantasy', 'family']",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,"['robinwilliams', 'jonathanhyde', 'kirstendunst']","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",joejohnston,robinwilliams jonathanhyde kirstendunst joejoh...
2,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,['comedy'],,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0,"['stevemartin', 'dianekeaton', 'martinshort']","[{'credit_id': '52fe44959251416c75039ed7', 'de...",charlesshyer,stevemartin dianekeaton martinshort charlesshy...
3,False,,60000000,"['action', 'crime', 'drama']",,949,tt0113277,en,Heat,"Obsessive master thief, Neil McCauley leads a ...",...,Released,A Los Angeles Crime Saga,Heat,False,7.7,1886.0,"['alpacino', 'robertdeniro', 'valkilmer']","[{'credit_id': '52fe4292c3a36847f802916d', 'de...",michaelmann,alpacino robertdeniro valkilmer michaelmann ac...
4,False,"{'id': 645, 'name': 'James Bond Collection', '...",58000000,"['adventure', 'action', 'thriller']",http://www.mgm.com/view/movie/757/Goldeneye/,710,tt0113189,en,GoldenEye,James Bond must unmask the mysterious head of ...,...,Released,No limits. No fears. No substitutes.,GoldenEye,False,6.6,1194.0,"['piercebrosnan', 'seanbean', 'izabellascorupco']","[{'credit_id': '52fe426ec3a36847f801e14b', 'de...",martincampbell,piercebrosnan seanbean izabellascorupco martin...
5,False,,62000000,"['comedy', 'drama', 'romance']",,9087,tt0112346,en,The American President,"Widowed U.S. president Andrew Shepherd, one of...",...,Released,Why can't the most powerful man in the world h...,The American President,False,6.5,199.0,"['michaeldouglas', 'annettebening', 'michaelj....","[{'credit_id': '52fe44dac3a36847f80adfa3', 'de...",robreiner,michaeldouglas annettebening michaelj.fox robr...
6,False,,0,"['comedy', 'horror']",,12110,tt0112896,en,Dracula: Dead and Loving It,When a lawyer shows up at the vampire's doorst...,...,Released,,Dracula: Dead and Loving It,False,5.7,210.0,"['leslienielsen', 'melbrooks', 'amyyasbeck']","[{'credit_id': '52fe44b79251416c7503e7fb', 'de...",melbrooks,leslienielsen melbrooks amyyasbeck melbrooks c...
7,False,"{'id': 117693, 'name': 'Balto Collection', 'po...",0,"['family', 'animation', 'adventure']",,21032,tt0112453,en,Balto,An outcast half-wolf risks his life to prevent...,...,Released,Part Dog. Part Wolf. All Hero.,Balto,False,7.1,423.0,"['kevinbacon', 'bobhoskins', 'bridgetfonda']","[{'credit_id': '593f24b9c3a3680369002371', 'de...",simonwells,kevinbacon bobhoskins bridgetfonda simonwells ...
8,False,,52000000,"['drama', 'crime']",,524,tt0112641,en,Casino,The life of the gambling paradise – Las Vegas ...,...,Released,No one stays at the top forever.,Casino,False,7.8,1343.0,"['robertdeniro', 'sharonstone', 'joepesci']","[{'credit_id': '52fe424dc3a36847f80139cd', 'de...",martinscorsese,robertdeniro sharonstone joepesci martinscorse...
9,False,,16500000,"['drama', 'romance']",,4584,tt0114388,en,Sense and Sensibility,"Rich Mr. Dashwood dies, leaving his second wif...",...,Released,Lose your heart and come to your senses.,Sense and Sensibility,False,7.2,364.0,"['katewinslet', 'emmathompson', 'hughgrant']","[{'credit_id': '52fe43cec3a36847f807101f', 'de...",anglee,katewinslet emmathompson hughgrant anglee dram...


We are going to use a combination of the following metadata attributes to recommend new movies to the user:
- movie overview (text description)
- director
- cast members

We have a "soup" column which combines the names of the director and cast members, combined into a "soup" column. We don't necessarily have to do this, and we can vectorize this information separately as well, but we do that for simplicity. 

In [3]:
print(metadata['overview'].head())

print(metadata['soup'].head())

0    Led by Woody, Andy's toys live happily in his ...
1    When siblings Judy and Peter discover an encha...
2    Just when George Banks has recovered from his ...
3    Obsessive master thief, Neil McCauley leads a ...
4    James Bond must unmask the mysterious head of ...
Name: overview, dtype: object
0    tomhanks timallen donrickles johnlasseter anim...
1    robinwilliams jonathanhyde kirstendunst joejoh...
2    stevemartin dianekeaton martinshort charlesshy...
3    alpacino robertdeniro valkilmer michaelmann ac...
4    piercebrosnan seanbean izabellascorupco martin...
Name: soup, dtype: object


# Vectorizing the attributes

There are several ways to vectorize. 

For the "overview" column, it is a natural text input, so we use the TFIDF vectorizer that is preferred for text inputs. Here, we could also use more advanced neural network-based embedding methods.

For the "soup" column, we want to preserve all the words, so we use a simple CountVectorizer that counts the presence of specific words.

We will combine the matrices outputted by these two methods into a single large matrix

In [4]:
# Import CountVectorizer and create the count matrix
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(metadata['soup'])

print(count_matrix.shape)

(1000, 2148)


In [5]:
#Import TfIdfVectorizer from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

#Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')

#Replace NaN with an empty string
metadata['overview'] = metadata['overview'].fillna('')

#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(metadata['overview'])

print(tfidf_matrix.shape)


(1000, 8859)


In [6]:
import scipy.sparse as sp

final_matrix = sp.hstack((count_matrix, tfidf_matrix), format='csr')

In [7]:
# Import linear_kernel
from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine similarity matrix
cosine_sim = linear_kernel(final_matrix, final_matrix)


In [8]:
#Construct a reverse map of indices and movie titles
indices = pd.Series(metadata.index, index=metadata['title']).drop_duplicates()


# Getting Recommendations for Similar Movies

Now, based on a specific movie that the user has watched in the past, we can generate similar movies (measured using the attributes/metadata) that are suitable for them.

In [9]:
# Function that takes in movie title as input and outputs most similar movies
def get_recommendations(title, cosine_sim=cosine_sim):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return metadata['title'].iloc[movie_indices]


In [10]:

get_recommendations("The Shawshank Redemption")

856       The Green Mile
363       Cool Hand Luke
793      Double Jeopardy
116        Carlito's Way
870             Papillon
442             Cop Land
60            Disclosure
257             Sleepers
473              Amistad
907    Dog Day Afternoon
Name: title, dtype: object

# Generating Recommendations for a Specific User

In the above cell we generated movies that are "similar" in metadata to a given movie.

However, perhaps, based on the aggregated movies that the user has watched in the past, we wish to recommend some new movies to watch.

One simple way to do this is to take the averaged vectors for the metadata that the user has watched. Let's try this below:

In [11]:
from sklearn.metrics.pairwise import cosine_similarity

def get_aggregated_recommendations_for_user(watched_titles, cosine_sim=cosine_sim):
    # Get the index of the movie that matches the title
    idx = indices[watched_titles]

    sims = sorted([(e, cosine_similarity(np.array(final_matrix[idx].mean(axis=0)), m)) for e, m in enumerate(final_matrix)], reverse=True, key=lambda x: x[1])

    sims = [i for i in sims if i[0] not in list(idx)]

    # Get the scores of the 10 most similar movies
    sim_scores = sims[:10]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # # Return the top 10 most similar movies
    similar = metadata['title'].iloc[movie_indices]

    return [i for i in similar if i not in watched_titles]


Let's consider a user that has watched 2 movies so far, and see what new movies to recommend to them

In [12]:
movies_watched = ["The Shawshank Redemption", "Braveheart"]

get_aggregated_recommendations_for_user(movies_watched)

['The Patriot',
 'Payback',
 'Conspiracy Theory',
 'Pocahontas',
 'Amistad',
 'Cop Land',
 "Carlito's Way",
 'Serpico',
 'Spartacus',
 'Death Wish']

We can see that the user is automatically recommended rather similar "adventure/thriller" movies