# Content Filtering Recommendation Engine on the MovieLens Data

## 0. Preparation

We first read in all the packages we're going to need.

In [1]:
# Import packages
import pandas as pd
pd.set_option('display.max_columns', None)

from ast import literal_eval

We then review read in the data we're going to use for our content filtering recommendation engine. We first start with the ratings data. Here, we're interested in the user, movie and the rating the user gave a given movie.

In [2]:
# Output data
ratings_small = pd.read_csv('ratings_small.csv')
ratings_small = ratings_small[['userId', 'movieId', 'rating']]
ratings_small.head(10)

Unnamed: 0,userId,movieId,rating
0,1,31,2.5
1,1,1029,3.0
2,1,1061,3.0
3,1,1129,2.0
4,1,1172,4.0
5,1,1263,2.0
6,1,1287,2.0
7,1,1293,2.0
8,1,1339,3.5
9,1,1343,2.0


We now have a look at the movie metadata. This dataset provides the descriptive information of the movie, including a description of the movie, the production company, etc.

In this recommender, I'm only going to assume that we have no access to past review scores (i.e. to prevent data leakage). We will be evaluating movies based only on textual information.

In [3]:
metadata = pd.read_csv('movies_metadata.csv')
metadata = metadata[['id', 'original_title', 'overview', 'production_companies', 'genres']]

# Remove rows with bad IDs.
metadata = metadata.drop([19730, 29503, 35587])

metadata.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,id,original_title,overview,production_companies,genres
0,862,Toy Story,"Led by Woody, Andy's toys live happily in his ...","[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'id': 16, 'name': 'Animation'}, {'id': 35, '..."
1,8844,Jumanji,When siblings Judy and Peter discover an encha...,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'id': 12, 'name': 'Adventure'}, {'id': 14, '..."
2,15602,Grumpier Old Men,A family wedding reignites the ancient feud be...,"[{'name': 'Warner Bros.', 'id': 6194}, {'name'...","[{'id': 10749, 'name': 'Romance'}, {'id': 35, ..."
3,31357,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",[{'name': 'Twentieth Century Fox Film Corporat...,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam..."
4,11862,Father of the Bride Part II,Just when George Banks has recovered from his ...,"[{'name': 'Sandollar Productions', 'id': 5842}...","[{'id': 35, 'name': 'Comedy'}]"


We also have data which describes the cast, crew and keywords which we will read and join to our metadata.

In [4]:
credits = pd.read_csv('credits.csv')
keywords = pd.read_csv('keywords.csv')

# Convert IDs to int. Required for merging
credits['id'] = credits['id'].astype('int')
metadata['id'] = metadata['id'].astype('int')
keywords['id'] = keywords['id'].astype('int')

# Merge keywords and credits into your main metadata dataframe
metadata = metadata.merge(credits, on='id')
metadata = metadata.merge(keywords, on='id')

metadata = metadata.dropna(subset=['id'])

metadata.head()

Unnamed: 0,id,original_title,overview,production_companies,genres,cast,crew,keywords
0,862,Toy Story,"Led by Woody, Andy's toys live happily in his ...","[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'id': 16, 'name': 'Animation'}, {'id': 35, '...","[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,8844,Jumanji,When siblings Judy and Peter discover an encha...,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...","[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,15602,Grumpier Old Men,A family wedding reignites the ancient feud be...,"[{'name': 'Warner Bros.', 'id': 6194}, {'name'...","[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...","[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...","[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,31357,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",[{'name': 'Twentieth Century Fox Film Corporat...,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...","[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...","[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,11862,Father of the Bride Part II,Just when George Banks has recovered from his ...,"[{'name': 'Sandollar Productions', 'id': 5842}...","[{'id': 35, 'name': 'Comedy'}]","[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...","[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."


In [6]:
#Replace NaN with an empty string
metadata = metadata.fillna('')

It looks like the production companies column is in the form of "stringified" lists. I'll clean this up.

In [7]:
features = ['production_companies', 'cast', 'crew', 'keywords', 'genres']
for feature in features:
    metadata[feature] = metadata[feature].apply(literal_eval)

In [8]:
metadata

Unnamed: 0,id,original_title,overview,production_companies,genres,cast,crew,keywords
0,862,Toy Story,"Led by Woody, Andy's toys live happily in his ...","[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'id': 16, 'name': 'Animation'}, {'id': 35, '...","[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,8844,Jumanji,When siblings Judy and Peter discover an encha...,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...","[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,15602,Grumpier Old Men,A family wedding reignites the ancient feud be...,"[{'name': 'Warner Bros.', 'id': 6194}, {'name'...","[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...","[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...","[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,31357,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",[{'name': 'Twentieth Century Fox Film Corporat...,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...","[{'cast_id': 1, 'character': 'Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...","[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,11862,Father of the Bride Part II,Just when George Banks has recovered from his ...,"[{'name': 'Sandollar Productions', 'id': 5842}...","[{'id': 35, 'name': 'Comedy'}]","[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...","[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."
5,949,Heat,"Obsessive master thief, Neil McCauley leads a ...","[{'name': 'Regency Enterprises', 'id': 508}, {...","[{'id': 28, 'name': 'Action'}, {'id': 80, 'nam...","[{'cast_id': 25, 'character': 'Lt. Vincent Han...","[{'credit_id': '52fe4292c3a36847f802916d', 'de...","[{'id': 642, 'name': 'robbery'}, {'id': 703, '..."
6,11860,Sabrina,An ugly duckling having undergone a remarkable...,"[{'name': 'Paramount Pictures', 'id': 4}, {'na...","[{'id': 35, 'name': 'Comedy'}, {'id': 10749, '...","[{'cast_id': 1, 'character': 'Linus Larrabee',...","[{'credit_id': '52fe44959251416c75039da9', 'de...","[{'id': 90, 'name': 'paris'}, {'id': 380, 'nam..."
7,45325,Tom and Huck,"A mischievous young boy, Tom Sawyer, witnesses...","[{'name': 'Walt Disney Pictures', 'id': 2}]","[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...","[{'cast_id': 2, 'character': 'Tom Sawyer', 'cr...","[{'credit_id': '52fe46bdc3a36847f810f797', 'de...",[]
8,9091,Sudden Death,International action superstar Jean Claude Van...,"[{'name': 'Universal Pictures', 'id': 33}, {'n...","[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...","[{'cast_id': 1, 'character': 'Darren Francis T...","[{'credit_id': '52fe44dbc3a36847f80ae0f1', 'de...","[{'id': 949, 'name': 'terrorist'}, {'id': 1562..."
9,710,GoldenEye,James Bond must unmask the mysterious head of ...,"[{'name': 'United Artists', 'id': 60}, {'name'...","[{'id': 12, 'name': 'Adventure'}, {'id': 28, '...","[{'cast_id': 1, 'character': 'James Bond', 'cr...","[{'credit_id': '52fe426ec3a36847f801e14b', 'de...","[{'id': 701, 'name': 'cuba'}, {'id': 769, 'nam..."
