# Content Filtering Recommendation Engine on the MovieLens Data

## 0. Preparation

We first read in all the packages we're going to need.

In [1]:
# Import packages
import pandas as pd
pd.set_option('display.max_columns', None)

# Import Numpy
import numpy as np

from ast import literal_eval

### 0.1 Ratings Data

We then review read in the data we're going to use for our content filtering recommendation engine. We first start with the ratings data. Here, we're interested in the user, movie and the rating the user gave a given movie.

In [2]:
# Output data
ratings_small = pd.read_csv('ratings_small.csv')
ratings_small = ratings_small[['userId', 'movieId', 'rating']]
ratings_small.head(10)

Unnamed: 0,userId,movieId,rating
0,1,31,2.5
1,1,1029,3.0
2,1,1061,3.0
3,1,1129,2.0
4,1,1172,4.0
5,1,1263,2.0
6,1,1287,2.0
7,1,1293,2.0
8,1,1339,3.5
9,1,1343,2.0


We also want to split our ratings data into our different sets for training, validation and testing.

In [3]:
from sklearn.model_selection import train_test_split

# We want a train-validate-test split of approx 70-20-10
train_validate_ratings, test_ratings = train_test_split(ratings_small, test_size=0.1, random_state=42)
train_ratings, validate_ratings = train_test_split(train_validate_ratings, test_size=0.22,random_state=42)

print(train_ratings.size)
print(validate_ratings.size)
print(test_ratings.size)

210606
59403
30003


### 0.2 Movie Metadata

We now have a look at the movie metadata. This dataset provides the descriptive information of the movie, including a description of the movie, the production company, etc.

In this recommender, I'm only going to assume that we have no access to past review scores (i.e. to prevent data leakage). We will be evaluating movies based only on textual information.

In [4]:
metadata = pd.read_csv('movies_metadata.csv')
metadata = metadata[['id', 'title', 'overview', 'genres']]

#Replace NaN with an empty string
metadata['overview'] = metadata['overview'].fillna('')

metadata.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,id,title,overview,genres
0,862,Toy Story,"Led by Woody, Andy's toys live happily in his ...","[{'id': 16, 'name': 'Animation'}, {'id': 35, '..."
1,8844,Jumanji,When siblings Judy and Peter discover an encha...,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '..."
2,15602,Grumpier Old Men,A family wedding reignites the ancient feud be...,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ..."
3,31357,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...","[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam..."
4,11862,Father of the Bride Part II,Just when George Banks has recovered from his ...,"[{'id': 35, 'name': 'Comedy'}]"


We also have data which describes the cast, crew and keywords which we will read and join to our metadata.

In [5]:
# Load keywords and credits
credits = pd.read_csv('credits.csv')
keywords = pd.read_csv('keywords.csv')

# Remove rows with bad IDs.
metadata = metadata.drop([19730, 29503, 35587])

# Convert IDs to int. Required for merging
keywords['id'] = keywords['id'].astype('int')
credits['id'] = credits['id'].astype('int')
metadata['id'] = metadata['id'].astype('int')

# Merge keywords and credits into your main metadata dataframe
metadata = metadata.merge(credits, on='id')
metadata = metadata.merge(keywords, on='id')

# Print the first two movies of your newly merged metadata
metadata.head(2)

Unnamed: 0,id,title,overview,genres,cast,crew,keywords
0,862,Toy Story,"Led by Woody, Andy's toys live happily in his ...","[{'id': 16, 'name': 'Animation'}, {'id': 35, '...","[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,8844,Jumanji,When siblings Judy and Peter discover an encha...,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...","[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","[{'id': 10090, 'name': 'board game'}, {'id': 1..."


It looks like the production companies column is in the form of "stringified" lists. I'll clean this up.

In [6]:
features = ['cast', 'crew', 'keywords', 'genres']
for feature in features:
    metadata[feature] = metadata[feature].apply(literal_eval)

In [7]:
metadata.head(2)

Unnamed: 0,id,title,overview,genres,cast,crew,keywords
0,862,Toy Story,"Led by Woody, Andy's toys live happily in his ...","[{'id': 16, 'name': 'Animation'}, {'id': 35, '...","[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,8844,Jumanji,When siblings Judy and Peter discover an encha...,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...","[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","[{'id': 10090, 'name': 'board game'}, {'id': 1..."


Write functions to extract information from each column

In [8]:
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

In [9]:
def get_list(x):
    if isinstance(x, list):
        names = [i['name'] for i in x]
        #Check if more than 3 elements exist. If yes, return only first three. If no, return entire list.
        if len(names) > 3:
            names = names[:3]
        return names

    #Return empty list in case of missing/malformed data
    return []

In [10]:
# Define new director, cast, genres and keywords features that are in a suitable form.
metadata['director'] = metadata['crew'].apply(get_director)

features = ['cast', 'keywords', 'genres']
for feature in features:
    metadata[feature] = metadata[feature].apply(get_list)

In [11]:
# Print the new features of the first 3 films
metadata[['title', 'cast', 'director', 'keywords', 'genres']].head(3)

Unnamed: 0,title,cast,director,keywords,genres
0,Toy Story,"[Tom Hanks, Tim Allen, Don Rickles]",John Lasseter,"[jealousy, toy, boy]","[Animation, Comedy, Family]"
1,Jumanji,"[Robin Williams, Jonathan Hyde, Kirsten Dunst]",Joe Johnston,"[board game, disappearance, based on children'...","[Adventure, Fantasy, Family]"
2,Grumpier Old Men,"[Walter Matthau, Jack Lemmon, Ann-Margret]",Howard Deutch,"[fishing, best friend, duringcreditsstinger]","[Romance, Comedy]"


The next step is to convert the names and keyword instances into lowercase and strip all the spaces between them.

In [12]:
# Function to convert all strings to lower case and strip names of spaces
def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        #Check if director exists. If not, return empty string
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''

In [13]:
# Apply clean_data function to your features.
features = ['cast', 'keywords', 'director', 'genres']

for feature in features:
    metadata[feature] = metadata[feature].apply(clean_data)

Put all the keywords into a "soup"

In [14]:
def create_soup(x):
    return ' '.join(x['keywords']) + ' ' + ' '.join(x['cast']) + ' ' + x['director'] + ' ' + ' '.join(x['genres'])

In [15]:
# Create a new soup feature
metadata['soup'] = metadata.apply(create_soup, axis=1)

In [16]:
metadata[['soup']].head(2)

Unnamed: 0,soup
0,jealousy toy boy tomhanks timallen donrickles ...
1,boardgame disappearance basedonchildren'sbook ...


We keep only those movies that appear in the ratings data

In [17]:
metadata_ratedMovies = metadata[metadata.id.isin(list(ratings_small['movieId']))]

We then prepare the metadata to calculate similarity

In [18]:
# Import CountVectorizer and create the count matrix
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(metadata_ratedMovies['soup'])

In [19]:
count_matrix.shape

(2848, 9315)

We now want to calculate the cosine similarity of the elements of our matrix so we know which movies are like one another based on our count vector of 'soup'

In [20]:
# Compute the Cosine Similarity matrix based on the count_matrix
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim = cosine_similarity(count_matrix, count_matrix)

We also want to create a way we can lookup the movieId based on our metadata

In [21]:
# Reset index of the main DataFrame and construct reverse mapping as before
metadata_ratedMovies = metadata_ratedMovies.reset_index()
indices = pd.Series(metadata_ratedMovies.index, index=metadata_ratedMovies['id'])

## 1. Training

We now use our training dataset to see what predictions of user ratings we get based on different methods.

### 1.1 k-means

We base this method on calculating the average rating of the k-closest movies. We will vary k and plot the results as part of our hyperparameter optimisation

In [22]:
train_ratings.head()

Unnamed: 0,userId,movieId,rating
47814,352,111362,3.5
68971,478,59315,3.5
18162,119,2478,3.0
8534,56,8665,4.0
3129,19,48,3.0


We create a function to return the k-closest movies

In [23]:
# Function that takes in movie title as input and outputs most similar movies
def get_recommendations(movieId, cosine_sim=cosine_sim, k=2):
    
    # Get the index of the movie that matches the title
    idx = indices[movieId]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:(k+1)]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return metadata_ratedMovies['id'].iloc[movie_indices]

In [25]:
get_recommendations(111362, cosine_sim, 5)

KeyError: 111362

In [28]:
metadata_ratedMovies[metadata_ratedMovies['id']==111362]

Unnamed: 0,index,id,title,overview,genres,cast,crew,keywords,director,soup


In [27]:
train_ratings[train_ratings['movieId']==111362]

Unnamed: 0,userId,movieId,rating
47814,352,111362,3.5
37827,272,111362,3.5
81434,553,111362,4.5
35162,251,111362,4.5
76590,529,111362,3.0
68807,475,111362,3.5
28135,205,111362,3.0
37678,270,111362,3.5
2543,15,111362,3.5
16193,104,111362,4.0
