# Knowledge-based recommender

In this part we will use the Full LensMovie dataset used in the previous lecture and start building a knowledge-based recommender version on top of our IMDB Top 250 Clone.  

The knowledge-based recommender will serve as a simple function and will perform the following tasks:
- Ask the user for the genres of movies he/she is looking for
- Ask the user for the duration
- Ask the user for the timeline of the movies recommended
- Using the information collected, recommend movies to the user that have a high weighted rating (according to the IMDB formula) and which satisfies the preceding conditions

Our dataset contains information regarding duration, genres, and timelines but it isn't currently in a form that is directly usable. Simply put, we will start performing data wrangling on our dataset before it can be used for building the recommender.

In [1]:
# Importing required libraries
import pandas as pd
import numpy as np

In [2]:
# Read movie_dataset.csv file into a pandas DataFrame
df = pd.read_csv('../datasets/movies_metadata.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
# Print all features (or columns) of the DataFrame
df.columns

Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count'],
      dtype='object')

At this point it is clear which features we do need and which we do no require.

In [4]:
# Keep only features that we require
df = df[['title','genres', 'release_date', 'runtime', 'vote_average', 'vote_count']]

In [5]:
df.head()

Unnamed: 0,title,genres,release_date,runtime,vote_average,vote_count
0,Toy Story,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",1995-10-30,81.0,7.7,5415.0
1,Jumanji,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",1995-12-15,104.0,6.9,2413.0
2,Grumpier Old Men,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",1995-12-22,101.0,6.5,92.0
3,Waiting to Exhale,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",1995-12-22,127.0,6.1,34.0
4,Father of the Bride Part II,"[{'id': 35, 'name': 'Comedy'}]",1995-02-10,106.0,5.7,173.0


In [6]:
# Convert release_date (object) to datetime format
df['release_date'] = pd.to_datetime(df['release_date'], errors='coerce')

In [7]:
# Extract year from the datetime
df['year'] = df['release_date'].apply(lambda x: str(x).split('-')[0] if x != np.nan else np.nan)

Our 'year' feature is still an object and is filled with NaT values, which are of null value in Pandas.
We will convert these values to an integer, 0, and convert the datatype of the year feature into int.

We will make use of convert_int method and apply to the year feature:

In [8]:
# Helper function to convert NaT to 0 and all other years to integers
def convert_int(x):
    try:
        return int(x)
    except:
        return 0

In [9]:
# Apply convert_int function to the year feature
df['year'] = df['year'].apply(convert_int)

We do not require the release_date feature anymore. So we will remove it

In [10]:
# Drop the release_date column
df = df.drop('release_date', axis=1)

In [11]:
# Display the dataframe
df.head()

Unnamed: 0,title,genres,runtime,vote_average,vote_count,year
0,Toy Story,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",81.0,7.7,5415.0,1995
1,Jumanji,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",104.0,6.9,2413.0,1995
2,Grumpier Old Men,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",101.0,6.5,92.0,1995
3,Waiting to Exhale,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",127.0,6.1,34.0,1995
4,Father of the Bride Part II,"[{'id': 35, 'name': 'Comedy'}]",106.0,5.7,173.0,1995


# Genres

Looking at the 'genres' feature we can see that the records are in a json format or a Python dictionary

In [12]:
# Print genres of the first movie
df.iloc[0]['genres']

"[{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}]"

We can see that the output is a stringify dictionary. In order to make use of this feature to be usable, we will need to convert it to a native Python dictionary.

In [13]:
# We will make use of the ast library and convert our string to a Python dictionary
from ast import literal_eval

In [14]:
# Define a stringified list and output its type
a = "[1, 2, 3]"

In [15]:
print(type(a))

<class 'str'>


In [16]:
# Apply literal_eval and output type
b = literal_eval(a)

In [17]:
type(b)

list

We now have all the tools required to convert genres feature into a Python dictionary format.

In [18]:
# Convert all NaN into stringified empty lists
df['genres'] = df['genres'].fillna('[]')

In [19]:
df['genres'].head()

0    [{'id': 16, 'name': 'Animation'}, {'id': 35, '...
1    [{'id': 12, 'name': 'Adventure'}, {'id': 14, '...
2    [{'id': 10749, 'name': 'Romance'}, {'id': 35, ...
3    [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...
4                       [{'id': 35, 'name': 'Comedy'}]
Name: genres, dtype: object

In [20]:
# Apply literal_eval to convert to the list object
df['genres'] = df['genres'].apply(literal_eval)

In [21]:
# Convert list of dictionaries to a list of strings
df['genres'] = df['genres'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

In [22]:
df['genres'][:20]

0            [Animation, Comedy, Family]
1           [Adventure, Fantasy, Family]
2                      [Romance, Comedy]
3               [Comedy, Drama, Romance]
4                               [Comedy]
5       [Action, Crime, Drama, Thriller]
6                      [Comedy, Romance]
7     [Action, Adventure, Drama, Family]
8          [Action, Adventure, Thriller]
9          [Adventure, Action, Thriller]
10              [Comedy, Drama, Romance]
11                      [Comedy, Horror]
12        [Family, Animation, Adventure]
13                      [History, Drama]
14                   [Action, Adventure]
15                        [Drama, Crime]
16                      [Drama, Romance]
17                       [Crime, Comedy]
18            [Crime, Comedy, Adventure]
19               [Action, Comedy, Crime]
Name: genres, dtype: object

Looking at the dataframe we can see that the newly generated feature 'genres' displays data as a list of strings. However, we are still not done yet. The last step is to explode the genres column. In other words, if a particular movie has multiple genres, we will create multiple copies of the movie, with each movie having one of the genres.

For example if there is a movie called 'Just Go With It' that has romance and comedy as its genres, we will explode this movie into two rows. One row will be Just Go With It as a romance movie, the other will be a comedy one.

In [23]:
# Create a new feature by exploding genres
s = df.apply(lambda x: pd.Series(x['genres']),axis=1).stack().reset_index(level=1, drop=True)

In [24]:
# Name the new feature as 'genre'
s.name = 'genre'

In [25]:
# Create a new dataframe gen_df which by dropping the old 'genres' feature and adding the new 'genre'
gen_df = df.drop('genres', axis=1).join(s)

In [26]:
# Print the head of the new gen_df
gen_df.head()

Unnamed: 0,title,runtime,vote_average,vote_count,year,genre
0,Toy Story,81.0,7.7,5415.0,1995,Animation
0,Toy Story,81.0,7.7,5415.0,1995,Comedy
0,Toy Story,81.0,7.7,5415.0,1995,Family
1,Jumanji,104.0,6.9,2413.0,1995,Adventure
1,Jumanji,104.0,6.9,2413.0,1995,Fantasy


Looking at the newly generated dataframe, we can see that our genre feature which contained 3 elements has been exploded into 3 rows.

# The build_chart function

Next, we will start writing our function that will act as our knowledge-based recommender. We will make use of IMDB weighted formula which we used in the previous chapter.

One thing to be noted here is that we cannot use our computed values of m and C from earlier, as we will not consider every movie just those that qualify.

The following inputs or preferences are required from the user for generating the best results:
- Get user input on their preferences
- Extract all movies that match the conditions set by the user
- Calculate the values of m and C only for these movies and proceed to build the chart as in the previous chapter

In order to generated our chart we will use our gen_df dataframe and the percentile used to calculate the value of m. By default, we will set this to 80% (0.8)

In [27]:
def build_chart(gen_df, percentile=0.8):
    #Ask for preferred genres
    print("Input preferred genre")
    genre = input()
    
    #Ask for lower limit of duration
    print("Input shortest duration")
    low_time = int(input())
    
    #Ask for upper limit of duration
    print("Input longest duration")
    high_time = int(input())
    
    #Ask for lower limit of timeline
    print("Input earliest year")
    low_year = int(input())
    
    #Ask for upper limit of timeline
    print("Input latest year")
    high_year = int(input())
    
    #Define a new movies variable to store the preferred movies. Copy the contents of gen_df to movies
    movies = gen_df.copy()
    
    #Filter based on the condition
    movies = movies[(movies['genre'] == genre) & 
                    (movies['runtime'] >= low_time) & 
                    (movies['runtime'] <= high_time) & 
                    (movies['year'] >= low_year) & 
                    (movies['year'] <= high_year)]
    
    #Compute the values of C and m for the filtered movies
    C = movies['vote_average'].mean()
    m = movies['vote_count'].quantile(percentile)
    
    #Only consider movies that have higher than m votes. Save this in a new dataframe qualified_movies
    qualified_movies = movies.copy().loc[movies['vote_count'] >= m]
    
    #Calculate score using the IMDB formula
    qualified_movies['score'] = qualified_movies.apply(lambda x: (x['vote_count']/(x['vote_count']+m) * x['vote_average']) 
                                       + (m/(m+x['vote_count']) * C)
                                       ,axis=1)

    #Sort movies in descending order of their scores
    qualified_movies = qualified_movies.sort_values('score', ascending=False)
    
    return qualified_movies

In [28]:
# Generate the chart for top animation movies and display top 5
build_chart(gen_df).head()

Input preferred genre
Comedy
Input shortest duration
90
Input longest duration
120
Input earliest year
2000
Input latest year
2014


Unnamed: 0,title,runtime,vote_average,vote_count,year,genre,score
18465,The Intouchables,112.0,8.2,5410.0,2011,Comedy,8.125848
22841,The Grand Budapest Hotel,99.0,8.0,4644.0,2014,Comedy,7.920984
13724,Up,96.0,7.8,7048.0,2009,Comedy,7.751941
24455,Big Hero 6,102.0,7.8,6289.0,2014,Comedy,7.746291
15348,Toy Story 3,103.0,7.6,4710.0,2010,Comedy,7.53575


In [29]:
# Convert the cleaned (non-exploded) dataframe df into a CSV file and save it in the data folder
df.to_csv('../datasets/movie_metadata_clean.csv', index=False)

In [30]:
df.head()

Unnamed: 0,title,genres,runtime,vote_average,vote_count,year
0,Toy Story,"[Animation, Comedy, Family]",81.0,7.7,5415.0,1995
1,Jumanji,"[Adventure, Fantasy, Family]",104.0,6.9,2413.0,1995
2,Grumpier Old Men,"[Romance, Comedy]",101.0,6.5,92.0,1995
3,Waiting to Exhale,"[Comedy, Drama, Romance]",127.0,6.1,34.0,1995
4,Father of the Bride Part II,[Comedy],106.0,5.7,173.0,1995
