# Building Content-based Recommenders

In this part we are building Content-based recommenders. As we have seen in the previous chapter namely "Building Knowledge-based recommender"; by enforcing the user to input movie genre type, timeline as well the minimum and maximum runtime for a movie, we managed to get satisfactory results.

However, our model still remains very generic. Let's assume that one of the users likes "The Avengers", "Spiderman" and "Iron Man". It is pretty obvious that the user likes superhero movies. Our model from the previous chapter would not be able to capture this details. The best it could do is to suggest action movies (when the user inputs the action genre type).

Another example: lets take for instance two comedy movies: "Duplex" and "Isn't it romantic" and assume that they have similar runtime, genre, and timeline, but differ hugely in their audience.

In order to fix this issue, an obvious solution would be to ask the user for more metadata as input. For instance we could introduce a sub-genre input into our dataset which would consist of: superhero, black comedy, romantic comedy and so on, but this solution suffers heavily from the perspective of usability.

One of the problems is that we do not posses data on sub-genres and even if we had, our users are extremely unlikely to posses knowledge of their favorite movies' metadata. Instead, what we could do is to let users give us input on what movies they like or dislike and provide recommendations that matches their tastes.

In this part, we are going to build two different types of content-based recommenders:
- Plot description-based recommender
- Metadata-based recommender

# Plot description-based recommender

This model compares the descriptions and taglines of different movies and provides recommendations that have the most similar plot descriptions.

In order to acomplish this, we will compute the pairwise similarity between bodies of text. This is done by representing these documents as vectors. In other words, every document is depicted as a series of n numbers, where each number represents a dimension and n is the size of the vocabulary of all documents put together.

But what are the values of these vectors? The answer to this question depends on the vectorizer we are using to convert our documents into vectors. The most popular and used ones are CountVectorizer and TF-IDFVectorizer.

We will make use of TF-IDFVectorizer in conjuction with cosine similarity for computing the similarity between documents.

The steps for building this model are as follows:
- Obtain the data required to build the model
- Create TF-IDF vectors for the plot description (or overview) of every movie
- Compute the pairwise cosine similarity score of every movie
- Write the recommender function that takes as input the movie title and outputs a list of recommended similar movies

In [1]:
# Import required libraries
import pandas as pd
import numpy as np

In [2]:
# Import original dataframe 'movie_metadata.csv'
orig_df = pd.read_csv('../datasets/movies_metadata.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
# Read the first 5 occurences
orig_df.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


In [4]:
# Import cleaned dataframe 'movie_metadata_clean.csv'
clean_df = pd.read_csv('../datasets/movie_metadata_clean.csv')

In [5]:
# Read the first 5 occurences
clean_df.head()

Unnamed: 0,title,genres,runtime,vote_average,vote_count,year
0,Toy Story,"['Animation', 'Comedy', 'Family']",81.0,7.7,5415.0,1995
1,Jumanji,"['Adventure', 'Fantasy', 'Family']",104.0,6.9,2413.0,1995
2,Grumpier Old Men,"['Romance', 'Comedy']",101.0,6.5,92.0,1995
3,Waiting to Exhale,"['Comedy', 'Drama', 'Romance']",127.0,6.1,34.0,1995
4,Father of the Bride Part II,['Comedy'],106.0,5.7,173.0,1995


In [6]:
# We will make use of 'overview' and 'id' features from the orig_df dataframe by adding these features 
# in clean_df dataframe
clean_df['overview'], clean_df['id'] = orig_df['overview'], orig_df['id']

In [7]:
# Check dataframe for the last change
clean_df.head()

Unnamed: 0,title,genres,runtime,vote_average,vote_count,year,overview,id
0,Toy Story,"['Animation', 'Comedy', 'Family']",81.0,7.7,5415.0,1995,"Led by Woody, Andy's toys live happily in his ...",862
1,Jumanji,"['Adventure', 'Fantasy', 'Family']",104.0,6.9,2413.0,1995,When siblings Judy and Peter discover an encha...,8844
2,Grumpier Old Men,"['Romance', 'Comedy']",101.0,6.5,92.0,1995,A family wedding reignites the ancient feud be...,15602
3,Waiting to Exhale,"['Comedy', 'Drama', 'Romance']",127.0,6.1,34.0,1995,"Cheated on, mistreated and stepped on, the wom...",31357
4,Father of the Bride Part II,['Comedy'],106.0,5.7,173.0,1995,Just when George Banks has recovered from his ...,11862


In [8]:
# Next step, represent 'overview' feature rows as TF-IDF vector
# Import TfidfVectorizer from sklearn
from sklearn.feature_extraction.text import TfidfVectorizer

In [9]:
# Remove all english stopwords from 'overview' feature records and represent each word as vector
tfidf = TfidfVectorizer(stop_words='english')

In [10]:
# Replace NaN with an empty string
clean_df['overview'] = clean_df['overview'].fillna('')

In [11]:
# Next, we construct the required TF-IDX matrix by applying the fit_transform method on the overview feature.
# We will generate our vocabulary on the first 15000 words, since generating the model using all words gives 
# us memory error.
tfidf_matrix = tfidf.fit_transform(clean_df['overview'][:15000])

In [12]:
# Next, we compute similarity score using cosine similarity
from sklearn.metrics.pairwise import linear_kernel

In [13]:
# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [14]:
cosine_sim.shape

(15000, 15000)

In [15]:
# The next step is to build our recommender function. We will create a reverse mapping of movie titles and their
# respective indices.
indices = pd.Series(clean_df.index, index=clean_df['title']).drop_duplicates()

In [16]:
# Print indices
indices.head(10)

title
Toy Story                      0
Jumanji                        1
Grumpier Old Men               2
Waiting to Exhale              3
Father of the Bride Part II    4
Heat                           5
Sabrina                        6
Tom and Huck                   7
Sudden Death                   8
GoldenEye                      9
dtype: int64

The following steps in building the recommender function are as follows:
- declare the title of the movies as an argument
- obtain the index of the movie from the indices reverse mapping
- get the list of cosine similarity scores for specified movie and a list of all movies using cosine_sim
- convert into a list of tuples where the first element is the position and the second is the similarity score
- sort the list of tuples on the basis of the cosine similarity scores
- get the top 10 elements of the list
- ignore the first element as it refers to the similarity score with itself (the most similar movie to a particular movie is obviously the movie itself)
- return the titles corresponding to the indices of the top 10 elements, excluding the first

In [17]:
# Create function that takes as input movie title and returns the top 10 list of recommended movies
def content_recommender(title, cosine_sim=cosine_sim, clean_df=clean_df, indices=indices):
    # Obtain the index of the movie that matches the title
    idx = indices[title]
    print("idx: ", str(idx))
    
    # Get the pairwise similarity scores of all movies with that movie
    # And convert it to a list of tuples as described above
    sim_scores = list(enumerate(cosine_sim[idx]))
    
    # Sort the movies based on the cosine similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    # Get the scores of the 10 most similar movies. Ignore the first movie
    sim_scores = sim_scores[1:11]
    
    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]
    
    # Return the top 10 most similar movies
    return clean_df['title'].iloc[movie_indices]

In [18]:
# Test content_recommender() function
content_recommender('The Lion King')

idx:  359


9353                         The Lion King 1½
9115           The Lion King 2: Simba's Pride
6094                                Born Free
3203                         The Waiting Game
14402    Michael Jackson: Life of a Superstar
6574            Once Upon a Time in China III
3293                                 The Bear
2779                    Napoleon and Samantha
11507                     David and Bathsheba
892                          The Wizard of Oz
Name: title, dtype: object

Looking at the generated results we can see that our recommender suggested "The Lion King" in its top 10 list. We can see that we also have other movies that refers to other type of animals as "The Bear" title suggests. It goes without saying that a person who loves "The Lion King" is very likely to have a passion for Disney movies.
Unfortunately, our plot description model is not able to capter all this information.

In the next section, we will take a look at another type of content-based recommender which uses advanced metadata such as: genres, cast, crew, and keywords (sub-genres).

# Metadata-based recommender

This model will take as host of features such as genres, keywords, cast and crew into consideration and provides recommendations that are the most similar with respect to the aformentioned features. 

To build this model the following metadata will be used:
- genre of the movie
- director of the movie (this person is part of the crew)
- movie's three majors stars (they are part of the cast)
- sub-genres or keywords

We have seen that in our dataset we had data containing genres while the other one does not make part of the original dataset. For that we will make use of two additional files: credits.csv which contains information on the casts and crew of the movies and keywords.csv contains information on the sub-genres.

# The keywords and credits datasets

In [19]:
# Import required libraries
import pandas as pd
import numpy as np

In [20]:
# Load "keywords.csv" file
key_df = pd.read_csv("../datasets/keywords.csv")

In [21]:
# Print the first 5 occurences
key_df.head()

Unnamed: 0,id,keywords
0,862,"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,8844,"[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,15602,"[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,31357,"[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,11862,"[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."


In [22]:
# Load "credits.csv" files
cred_df = pd.read_csv("../datasets/credits.csv")

In [23]:
# Print the first 5 occurences
cred_df.head()

Unnamed: 0,cast,crew,id
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844
2,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602
3,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...",31357
4,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...",11862


In [24]:
# Next, we join all 3 datasets using the "id" as the join key
# First, we convert the id to integer type
# clean_df['id'] = clean_df['id'].astype('int') ---> generates invalid literal for int() with base 10: '1997-08-20' error

Looking at the generated error, we can see that 1997-08-20 is listed as an ID. This is clearly a bad data. Next, we should find all the rows with bad IDs and remove them in order for the code execution to be successful

In [25]:
# Function to convert all non-integer IDs to NaN
def clean_ids(x):
    try:
        return int(x)
    except:
        return np.nan

In [26]:
# Clean the ids of clean_df
clean_df['id'] = clean_df['id'].apply(clean_ids)

In [27]:
# Filter all rows that have null ID
clean_df = clean_df[clean_df['id'].notnull()]

We are now in a good position to convert the IDs of all three DataFrames into integers and mergem them into a single DataFrame

In [28]:
# Convert clean_df['id'] into integer type
clean_df['id'] = clean_df['id'].astype('int')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [29]:
# Convert key_df['id'] into integer type
key_df['id'] = key_df['id'].astype('int')

In [30]:
# Convert cred_df['id'] into integer type
cred_df['id'] = cred_df['id'].astype('int')

In [31]:
# Merge all 3 data frames into a single one using 'id' as the key join
# First merge clean_df with cred_df on 'id'
clean_df = clean_df.merge(cred_df, on='id')

In [32]:
# Merge clean_df with key_df on 'id'
clean_df = clean_df.merge(key_df, on='id')

In [33]:
# Check for the last change
clean_df.head()

Unnamed: 0,title,genres,runtime,vote_average,vote_count,year,overview,id,cast,crew,keywords
0,Toy Story,"['Animation', 'Comedy', 'Family']",81.0,7.7,5415.0,1995,"Led by Woody, Andy's toys live happily in his ...",862,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,Jumanji,"['Adventure', 'Fantasy', 'Family']",104.0,6.9,2413.0,1995,When siblings Judy and Peter discover an encha...,8844,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,Grumpier Old Men,"['Romance', 'Comedy']",101.0,6.5,92.0,1995,A family wedding reignites the ancient feud be...,15602,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...","[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,Waiting to Exhale,"['Comedy', 'Drama', 'Romance']",127.0,6.1,34.0,1995,"Cheated on, mistreated and stepped on, the wom...",31357,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...","[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,Father of the Bride Part II,['Comedy'],106.0,5.7,173.0,1995,Just when George Banks has recovered from his ...,11862,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...","[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."


# Wrangling keywords, cast and crew

The following steps will be performed:
- convert keywords into a list of strings where string is a keyword (similar to genres). We will include only the top three keywords
- convert cast into a list of strings where each string is a star. Similart to keywords we will include only the top three stars
- convert crew into director. We will extract only the director of the movie and ignore all other crew members

In [34]:
# Convert the stringified objects into native python objects
from ast import literal_eval

In [35]:
# Create a list with required features
features = ['cast', 'crew', 'keywords', 'genres']

In [36]:
for feature in features:
    clean_df[feature] = clean_df[feature].apply(literal_eval)

In [37]:
# Next, extract director from the crew list
# First, examine crew object
clean_df.iloc[0]['crew'][0]

{'credit_id': '52fe4284c3a36847f8024f49',
 'department': 'Directing',
 'gender': 2,
 'id': 7879,
 'job': 'Director',
 'name': 'John Lasseter',
 'profile_path': '/7EdqiNbr4FRjIhKHyPPdFfEEEFG.jpg'}

In [38]:
# Create function to extract director name from the list
def get_director(x):
    for crew_member in x:
        if crew_member['job'] == 'Director':
            return crew_member['name']
    return np.nan

In [39]:
# Define new 'director' feature
clean_df['director'] = clean_df['crew'].apply(get_director)

In [40]:
# Print the directors of the first 5 movies
clean_df['director'].head()

0      John Lasseter
1       Joe Johnston
2      Howard Deutch
3    Forest Whitaker
4      Charles Shyer
Name: director, dtype: object

Both 'keywords' and 'cast' are dictionary as well. We need to extract the top three name attributes of each list.
We will write a function that will do wrangling on both these features. Also, just like 'keywords' and 'cast' we will
consider the top three genres

In [41]:
# Examine cast object
clean_df.iloc[0]['cast'][0]

{'cast_id': 14,
 'character': 'Woody (voice)',
 'credit_id': '52fe4284c3a36847f8024f95',
 'gender': 2,
 'id': 31,
 'name': 'Tom Hanks',
 'order': 0,
 'profile_path': '/pQFoyx7rp09CJTAb932F2g8Nlho.jpg'}

In [42]:
# Examine keywords object
clean_df.iloc[0]['keywords'][0]

{'id': 931, 'name': 'jealousy'}

In [43]:
# Create function that returns top 3 elements from entire list
def generate_list(x):
    if isinstance(x, list):
        names = [elem['name'] for elem in x]
        # Check if more than 3 elements exist. If yes return only the first 3
        # If no, return entire list
        if (len(names) > 3):
            names = names[:3]
            return names
        return names
    
    # Return empty list in case of missing/malformed data
    return []

In [44]:
# We will use this function to wrangle 'cast' and 'keywords' features
# Apply the generate_list function on 'cast' feature
clean_df['cast'] = clean_df['cast'].apply(generate_list)

In [45]:
# Apply the generate_list function on 'keywords' feature
clean_df['keywords'] = clean_df['keywords'].apply(generate_list)

In [46]:
# We will consider maximum 3 genres for each movie
clean_df['genres'] = clean_df['genres'].apply(lambda x: x[:3])

In [47]:
# Check clean_df for the last change on required features
clean_df[['title', 'cast', 'director', 'keywords', 'genres']].head()

Unnamed: 0,title,cast,director,keywords,genres
0,Toy Story,"[Tom Hanks, Tim Allen, Don Rickles]",John Lasseter,"[jealousy, toy, boy]","[Animation, Comedy, Family]"
1,Jumanji,"[Robin Williams, Jonathan Hyde, Kirsten Dunst]",Joe Johnston,"[board game, disappearance, based on children'...","[Adventure, Fantasy, Family]"
2,Grumpier Old Men,"[Walter Matthau, Jack Lemmon, Ann-Margret]",Howard Deutch,"[fishing, best friend, duringcreditsstinger]","[Romance, Comedy]"
3,Waiting to Exhale,"[Whitney Houston, Angela Bassett, Loretta Devine]",Forest Whitaker,"[based on novel, interracial relationship, sin...","[Comedy, Drama, Romance]"
4,Father of the Bride Part II,"[Steve Martin, Diane Keaton, Martin Short]",Charles Shyer,"[baby, midlife crisis, confidence]",[Comedy]


In [48]:
clean_df.iloc[1]['keywords']

['board game', 'disappearance', "based on children's book"]

Next, we will use vectorizer to build document vectors. If two actors had the same first name e.g Ryan Reynolds and Ryan Gosling, the vectorizer will treat both Ryans as the same, although they are clearly different entities.

This will impact the quality of the recommendations we receive. Therefore, if a person likes Ryan Reynold's movies, it doesn't mean that he will like as well Ryan Gosling movies.

To avoid this issue we will strip the spaces between keywords, actor, and director names and convert them all into lowercase. The two aformentioned examples will become ryanreynolds, ryangosling and our vectorizer will be able to distinguish them.

In [49]:
# Function to sanitize data to prevent ambiguity
def sanitize(x):
    # Check if x is instance of list
    if isinstance(x, list):
        # Strip spaces and convert to lowercase
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        # Check if director exists. If not, return empty string
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''

In [50]:
# Apply the sanitize function to cast, director, keywords, genres
for feature in ['cast', 'director', 'genres', 'keywords']:
    clean_df[feature] = clean_df[feature].apply(sanitize)

# Creating the metadata soup

In the plot description based recommender, we only worked with 'overview' feature. It was easy to generate a vocabulary using just one feature containing string data. Now, since we are working with 4 different features where one is of type string whereas the others are a list of strings, what we need to do is to create a soup that contains the actors, director, keywords and genres.

In this way we can feed into our vectorizer and peform similar steps as before

In [51]:
# Function to create a soup out of the desired data
def create_soup(x):
    return ' '.join(x['keywords']) + ' ' + ' '.join(x['cast']) + ' ' + x['director'] + ' '  + ' '.join(x['genres']) 

In [52]:
# Create the new soup feature
clean_df['soup'] = clean_df.apply(create_soup, axis=1)

In [53]:
# Check the first soup value
clean_df.iloc[0]['soup']

'jealousy toy boy tomhanks timallen donrickles johnlasseter animation comedy family'

With the soup created, we are now in a good position to create our document vectors, compute similarity scores, and build the metadata-based recommender function.

# Generating the recommendations

Next we will use CountVectorizder instead of TF-IDFVectorizer. The reason is that TF-IDFVectorizer will accord
less weight to actors and directors who have acted and directed in a relatively larger number of movies.

In [54]:
# Create CountVectorizer object
from sklearn.feature_extraction.text import CountVectorizer

In [55]:
count_vect = CountVectorizer(stop_words='english')

In [56]:
count_matrix = count_vect.fit_transform(clean_df['soup'][:10000])

In [57]:
# Using the CountVectorizer means that we are forced to use the more computationally expensive cosine_similarity function
from sklearn.metrics.pairwise import cosine_similarity

In [58]:
# Check the shape of the matrix
count_matrix.shape

(10000, 20264)

In [59]:
# Since applying cosine_similarity on such a high vocabulary (73881) we will reduce the data we are working with
cosine_sim2 = cosine_similarity(count_matrix, count_matrix)

In [60]:
# Since we dropped a few movies with bad indices, we need to construct our reverse mapping again
clean_df = clean_df.reset_index()

In [61]:
indices2 = pd.Series(clean_df.index, index=clean_df['title'])

In [62]:
# We will make use of the previous method created when developing plot description recommender
content_recommender('The Lion King', cosine_sim2, clean_df, indices2)

idx:  359


3336                Creature Comforts
3497                     Time Masters
3724    Thomas and the Magic Railroad
7090                    Teacher's Pet
1018              So Dear to My Heart
2787                       Thumbelina
4949            The Flight of Dragons
1648                 Ill Gotten Gains
3487       Jails, Hospitals & Hip-Hop
651         James and the Giant Peach
Name: title, dtype: object

Looking at the generated recommendations we can see that our model captured more than just lions. Most of the movies relate to creatus such as dragons, giants which indicates the animation movie type. 

The model build is nowhere near as the powerful models used in the industry. There is still place for improvement.

Suggestions for improving the model:
- we could experiment with more than just 3 keywords in order considered for the metadata soup.
 
- we could experiment as well with different features since our dataset contains information on production companies, countries, and languages.

- we could come up with more well-defined sub-genres. We could define as with genres, a definite number of sub-genres and assign only these subgenres to the movies.

- we could give more weight to the director. Our model gave as much importance to the director as to the actors. Therefore, we could give more emphasis to the director by mentioning this individual multiple times in our soup instead of just once.

- we could consider as well other members of the crew. The director isn't the only person that gives the movie its character. We could consider adding other crew members, such as producers and screenwriters to our soup.

- we could introduce a popularity filter. It is possible that two movies have the same genres and sub-genres, but differ widely in quality and popularity. In such case we could consider the n most similar movies, compute a weighted rating and display the top five results.
