The rapid growth of data collection has led to a new era of information. Data is being used to create more efficient systems and this is where Recommendation Systems come into play. Recommendation Systems are a type of information filtering systems as they improve the quality of search results and provides items that are more relevant to the search item or are realted to the search history of the user.

They are used to predict the rating or preference that a user would give to an item. Almost every major tech company has applied them in some form or the other: Amazon uses it to suggest products to customers, YouTube uses it to decide which video to play next on autoplay, and Facebook uses it to recommend pages to like and people to follow. Moreover, companies like Netflix and Spotify depend highly on the effectiveness of their recommendation engines for their business and sucees.

In this notebook, I will attempt at implementing a few recommendation algorithms (content based, popularity based and collaborative filtering) and try to build an ensemble of these models to come up with our final recommendation system. With us, we have two MovieLens datasets.

The Full Dataset: Consists of 26,000,000 ratings and 750,000 tag applications applied to 45,000 movies by 270,000 users. Includes tag genome data with 12 million relevance scores across 1,100 tags.
The Small Dataset: Comprises of 100,000 ratings and 1,300 tag applications applied to 9,000 movies by 700 users.
I will build a Simple Recommender using movies from the Full Dataset whereas all personalised recommender systems will make use of the small dataset (due to the computing power I possess being very limited). As a first step, I will build my simple recommender system.

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from ast import literal_eval
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet
from surprise import Reader, Dataset, SVD
#from surprise import evaluate

import warnings; warnings.simplefilter('ignore')

# Simple Recommender

The Simple Recommender offers generalized recommnendations to every user based on movie popularity and (sometimes) genre. The basic idea behind this recommender is that movies that are more popular and more critically acclaimed will have a higher probability of being liked by the average audience. This model does not give personalized recommendations based on the user.

The implementation of this model is extremely trivial. All we have to do is sort our movies based on ratings and popularity and display the top movies of our list. As an added step, we can pass in a genre argument to get the top movies of a particular genre.

In [2]:
md = pd. read_csv('movies_metadata.csv')
md.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


In [3]:
md['genres']

0        [{'id': 16, 'name': 'Animation'}, {'id': 35, '...
1        [{'id': 12, 'name': 'Adventure'}, {'id': 14, '...
2        [{'id': 10749, 'name': 'Romance'}, {'id': 35, ...
3        [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...
4                           [{'id': 35, 'name': 'Comedy'}]
                               ...                        
45461    [{'id': 18, 'name': 'Drama'}, {'id': 10751, 'n...
45462                        [{'id': 18, 'name': 'Drama'}]
45463    [{'id': 28, 'name': 'Action'}, {'id': 18, 'nam...
45464                                                   []
45465                                                   []
Name: genres, Length: 45466, dtype: object

In [4]:
md['genres']=md['genres'].fillna('[]').apply(literal_eval).apply(lambda x:[i['name'] for i in x] if isinstance (x,list) else [])
md['genres']

0         [Animation, Comedy, Family]
1        [Adventure, Fantasy, Family]
2                   [Romance, Comedy]
3            [Comedy, Drama, Romance]
4                            [Comedy]
                     ...             
45461                 [Drama, Family]
45462                         [Drama]
45463       [Action, Drama, Thriller]
45464                              []
45465                              []
Name: genres, Length: 45466, dtype: object

I use the TMDB Ratings to come up with our Top Movies Chart. I will use IMDB's weighted rating formula to construct my chart. Mathematically, it is represented as follows:

Weighted Rating (WR) =  (v/(v+m).R)+(m/(v+m).C) 
where,

v is the number of votes for the movie
m is the minimum votes required to be listed in the chart
R is the average rating of the movie
C is the mean vote average across the whole report
The next step is to determine an appropriate value for m, the minimum votes required to be listed in the chart. We will use 95th percentile as our cutoff. In other words, for a movie to feature in the charts, it must have more votes than at least 95% of the movies in the list.

I will build our overall Top 250 Chart and will define a function to build charts for a particular genre. Let's begin!

In [5]:
md.columns

Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count'],
      dtype='object')

In [6]:
v=md[md['vote_count'].notnull()]['vote_count'].astype('int')
R=md[md['vote_average'].notnull()]['vote_average'].astype('int')
C=md['vote_average'].mean()
C

5.618207215133889

In [7]:
m=v.quantile(0.95)
m

434.0

In [8]:
C

5.618207215133889

In [9]:

rating= (v/(v+m)*R)+(m/(v+m)*C)

In [10]:
md['Weighted_Rating']=rating

In [11]:
rating

0        6.897470
1        5.941799
2        5.684985
3        5.645944
4        5.442013
           ...   
45461    5.614487
45462    5.641423
45463    5.582504
45464    5.618207
45465    5.618207
Length: 45460, dtype: float64

So lets see what are the top movies based on Weighted Rating.

In [12]:
md['release_date']

0        1995-10-30
1        1995-12-15
2        1995-12-22
3        1995-12-22
4        1995-02-10
            ...    
45461           NaN
45462    2011-11-17
45463    2003-08-01
45464    1917-10-21
45465    2017-06-09
Name: release_date, Length: 45466, dtype: object

In [13]:
md['year']=md['release_date'].apply(lambda x:str(x).split('-')[0] if x!=np.nan else np.nan)

In [14]:
qualified=md[(md['vote_count']>m)&(md['vote_average'].notnull())&(md['vote_count'].notnull())][['title','Weighted_Rating','genres','vote_count','vote_average','year']].sort_values(by='Weighted_Rating',ascending=False)

In [15]:
qualified['vote_count']=qualified['vote_count'].astype('int')
qualified['vote_average']=qualified['vote_average'].astype('int')

In [16]:
qualified.shape

(2268, 6)

Therefore, to qualify to be considered for the chart, a movie has to have at least 434 votes on TMDB. We also see that the average rating for a movie on TMDB is 5.618 on a scale of 10. 2268 Movies qualify to be on our chart.

In [17]:
qualified.head(10)

Unnamed: 0,title,Weighted_Rating,genres,vote_count,vote_average,year
15480,Inception,7.928755,"[Action, Thriller, Science Fiction, Mystery, A...",14075,8,2010
12481,The Dark Knight,7.918626,"[Drama, Action, Crime, Thriller]",12269,8,2008
22879,Interstellar,7.911049,"[Adventure, Drama, Science Fiction]",11187,8,2014
2843,Fight Club,7.897775,[Drama],9678,8,1999
4863,The Lord of the Rings: The Fellowship of the Ring,7.88916,"[Adventure, Fantasy, Action]",8892,8,2001
292,Pulp Fiction,7.886457,"[Thriller, Crime]",8670,8,1994
314,The Shawshank Redemption,7.882427,"[Drama, Crime]",8358,8,1994
7000,The Lord of the Rings: The Return of the King,7.880635,"[Adventure, Fantasy, Action]",8226,8,2003
351,Forrest Gump,7.879536,"[Comedy, Drama, Romance]",8147,8,1994
5814,The Lord of the Rings: The Two Towers,7.871988,"[Adventure, Fantasy, Action]",7641,8,2002


We see that three Christopher Nolan Films, Inception, The Dark Knight and Interstellar occur at the very top of our chart. The chart also indicates a strong bias of TMDB Users towards particular genres and directors.

Let us now construct our function that builds charts for particular genres. For this, we will use relax our default conditions to the 85th percentile instead of 95.

In [18]:
m=v.quantile(0.85)
m

82.0

In [19]:
s=md.apply(lambda x:pd.Series(x['genres']),axis=1).stack().reset_index(level=1,drop=True)
s.name='genre'
gen_md=md.drop('genres',axis=1).join(s)

In [20]:
gen_md['genre']

0        Animation
0           Comedy
0           Family
1        Adventure
1          Fantasy
           ...    
45463       Action
45463        Drama
45463     Thriller
45464          NaN
45465          NaN
Name: genre, Length: 93548, dtype: object

In [21]:
def genre(gen):
    qualified=gen_md[(gen_md['vote_count']>m)&(gen_md['vote_average'].notnull())&(gen_md['vote_count'].notnull())&(gen_md['genre']==gen)][['title','Weighted_Rating','genre','vote_count','vote_average','year']].sort_values(by='Weighted_Rating',ascending=False)
    return qualified.head(10)

In [22]:
genre('Animation')

Unnamed: 0,title,Weighted_Rating,genre,vote_count,vote_average,year
359,The Lion King,7.826386,Animation,5520.0,8.0,1994
5481,Spirited Away,7.765175,Animation,3968.0,8.3,2001
9698,Howl's Moving Castle,7.58369,Animation,2049.0,8.2,2004
2884,Princess Mononoke,7.582344,Animation,2041.0,8.2,1997
5833,My Neighbor Totoro,7.522321,Animation,1730.0,8.0,1988
40251,Your Name.,7.293922,Animation,1030.0,8.5,2016
5553,Grave of the Fireflies,7.265839,Animation,974.0,8.2,1988
19901,Paperman,7.114985,Animation,734.0,8.0,2012
13724,Up,6.919848,Animation,7048.0,7.8,2009
30315,Inside Out,6.916372,Animation,6737.0,7.9,2015


Oh so the top animated movies are The Lion King , Spirited Away ...Lets check other category.

In [23]:
genre('Romance').head(15)

Unnamed: 0,title,Weighted_Rating,genre,vote_count,vote_average,year
351,Forrest Gump,7.879536,Romance,8147.0,8.2,1994
10309,Dilwale Dulhania Le Jayenge,7.659636,Romance,661.0,9.1,1995
876,Vertigo,7.35232,Romance,1162.0,8.0,1958
40251,Your Name.,7.293922,Romance,1030.0,8.5,2016
883,Some Like It Hot,7.185423,Romance,835.0,8.0,1959
1132,Cinema Paradiso,7.184781,Romance,834.0,8.2,1988
19901,Paperman,7.114985,Romance,734.0,8.0,2012
37863,Sing Street,7.06283,Romance,669.0,8.0,2016
1639,Titanic,6.926902,Romance,7770.0,7.5,1997
882,The Apartment,6.890882,Romance,498.0,8.1,1960


The top romance movie according to our metrics is Forrest Gump followed by  Bollywood's Dilwale Dulhania Le Jayenge. This Shahrukh Khan starrer also happens to be one of my personal favorites.. 

In [24]:
genre('Comedy').head(15)

Unnamed: 0,title,Weighted_Rating,genre,vote_count,vote_average,year
351,Forrest Gump,7.879536,Comedy,8147.0,8.2,1994
1225,Back to the Future,7.845092,Comedy,6239.0,8.0,1985
18465,The Intouchables,7.823118,Comedy,5410.0,8.2,2011
22841,The Grand Budapest Hotel,7.796436,Comedy,4644.0,8.0,2014
2211,Life Is Beautiful,7.746456,Comedy,3643.0,8.3,1997
10309,Dilwale Dulhania Le Jayenge,7.659636,Comedy,661.0,9.1,1995
732,Dr. Strangelove or: How I Learned to Stop Worr...,7.457661,Comedy,1472.0,8.0,1964
3342,Modern Times,7.213918,Comedy,881.0,8.1,1936
883,Some Like It Hot,7.185423,Comedy,835.0,8.0,1959
1236,The Great Dictator,7.131346,Comedy,756.0,8.1,1940


# Content Based Recommender

The recommender we built in the previous section suffers some severe limitations. For one, it gives the same recommendation to everyone, regardless of the user's personal taste. If a person who loves romantic movies (and hates action) were to look at our Top 15 Chart, s/he wouldn't probably like most of the movies. If s/he were to go one step further and look at our charts by genre, s/he wouldn't still be getting the best recommendations.

For instance, consider a person who loves Dilwale Dulhania Le Jayenge, My Name is Khan and Kabhi Khushi Kabhi Gham. One inference we can obtain is that the person loves the actor Shahrukh Khan and the director Karan Johar. Even if s/he were to access the romance chart, s/he wouldn't find these as the top recommendations.

To personalise our recommendations more, I am going to build an engine that computes similarity between movies based on certain metrics and suggests movies that are most similar to a particular movie that a user liked. Since we will be using movie metadata (or content) to build this engine, this also known as Content Based Filtering.

I will build two Content Based Recommenders based on:

Movie Overviews and Taglines

Movie Cast, Crew, Keywords and Genre

Also, as mentioned in the introduction, I will be using a subset of all the movies available to us due to limiting computing power available to me.

In [25]:
links_small = pd.read_csv('links_small.csv')

In [26]:
links_small

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0
...,...,...,...
9120,162672,3859980,402672.0
9121,163056,4262980,315011.0
9122,163949,2531318,391698.0
9123,164977,27660,137608.0


In [27]:
md.columns

Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count', 'Weighted_Rating', 'year'],
      dtype='object')

In [28]:
md['id']

0           862
1          8844
2         15602
3         31357
4         11862
          ...  
45461    439050
45462    111109
45463     67758
45464    227506
45465    461257
Name: id, Length: 45466, dtype: object

In [29]:
#md = md.drop([19730, 29503, 35587])

In [30]:
md[md['id']=='1997-08-20']
md=md.drop([19730])

In [31]:
md[md['id']=='2012-09-29']
md=md.drop([29503])

In [32]:
md[md['id']=='2014-01-01']
md=md.drop([35587])

In [33]:
md['id']=md['id'].astype('int')

In [34]:
links_small.describe()

Unnamed: 0,movieId,imdbId,tmdbId
count,9125.0,9125.0,9112.0
mean,31123.291836,479824.4,39104.545544
std,40782.633604,743177.4,62814.519801
min,1.0,417.0,2.0
25%,2850.0,88846.0,9451.75
50%,6290.0,119778.0,15852.0
75%,56274.0,428441.0,39160.5
max,164979.0,5794766.0,416437.0


In [35]:
links_small=links_small[links_small['tmdbId'].notnull()]['tmdbId'].astype(int)

In [36]:
links_small

0          862
1         8844
2        15602
3        31357
4        11862
         ...  
9120    402672
9121    315011
9122    391698
9123    137608
9124    410803
Name: tmdbId, Length: 9112, dtype: int32

In [37]:
smd=md[md['id'].isin(links_small)]

In [38]:
smd.columns,smd.shape

(Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
        'imdb_id', 'original_language', 'original_title', 'overview',
        'popularity', 'poster_path', 'production_companies',
        'production_countries', 'release_date', 'revenue', 'runtime',
        'spoken_languages', 'status', 'tagline', 'title', 'video',
        'vote_average', 'vote_count', 'Weighted_Rating', 'year'],
       dtype='object'), (9099, 26))

We have 9099 movies avaiable in our small movies metadata dataset which is 5 times smaller than our original dataset of 45000 movies.

# Movie Description Based Recommender

In [39]:
smd['overview']=smd['overview'].fillna('')
smd['tagline']=smd['tagline'].fillna('')
smd['description']=smd['overview']+smd['tagline']


In [40]:
smd['description']

0        Led by Woody, Andy's toys live happily in his ...
1        When siblings Judy and Peter discover an encha...
2        A family wedding reignites the ancient feud be...
3        Cheated on, mistreated and stepped on, the wom...
4        Just when George Banks has recovered from his ...
                               ...                        
40224    From the mind behind Evangelion comes a hit la...
40503    The band stormed Europe in 1963, and, in 1964,...
44821    When Molly Hale's sadness of her father's disa...
44826    All your favorite Pokémon characters are back,...
45265    While holidaying in the French Alps, a Swedish...
Name: description, Length: 9099, dtype: object

In [41]:
tf=TfidfVectorizer(analyzer='word',ngram_range=(1,2),min_df=0,stop_words='english')
tfidf_matrix=tf.fit_transform(smd['description'])

In [42]:
tfidf_matrix.shape

(9099, 268124)

In [43]:
tfidf_matrix

<9099x268124 sparse matrix of type '<class 'numpy.float64'>'
	with 540591 stored elements in Compressed Sparse Row format>

Cosine Similarity

I will be using the Cosine Similarity to calculate a numeric quantity that denotes the similarity between two movies. Mathematically, it is defined as follows:

cosine(x,y)=x.y⊺||x||.||y|| 
Since we have used the TF-IDF Vectorizer, calculating the Dot Product will directly give us the Cosine Similarity Score. Therefore, we will use sklearn's linear_kernel instead of cosine_similarities since it is much faster.

In [44]:
cosine_sim=linear_kernel(tfidf_matrix,tfidf_matrix)

In [45]:
cosine_sim[0]

array([1.        , 0.00680476, 0.        , ..., 0.        , 0.00344913,
       0.        ])

We now have a pairwise cosine similarity matrix for all the movies in our dataset. The next step is to write a function that returns the 30 most similar movies based on the cosine similarity score.

In [46]:
smd.columns

Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count', 'Weighted_Rating', 'year', 'description'],
      dtype='object')

In [47]:
smd=smd.reset_index()
titles=smd['title']
indices = pd.Series(smd.index, index=smd['title'])

In [48]:
indices

title
Toy Story                                                0
Jumanji                                                  1
Grumpier Old Men                                         2
Waiting to Exhale                                        3
Father of the Bride Part II                              4
                                                      ... 
Shin Godzilla                                         9094
The Beatles: Eight Days a Week - The Touring Years    9095
Pokémon: Spell of the Unknown                         9096
Pokémon 4Ever: Celebi - Voice of the Forest           9097
Force Majeure                                         9098
Length: 9099, dtype: int64

In [49]:
def get_recommendation(title):
    idx=indices[title]
    sim_score=list(enumerate(cosine_sim[idx]))
    sim_score=sorted(sim_score,key=lambda x:x[1],reverse=True)
    sim_score=sim_score[1:31]
    movie_indices=[i[0] for i in sim_score]
    return titles.iloc[movie_indices]

We're all set. Let us now try and get the top recommendations for a few movies and see how good the recommendations are.

In [50]:
get_recommendation('The Godfather').head(10)


973      The Godfather: Part II
8387                 The Family
3509                       Made
4196         Johnny Dangerously
29               Shanghai Triad
5667                       Fury
2412             American Movie
1582    The Godfather: Part III
4221                    8 Women
2159              Summer of Sam
Name: title, dtype: object

In [51]:
get_recommendation('The Dark Knight').head(10)

7931                      The Dark Knight Rises
132                              Batman Forever
1113                             Batman Returns
8227    Batman: The Dark Knight Returns, Part 2
7565                 Batman: Under the Red Hood
524                                      Batman
7901                           Batman: Year One
2579               Batman: Mask of the Phantasm
2696                                        JFK
8165    Batman: The Dark Knight Returns, Part 1
Name: title, dtype: object

We see that for The Dark Knight, our system is able to identify it as a Batman film and subsequently recommend other Batman films as its top recommendations. But unfortunately, that is all this system can do at the moment. This is not of much use to most people as it doesn't take into considerations very important features such as cast, crew, director and genre, which determine the rating and the popularity of a movie. Someone who liked The Dark Knight probably likes it more because of Nolan and would hate Batman Forever and every other substandard movie in the Batman Franchise.

Therefore, we are going to use much more suggestive metadata than Overview and Tagline. In the next subsection, we will build a more sophisticated recommender that takes genre, keywords, cast and crew into consideration.

Metadata Based Recommender

To build our standard metadata based content recommender, we will need to merge our current dataset with the crew and the keyword datasets. Let us prepare this data as our first step.

In [52]:
credits = pd.read_csv('credits.csv')
keywords = pd.read_csv('keywords.csv')

In [53]:
credits.head()

Unnamed: 0,cast,crew,id
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844
2,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602
3,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...",31357
4,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...",11862


In [54]:
credits['crew'][0]

'[{\'credit_id\': \'52fe4284c3a36847f8024f49\', \'department\': \'Directing\', \'gender\': 2, \'id\': 7879, \'job\': \'Director\', \'name\': \'John Lasseter\', \'profile_path\': \'/7EdqiNbr4FRjIhKHyPPdFfEEEFG.jpg\'}, {\'credit_id\': \'52fe4284c3a36847f8024f4f\', \'department\': \'Writing\', \'gender\': 2, \'id\': 12891, \'job\': \'Screenplay\', \'name\': \'Joss Whedon\', \'profile_path\': \'/dTiVsuaTVTeGmvkhcyJvKp2A5kr.jpg\'}, {\'credit_id\': \'52fe4284c3a36847f8024f55\', \'department\': \'Writing\', \'gender\': 2, \'id\': 7, \'job\': \'Screenplay\', \'name\': \'Andrew Stanton\', \'profile_path\': \'/pvQWsu0qc8JFQhMVJkTHuexUAa1.jpg\'}, {\'credit_id\': \'52fe4284c3a36847f8024f5b\', \'department\': \'Writing\', \'gender\': 2, \'id\': 12892, \'job\': \'Screenplay\', \'name\': \'Joel Cohen\', \'profile_path\': \'/dAubAiZcvKFbboWlj7oXOkZnTSu.jpg\'}, {\'credit_id\': \'52fe4284c3a36847f8024f61\', \'department\': \'Writing\', \'gender\': 0, \'id\': 12893, \'job\': \'Screenplay\', \'name\': \'A

In [55]:
keywords.head()

Unnamed: 0,id,keywords
0,862,"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,8844,"[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,15602,"[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,31357,"[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,11862,"[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."


In [56]:
keywords['keywords'][0]

"[{'id': 931, 'name': 'jealousy'}, {'id': 4290, 'name': 'toy'}, {'id': 5202, 'name': 'boy'}, {'id': 6054, 'name': 'friendship'}, {'id': 9713, 'name': 'friends'}, {'id': 9823, 'name': 'rivalry'}, {'id': 165503, 'name': 'boy next door'}, {'id': 170722, 'name': 'new toy'}, {'id': 187065, 'name': 'toy comes to life'}]"

In [57]:
credits['cast'][0]

"[{'cast_id': 14, 'character': 'Woody (voice)', 'credit_id': '52fe4284c3a36847f8024f95', 'gender': 2, 'id': 31, 'name': 'Tom Hanks', 'order': 0, 'profile_path': '/pQFoyx7rp09CJTAb932F2g8Nlho.jpg'}, {'cast_id': 15, 'character': 'Buzz Lightyear (voice)', 'credit_id': '52fe4284c3a36847f8024f99', 'gender': 2, 'id': 12898, 'name': 'Tim Allen', 'order': 1, 'profile_path': '/uX2xVf6pMmPepxnvFWyBtjexzgY.jpg'}, {'cast_id': 16, 'character': 'Mr. Potato Head (voice)', 'credit_id': '52fe4284c3a36847f8024f9d', 'gender': 2, 'id': 7167, 'name': 'Don Rickles', 'order': 2, 'profile_path': '/h5BcaDMPRVLHLDzbQavec4xfSdt.jpg'}, {'cast_id': 17, 'character': 'Slinky Dog (voice)', 'credit_id': '52fe4284c3a36847f8024fa1', 'gender': 2, 'id': 12899, 'name': 'Jim Varney', 'order': 3, 'profile_path': '/eIo2jVVXYgjDtaHoF19Ll9vtW7h.jpg'}, {'cast_id': 18, 'character': 'Rex (voice)', 'credit_id': '52fe4284c3a36847f8024fa5', 'gender': 2, 'id': 12900, 'name': 'Wallace Shawn', 'order': 4, 'profile_path': '/oGE6JqPP2xH4t

Crew: From the crew, we will only pick the director as our feature since the others don't contribute that much to the feel of the movie.

Cast: Choosing Cast is a little more tricky. Lesser known actors and minor roles do not really affect people's opinion of a movie. Therefore, we must only select the major characters and their respective actors. Arbitrarily we will choose the top 3 actors that appear in the credits list.

In [58]:
credits['cast']=credits['cast'].apply(literal_eval).apply(lambda x:[i['name'] for i in x] if isinstance (x,list) else [])

In [59]:
credits['cast']

0        [Tom Hanks, Tim Allen, Don Rickles, Jim Varney...
1        [Robin Williams, Jonathan Hyde, Kirsten Dunst,...
2        [Walter Matthau, Jack Lemmon, Ann-Margret, Sop...
3        [Whitney Houston, Angela Bassett, Loretta Devi...
4        [Steve Martin, Diane Keaton, Martin Short, Kim...
                               ...                        
45471          [Leila Hatami, Kourosh Tahami, Elham Korda]
45472    [Angel Aquino, Perry Dizon, Hazel Orencio, Joe...
45473    [Erika Eleniak, Adam Baldwin, Julie du Page, J...
45474    [Iwan Mosschuchin, Nathalie Lissenko, Pavel Pa...
45475                                                   []
Name: cast, Length: 45476, dtype: object

In [60]:
credits['cast'][0]

['Tom Hanks',
 'Tim Allen',
 'Don Rickles',
 'Jim Varney',
 'Wallace Shawn',
 'John Ratzenberger',
 'Annie Potts',
 'John Morris',
 'Erik von Detten',
 'Laurie Metcalf',
 'R. Lee Ermey',
 'Sarah Freeman',
 'Penn Jillette']

In [61]:
credits['cast']=credits['cast'].apply(lambda x:x[:3])

In [62]:
credits['cast']

0                      [Tom Hanks, Tim Allen, Don Rickles]
1           [Robin Williams, Jonathan Hyde, Kirsten Dunst]
2               [Walter Matthau, Jack Lemmon, Ann-Margret]
3        [Whitney Houston, Angela Bassett, Loretta Devine]
4               [Steve Martin, Diane Keaton, Martin Short]
                               ...                        
45471          [Leila Hatami, Kourosh Tahami, Elham Korda]
45472           [Angel Aquino, Perry Dizon, Hazel Orencio]
45473         [Erika Eleniak, Adam Baldwin, Julie du Page]
45474    [Iwan Mosschuchin, Nathalie Lissenko, Pavel Pa...
45475                                                   []
Name: cast, Length: 45476, dtype: object

In [63]:
credits['crew'][0]

'[{\'credit_id\': \'52fe4284c3a36847f8024f49\', \'department\': \'Directing\', \'gender\': 2, \'id\': 7879, \'job\': \'Director\', \'name\': \'John Lasseter\', \'profile_path\': \'/7EdqiNbr4FRjIhKHyPPdFfEEEFG.jpg\'}, {\'credit_id\': \'52fe4284c3a36847f8024f4f\', \'department\': \'Writing\', \'gender\': 2, \'id\': 12891, \'job\': \'Screenplay\', \'name\': \'Joss Whedon\', \'profile_path\': \'/dTiVsuaTVTeGmvkhcyJvKp2A5kr.jpg\'}, {\'credit_id\': \'52fe4284c3a36847f8024f55\', \'department\': \'Writing\', \'gender\': 2, \'id\': 7, \'job\': \'Screenplay\', \'name\': \'Andrew Stanton\', \'profile_path\': \'/pvQWsu0qc8JFQhMVJkTHuexUAa1.jpg\'}, {\'credit_id\': \'52fe4284c3a36847f8024f5b\', \'department\': \'Writing\', \'gender\': 2, \'id\': 12892, \'job\': \'Screenplay\', \'name\': \'Joel Cohen\', \'profile_path\': \'/dAubAiZcvKFbboWlj7oXOkZnTSu.jpg\'}, {\'credit_id\': \'52fe4284c3a36847f8024f61\', \'department\': \'Writing\', \'gender\': 0, \'id\': 12893, \'job\': \'Screenplay\', \'name\': \'A

In [64]:
def get_director(x):
    for i in x:
        if i['job']=='Director':
            return i['name']
    return np.nan



In [65]:
credits['director']=credits['crew'].apply(literal_eval).apply(get_director)

In [66]:
md=md.merge(credits,on='id')
md=md.merge(keywords,on='id')

In [67]:
md.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,title,video,vote_average,vote_count,Weighted_Rating,year,cast,crew,director,keywords
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[Animation, Comedy, Family]",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,Toy Story,False,7.7,5415.0,6.89747,1995,"[Tom Hanks, Tim Allen, Don Rickles]","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",John Lasseter,"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,False,,65000000,"[Adventure, Fantasy, Family]",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,Jumanji,False,6.9,2413.0,5.941799,1995,"[Robin Williams, Jonathan Hyde, Kirsten Dunst]","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",Joe Johnston,"[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[Romance, Comedy]",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,Grumpier Old Men,False,6.5,92.0,5.684985,1995,"[Walter Matthau, Jack Lemmon, Ann-Margret]","[{'credit_id': '52fe466a9251416c75077a89', 'de...",Howard Deutch,"[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,False,,16000000,"[Comedy, Drama, Romance]",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,Waiting to Exhale,False,6.1,34.0,5.645944,1995,"[Whitney Houston, Angela Bassett, Loretta Devine]","[{'credit_id': '52fe44779251416c91011acb', 'de...",Forest Whitaker,"[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,[Comedy],,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,Father of the Bride Part II,False,5.7,173.0,5.442013,1995,"[Steve Martin, Diane Keaton, Martin Short]","[{'credit_id': '52fe44959251416c75039ed7', 'de...",Charles Shyer,"[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."


In [68]:
md.columns

Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count', 'Weighted_Rating', 'year', 'cast', 'crew',
       'director', 'keywords'],
      dtype='object')

In [69]:
md.shape

(46628, 30)

In [70]:
smd=md[md['id'].isin(links_small)]

In [71]:
smd.shape

(9219, 30)

In [72]:
smd.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,title,video,vote_average,vote_count,Weighted_Rating,year,cast,crew,director,keywords
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[Animation, Comedy, Family]",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,Toy Story,False,7.7,5415.0,6.89747,1995,"[Tom Hanks, Tim Allen, Don Rickles]","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",John Lasseter,"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,False,,65000000,"[Adventure, Fantasy, Family]",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,Jumanji,False,6.9,2413.0,5.941799,1995,"[Robin Williams, Jonathan Hyde, Kirsten Dunst]","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",Joe Johnston,"[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[Romance, Comedy]",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,Grumpier Old Men,False,6.5,92.0,5.684985,1995,"[Walter Matthau, Jack Lemmon, Ann-Margret]","[{'credit_id': '52fe466a9251416c75077a89', 'de...",Howard Deutch,"[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,False,,16000000,"[Comedy, Drama, Romance]",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,Waiting to Exhale,False,6.1,34.0,5.645944,1995,"[Whitney Houston, Angela Bassett, Loretta Devine]","[{'credit_id': '52fe44779251416c91011acb', 'de...",Forest Whitaker,"[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,[Comedy],,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,Father of the Bride Part II,False,5.7,173.0,5.442013,1995,"[Steve Martin, Diane Keaton, Martin Short]","[{'credit_id': '52fe44959251416c75039ed7', 'de...",Charles Shyer,"[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."


In [73]:
smd['keywords'][0]

"[{'id': 931, 'name': 'jealousy'}, {'id': 4290, 'name': 'toy'}, {'id': 5202, 'name': 'boy'}, {'id': 6054, 'name': 'friendship'}, {'id': 9713, 'name': 'friends'}, {'id': 9823, 'name': 'rivalry'}, {'id': 165503, 'name': 'boy next door'}, {'id': 170722, 'name': 'new toy'}, {'id': 187065, 'name': 'toy comes to life'}]"

In [74]:
smd['keywords']=smd['keywords'].apply(literal_eval).apply(lambda x:[i['name'] for i in x] if isinstance (x,list) else [])


In [75]:
smd.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,title,video,vote_average,vote_count,Weighted_Rating,year,cast,crew,director,keywords
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[Animation, Comedy, Family]",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,Toy Story,False,7.7,5415.0,6.89747,1995,"[Tom Hanks, Tim Allen, Don Rickles]","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",John Lasseter,"[jealousy, toy, boy, friendship, friends, riva..."
1,False,,65000000,"[Adventure, Fantasy, Family]",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,Jumanji,False,6.9,2413.0,5.941799,1995,"[Robin Williams, Jonathan Hyde, Kirsten Dunst]","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",Joe Johnston,"[board game, disappearance, based on children'..."
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[Romance, Comedy]",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,Grumpier Old Men,False,6.5,92.0,5.684985,1995,"[Walter Matthau, Jack Lemmon, Ann-Margret]","[{'credit_id': '52fe466a9251416c75077a89', 'de...",Howard Deutch,"[fishing, best friend, duringcreditsstinger, o..."
3,False,,16000000,"[Comedy, Drama, Romance]",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,Waiting to Exhale,False,6.1,34.0,5.645944,1995,"[Whitney Houston, Angela Bassett, Loretta Devine]","[{'credit_id': '52fe44779251416c91011acb', 'de...",Forest Whitaker,"[based on novel, interracial relationship, sin..."
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,[Comedy],,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,Father of the Bride Part II,False,5.7,173.0,5.442013,1995,"[Steve Martin, Diane Keaton, Martin Short]","[{'credit_id': '52fe44959251416c75039ed7', 'de...",Charles Shyer,"[baby, midlife crisis, confidence, aging, daug..."


In [76]:
smd.columns

Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count', 'Weighted_Rating', 'year', 'cast', 'crew',
       'director', 'keywords'],
      dtype='object')

My approach to building the recommender is going to be extremely hacky. What I plan on doing is creating a metadata dump for every movie which consists of genres, director, main actors and keywords. I then use a Count Vectorizer to create our count matrix as we did in the Description Recommender. The remaining steps are similar to what we did earlier: we calculate the cosine similarities and return movies that are most similar.

These are steps I follow in the preparation of my genres and credits data:

Strip Spaces and Convert to Lowercase from all our features. This way, our engine will not confuse between Johnny Depp and Johnny Galecki.
Mention Director 3 times to give it more weight relative to the entire cast.

In [77]:
#smd['director']=smd['director'].astype(str)

In [78]:
smd['director']

0             John Lasseter
1              Joe Johnston
2             Howard Deutch
3           Forest Whitaker
4             Charles Shyer
                ...        
40952        Gregg Champion
41172     Tinu Suresh Desai
41225    Ashutosh Gowariker
41391          Hideaki Anno
41669            Ron Howard
Name: director, Length: 9219, dtype: object

In [79]:
smd['director']=smd['director'].astype(str).apply(lambda x: str.lower(x.replace(" ","")))
smd['director']=smd['director'].apply(lambda x:[x,x,x])
#smd['director'] = smd['director'].astype('str').apply(lambda x: str.lower(x.replace(" ", "")))

In [80]:
smd['director']

0               [johnlasseter, johnlasseter, johnlasseter]
1                  [joejohnston, joejohnston, joejohnston]
2               [howarddeutch, howarddeutch, howarddeutch]
3         [forestwhitaker, forestwhitaker, forestwhitaker]
4               [charlesshyer, charlesshyer, charlesshyer]
                               ...                        
40952        [greggchampion, greggchampion, greggchampion]
41172    [tinusureshdesai, tinusureshdesai, tinusureshd...
41225    [ashutoshgowariker, ashutoshgowariker, ashutos...
41391              [hideakianno, hideakianno, hideakianno]
41669                    [ronhoward, ronhoward, ronhoward]
Name: director, Length: 9219, dtype: object

In [81]:
#smd['cast']=smd['cast']

In [82]:
smd['cast']=smd['cast'].apply(lambda x:[str.lower(i.replace(" ",""))for i in x])

In [83]:
smd['cast']

0                         [tomhanks, timallen, donrickles]
1              [robinwilliams, jonathanhyde, kirstendunst]
2                 [waltermatthau, jacklemmon, ann-margret]
3           [whitneyhouston, angelabassett, lorettadevine]
4                  [stevemartin, dianekeaton, martinshort]
                               ...                        
40952          [sidneypoitier, wendycrewson, jayo.sanders]
41172               [akshaykumar, ileanad'cruz, eshagupta]
41225               [hrithikroshan, poojahegde, kabirbedi]
41391    [hirokihasegawa, yutakatakenouchi, satomiishih...
41669              [paulmccartney, ringostarr, johnlennon]
Name: cast, Length: 9219, dtype: object

In [84]:
smd=smd.drop("crew",axis=1)

In [85]:
smd.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,tagline,title,video,vote_average,vote_count,Weighted_Rating,year,cast,director,keywords
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[Animation, Comedy, Family]",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,,Toy Story,False,7.7,5415.0,6.89747,1995,"[tomhanks, timallen, donrickles]","[johnlasseter, johnlasseter, johnlasseter]","[jealousy, toy, boy, friendship, friends, riva..."
1,False,,65000000,"[Adventure, Fantasy, Family]",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,5.941799,1995,"[robinwilliams, jonathanhyde, kirstendunst]","[joejohnston, joejohnston, joejohnston]","[board game, disappearance, based on children'..."
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[Romance, Comedy]",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0,5.684985,1995,"[waltermatthau, jacklemmon, ann-margret]","[howarddeutch, howarddeutch, howarddeutch]","[fishing, best friend, duringcreditsstinger, o..."
3,False,,16000000,"[Comedy, Drama, Romance]",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0,5.645944,1995,"[whitneyhouston, angelabassett, lorettadevine]","[forestwhitaker, forestwhitaker, forestwhitaker]","[based on novel, interracial relationship, sin..."
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,[Comedy],,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0,5.442013,1995,"[stevemartin, dianekeaton, martinshort]","[charlesshyer, charlesshyer, charlesshyer]","[baby, midlife crisis, confidence, aging, daug..."


Keywords

We will do a small amount of pre-processing of our keywords before putting them to any use. As a first step, we calculate the frequenct counts of every keyword that appears in the dataset.

In [86]:
s=smd.apply(lambda x:pd.Series(x['keywords']),axis=1).stack().reset_index(level=1,drop=True)

In [87]:
s.name='Keyword'

In [88]:
s

0           jealousy
0                toy
0                boy
0         friendship
0            friends
            ...     
41391    destruction
41391          kaiju
41391          toyko
41669          music
41669    documentary
Name: Keyword, Length: 64407, dtype: object

In [89]:
s=s.value_counts()
s[:5]

independent film        610
woman director          550
murder                  399
duringcreditsstinger    327
based on novel          318
Name: Keyword, dtype: int64

Keywords occur in frequencies ranging from 1 to 610. We do not have any use for keywords that occur only once. Therefore, these can be safely removed. Finally, we will convert every word to its stem so that words such as Dogs and Dog are considered the same.

In [90]:
s = s[s > 1]

In [91]:
s

independent film        610
woman director          550
murder                  399
duringcreditsstinger    327
based on novel          318
                       ... 
fondling                  2
expatriate                2
racial injustice          2
core melt                 2
kissing                   2
Name: Keyword, Length: 6709, dtype: int64

In [92]:
stemmer = SnowballStemmer('english')
stemmer.stem('dogs')

'dog'

In [93]:
def filter_keywords(x):
    words=[]
    for i in x:
        if i in s:
            words.append(i)
    return words

In [94]:
smd['keywords'] = smd['keywords'].apply(filter_keywords)
smd['keywords'] = smd['keywords'].apply(lambda x: [stemmer.stem(i) for i in x])
smd['keywords'] = smd['keywords'].apply(lambda x: [str.lower(i.replace(" ", "")) for i in x])

In [95]:
smd.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,tagline,title,video,vote_average,vote_count,Weighted_Rating,year,cast,director,keywords
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[Animation, Comedy, Family]",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,,Toy Story,False,7.7,5415.0,6.89747,1995,"[tomhanks, timallen, donrickles]","[johnlasseter, johnlasseter, johnlasseter]","[jealousi, toy, boy, friendship, friend, rival..."
1,False,,65000000,"[Adventure, Fantasy, Family]",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,5.941799,1995,"[robinwilliams, jonathanhyde, kirstendunst]","[joejohnston, joejohnston, joejohnston]","[boardgam, disappear, basedonchildren'sbook, n..."
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[Romance, Comedy]",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0,5.684985,1995,"[waltermatthau, jacklemmon, ann-margret]","[howarddeutch, howarddeutch, howarddeutch]","[fish, bestfriend, duringcreditssting]"
3,False,,16000000,"[Comedy, Drama, Romance]",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0,5.645944,1995,"[whitneyhouston, angelabassett, lorettadevine]","[forestwhitaker, forestwhitaker, forestwhitaker]","[basedonnovel, interracialrelationship, single..."
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,[Comedy],,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0,5.442013,1995,"[stevemartin, dianekeaton, martinshort]","[charlesshyer, charlesshyer, charlesshyer]","[babi, midlifecrisi, confid, age, daughter, mo..."


In [96]:
smd.columns

Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count', 'Weighted_Rating', 'year', 'cast',
       'director', 'keywords'],
      dtype='object')

In [97]:
#smd['keywords'].describe()
smd['cast']

0                         [tomhanks, timallen, donrickles]
1              [robinwilliams, jonathanhyde, kirstendunst]
2                 [waltermatthau, jacklemmon, ann-margret]
3           [whitneyhouston, angelabassett, lorettadevine]
4                  [stevemartin, dianekeaton, martinshort]
                               ...                        
40952          [sidneypoitier, wendycrewson, jayo.sanders]
41172               [akshaykumar, ileanad'cruz, eshagupta]
41225               [hrithikroshan, poojahegde, kabirbedi]
41391    [hirokihasegawa, yutakatakenouchi, satomiishih...
41669              [paulmccartney, ringostarr, johnlennon]
Name: cast, Length: 9219, dtype: object

In [98]:
smd['soup'] = smd['keywords'] + smd['cast'] + smd['director']+ smd['genres']
smd['soup'][0]
#

['jealousi',
 'toy',
 'boy',
 'friendship',
 'friend',
 'rivalri',
 'boynextdoor',
 'newtoy',
 'toycomestolif',
 'tomhanks',
 'timallen',
 'donrickles',
 'johnlasseter',
 'johnlasseter',
 'johnlasseter',
 'Animation',
 'Comedy',
 'Family']

In [99]:
smd['soup'] = smd['soup'].apply(lambda x: ' '.join(x))
smd['soup']

0        jealousi toy boy friendship friend rivalri boy...
1        boardgam disappear basedonchildren'sbook newho...
2        fish bestfriend duringcreditssting waltermatth...
3        basedonnovel interracialrelationship singlemot...
4        babi midlifecrisi confid age daughter motherda...
                               ...                        
40952    friendship sidneypoitier wendycrewson jayo.san...
41172    bollywood akshaykumar ileanad'cruz eshagupta t...
41225    bollywood hrithikroshan poojahegde kabirbedi a...
41391    monster godzilla giantmonst destruct kaiju hir...
41669    music documentari paulmccartney ringostarr joh...
Name: soup, Length: 9219, dtype: object

In [100]:
count = CountVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
count_matrix = count.fit_transform(smd['soup'])

In [101]:
cosine_sim = cosine_similarity(count_matrix, count_matrix)

In [102]:
smd = smd.reset_index()
titles = smd['title']
indices = pd.Series(smd.index, index=smd['title'])

We will reuse the get_recommendations function that we had written earlier. Since our cosine similarity scores have changed, we expect it to give us different (and probably better) results. Let us check for The Dark Knight again and see what recommendations I get this time around.

In [103]:
get_recommendation('The Dark Knight').head(10)

8031         The Dark Knight Rises
6218                 Batman Begins
6623                  The Prestige
2085                     Following
7648                     Inception
4145                      Insomnia
3381                       Memento
8613                  Interstellar
7659    Batman: Under the Red Hood
1134                Batman Returns
Name: title, dtype: object

I am much more satisfied with the results I get this time around. The recommendations seem to have recognized other Christopher Nolan movies (due to the high weightage given to director) and put them as top recommendations. I enjoyed watching The Dark Knight as well as some of the other ones in the list including Batman Begins, The Prestige and The Dark Knight Rises.

We can of course experiment on this engine by trying out different weights for our features (directors, actors, genres), limiting the number of keywords that can be used in the soup, weighing genres based on their frequency, only showing movies with the same languages, etc.

et me also get recommendations for another movie, Mean Girls 

In [104]:
get_recommendation('Mean Girls').head(10)

3319               Head Over Heels
4763                 Freaky Friday
1329              The House of Yes
6277              Just Like Heaven
7905         Mr. Popper's Penguins
7332    Ghosts of Girlfriends Past
6959     The Spiderwick Chronicles
8883                      The DUFF
6698         It's a Boy Girl Thing
7377       I Love You, Beth Cooper
Name: title, dtype: object

Popularity and Ratings
One thing that we notice about our recommendation system is that it recommends movies regardless of ratings and popularity. It is true that Batman and Robin has a lot of similar characters as compared to The Dark Knight but it was a terrible movie that shouldn't be recommended to anyone.

Therefore, we will add a mechanism to remove bad movies and return movies which are popular and have had a good critical response.

I will take the top 25 movies based on similarity scores and calculate the vote of the 60th percentile movie. Then, using this as the value of  m , we will calculate the weighted rating of each movie using IMDB's formula like we did in the Simple Recommender section.

In [105]:
def weighted_rating(x):
    v = x['vote_count']
    R = x['vote_average']
    return (v/(v+m) * R) + (m/(m+v) * C)

In [106]:
def improved_recommendations(title):
    idx = indices['The Dark Knight']
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:26]
    movie_indices = [i[0] for i in sim_scores]
    movies = smd.iloc[movie_indices][['title', 'vote_count', 'vote_average', 'year']]
    vote_counts = movies[movies['vote_count'].notnull()]['vote_count'].astype('int')
    vote_averages = movies[movies['vote_average'].notnull()]['vote_average'].astype('int')
    C = vote_averages.mean()
    m = vote_counts.quantile(0.60)
    qualified = movies[(movies['vote_count'] >= m) & (movies['vote_count'].notnull()) & (movies['vote_average'].notnull())]
    qualified['vote_count'] = qualified['vote_count'].astype('int')
    qualified['vote_average'] = qualified['vote_average'].astype('int')
    qualified['wr'] = qualified.apply(weighted_rating, axis=1)
    qualified = qualified.sort_values('wr', ascending=False).head(10)
    return qualified

In [107]:
improved_recommendations('The Dark Knight')

Unnamed: 0,title,vote_count,vote_average,year,wr
7648,Inception,14075,8,2010,7.986204
8613,Interstellar,11187,8,2014,7.982669
6623,The Prestige,4510,8,2006,7.957468
3381,Memento,4168,8,2000,7.954045
8031,The Dark Knight Rises,9263,7,2012,6.987875
6218,Batman Begins,7511,7,2005,6.985077
1134,Batman Returns,1706,6,1992,5.98249
132,Batman Forever,1529,5,1995,5.031467
9024,Batman v Superman: Dawn of Justice,7189,5,2016,5.006972
1260,Batman & Robin,1447,4,1997,4.086784


Unfortunately, Batman and Robin does not disappear from our recommendation list. This is probably due to the fact that it is rated a 4, which is only slightly below average on TMDB. It certainly doesn't deserve a 4 when amazing movies like The Dark Knight Rises has only a 7. However, there is nothing much we can do about this. Therefore, we will conclude our Content Based Recommender section here and come back to it when we build a hybrid engine.

# Collaborative Filtering

Our content based engine suffers from some severe limitations. It is only capable of suggesting movies which are close to a certain movie. That is, it is not capable of capturing tastes and providing recommendations across genres.

Also, the engine that we built is not really personal in that it doesn't capture the personal tastes and biases of a user. Anyone querying our engine for recommendations based on a movie will receive the same recommendations for that movie, regardless of who s/he is.

Therefore, in this section, we will use a technique called Collaborative Filtering to make recommendations to Movie Watchers. Collaborative Filtering is based on the idea that users similar to a me can be used to predict how much I will like a particular product or service those users have used/experienced but I have not.

I will not be implementing Collaborative Filtering from scratch. Instead, I will use the Surprise library that used extremely powerful algorithms like Singular Value Decomposition (SVD) to minimise RMSE (Root Mean Square Error) and give great recommendations.

In [108]:
reader = Reader()    

In [109]:
ratings = pd.read_csv('ratings_small.csv')
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


In [110]:
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)
#data.split(n_folds=5)

In [111]:
data

<surprise.dataset.DatasetAutoFolds at 0x20a42c2f048>

In [112]:
svd = SVD()
evaluate(svd, data, measures=['RMSE', 'MAE'])

NameError: name 'evaluate' is not defined