# Movies Recommender System

This is the second part of my Springboard Capstone Project on Movie Data Analysis and Recommendation Systems. In first notebook, the story of film was narrated by performing an extensive exploratory data analysis on Movies Metadata collected from TMDB. Two extremely minimalist predictive models were alse built to predict movie revenue and movie success and visualise which features influence the output (revenue and success respectively).

In this notebook, a few recommendation algorithms (content based, popularity based and collaborative filtering) will be implemented and built an ensemble of these models to come up with final recommendation system. There are two MovieLens datasets.

* **The Full Dataset:** Consists of 33,000,000 ratings and 2,000,000 tag applications applied to 86,000 movies by 330,975 users. Includes tag genome data with 14 million relevance scores across 1,100 tags.

* **The Small Dataset:** Comprises of 100,000 ratings and 3,600 tag applications applied to 9,000 movies by 600 users.

Simple Recommender will be built by using movies from the *Full Dataset* whereas all personalised recommender systems will be made use of the *Small Dataset* (due to the computing power is very limited).

In [1]:
from ast import literal_eval
import warnings
import numpy as np
import pandas as pd
from nltk.stem.snowball import SnowballStemmer
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
from surprise import Reader, Dataset, SVD
from surprise.model_selection import cross_validate



In [2]:
warnings.simplefilter('ignore')

## Simple Recommender

The Simple Recommender offers generalized recommendations to every user based on movie popularity and (sometimes) genre. The basic idea behind this recommender is that movies that are more popular and more critically acclaimed will have a higher probability of being liked by the average audience. This model does not give personalized recommendations based on the user. 

The implementation of this model is extremely trivial. All have to do is sort movies based on ratings and popularity and display the top movies of the list. As an added step, pass in a genre argument to get the top movies of a particular genre. 

In [3]:
movies_df = pd.read_csv('data/movies_metadata.csv')
movies_df.head()

Unnamed: 0,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,/lxD5ak7BOoinRNehOCA85CQ8ubr.jpg,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 12, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,...,1995-10-30,394400000,81,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Hang on for the comedy that goes to infinity a...,Toy Story,False,7.971,17351
1,False,/pYw10zrqfkdm3yD9JTO6vEGQhKy.jpg,"{'id': 495527, 'name': 'Jumanji Collection', '...",65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",http://www.sonypictures.com/movies/jumanji/,8844,tt0113497,en,Jumanji,...,1995-12-15,262821940,104,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Roll the dice and unleash the excitement!,Jumanji,False,7.239,9937
2,False,/1J4Z7VhdAgtdd97nCxY7dcBpjGT.jpg,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",25000000,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,...,1995-12-22,71500000,101,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.487,353
3,False,/jZjoEKXMTDoZAGdkjhAdJaKtXSN.jpg,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,...,1995-12-22,81452156,127,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.2,143
4,False,/lEsjVrGU21BeJjF5AF9EWsihDpw.jpg,"{'id': 96871, 'name': 'Father of the Bride (St...",0,"[{'id': 35, 'name': 'Comedy'}, {'id': 10751, '...",,11862,tt0113041,en,Father of the Bride Part II,...,1995-12-08,76594107,106,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Just when his world is back to normal... he's ...,Father of the Bride Part II,False,6.234,672


In [4]:
movies_df['genres'] = movies_df['genres'].apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

The TMDB Ratings is used to come up with **Top Movies Chart.** and IMDB's ***weighted rating*** formula is also used to construct chart. Mathematically, it is represented as follows:

Weighted Rating (WR) = $(\frac{v}{v + m} . R) + (\frac{m}{v + m} . C)$

Where,
* ***v*** is the number of votes for the movie
* ***m*** is the minimum votes required to be listed in the chart
* ***R*** is the average rating of the movie
* ***C*** is the mean vote across the whole report

The next step is to determine an appropriate value for ***m***, the minimum votes required to be listed in the chart. Using **95th percentile** as cutoff. In other words, for a movie to feature in the charts, it must have more votes than at least 95% of the movies in the list.

An overall **Top 250 Chart** is built and defined a function to build charts for a particular genre. Let's begin!

In [5]:
vote_averages = movies_df[movies_df['vote_average'].notnull()]['vote_average'].astype(int)
C = vote_averages.mean()
C

5.425442754971849

In [6]:
vote_counts = movies_df[movies_df['vote_count'].notnull()]['vote_count'].astype(int)
m = vote_counts.quantile(0.95)
m

888.0

In [7]:
movies_df['year'] = pd.to_datetime(movies_df['release_date'], errors='coerce').apply(lambda x: str(x).split('-')[0] if x != np.nan else np.nan)

In [8]:
qualified_df = movies_df[(movies_df['vote_count'].notnull()) & (movies_df['vote_count'] >= m) & (movies_df['vote_average'].notnull())][['title', 'genres', 'popularity', 'vote_average',  'vote_count', 'year']]
qualified_df['vote_count'] = qualified_df['vote_count'].astype('int')
qualified_df['vote_average'] = qualified_df['vote_average'].astype('int')
qualified_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4274 entries, 0 to 85330
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   title         4274 non-null   object 
 1   genres        4274 non-null   object 
 2   popularity    4274 non-null   float64
 3   vote_average  4274 non-null   int32  
 4   vote_count    4274 non-null   int32  
 5   year          4274 non-null   object 
dtypes: float64(1), int32(2), object(3)
memory usage: 200.3+ KB


Therefore, to qualify to be considered for the chart, a movie has to have at least **888 votes** on TMDB. It also can be seen that the average rating for a movie on TMDB is **5.4254** on a scale of 10. **4274** Movies qualify to be on this chart.

In [9]:
def weighted_rating(df, m, C):
    v = df['vote_count']
    R = df['vote_average']

    WR = (v/(v + m) * R) + (m/(v + m) * C)
    return WR

In [10]:
qualified_df['weighted_rating'] = qualified_df.apply(weighted_rating, args=(m, C), axis=1)

In [11]:
top_250 = qualified_df.sort_values('weighted_rating', ascending=False).head(250)

### Top Movies

In [12]:
top_250.head(15)

Unnamed: 0,title,genres,popularity,vote_average,vote_count,year,weighted_rating
14848,Inception,"[Action, Science Fiction, Adventure]",118.271,8,34845,2010,7.93602
20995,Interstellar,"[Adventure, Drama, Science Fiction]",158.041,8,33043,2014,7.932622
12167,The Dark Knight,"[Drama, Action, Crime, Thriller]",94.471,8,31010,2008,7.928328
24844,Avengers: Infinity War,"[Adventure, Action, Science Fiction]",211.354,8,28055,2018,7.92101
2848,Fight Club,[Drama],73.879,8,27610,1999,7.919777
292,Pulp Fiction,"[Thriller, Crime]",78.343,8,26219,1994,7.91566
351,Forrest Gump,"[Comedy, Drama, Romance]",82.404,8,25752,1994,7.914181
314,The Shawshank Redemption,"[Drama, Crime]",259.23,8,25045,1994,7.911842
18866,Django Unchained,"[Drama, Western]",56.438,8,24935,2012,7.911466
24845,Avengers: Endgame,"[Adventure, Science Fiction, Action]",134.347,8,24152,2019,7.908698


Three Christopher Nolan Films, **Inception**, **Interstellar** and **The Dark Knight** occur at the very top of the chart. The chart also indicates a strong bias of TMDB Users towards particular genres and directors.

Now construct a function that builds charts for particular genres. For this, using relax default conditions to the **85th** percentile instead of 95.

In [13]:
stack_genres = movies_df.apply(lambda x: pd.Series(x['genres']), axis=1).stack().reset_index(level=1, drop=True)
stack_genres.name = 'genre'

In [14]:
genres_df = movies_df.drop('genres', axis=1).join(stack_genres)

In [15]:
def build_chart(genre, percentile=0.85):
    df = genres_df[genres_df['genre'] == genre]

    vote_counts = df[df['vote_count'].notnull()]['vote_count'].astype('int')
    vote_averages = df[df['vote_average'].notnull()]['vote_average'].astype('int')

    C = vote_averages.mean()
    m = vote_counts.quantile(percentile)
    
    qualified_df = df[(df['vote_count'].notnull()) & (df['vote_count'] >= m) & (df['vote_average'].notnull())][['title', 'popularity', 'vote_average',  'vote_count', 'year']]
    qualified_df['vote_count'] = qualified_df['vote_count'].astype('int')
    qualified_df['vote_average'] = qualified_df['vote_average'].astype('int')

    qualified_df['weighted_rating'] = qualified_df.apply(weighted_rating, args=(m, C), axis=1)
    
    top_movies = qualified_df.sort_values('weighted_rating', ascending=False).head(250)

    return top_movies

Let see method in action by displaying the Top 15 Romance Movies (Romance almost didn't feature at all in the Generic Top Chart despite being one of the most popular movie genres).

### Top Romance Movies

In [16]:
build_chart('Romance').head(15)

Unnamed: 0,title,popularity,vote_average,vote_count,year,weighted_rating
351,Forrest Gump,82.404,8,25752,1994,7.980267
7211,Eternal Sunshine of the Spotless Mind,53.919,8,13900,2004,7.963695
44531,Call Me by Your Name,38.573,8,11458,2017,7.956099
42046,Your Name.,81.862,8,10543,2016,7.952364
10359,Pride & Prejudice,53.092,8,7489,2005,7.933476
52232,"Love, Simon",32.447,8,5784,2018,7.914562
877,Vertigo,27.338,8,5330,1958,7.907564
58360,Five Feet Apart,35.339,8,5304,2019,7.907129
886,Casablanca,26.816,8,5013,1943,7.901957
10136,Dilwale Dulhania Le Jayenge,36.331,8,4293,1995,7.886292


The top romance movie according to metrics is Hollywood's **Forrest Gump**. It is based on the 1986 novel of the same name by Winston Groom and stars Tom Hanks, Robin Wright.

## Content Based Recommender

The recommender built in the previous section suffers some severe limitations. For one, it gives the same recommendation to everyone, regardless of the user's personal taste. If a person who loves romantic movies (and hates action) were to look at our Top 15 Chart, she/he wouldn't probably like most of the movies. If she/he were to go one step further and look at the charts by genre, she/he wouldn't still be getting the best recommendations.

To personalise recommendations more, an engine is going to be built that computes similarity between movies based on certain metrics and suggests movies that are most similar to a particular movie that a user liked. Since using movie metadata (or content) to build this engine, this also known as **Content Based Filtering.**

Build two Content Based Recommenders based on:
* Movie Overviews and Taglines
* Movie Cast, Crew, Keywords and Genre

Also, as mentioned in the introduction, using a subset of all the movies available due to limiting personal computing power available. 

In [17]:
links_small = pd.read_csv('data_small/links_small.csv')

In [18]:
tmdbId_small = links_small[links_small['tmdbId'].notnull()]['tmdbId'].astype(int)

In [19]:
movies_df['id'] = movies_df['id'].astype(int)

In [20]:
small_df = movies_df[movies_df['id'].isin(tmdbId_small)]

In [21]:
small_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9392 entries, 0 to 80245
Data columns (total 26 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  9392 non-null   bool   
 1   backdrop_path          9209 non-null   object 
 2   belongs_to_collection  1974 non-null   object 
 3   budget                 9392 non-null   int64  
 4   genres                 9392 non-null   object 
 5   homepage               2498 non-null   object 
 6   id                     9392 non-null   int32  
 7   imdb_id                9390 non-null   object 
 8   original_language      9392 non-null   object 
 9   original_title         9392 non-null   object 
 10  overview               9383 non-null   object 
 11  popularity             9392 non-null   float64
 12  poster_path            9380 non-null   object 
 13  production_companies   9392 non-null   object 
 14  production_countries   9392 non-null   object 
 15  rel

There are **9392** movies avaiable in the small movies metadata dataset which is 9 times smaller than original dataset of 85,000 movies.

### Movie Description Based Recommender

Firstly, try to build a recommender using movie descriptions and taglines. There is no quantitative metric to judge machine's performance so this will have to be done qualitatively.

In [22]:
small_df['tagline'].fillna('', inplace=True)
small_df['description'] = small_df['overview'] + small_df['tagline']
small_df['description'].fillna('', inplace=True)

In [23]:
tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 2), min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(small_df['description'])

#### Cosine Similarity

Using the Cosine Similarity to calculate a numeric quantity that denotes the similarity between two movies. Mathematically, it is defined as follows:

$cosine(x,y) = \frac{x. y^\intercal}{||x||.||y||}$

Since using the TF-IDF Vectorizer, calculating the Dot Product will directly give the Cosine Similarity Score. Therefore, sklearn's **linear_kernel** will be used instead of cosine_similarities since it is much faster.

In [24]:
cosine = linear_kernel(tfidf_matrix, tfidf_matrix)

In [25]:
cosine[0]

array([1.        , 0.00658619, 0.        , ..., 0.        , 0.        ,
       0.00930901])

Then now having a pairwise cosine similarity matrix for all the movies in the dataset.

The next step is to write a function that returns the 30 most similar movies based on the cosine similarity score.

In [26]:
small_df = small_df.reset_index()

In [27]:
small_df.rename(columns={'index': 'previous_index'}, inplace=True)

In [28]:
def get_recommendations(title):
    idx = pd.Series(data=small_df[small_df['title'] == title].index,
                    index=small_df[small_df['title'] == title]['title'])
                    
    if idx.shape[0] > 1:
        indices_list = list(idx)
        indices_df = small_df[small_df.index.isin(indices_list)]
        indices_df = indices_df[indices_df['year'] == indices_df['year'].max()]
        idx = pd.Series(data=indices_df.index, index=indices_df['title'])
    
    idx = idx[0]
    similarity_score = list(enumerate(cosine[idx]))
    similarity_score = sorted(similarity_score, key=lambda x: x[1], reverse=True)
    similarity_score = similarity_score[1:31]
    movie_indices = [i[0] for i in similarity_score]

    return small_df['title'].iloc[movie_indices]

In [29]:
get_recommendations('Iron Man').head(10)

7895                 Iron Man 3
7160                 Iron Man 2
8473                      Clown
8369    Avengers: Age of Ultron
8684                       Room
2297          Felicia's Journey
7723                      Brake
5745                    Hostage
9286         The Shape of Water
6787                       Igor
Name: title, dtype: object

In [30]:
get_recommendations('Captain America').head(10)

7459                   Captain America: The First Avenger
9070                                            Team Thor
9107                        Marvel One-Shot: Agent Carter
8096                  Captain America: The Winter Soldier
8379                           Captain America: Civil War
6643    Indiana Jones and the Kingdom of the Crystal S...
1601                           House II: The Second Story
2361                                  Hell in the Pacific
6286                                Letters from Iwo Jima
1115                                       Absolute Power
Name: title, dtype: object

For **Captain America**, the system is able to identify it as a Captain America film and subsequently recommend other Captain America films as its top recommendations.

But unfortunately, that is all this system can do at the moment. This is not of much use to most people as it doesn't take into considerations very important features such as cast, crew, director and genre, which determine the rating and the popularity of a movie. Someone who liked **Captain America** probably likes it more due to belonging to **Marvel Studios** and every other high quality movies in the MCU.

Therefore, much more suggestive metadata is going to be used than **Overview** and **Tagline**. In the next subsection, building a more sophisticated recommender that takes **genre**, **keywords**, **cast** and **crew** into consideration.

### Metadata Based Recommender

To build standard metadata based content recommender, it is important to merge current dataset with the crew and the keyword datasets.

In [31]:
credits_df = pd.read_csv('data/credits.csv')
keywords_df = pd.read_csv('data/keywords.csv')

In [32]:
credits_df['id'] = credits_df['id'].astype('int')
keywords_df['id'] = keywords_df['id'].astype('int')

In [33]:
credits_df.drop_duplicates('id', inplace=True)
keywords_df.drop_duplicates('id', inplace=True)

In [34]:
movies_df = movies_df.merge(credits_df, on='id', how='left')
movies_df = movies_df.merge(keywords_df, on='id', how='left')

In [35]:
small_df = movies_df[movies_df['id'].isin(tmdbId_small)]

Now having cast, crew, genres and credits, all in one dataframe. Let wrangle this a little more using the following intuitions:

1. **Crew:** From the crew, only pick the director as feature since the others don't contribute that much to the *feel* of the movie.

2. **Cast:** Choosing Cast is a little more tricky. Lesser known actors and minor roles do not really affect people's opinion of a movie. Therefore, only select the major characters and their respective actors. Arbitrarily choosing the top 3 actors that appear in the credits list. 

In [36]:
small_df['cast'] = small_df['cast'].apply(literal_eval)
small_df['crew'] = small_df['crew'].apply(literal_eval)
small_df['keywords'] = small_df['keywords'].apply(literal_eval)
small_df['cast_size'] = small_df['cast'].apply(lambda x: len(x))
small_df['crew_size'] = small_df['crew'].apply(lambda x: len(x))

In [37]:
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

In [38]:
small_df['director'] = small_df['crew'].apply(get_director)

In [39]:
small_df['cast'] = small_df['cast'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
small_df['cast'] = small_df['cast'].apply(lambda x: x[:3] if len(x) >= 3 else x)

In [40]:
small_df['keywords'] = small_df['keywords'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

The plan on what doing is creating a metadata dump for every movie which consists of **genres, director, main actors and keywords.** then using a **Count Vectorizer** to create count matrix as being done in the Description Recommender. The remaining steps are similar to what done earlier: calculate the cosine similarities and return movies that are most similar.

These are steps following in the preparation of genres and credits data:

1. **Strip Spaces and Convert to Lowercase** from all features. This way, the engine will not confuse between **Johnny Depp** and **Johnny Galecki.**

2. **Mention Director 2 times** to give it more weight relative to the entire cast.

In [41]:
small_df['cast'] = small_df['cast'].apply(lambda x: [str.lower(i.replace(' ', '')) for i in x])

In [42]:
small_df['director'] = small_df['director'].astype('str').apply(lambda x: str.lower(x.replace(' ', '')))
small_df['director'] = small_df['director'].apply(lambda x: [x,x])

#### Keywords

Doing a small amount of pre-processing of keywords before putting them to any use. As a first step, calculate the frequenct counts of every keyword that appears in the dataset.

In [43]:
stack_keywords = small_df.apply(lambda x: pd.Series(x['keywords']), axis=1).stack().reset_index(level=1, drop=True)
stack_keywords.name = 'keyword'

In [44]:
stack_keywords = stack_keywords.value_counts()

In [45]:
stack_keywords[:5]

based on novel or book    918
woman director            567
murder                    511
new york city             400
duringcreditsstinger      344
Name: keyword, dtype: int64

Keywords occur in frequencies ranging from 1 to 918. It is not any use for keywords that occur only once. Therefore, these can be safely removed.

Finally, every word will be converted to its stem so that words such as *Heroes* and *Hero* are considered the same.

In [46]:
stack_keywords = stack_keywords[stack_keywords > 1]

In [47]:
stack_keywords

based on novel or book    918
woman director            567
murder                    511
new york city             400
duringcreditsstinger      344
                         ... 
necklace                    2
auckland                    2
pet cemetery                2
cuban refugees              2
ferry                       2
Name: keyword, Length: 8056, dtype: int64

In [48]:
stemmer = SnowballStemmer('english')

In [49]:
stemmer.stem('heroes')

'hero'

In [50]:
def filter_keywords(keyword):
    words = []
    for i in keyword:
        if i in stack_keywords:
            words.append(i)
    return words

In [51]:
small_df['keywords'] = small_df['keywords'].apply(filter_keywords)
small_df['keywords'] = small_df['keywords'].apply(lambda x: [stemmer.stem(i) for i in x])
small_df['keywords'] = small_df['keywords'].apply(lambda x: [str.lower(i.replace(' ', '')) for i in x])

In [52]:
small_df['union'] = small_df['genres'] + small_df['keywords'] + small_df['cast'] + small_df['director']
small_df['union'] = small_df['union'].apply(lambda x: ' '.join(x))

In [53]:
count = CountVectorizer(analyzer='word', ngram_range=(1, 2), min_df=0, stop_words='english')
count_matrix = count.fit_transform(small_df['union'])

In [54]:
cosine = cosine_similarity(count_matrix, count_matrix)

In [55]:
cosine[0]

array([1.        , 0.04843595, 0.02216755, ..., 0.02594996, 0.        ,
       0.        ])

In [56]:
small_df = small_df.reset_index()

In [57]:
small_df.rename(columns={'index': 'previous_index'}, inplace=True)

Reusing the get_recommendations function that had been written earlier. Since cosine similarity scores have been changed, it is expected that it can give different (and probably better) results. Let check for **Iron Man** again and see what recommendations this time around.

In [58]:
get_recommendations('Iron Man').head(10)

7160                             Iron Man 2
7895                             Iron Man 3
8376                 Avengers: Infinity War
8379             Captain America: Civil War
7502                           The Avengers
8375                          Black Panther
8166                Guardians of the Galaxy
8378         Guardians of the Galaxy Vol. 2
8096    Captain America: The Winter Soldier
8369                Avengers: Age of Ultron
Name: title, dtype: object

The results that get this time around is deeply more satisfying. The recommendations seem to have recognized other Iron Man franchise and other MCU movies (due to the high weightage given to genres) and put them as top recommendations. It is enjoyable of watching **Iron Man** as well as some of the other ones in the list including **The Avengers**, **Captain America: The Winter Soldier** and **Black Panther**.

In [59]:
get_recommendations("Twilight").head(10)

7038                  The Twilight Saga: New Moon
7196                   The Twilight Saga: Eclipse
7379                              Red Riding Hood
7554    The Twilight Saga: Breaking Dawn - Part 1
7795    The Twilight Saga: Breaking Dawn - Part 2
6305                          Blood and Chocolate
7851                          Beautiful Creatures
5833                             Lords of Dogtown
8439                                  Paper Towns
6267                           The Nativity Story
Name: title, dtype: object

In [60]:
get_recommendations('Fast & Furious').head(10)

7906                         Fast & Furious 6
6145    The Fast and the Furious: Tokyo Drift
7414                                Fast Five
4285                     Better Luck Tomorrow
6039                                Annapolis
8487                                Furious 7
3221                 The Fast and the Furious
4569                               The Rookie
6727                           Righteous Kill
8088                           Need for Speed
Name: title, dtype: object

It is clearly able to experiment on this engine by trying out different weights for other features (directors, actors, genres), limiting the number of keywords that can be used in the soup, weighing genres based on their frequency, only showing movies with the same languages, etc.

#### Popularity and Ratings

One thing that noticed about the recommendation system is that it recommends movies regardless of ratings and popularity. It is true that **Blood and Chocolate** has a lot of similar characters as compared to **Twilight** but it was a terrible movie that shouldn't be recommended to anyone.

Therefore, a mechanism will be added to remove bad movies and return movies which are popular and have had a good critical response.

An aim is taking the **Top 25 movies** based on similarity scores and calculate the vote of the **60th percentile** movie. Then, using this as the value of ***m*** and calculating the ***weighted rating*** of each movie using IMDB's formula like what had been done in the Simple Recommender section.

In [61]:
def improved_recommendations(title):
    idx = pd.Series(data=small_df[small_df['title'] == title].index,
                    index=small_df[small_df['title'] == title]['title'])
                    
    if idx.shape[0] > 1:
        indices_list = list(idx)
        indices_df = small_df[small_df.index.isin(indices_list)]
        indices_df = indices_df[indices_df['year'] == indices_df['year'].max()]
        idx = pd.Series(data=indices_df.index, index=indices_df['title'])
    
    idx = idx[0]
    similarity_score = list(enumerate(cosine[idx]))
    similarity_score = sorted(similarity_score, key=lambda x: x[1], reverse=True)
    similarity_score = similarity_score[1:26]
    movie_indices = [i[0] for i in similarity_score]


    movies = small_df.iloc[movie_indices][['title', 'vote_count', 'vote_average', 'year']]
    vote_counts = movies[movies['vote_count'].notnull()]['vote_count'].astype(int)
    vote_averages = movies[movies['vote_average'].notnull()]['vote_average'].astype(int)

    m = vote_counts.quantile(0.6)
    C = vote_averages.mean()

    qualified = movies[(movies['vote_count'].notnull()) & (movies['vote_count'] >= m) & (movies['vote_average'].notnull())]
    qualified['vote_count'] = qualified['vote_count'].astype('int')
    qualified['vote_average'] = qualified['vote_average'].astype('int')

    qualified['weighted_rating'] = qualified.apply(weighted_rating, args=(m, C), axis=1)
    qualified = qualified.sort_values('weighted_rating', ascending=False).head(10)

    return qualified

In [62]:
improved_recommendations('Iron Man')

Unnamed: 0,title,vote_count,vote_average,year,weighted_rating
8376,Avengers: Infinity War,28055,8,2018,7.349975
7502,The Avengers,29412,7,2012,6.773061
8166,Guardians of the Galaxy,26869,7,2014,6.760758
8369,Avengers: Age of Ultron,21951,7,2015,6.732739
8379,Captain America: Civil War,21721,7,2016,6.731267
8375,Black Panther,21221,7,2018,6.72801
8378,Guardians of the Galaxy Vol. 2,20544,7,2017,6.723473
7406,Thor,20187,6,2011,6.219199
7459,Captain America: The First Avenger,20446,6,2011,6.217797
7895,Iron Man 3,21239,6,2013,6.213613


In [63]:
improved_recommendations('Fast & Furious')

Unnamed: 0,title,vote_count,vote_average,year,weighted_rating
8487,Furious 7,10086,7,2015,6.936316
7414,Fast Five,7666,7,2011,6.917744
915,The Killer,689,7,1989,6.483301
7906,Fast & Furious 6,10096,6,2013,5.995287
3221,The Fast and the Furious,9322,6,2001,5.994921
6145,The Fast and the Furious: Tokyo Drift,6318,6,2006,5.992725
8088,Need for Speed,4101,6,2014,5.989318
7449,Takers,1170,6,2010,5.971942
6056,Running Scared,952,6,2006,5.968081
6727,Righteous Kill,1143,5,2008,5.327572


In [64]:
improved_recommendations('Twilight')

Unnamed: 0,title,vote_count,vote_average,year,weighted_rating
7038,The Twilight Saga: New Moon,8706,6,2009,5.985141
7795,The Twilight Saga: Breaking Dawn - Part 2,8433,6,2012,5.984721
7196,The Twilight Saga: Eclipse,8312,6,2010,5.984527
7554,The Twilight Saga: Breaking Dawn - Part 1,8311,6,2011,5.984526
8439,Paper Towns,4901,6,2015,5.975919
6876,17 Again,4839,6,2009,5.975673
7851,Beautiful Creatures,2726,6,2013,5.962681
7411,Water for Elephants,2408,6,2011,5.95942
4848,Thirteen,1486,6,2003,5.945646
7379,Red Riding Hood,2991,5,2011,5.256491


Fortunately, **Blood and Chocolate** does disappear from recommendation list. 

Therefore, this is a conclusion the Content Based Recommender section here and come back to it when building a hybrid engine later.

## Collaborative Filtering

The Content Based engine suffers from some severe limitations. It is only capable of suggesting movies which are *close* to a certain movie. That is, it is not capable of capturing tastes and providing recommendations across genres.

Also, the engine that built is not really personal in that it doesn't capture the personal tastes and biases of a user. Anyone querying this engine for recommendations based on a movie will receive the same recommendations for that movie, regardless of who she/he is.

Therefore, in this section, it is about to use a technique called **Collaborative Filtering** to make recommendations to Movie Watchers. Collaborative filtering filters information by using the interactions and data collected by the system from other users. It’s based on the idea that people who agreed in their evaluation of certain items are likely to agree again in the future.

It will not be implementing Collaborative Filtering from scratch. Instead, it will use the **Surprise** library that used extremely powerful algorithms like **Singular Value Decomposition (SVD)** to minimise RMSE (Root Mean Square Error) and give great recommendations.

In [65]:
reader = Reader()

In [66]:
ratings_df = pd.read_csv('data_small/ratings_small.csv')

In [67]:
data = Dataset.load_from_df(ratings_df[['userId', 'movieId', 'rating']], reader=reader)

In [68]:
# Use the famous SVD algorithm.
svd = SVD()

In [69]:
# Run 5-fold cross-validation and print results.
cross_validate(svd, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8758  0.8813  0.8719  0.8703  0.8740  0.8747  0.0038  
MAE (testset)     0.6730  0.6772  0.6687  0.6686  0.6737  0.6722  0.0033  
Fit time          2.53    1.17    1.33    1.24    1.24    1.50    0.52    
Test time         0.90    0.15    0.39    0.17    0.18    0.36    0.28    


{'test_rmse': array([0.87580527, 0.88130383, 0.87187043, 0.870342  , 0.87401289]),
 'test_mae': array([0.6730075 , 0.67717802, 0.66865805, 0.66857067, 0.67374549]),
 'fit_time': (2.5314135551452637,
  1.1726486682891846,
  1.3252294063568115,
  1.2438914775848389,
  1.2426700592041016),
 'test_time': (0.9007225036621094,
  0.15437030792236328,
  0.38810062408447266,
  0.16626310348510742,
  0.18392682075500488)}

A mean **Root Mean Sqaure Error** of **0.8747** which is more than good enough for this case. Let now train on the dataset and arrive at predictions.

In [70]:
train_set = data.build_full_trainset()

In [71]:
svd.fit(train_set)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x1e5053e9150>

In [72]:
ratings_df[ratings_df['userId'] == 1]

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
...,...,...,...,...
227,1,3744,4.0,964980694
228,1,3793,5.0,964981855
229,1,3809,4.0,964981220
230,1,4006,4.0,964982903


In [73]:
svd.predict(uid=196, iid=302, r_ui=4, verbose=True)

user: 196        item: 302        r_ui = 4.00   est = 3.40   {'was_impossible': False}


Prediction(uid=196, iid=302, r_ui=4, est=3.39642517514232, details={'was_impossible': False})

For movie with ID 302, getting an estimated prediction of **3.410**. One startling feature of this recommender system is that it doesn't care what the movie is (or what it contains). It works purely on the basis of an assigned movie ID and tries to predict ratings based on how the other users have predicted the movie.

## Hybrid Recommender

In this section, a main purpose is to build a Simple Hybrid Recommender that brings together techniques that have been implemented in the Content Based and Collaborative Filter Based engines. This is how it will work:

* **Input:** User ID and the Title of a Movie

* **Output:** Similar movies sorted on the basis of expected ratings by that particular user.

In [74]:
def convert_int(x):
    try:
        return int(x)
    except:
        np.nan

In [75]:
id_map = pd.read_csv('data_small/links_small.csv')[['movieId', 'tmdbId']]

In [76]:
id_map['tmdbId'] = id_map['tmdbId'].apply(convert_int)

In [77]:
id_map = id_map.merge(small_df[['title', 'id']], left_on='tmdbId', right_on='id').set_index('title')

In [78]:
id_map.drop('tmdbId', axis=1, inplace=True)

In [79]:
def hybrid_recommendations(userId, title):
    idx = pd.Series(data=small_df[small_df['title'] == title].index,
                    index=small_df[small_df['title'] == title]['title'])
                    
    if idx.shape[0] > 1:
        indices_list = list(idx)
        indices_df = small_df[small_df.index.isin(indices_list)]
        indices_df = indices_df[indices_df['year'] == indices_df['year'].max()]
        idx = pd.Series(data=indices_df.index, index=indices_df['title'])
    
    idx = idx[0]
    similarity_score = list(enumerate(cosine[idx]))
    similarity_score = sorted(similarity_score, key=lambda x: x[1], reverse=True)
    similarity_score = similarity_score[1:26]
    movie_indices = [i[0] for i in similarity_score]


    movies = small_df.iloc[movie_indices][['title', 'id', 'vote_count', 'vote_average', 'year']]
    movies['est'] = movies['id'].apply(lambda x: svd.predict(userId, id_map.set_index('id').loc[x]['movieId']).est)
    movies = movies.sort_values('est', ascending=False).head(10)

    return movies

In [80]:
hybrid_recommendations(1, 'Iron Man')

Unnamed: 0,title,id,vote_count,vote_average,year,est
8166,Guardians of the Galaxy,118340,26869,7.905,2014,4.769449
8377,Thor: Ragnarok,284053,19720,7.594,2017,4.676347
7434,X-Men: First Class,49538,12093,7.297,2011,4.6225
4317,X2,36658,9521,7.0,2003,4.57888
7502,The Avengers,24428,29412,7.711,2012,4.526972
7459,Captain America: The First Avenger,1771,20446,6.995,2011,4.503477
8375,Black Panther,284054,21221,7.388,2018,4.488888
8096,Captain America: The Winter Soldier,100402,17866,7.67,2014,4.477583
8378,Guardians of the Galaxy Vol. 2,283995,20544,7.621,2017,4.406626
2826,X-Men,36657,10643,6.998,2000,4.39223


In [81]:
hybrid_recommendations(500, 'Iron Man')

Unnamed: 0,title,id,vote_count,vote_average,year,est
8378,Guardians of the Galaxy Vol. 2,283995,20544,7.621,2017,3.740729
8377,Thor: Ragnarok,284053,19720,7.594,2017,3.6596
7434,X-Men: First Class,49538,12093,7.297,2011,3.501074
8372,Ant-Man,102899,18882,7.081,2015,3.488785
8376,Avengers: Infinity War,299536,28055,8.252,2018,3.433095
9107,Marvel One-Shot: Agent Carter,211387,690,7.2,2013,3.431697
8379,Captain America: Civil War,271110,21721,7.442,2016,3.402881
8096,Captain America: The Winter Soldier,100402,17866,7.67,2014,3.358924
4317,X2,36658,9521,7.0,2003,3.349765
8369,Avengers: Age of Ultron,99861,21951,7.273,2015,3.346312


It can be seen that for the Hybrid Recommender, it can get different recommendations for different users although the movie is the same. Hence, this recommendations are more personalized and tailored towards particular users.

## Conclusion

In this notebook, 4 different recommendation engines have been built based on different ideas and algorithms. They are as follows:

1. **Simple Recommender:** This system used overall TMDB Vote Count and Vote Averages to build Top Movies Charts, in general and for a specific genre. The IMDB Weighted Rating System was used to calculate ratings on which the sorting was finally performed.

2. **Content Based Recommender:** This system was built by two content based engines: One that took movie overview and taglines as input. The other which took metadata such as cast, crew, genre and keywords to come up with predictions. They were also deviced a simple filter to give greater preference to movies with more votes and higher ratings.

3. **Collaborative Filtering:** The powerful Surprise Library was used to build a Collaborative Filter based on single value decomposition. The RMSE obtained was less than 1 and the engine gave estimated ratings for a given user and movie.

4. **Hybrid Engine:** All the ideas from Content and Collaborative Filterting were brought together to build an engine that gave movie suggestions to a particular user based on the estimated ratings that it had internally calculated for that user.