[Kaggle - Movie_Recommendation_System](https://www.kaggle.com/rounakbanik/the-movies-dataset)

# Table of Content
1. [Data Preprocessing](#preprocess)
2. [Simple Recommender](#simple)
    1. [All genres](#2.1)
    2. [Designated genre](#2.2)
3. [Content Based Recommender](#content)
    1. [Movie Description Based Recommender](#3.1)
    2. [Metadata Based Recommender](#3.2)
    3. [Metadata & Score Based Recommender](#3.3)
4. [Collaborative Filtering](#collaborative)
5. [Hybrid Recommender](#hybrid)

---
Recommendation Systems are a type of information filtering systems as they improve the quality of search results and provides items that are more relevant to the search item or are realted to the search history of the user.

They are used to predict the rating or preference that a user would give to an item.

**Simple Recommender:**
* give generalized recommnendations based on movie *genre* and *popularity* only (use *vote_average* & *vote_count* to give a *score*)
* The idea is that movies that are more popular and critically acclaimed will have a higher probability of being liked by the average audience.

**Content Based Recommender:**
* use item metadata, such as genre, director, description, actors, etc., to make these recommendations
* The idea behind is that if a person liked a particular item, he will also like an item that is similar to it.
* compute similarity of movies' text features and suggests movies that are most similar to a particular movie that a user liked
* content features:
    * *Overviews* and *Taglines*
    * *Cast*, *Crew*, *Keywords* and *Genre*
    * Finally, text similarity with *score* prioritising (output: content-similar movies sorted by *score*)
* not really personal as it doesn't capture the personal tastes of a user. Anyone querying our engine with a movie name will receive the same recommendations for that movie.

**Collaborative Filtering:**
* predict what the user like (estimated rating of the unseen movie by the user)
    * training: need a table of users giving ratings on different items(movieid) for training
    * prediction: work purely on user id, movie ID to predict estimated rating of the unseen movie by the user (based on how the other users have predicted the movie)
* use surprise library which is a Python scikit building recommender system
    * similar to sklearn (fit, predict, cv, gridsearchcv)
    * different: sklearn (np.array), surprise (has its own object, load from csv with col must = ['userid', 'itemid', 'rating'])
* [Here more about CF theory](https://blog.xuite.net/metafun/life/131996342-Recommendation+Systems%E5%92%8C%E5%8D%94%E5%90%8C%E9%81%8E%E6%BF%BE%28CF%29%E7%B0%A1%E4%BB%8B)

**Hybrid Recommender**
* more personalized, can return different movie recommendations for different users although the searched movie is the same.
* Content-based + collaborative-filter-based engine
* Input: User ID, movie title
* Output: Content-similar movies sorted by estimated ratings by the particular user

---
# 1) Data Preprocessing<a class="anchor" id="preprocess"></a>

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from ast import literal_eval   # extract Python expression from string
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
from nltk.stem.snowball import SnowballStemmer

import warnings
warnings.simplefilter('ignore')

In [2]:
md = pd.read_csv('./input/movies_metadata.csv', parse_dates=['release_date'])

md = md.drop_duplicates(subset=['id'])

# convert id from float to int (non-int will become nan)
def convert_int(x):
    try:
        return int(x)
    except:
        return np.nan
md['id'] = md['id'].apply(convert_int)
md = md[md['id'].notnull()]
md['id'] = md['id'].astype(int)

display(md.head())
display(md.info())

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


<class 'pandas.core.frame.DataFrame'>
Int64Index: 45433 entries, 0 to 45465
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  45433 non-null  object 
 1   belongs_to_collection  4488 non-null   object 
 2   budget                 45433 non-null  object 
 3   genres                 45433 non-null  object 
 4   homepage               7774 non-null   object 
 5   id                     45433 non-null  int32  
 6   imdb_id                45416 non-null  object 
 7   original_language      45422 non-null  object 
 8   original_title         45433 non-null  object 
 9   overview               44479 non-null  object 
 10  popularity             45430 non-null  object 
 11  poster_path            45047 non-null  object 
 12  production_companies   45430 non-null  object 
 13  production_countries   45430 non-null  object 
 14  release_date           45346 non-null  object 
 15  re

None

* *genre* - The genre of the movie, Action, Comedy ,Thriller etc.
* *id* - movie_id
* *keywords* - The keywords or tags related to the movie.
* *overview* - A brief description of the movie.
* *popularity* - A numeric quantity specifying the movie popularity.
* *release_date* - The date on which it was released.
* *tagline* - Movie's tagline.
* *title* - Title of the movie.
* *vote_average* - average ratings the movie recieved.
* *vote_count* - the count of votes recieved.

In [3]:
# year column
md['release_date'] = pd.to_datetime(md['release_date'], errors='coerce')
md['year'] = md['release_date'].dt.year

md.head(3)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,year
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0,1995.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,1995.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0,1995.0


In [4]:
# literal_eval example: extract python expression from str
md['genres'].fillna('[]').apply(literal_eval)

0        [{'id': 16, 'name': 'Animation'}, {'id': 35, '...
1        [{'id': 12, 'name': 'Adventure'}, {'id': 14, '...
2        [{'id': 10749, 'name': 'Romance'}, {'id': 35, ...
3        [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...
4                           [{'id': 35, 'name': 'Comedy'}]
                               ...                        
45461    [{'id': 18, 'name': 'Drama'}, {'id': 10751, 'n...
45462                        [{'id': 18, 'name': 'Drama'}]
45463    [{'id': 28, 'name': 'Action'}, {'id': 18, 'nam...
45464                                                   []
45465                                                   []
Name: genres, Length: 45433, dtype: object

In [5]:
# extract value of 'name' from 'genres' column 
md['genres'] = md['genres'].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

# select first genre from genres
md['1stgenre'] = md['genres'].apply(lambda x: x[0] if len(x) != 0 else np.nan)

md.head(2)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,year,1stgenre
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[Animation, Comedy, Family]",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0,1995.0,Animation
1,False,,65000000,"[Adventure, Fantasy, Family]",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,1995.0,Adventure


---
# 2) Simple Recommender<a class="anchor" id="simple"></a>
Offers generalized recommnendations based on movie popularity and genre

## 2.1) All genres<a class="anchor" id="2.1"></a>

### Calculate C, m for calculating the score later

In [6]:
# Calculate quantile 95% of vote_counts
m = md['vote_count'].quantile(0.95)
print('95% of vote_count (m):', m)

# Calculate mean of vote_averages
C = md['vote_average'].mean()
print('mean of vote_average (C):', C)

95% of vote_count (m): 434.0
mean of vote_average (C): 5.618329297820823


### Select qualified data ('vote_count' >= its mean)

In [7]:
col = ['title', 'year', 'vote_count', 'vote_average', 'popularity', 'genres', '1stgenre']

# Select row: df['vote_count'] >= its mean
qualified = md[md['vote_count'] >= m][col]

qualified['vote_count'] = qualified['vote_count'].astype(int)
qualified['year'] = qualified['year'].astype(int)

qualified

Unnamed: 0,title,year,vote_count,vote_average,popularity,genres,1stgenre
0,Toy Story,1995,5415,7.7,21.9469,"[Animation, Comedy, Family]",Animation
1,Jumanji,1995,2413,6.9,17.0155,"[Adventure, Fantasy, Family]",Adventure
5,Heat,1995,1886,7.7,17.9249,"[Action, Crime, Drama, Thriller]",Action
9,GoldenEye,1995,1194,6.6,14.686,"[Adventure, Action, Thriller]",Adventure
15,Casino,1995,1343,7.8,10.1374,"[Drama, Crime]",Drama
...,...,...,...,...,...,...,...
44624,What Happened to Monday,2017,598,7.3,60.581223,"[Science Fiction, Thriller]",Science Fiction
44632,Atomic Blonde,2017,748,6.1,14.455104,"[Action, Thriller]",Action
44678,Dunkirk,2017,2712,7.5,30.938854,"[Action, Drama, History, Thriller, War]",Action
44842,Transformers: The Last Knight,2017,1440,6.2,39.186819,"[Action, Science Fiction, Thriller, Adventure]",Action


### Calculate score for movie ranking

![](https://i.ibb.co/6ZXJ3bd/wr.png)
* v is the number of votes for the movie;
* m is the minimum votes required to be listed in the chart;
* R is the average rating of the movie; And
* C is the mean vote across the whole report

In [8]:
def weighted_rating(x, C, m):
    v = x['vote_count']
    R = x['vote_average']
    return (v/(v+m) * R) + (m/(m+v) * C)

qualified['score'] = qualified.apply(weighted_rating, args=(C, m), axis=1)

qualified.sort_values('score', ascending=False).head(10)

Unnamed: 0,title,year,vote_count,vote_average,popularity,genres,1stgenre,score
314,The Shawshank Redemption,1994,8358,8.5,51.6454,"[Drama, Crime]",Drama,8.357752
834,The Godfather,1972,6024,8.5,41.1093,"[Drama, Crime]",Drama,8.306342
12481,The Dark Knight,2008,12269,8.3,123.167,"[Drama, Action, Crime, Thriller]",Drama,8.20838
2843,Fight Club,1999,9678,8.3,63.8696,[Drama],Drama,8.184905
292,Pulp Fiction,1994,8670,8.3,140.95,"[Thriller, Crime]",Thriller,8.172161
351,Forrest Gump,1994,8147,8.2,48.3072,"[Comedy, Drama, Romance]",Comedy,8.069427
522,Schindler's List,1993,4436,8.3,41.7251,"[Drama, History, War]",Drama,8.061017
23673,Whiplash,2014,4376,8.3,64.3,[Drama],Drama,8.058036
5481,Spirited Away,2001,3968,8.3,41.0489,"[Fantasy, Adventure, Animation, Family]",Fantasy,8.03561
1154,The Empire Strikes Back,1980,5998,8.2,19.471,"[Adventure, Action, Science Fiction]",Adventure,8.025801


## 2.2) Designated genre<a class="anchor" id="2.2"></a>

In [9]:
def build_chart(genre, percentile=0.85, col = ['title', 'year', 'vote_count', 'vote_average', 'popularity']):
    # Select genre
    df = md[md['1stgenre'] == genre]
    
    # Calculate C, m
    C = df['vote_average'].mean()
    m = df['vote_count'].quantile(percentile)
    
    # Select df['vote_count'] >= its mean
    qualified = df[df['vote_count'] >= m][col]
    
    # Calculate score for ranking movies
    qualified['score'] = qualified.apply(weighted_rating, args=(C, m), axis=1)

    qualified = qualified.sort_values('score', ascending=False)
    
    return qualified

In [10]:
build_chart('Romance')

Unnamed: 0,title,year,vote_count,vote_average,popularity,score
40251,Your Name.,2016.0,1030.0,8.5,34.461252,8.365183
22168,Her,2013.0,4215.0,7.9,13.8295,7.872953
37863,Sing Street,2016.0,669.0,8.0,10.672862,7.832634
7834,The Notebook,2004.0,3163.0,7.7,15.239,7.667241
23512,The Fault in Our Stars,2014.0,3868.0,7.6,16.2747,7.574424
...,...,...,...,...,...,...
7378,Look Who's Talking Too,1990.0,317.0,5.0,6.42968,5.084572
7720,Soul Plane,2004.0,109.0,4.7,8.60439,4.989844
5318,Look Who's Talking Now!,1993.0,200.0,4.7,9.01608,4.884552
31401,Accidental Love,2015.0,95.0,3.9,5.21373,4.495396


---
# 3) Content Based Recommender<a class="anchor" id="content"></a>

In [11]:
display(md.head())
display(md.info())

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,year,1stgenre
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[Animation, Comedy, Family]",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0,1995.0,Animation
1,False,,65000000,"[Adventure, Fantasy, Family]",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,1995.0,Adventure
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[Romance, Comedy]",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0,1995.0,Romance
3,False,,16000000,"[Comedy, Drama, Romance]",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0,1995.0,Comedy
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,[Comedy],,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0,1995.0,Comedy


<class 'pandas.core.frame.DataFrame'>
Int64Index: 45433 entries, 0 to 45465
Data columns (total 26 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   adult                  45433 non-null  object        
 1   belongs_to_collection  4488 non-null   object        
 2   budget                 45433 non-null  object        
 3   genres                 45433 non-null  object        
 4   homepage               7774 non-null   object        
 5   id                     45433 non-null  int32         
 6   imdb_id                45416 non-null  object        
 7   original_language      45422 non-null  object        
 8   original_title         45433 non-null  object        
 9   overview               44479 non-null  object        
 10  popularity             45430 non-null  object        
 11  poster_path            45047 non-null  object        
 12  production_companies   45430 non-null  object        
 13  p

None

## 3.1) Movie Description Based Recommender<a class="anchor" id="3.1"></a>
Use description which is composed of **overview** & **tagline**

In [12]:
md['description'] = md.overview.fillna('')+' '+md.tagline.fillna('')
md['description']

0        Led by Woody, Andy's toys live happily in his ...
1        When siblings Judy and Peter discover an encha...
2        A family wedding reignites the ancient feud be...
3        Cheated on, mistreated and stepped on, the wom...
4        Just when George Banks has recovered from his ...
                               ...                        
45461    Rising and falling between a man and woman. Ri...
45462    An artist struggles to finish his work while a...
45463    When one of her hits goes wrong, a professiona...
45464    In a small town live two brothers, one a minis...
45465    50 years after decriminalisation of homosexual...
Name: description, Length: 45433, dtype: object

### TfidfVectorizer
Tfidf (Term Frequency - Inverse Document Frequency) will return a matrix where each column represents a word in the overview vocabulary (all the words that appear in at least one document) and each row represents a movie. Tfidf can reduce the importance of words that occur frequently (is, the, ...) while CountVectorizer doesn't.

In [13]:
# Wall time: 7.57 s
tf = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')   # term frequency vectorizer
tfidf_matrix = tf.fit_transform(md['description'])

tfidf_matrix.shape

(45433, 1103618)

* 1103618 different words were used to describe the 45436 movies in md dataset

### Cosine Similarity
![](https://neo4j.com/docs/graph-algorithms/current/images/cosine-similarity.png)
Since we have used the TF-IDF Vectorizer, calculating the Dot Product will directly give us the Cosine Similarity. Therefore, we will use sklearn's linear_kernel instead of cosine_similarities as it is much faster.

In [14]:
# Wall time: 21.5 s
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

cosine_sim.shape

(45433, 45433)

In [15]:
titles = md['title']
titles

0                          Toy Story
1                            Jumanji
2                   Grumpier Old Men
3                  Waiting to Exhale
4        Father of the Bride Part II
                    ...             
45461                         Subdue
45462            Century of Birthing
45463                       Betrayal
45464               Satan Triumphant
45465                       Queerama
Name: title, Length: 45433, dtype: object

In [16]:
indices = pd.Series(md.index, index = md['title'])
indices

title
Toy Story                          0
Jumanji                            1
Grumpier Old Men                   2
Waiting to Exhale                  3
Father of the Bride Part II        4
                               ...  
Subdue                         45461
Century of Birthing            45462
Betrayal                       45463
Satan Triumphant               45464
Queerama                       45465
Length: 45433, dtype: int64

In [17]:
def get_recommendations(title):
    # get user searched movie index
    idx = indices[title]
    
    # select 1st index if 2/more movies having same name
    idx = idx[0] if np.size(idx) > 1 else idx
    
    # get the cosine similarity of the movie
    sim_scores = list(enumerate(cosine_sim[idx]))
    
    # sort the cosine similarity
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    # find the second largest cosine similarity (the largest is itself (user searched movie))
    sim_scores = sim_scores[1:31]
    
    # find the index corresponding to the recommended movies
    movie_indices = [i[0] for i in sim_scores]
    
    # return the recommended movie titles by using the index
    return titles.iloc[movie_indices].head(10)

In [18]:
result1 = get_recommendations('The Dark Knight')

result1

43461                                Megafault
21412        Jamie and Jessie Are Not Together
27441                          I Know You Know
11689                             Dead Silence
3977                         Empire of the Sun
28231                      Jamie Marks Is Dead
21676                                  The Pit
864      Halloween: The Curse of Michael Myers
27204                  It Happened in Brooklyn
4904                        Truly Madly Deeply
Name: title, dtype: object

* The system is only able to recommend the Description-related which is composed of **Overview** & **Tagline**

In [19]:
del tfidf_matrix
del cosine_sim

## 3.2) Metadata Based Recommender<a class="anchor" id="3.2"></a>
This recommender takes **genre**, **keywords**, **cast** and **crew** into account

### cast & crew

In [20]:
df_credits = pd.read_csv('./input/credits.csv')

df_credits = df_credits.drop_duplicates(subset=['id'])

df_credits

Unnamed: 0,cast,crew,id
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844
2,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602
3,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...",31357
4,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...",11862
...,...,...,...
45471,"[{'cast_id': 0, 'character': '', 'credit_id': ...","[{'credit_id': '5894a97d925141426c00818c', 'de...",439050
45472,"[{'cast_id': 1002, 'character': 'Sister Angela...","[{'credit_id': '52fe4af1c3a36847f81e9b15', 'de...",111109
45473,"[{'cast_id': 6, 'character': 'Emily Shaw', 'cr...","[{'credit_id': '52fe4776c3a368484e0c8387', 'de...",67758
45474,"[{'cast_id': 2, 'character': '', 'credit_id': ...","[{'credit_id': '533bccebc3a36844cf0011a7', 'de...",227506


* id - movie_id
* cast - actors, character
* crew - Director, Editor, Composer, Writer etc.

In [21]:
# extract python expression from str
df_credits['cast'] = df_credits['cast'].apply(literal_eval)
df_credits['crew'] = df_credits['crew'].apply(literal_eval)

In [22]:
df_credits['cast'][0]

[{'cast_id': 14,
  'character': 'Woody (voice)',
  'credit_id': '52fe4284c3a36847f8024f95',
  'gender': 2,
  'id': 31,
  'name': 'Tom Hanks',
  'order': 0,
  'profile_path': '/pQFoyx7rp09CJTAb932F2g8Nlho.jpg'},
 {'cast_id': 15,
  'character': 'Buzz Lightyear (voice)',
  'credit_id': '52fe4284c3a36847f8024f99',
  'gender': 2,
  'id': 12898,
  'name': 'Tim Allen',
  'order': 1,
  'profile_path': '/uX2xVf6pMmPepxnvFWyBtjexzgY.jpg'},
 {'cast_id': 16,
  'character': 'Mr. Potato Head (voice)',
  'credit_id': '52fe4284c3a36847f8024f9d',
  'gender': 2,
  'id': 7167,
  'name': 'Don Rickles',
  'order': 2,
  'profile_path': '/h5BcaDMPRVLHLDzbQavec4xfSdt.jpg'},
 {'cast_id': 17,
  'character': 'Slinky Dog (voice)',
  'credit_id': '52fe4284c3a36847f8024fa1',
  'gender': 2,
  'id': 12899,
  'name': 'Jim Varney',
  'order': 3,
  'profile_path': '/eIo2jVVXYgjDtaHoF19Ll9vtW7h.jpg'},
 {'cast_id': 18,
  'character': 'Rex (voice)',
  'credit_id': '52fe4284c3a36847f8024fa5',
  'gender': 2,
  'id': 12900,
 

In [23]:
df_credits['crew'][0]

[{'credit_id': '52fe4284c3a36847f8024f49',
  'department': 'Directing',
  'gender': 2,
  'id': 7879,
  'job': 'Director',
  'name': 'John Lasseter',
  'profile_path': '/7EdqiNbr4FRjIhKHyPPdFfEEEFG.jpg'},
 {'credit_id': '52fe4284c3a36847f8024f4f',
  'department': 'Writing',
  'gender': 2,
  'id': 12891,
  'job': 'Screenplay',
  'name': 'Joss Whedon',
  'profile_path': '/dTiVsuaTVTeGmvkhcyJvKp2A5kr.jpg'},
 {'credit_id': '52fe4284c3a36847f8024f55',
  'department': 'Writing',
  'gender': 2,
  'id': 7,
  'job': 'Screenplay',
  'name': 'Andrew Stanton',
  'profile_path': '/pvQWsu0qc8JFQhMVJkTHuexUAa1.jpg'},
 {'credit_id': '52fe4284c3a36847f8024f5b',
  'department': 'Writing',
  'gender': 2,
  'id': 12892,
  'job': 'Screenplay',
  'name': 'Joel Cohen',
  'profile_path': '/dAubAiZcvKFbboWlj7oXOkZnTSu.jpg'},
 {'credit_id': '52fe4284c3a36847f8024f61',
  'department': 'Writing',
  'gender': 0,
  'id': 12893,
  'job': 'Screenplay',
  'name': 'Alec Sokolow',
  'profile_path': '/v79vlRYi94BZUQnkkyzn

In [24]:
# get cast_size & crew_size
df_credits['cast_size'] = df_credits['cast'].apply(lambda x: len(x))
df_credits['crew_size'] = df_credits['crew'].apply(lambda x: len(x))

df_credits.head()

Unnamed: 0,cast,crew,id,cast_size,crew_size
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862,13,106
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844,26,16
2,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602,7,4
3,"[{'cast_id': 1, 'character': 'Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...",31357,10,10
4,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...",11862,12,7


In [25]:
# get director from crew
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

df_credits['director'] = df_credits['crew'].apply(get_director)

df_credits.head()

Unnamed: 0,cast,crew,id,cast_size,crew_size,director
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862,13,106,John Lasseter
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844,26,16,Joe Johnston
2,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602,7,4,Howard Deutch
3,"[{'cast_id': 1, 'character': 'Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...",31357,10,10,Forest Whitaker
4,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...",11862,12,7,Charles Shyer


In [26]:
# extract actor name from value of name in cast
df_credits['actors'] = df_credits['cast'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

# select from 3 actors
df_credits['3actors'] = df_credits['actors'].apply(lambda x: x[:3] if len(x) >=3 else x)

df_credits.head()

Unnamed: 0,cast,crew,id,cast_size,crew_size,director,actors,3actors
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862,13,106,John Lasseter,"[Tom Hanks, Tim Allen, Don Rickles, Jim Varney...","[Tom Hanks, Tim Allen, Don Rickles]"
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844,26,16,Joe Johnston,"[Robin Williams, Jonathan Hyde, Kirsten Dunst,...","[Robin Williams, Jonathan Hyde, Kirsten Dunst]"
2,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602,7,4,Howard Deutch,"[Walter Matthau, Jack Lemmon, Ann-Margret, Sop...","[Walter Matthau, Jack Lemmon, Ann-Margret]"
3,"[{'cast_id': 1, 'character': 'Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...",31357,10,10,Forest Whitaker,"[Whitney Houston, Angela Bassett, Loretta Devi...","[Whitney Houston, Angela Bassett, Loretta Devine]"
4,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...",11862,12,7,Charles Shyer,"[Steve Martin, Diane Keaton, Martin Short, Kim...","[Steve Martin, Diane Keaton, Martin Short]"


In [27]:
# remove whitespace & lowercase in 3actors
df_credits['3actors'] = df_credits['3actors'].apply(lambda x: [str.lower(i.replace(" ", "")) for i in x])

df_credits.head()

Unnamed: 0,cast,crew,id,cast_size,crew_size,director,actors,3actors
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862,13,106,John Lasseter,"[Tom Hanks, Tim Allen, Don Rickles, Jim Varney...","[tomhanks, timallen, donrickles]"
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844,26,16,Joe Johnston,"[Robin Williams, Jonathan Hyde, Kirsten Dunst,...","[robinwilliams, jonathanhyde, kirstendunst]"
2,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602,7,4,Howard Deutch,"[Walter Matthau, Jack Lemmon, Ann-Margret, Sop...","[waltermatthau, jacklemmon, ann-margret]"
3,"[{'cast_id': 1, 'character': 'Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...",31357,10,10,Forest Whitaker,"[Whitney Houston, Angela Bassett, Loretta Devi...","[whitneyhouston, angelabassett, lorettadevine]"
4,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...",11862,12,7,Charles Shyer,"[Steve Martin, Diane Keaton, Martin Short, Kim...","[stevemartin, dianekeaton, martinshort]"


In [28]:
# remove whitespace & lowercase in director
df_credits['director'] = df_credits['director'].astype('str').apply(lambda x: str.lower(x.replace(" ", "")))

# Mention Director 3 times to give it more weight relative to 3actors
df_credits['director'] = df_credits['director'].apply(lambda x: [x,x,x])

df_credits.head()

Unnamed: 0,cast,crew,id,cast_size,crew_size,director,actors,3actors
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862,13,106,"[johnlasseter, johnlasseter, johnlasseter]","[Tom Hanks, Tim Allen, Don Rickles, Jim Varney...","[tomhanks, timallen, donrickles]"
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844,26,16,"[joejohnston, joejohnston, joejohnston]","[Robin Williams, Jonathan Hyde, Kirsten Dunst,...","[robinwilliams, jonathanhyde, kirstendunst]"
2,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602,7,4,"[howarddeutch, howarddeutch, howarddeutch]","[Walter Matthau, Jack Lemmon, Ann-Margret, Sop...","[waltermatthau, jacklemmon, ann-margret]"
3,"[{'cast_id': 1, 'character': 'Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...",31357,10,10,"[forestwhitaker, forestwhitaker, forestwhitaker]","[Whitney Houston, Angela Bassett, Loretta Devi...","[whitneyhouston, angelabassett, lorettadevine]"
4,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...",11862,12,7,"[charlesshyer, charlesshyer, charlesshyer]","[Steve Martin, Diane Keaton, Martin Short, Kim...","[stevemartin, dianekeaton, martinshort]"


### keywords

In [29]:
df_keywords = pd.read_csv('./input/keywords.csv')

df_keywords = df_keywords.drop_duplicates(subset=['id'])

df_keywords

Unnamed: 0,id,keywords
0,862,"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,8844,"[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,15602,"[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,31357,"[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,11862,"[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."
...,...,...
46414,439050,"[{'id': 10703, 'name': 'tragic love'}]"
46415,111109,"[{'id': 2679, 'name': 'artist'}, {'id': 14531,..."
46416,67758,[]
46417,227506,[]


In [30]:
# extract python expression from str
df_keywords['keywords'] = df_keywords['keywords'].apply(literal_eval)

In [31]:
df_keywords['keywords'][0]

[{'id': 931, 'name': 'jealousy'},
 {'id': 4290, 'name': 'toy'},
 {'id': 5202, 'name': 'boy'},
 {'id': 6054, 'name': 'friendship'},
 {'id': 9713, 'name': 'friends'},
 {'id': 9823, 'name': 'rivalry'},
 {'id': 165503, 'name': 'boy next door'},
 {'id': 170722, 'name': 'new toy'},
 {'id': 187065, 'name': 'toy comes to life'}]

In [32]:
# extract keywords from value of name
df_keywords['keywords'] = df_keywords['keywords'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
df_keywords.head()

Unnamed: 0,id,keywords
0,862,"[jealousy, toy, boy, friendship, friends, riva..."
1,8844,"[board game, disappearance, based on children'..."
2,15602,"[fishing, best friend, duringcreditsstinger, o..."
3,31357,"[based on novel, interracial relationship, sin..."
4,11862,"[baby, midlife crisis, confidence, aging, daug..."


In [33]:
# select first 3 keywords from keywords
df_keywords['3keywords'] = df_keywords['keywords'].apply(lambda x: x[:3] if len(x) >=3 else x)

df_keywords.head()

Unnamed: 0,id,keywords,3keywords
0,862,"[jealousy, toy, boy, friendship, friends, riva...","[jealousy, toy, boy]"
1,8844,"[board game, disappearance, based on children'...","[board game, disappearance, based on children'..."
2,15602,"[fishing, best friend, duringcreditsstinger, o...","[fishing, best friend, duringcreditsstinger]"
3,31357,"[based on novel, interracial relationship, sin...","[based on novel, interracial relationship, sin..."
4,11862,"[baby, midlife crisis, confidence, aging, daug...","[baby, midlife crisis, confidence]"


In [34]:
# Stemming words & lowercase & remove whitespace in keywords
stemmer = SnowballStemmer('english')

df_keywords['3keywords'] = df_keywords['3keywords'].apply(lambda x: [stemmer.stem(i) for i in x])
df_keywords['3keywords'] = df_keywords['3keywords'].apply(lambda x: [str.lower(i.replace(" ", "")) for i in x])

df_keywords.head()

Unnamed: 0,id,keywords,3keywords
0,862,"[jealousy, toy, boy, friendship, friends, riva...","[jealousi, toy, boy]"
1,8844,"[board game, disappearance, based on children'...","[boardgam, disappear, basedonchildren'sbook]"
2,15602,"[fishing, best friend, duringcreditsstinger, o...","[fish, bestfriend, duringcreditssting]"
3,31357,"[based on novel, interracial relationship, sin...","[basedonnovel, interracialrelationship, single..."
4,11862,"[baby, midlife crisis, confidence, aging, daug...","[babi, midlifecrisi, confid]"


### Combine genres, 3actors, director, 3keywords

In [38]:
md = md.merge(df_credits[['id', 'director', '3actors']], on='id', how='left')
md = md.merge(df_keywords[['id', '3keywords']], on='id', how='left')

md

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,title,video,vote_average,vote_count,year,1stgenre,description,director,3actors,3keywords
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[Animation, Comedy, Family]",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,Toy Story,False,7.7,5415.0,1995.0,Animation,"Led by Woody, Andy's toys live happily in his ...","[johnlasseter, johnlasseter, johnlasseter]","[tomhanks, timallen, donrickles]","[jealousi, toy, boy]"
1,False,,65000000,"[Adventure, Fantasy, Family]",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,Jumanji,False,6.9,2413.0,1995.0,Adventure,When siblings Judy and Peter discover an encha...,"[joejohnston, joejohnston, joejohnston]","[robinwilliams, jonathanhyde, kirstendunst]","[boardgam, disappear, basedonchildren'sbook]"
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[Romance, Comedy]",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,Grumpier Old Men,False,6.5,92.0,1995.0,Romance,A family wedding reignites the ancient feud be...,"[howarddeutch, howarddeutch, howarddeutch]","[waltermatthau, jacklemmon, ann-margret]","[fish, bestfriend, duringcreditssting]"
3,False,,16000000,"[Comedy, Drama, Romance]",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,Waiting to Exhale,False,6.1,34.0,1995.0,Comedy,"Cheated on, mistreated and stepped on, the wom...","[forestwhitaker, forestwhitaker, forestwhitaker]","[whitneyhouston, angelabassett, lorettadevine]","[basedonnovel, interracialrelationship, single..."
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,[Comedy],,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,Father of the Bride Part II,False,5.7,173.0,1995.0,Comedy,Just when George Banks has recovered from his ...,"[charlesshyer, charlesshyer, charlesshyer]","[stevemartin, dianekeaton, martinshort]","[babi, midlifecrisi, confid]"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45428,False,,0,"[Drama, Family]",http://www.imdb.com/title/tt6209470/,439050,tt6209470,fa,رگ خواب,Rising and falling between a man and woman.,...,Subdue,False,4.0,1.0,,Drama,Rising and falling between a man and woman. Ri...,"[hamidnematollah, hamidnematollah, hamidnemato...","[leilahatami, kouroshtahami, elhamkorda]",[tragiclov]
45429,False,,0,[Drama],,111109,tt2028550,tl,Siglo ng Pagluluwal,An artist struggles to finish his work while a...,...,Century of Birthing,False,9.0,3.0,2011.0,Drama,An artist struggles to finish his work while a...,"[lavdiaz, lavdiaz, lavdiaz]","[angelaquino, perrydizon, hazelorencio]","[artist, play, pinoy]"
45430,False,,0,"[Action, Drama, Thriller]",,67758,tt0303758,en,Betrayal,"When one of her hits goes wrong, a professiona...",...,Betrayal,False,3.8,6.0,2003.0,Action,"When one of her hits goes wrong, a professiona...","[markl.lester, markl.lester, markl.lester]","[erikaeleniak, adambaldwin, juliedupage]",[]
45431,False,,0,[],,227506,tt0008536,en,Satana likuyushchiy,"In a small town live two brothers, one a minis...",...,Satan Triumphant,False,0.0,0.0,1917.0,,"In a small town live two brothers, one a minis...","[yakovprotazanov, yakovprotazanov, yakovprotaz...","[iwanmosschuchin, nathalielissenko, pavelpavlov]",[]


In [39]:
# Combine all features in md
md['soup'] = md['genres'] + md['director'] + md['3actors'] + md['3keywords']

# replace nan by ''
md['soup'] = md['soup'].fillna('')

# join each item in list by whitespace
md['soup'] = md['soup'].apply(lambda x: ' '.join(x) if isinstance(x, list) else '')

md['soup']

0        Animation Comedy Family johnlasseter johnlasse...
1        Adventure Fantasy Family joejohnston joejohnst...
2        Romance Comedy howarddeutch howarddeutch howar...
3        Comedy Drama Romance forestwhitaker forestwhit...
4        Comedy charlesshyer charlesshyer charlesshyer ...
                               ...                        
45428    Drama Family hamidnematollah hamidnematollah h...
45429    Drama lavdiaz lavdiaz lavdiaz angelaquino perr...
45430    Action Drama Thriller markl.lester markl.leste...
45431    yakovprotazanov yakovprotazanov yakovprotazano...
45432               daisyasquith daisyasquith daisyasquith
Name: soup, Length: 45433, dtype: object

### CountVectorizer

In [40]:
# Wall time: 2.36 s
count = CountVectorizer(analyzer='word', ngram_range=(1, 2), min_df=0, stop_words='english')
count_matrix = count.fit_transform(md['soup'])

count_matrix.shape

(45433, 308433)

In [41]:
# Cosine Similarity

# Wall time: 45.2 s
cosine_sim = cosine_similarity(count_matrix, count_matrix)

### Recreate titles & indices for get_recommendations

In [42]:
titles = md['title']
titles

0                          Toy Story
1                            Jumanji
2                   Grumpier Old Men
3                  Waiting to Exhale
4        Father of the Bride Part II
                    ...             
45428                         Subdue
45429            Century of Birthing
45430                       Betrayal
45431               Satan Triumphant
45432                       Queerama
Name: title, Length: 45433, dtype: object

In [43]:
indices = pd.Series(md.index, index = md['title'])
indices

title
Toy Story                          0
Jumanji                            1
Grumpier Old Men                   2
Waiting to Exhale                  3
Father of the Bride Part II        4
                               ...  
Subdue                         45428
Century of Birthing            45429
Betrayal                       45430
Satan Triumphant               45431
Queerama                       45432
Length: 45433, dtype: int64

In [44]:
def get_recommendations(title):
    # get user searched movie index
    idx = indices[title]
    
    # select 1st index if 2/more movies having same name
    idx = idx[0] if np.size(idx) > 1 else idx
    
    # get the cosine similarity of the movie
    sim_scores = list(enumerate(cosine_sim[idx]))
    
    # sort the cosine similarity
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    # find the second largest cosine similarity (the largest is itself (user searched movie))
    sim_scores = sim_scores[1:31]
    
    # find the index corresponding to the recommended movies
    movie_indices = [i[0] for i in sim_scores]
    
    # return the recommended movie titles by using the index
    return titles.iloc[movie_indices].head(10)

In [45]:
result2 = get_recommendations('The Dark Knight')

result2

18244    The Dark Knight Rises
10119            Batman Begins
11351             The Prestige
2465                 Following
5253                  Insomnia
25879                Doodlebug
4098                   Memento
44648                  Dunkirk
22864             Interstellar
15474                Inception
Name: title, dtype: object

* The system is able to recommand other Christopher Nolan movies
* prioritise movies by similarity (genre, keywords, cast, crew)
* We may improve the content-based recommander by adding more text features into account (spoken_languages, production_countries)

## 3.3) Metadata & Score Based Recommender<a class="anchor" id="3.3"></a>
Besides **genre**, **keywords**, **cast** and **crew**, this recommender also takes **Score** into account

In [46]:
def improved_recommendations(title):
    # same as pervious, get the suggested content-similar movie_indices
    idx = indices[title]
    idx = idx[0] if np.size(idx) > 1 else idx
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:26]
    movie_indices = [i[0] for i in sim_scores]

    # select the selected similar movies
    movies = md.iloc[movie_indices][['title', 'vote_count', 'vote_average', 'year']]
    
    # calculate C, m
    C = movies['vote_average'].mean()
    m = movies['vote_count'].quantile(0.60)
    
    # select vote_count >= its 60%
    qualified = movies[movies['vote_count'] >= m]
    
    # calculate score
    qualified['score'] = qualified.apply(weighted_rating, args=(C, m), axis=1)
    
    # sort content-similar movies by score
    qualified = qualified.sort_values('score', ascending=False)
    
    return qualified.head(10)

In [47]:
result3 = improved_recommendations('The Dark Knight')

result3

Unnamed: 0,title,vote_count,vote_average,year,score
15474,Inception,14075.0,8.1,2010.0,8.03788
22864,Interstellar,11187.0,8.1,2014.0,8.022707
4098,Memento,4168.0,8.1,2000.0,7.909785
11351,The Prestige,4510.0,8.0,2006.0,7.834791
18244,The Dark Knight Rises,9263.0,7.6,2012.0,7.539828
10119,Batman Begins,7511.0,7.5,2005.0,7.434699
44648,Dunkirk,2712.0,7.5,2017.0,7.341109
5253,Insomnia,1181.0,6.8,2002.0,6.752377
12207,Hitman,982.0,5.9,2007.0,6.200426
29683,Hitman: Agent 47,1183.0,5.5,2015.0,5.90715


* The system is able to recommand other Christopher Nolan movies
* prioritise the movies by score
* It is only capable of suggesting movies which are close to a certain movie. That is, it is not capable of capturing tastes and providing recommendations across genres. So, we need Collaborative Filtering.

---
## 4) Collaborative Filtering<a class="anchor" id="collaborative"></a>

In [3]:
from surprise import Reader, Dataset, SVD, NMF, KNNBasic
from surprise.model_selection import cross_validate, GridSearchCV

In [4]:
ratings = pd.read_csv('./input/ratings.csv')

ratings

Unnamed: 0,userId,movieId,rating,timestamp
0,1,110,1.0,1425941529
1,1,147,4.5,1425942435
2,1,858,5.0,1425941523
3,1,1221,5.0,1425941546
4,1,1246,5.0,1425941556
...,...,...,...,...
26024284,270896,58559,5.0,1257031564
26024285,270896,60069,5.0,1257032032
26024286,270896,63082,4.5,1257031764
26024287,270896,64957,4.5,1257033990


In [7]:
ratings.userId.value_counts()

45811     18276
8659       9279
270123     7638
179792     7515
228291     7410
          ...  
114594        1
111195        1
266491        1
45691         1
177573        1
Name: userId, Length: 270896, dtype: int64

### Load data

In [5]:
reader = Reader(rating_scale=(1,5))
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)
trainset = data.build_full_trainset()

### Test for different algorithms using CV (find the best algorithm)

In [51]:
algorithms = [SVD(),
              NMF(),
              KNNBasic()
             ]

cv_result={}
for algorithm in algorithms:
    cv_result[algorithm.__class__.__name__] = cross_validate(algorithm, data, measures=['RMSE', 'MAE'], cv=3, verbose=True)

cv_result

Evaluating RMSE, MAE of algorithm SVD on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.9026  0.9008  0.9021  0.9019  0.0007  
MAE (testset)     0.6961  0.6933  0.6969  0.6954  0.0016  
Fit time          4.22    3.79    3.56    3.86    0.27    
Test time         0.38    0.28    1.64    0.77    0.62    
Evaluating RMSE, MAE of algorithm NMF on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.9603  0.9612  0.9617  0.9611  0.0006  
MAE (testset)     0.7380  0.7388  0.7409  0.7392  0.0012  
Fit time          4.61    4.93    5.22    4.92    0.25    
Test time         0.21    0.16    0.19    0.18    0.02    
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNBasic on 3 split(s).

                  Fold 1

{'SVD': {'test_rmse': array([0.90263805, 0.90084508, 0.90206854]),
  'test_mae': array([0.69611384, 0.69328406, 0.69692204]),
  'fit_time': (4.219578981399536, 3.7868263721466064, 3.559969186782837),
  'test_time': (0.3758361339569092, 0.2818269729614258, 1.6395964622497559)},
 'NMF': {'test_rmse': array([0.96029026, 0.96121951, 0.96169043]),
  'test_mae': array([0.73795466, 0.73876517, 0.74090792]),
  'fit_time': (4.605369806289673, 4.928183555603027, 5.2249884605407715),
  'test_time': (0.20604252815246582,
   0.15990877151489258,
   0.18889212608337402)},
 'KNNBasic': {'test_rmse': array([0.98396915, 0.97796982, 0.97149423]),
  'test_mae': array([0.75655947, 0.75099769, 0.74874904]),
  'fit_time': (0.14191842079162598, 0.13892054557800293, 0.13992047309875488),
  'test_time': (2.359658718109131, 1.9089040756225586, 1.948892593383789)}}

In [52]:
cv_result_df = pd.DataFrame(cv_result).T
cv_result_df = cv_result_df.applymap(lambda x: np.mean(x))   # apply np.mean to each cell

cv_result_df

Unnamed: 0,test_rmse,test_mae,fit_time,test_time
SVD,0.901851,0.69544,3.855458,0.765753
NMF,0.961067,0.739209,4.919514,0.184948
KNNBasic,0.977811,0.752102,0.140253,2.072485


**SVD has the best performance & fastest**

### GridSearch over SVD (find the best hyperparams)

In [53]:
param_grid = {'n_factors': [45, 50],   #  number of factors
              'n_epochs': [20],
              'lr_all': [0.02],        # learning rate
              'reg_all': [0.1]}        # regularization 

gs = GridSearchCV(SVD,
                  param_grid,
                  refit=True,          # True: don't need to train again
                  measures=['rmse'],
                  cv=3,
                  return_train_measures=True,
                  n_jobs=-1)           # use all cpu

# training
gs.fit(data)

# save best model
best_model = gs.best_estimator['rmse']

print('best_params:', gs.best_params['rmse'])
print('best_score:', gs.best_score['rmse'], '\n')

print('mean_train_score:', gs.cv_results['mean_train_rmse'][gs.best_index['rmse']])
print('std_train_score:', gs.cv_results['std_train_rmse'][gs.best_index['rmse']])
print('mean_test_score:', gs.cv_results['mean_test_rmse'][gs.best_index['rmse']])
print('std_test_score:', gs.cv_results['std_test_rmse'][gs.best_index['rmse']])

best_params: {'n_factors': 50, 'n_epochs': 20, 'lr_all': 0.02, 'reg_all': 0.1}
best_score: 0.8858986178145648 

mean_train_score: 0.6307864480699031
std_train_score: 0.000130558439780237
mean_test_score: 0.8858986178145648
std_test_score: 0.0014121610770557443


### Prediction

In [54]:
# best_model.fit(trainset)
best_model.predict(1, 302, 3)

Prediction(uid=1, iid=302, r_ui=3, est=2.644969196636758, details={'was_impossible': False})

* uid  = 1     (user id)
* iid  = 302   (item id)
* r_ui = 3     (true rating, optional)
* est  = 2.69  (estimated rating)

In [55]:
best_model.predict(1, 302)

Prediction(uid=1, iid=302, r_ui=None, est=2.644969196636758, details={'was_impossible': False})

* prediction: work purely on user id, movie ID to predict estimated rating of the unseen movie by the user (based on how the other users have predicted the movie)

---
# 5) Hybrid Recommender<a class="anchor" id="hybrid"></a>

In [88]:
id_map = pd.read_csv('./input/links.csv')
id_map = id_map[['movieId', 'tmdbId']]
id_map.columns = ['movieId', 'id']

id_map = id_map[id_map['id'].notnull()]
id_map['id'] = id_map['id'].astype(int)

id_map

Unnamed: 0,movieId,id
0,1,862
1,2,8844
2,3,15602
3,4,31357
4,5,11862
...,...,...
45838,176269,439050
45839,176271,111109
45840,176273,67758
45841,176275,227506


In [89]:
id_map = id_map.merge(md[['title', 'id']], on='id').set_index('title')
id_map

Unnamed: 0_level_0,movieId,id
title,Unnamed: 1_level_1,Unnamed: 2_level_1
Toy Story,1,862
Jumanji,2,8844
Grumpier Old Men,3,15602
Waiting to Exhale,4,31357
Father of the Bride Part II,5,11862
...,...,...
Subdue,176269,439050
Century of Birthing,176271,111109
Betrayal,176273,67758
Satan Triumphant,176275,227506


In [90]:
indices_map = id_map.reset_index().set_index('id')
indices_map

Unnamed: 0_level_0,title,movieId
id,Unnamed: 1_level_1,Unnamed: 2_level_1
862,Toy Story,1
8844,Jumanji,2
15602,Grumpier Old Men,3
31357,Waiting to Exhale,4
11862,Father of the Bride Part II,5
...,...,...
439050,Subdue,176269
111109,Century of Birthing,176271
67758,Betrayal,176273
227506,Satan Triumphant,176275


In [101]:
indices

title
Toy Story                          0
Jumanji                            1
Grumpier Old Men                   2
Waiting to Exhale                  3
Father of the Bride Part II        4
                               ...  
Subdue                         45428
Century of Birthing            45429
Betrayal                       45430
Satan Triumphant               45431
Queerama                       45432
Length: 45433, dtype: int64

In [96]:
def hybrid(userId, title):
    # same as pervious, get the suggested content-similar movie_indices
    idx = indices[title]
    idx = idx[0] if np.size(idx) > 1 else idx
    sim_scores = list(enumerate(cosine_sim[int(idx)]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:26]
    movie_indices = [i[0] for i in sim_scores]
    
    # select the selected similar movies
    movies = md.iloc[movie_indices][['title', 'vote_count', 'vote_average', 'year', 'id']]
    
    # calculate estimated rating
    movies['est'] = movies['id'].apply(lambda x: best_model.predict(userId, indices_map.loc[x]['movieId']).est)
    
    # sort content-similar movies by estimated rating
    movies = movies.sort_values('est', ascending=False)
    
    return movies.head(10)

In [99]:
hybrid(1, 'The Dark Knight')

Unnamed: 0,title,vote_count,vote_average,year,id,est
4098,Memento,4168.0,8.1,2000.0,77,3.24208
22864,Interstellar,11187.0,8.1,2014.0,157336,3.171741
11351,The Prestige,4510.0,8.0,2006.0,1124,3.168749
15474,Inception,14075.0,8.1,2010.0,27205,3.109525
10119,Batman Begins,7511.0,7.5,2005.0,272,2.962197
14906,Harry Brown,351.0,6.7,2009.0,25941,2.895921
9128,Thursday,84.0,7.0,1998.0,9812,2.89003
18244,The Dark Knight Rises,9263.0,7.6,2012.0,49026,2.859057
5064,The Long Good Friday,87.0,7.1,1980.0,14807,2.683143
2465,Following,363.0,7.2,1998.0,11660,2.638185


In [102]:
hybrid(300, 'The Dark Knight')

Unnamed: 0,title,vote_count,vote_average,year,id,est
4098,Memento,4168.0,8.1,2000.0,77,4.374946
9128,Thursday,84.0,7.0,1998.0,9812,4.33799
11351,The Prestige,4510.0,8.0,2006.0,1124,4.326895
22864,Interstellar,11187.0,8.1,2014.0,157336,4.240085
10119,Batman Begins,7511.0,7.5,2005.0,272,4.201206
15474,Inception,14075.0,8.1,2010.0,27205,4.178889
18244,The Dark Knight Rises,9263.0,7.6,2012.0,49026,4.113329
5064,The Long Good Friday,87.0,7.1,1980.0,14807,4.104606
14906,Harry Brown,351.0,6.7,2009.0,25941,3.977914
44648,Dunkirk,2712.0,7.5,2017.0,374720,3.934325


The hybrid recommender is more personalized to particular users, which is able to return different movie recommendations for different users although the searched movie is the same.