# Exercise 2 - Movie Recommender System

## Text Similarity

Recommender systems are one of the popular and most adopted applications of machine learning. They are typically used to recommend entities to users and these entites can be anything like products, movies, services and so on.

Popular examples of recommendations include,

- Amazon suggesting products on its website
- Amazon Prime, Netflix, Hotstar recommending movies\shows
- YouTube recommending videos to watch

Typically recommender systems can be implemented in three ways:

- Simple Rule-based Recommenders: Typically based on specific global metrics and thresholds like movie popularity, global ratings etc.
- Content-based Recommenders: This is based on providing similar entities based on a specific entity of interest. Content metadata can be used here like movie descriptions, genre, cast, director and so on
- Collaborative filtering Recommenders: Here we don't need metadata but we try to predict recommendations and ratings based on past ratings of different users and specific items.

__We will be building a movie recommendation system here where based on data\metadata pertaining to different movies, we try and recommend similar movies of interest!__

With this exercise we will learn how to apply concepts learnt through tutorials of week1. Let's get started

In [1]:
!nvidia-smi

Fri Mar 26 01:17:31 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.56       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P0    25W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Load Data
If you are using google colab please use the upload file button option from the 'Files' icon on the left pane to upload the tmdb_5000_movies.csv.gz dataset.

In [2]:
import pandas as pd

df = pd.read_csv('tmdb_5000_movies.csv.gz', compression='gzip')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4803 non-null   int64  
 1   genres                4803 non-null   object 
 2   homepage              1712 non-null   object 
 3   id                    4803 non-null   int64  
 4   keywords              4803 non-null   object 
 5   original_language     4803 non-null   object 
 6   original_title        4803 non-null   object 
 7   overview              4800 non-null   object 
 8   popularity            4803 non-null   float64
 9   production_companies  4803 non-null   object 
 10  production_countries  4803 non-null   object 
 11  release_date          4802 non-null   object 
 12  revenue               4803 non-null   int64  
 13  runtime               4801 non-null   float64
 14  spoken_languages      4803 non-null   object 
 15  status               

## **View** top few rows of the dataframe

In [3]:
df.head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124


In [4]:
column_list = ['title', 'tagline', 'overview', 'genres', 'popularity']
df = df[column_list]
df.tagline.fillna('', inplace=True)

In [5]:
df['description'] = df['tagline'].map(str) + ' ' + df['overview'].map(str)

In [6]:
df.dropna(inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4800 entries, 0 to 4802
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   title        4800 non-null   object 
 1   tagline      4800 non-null   object 
 2   overview     4800 non-null   object 
 3   genres       4800 non-null   object 
 4   popularity   4800 non-null   float64
 5   description  4800 non-null   object 
dtypes: float64(1), object(5)
memory usage: 262.5+ KB


# __Question 1:__ Simple Text Preprocessing

First step is to prepare the text columns for analysis. In this section we will prepare textual columns before we extract features from them

In [14]:
from tqdm import tqdm
import re

def normalize_corpus(docs):
    norm_docs = []
    for doc in tqdm(docs):
        # use regex to remove special characters\whitespaces
        doc = re.sub(r'[^a-zA-Z0-9\s]', '', doc, flags=re.I|re.A)

        # lower case  
        doc = doc.lower()

        # store in new list
        norm_docs.append(doc)

    return norm_docs

### Preprocess the __description__ column

In [15]:
norm_corpus = normalize_corpus(list(df['description']))
len(norm_corpus)

100%|██████████| 4800/4800 [00:00<00:00, 73428.35it/s]


4800

In [16]:
movies_list = df['title'].values
movies_list, movies_list.shape

(array(['Avatar', "Pirates of the Caribbean: At World's End", 'Spectre',
        ..., 'Signed, Sealed, Delivered', 'Shanghai Calling',
        'My Date with Drew'], dtype=object), (4800,))

## Movie Recommendation with Embeddings
Let us use sentence level embeddings and then compute movie similarity. Here we will use the **USE model** and use the pretrained weights.


# __Question 2:__  Use ``TensorFlow Hub`` to get embeddings using Universal Sentence Encoder for the normalized corpus

In [17]:
import tensorflow_hub as hub

### The following may take some time to load

In [18]:
module_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
embed = hub.load(module_url)

In [19]:
movie_embeddings = embed(norm_corpus)

movie_embeddings, movie_embeddings.shape

(<tf.Tensor: shape=(4800, 512), dtype=float32, numpy=
 array([[-0.05723435, -0.04214483, -0.03309576, ..., -0.04181182,
          0.07057297, -0.06043932],
        [-0.02595232, -0.05021415, -0.04850538, ...,  0.01649479,
          0.09187799,  0.01649959],
        [ 0.03469853, -0.00659053, -0.04887958, ..., -0.01806329,
          0.0558591 ,  0.04785299],
        ...,
        [-0.06091747, -0.07319183, -0.05980853, ..., -0.0440587 ,
          0.05961612,  0.03807229],
        [ 0.02981014, -0.04371533,  0.01389753, ..., -0.06064739,
          0.0625461 , -0.07192577],
        [ 0.02773671,  0.02169432,  0.01284114, ..., -0.03256613,
          0.07879637,  0.00417032]], dtype=float32)>, TensorShape([4800, 512]))

# __Question 3:__ Get Movie Similarity Scores

We will leverage cosine similarity again to generate similarity scores based on universal embeddings for each movie description

In [20]:
from sklearn.metrics.pairwise import cosine_similarity

### Complete the following snippet to prepare a dataframe of pair-wise similarity of different movies

In [21]:
doc_sim = cosine_similarity(movie_embeddings)
doc_sim_df = pd.DataFrame(doc_sim)
doc_sim_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,4760,4761,4762,4763,4764,4765,4766,4767,4768,4769,4770,4771,4772,4773,4774,4775,4776,4777,4778,4779,4780,4781,4782,4783,4784,4785,4786,4787,4788,4789,4790,4791,4792,4793,4794,4795,4796,4797,4798,4799
0,1.0,0.37031,0.27023,0.303317,0.490067,0.300835,0.216841,0.339045,0.125727,0.359996,0.299888,0.27968,0.268741,0.221823,0.369573,0.217014,0.315118,0.365516,0.296259,0.07829,0.224455,0.242932,0.031772,0.265121,0.319318,0.297341,0.30964,0.535508,0.200572,0.290005,0.171498,0.22886,0.316474,0.219332,0.026685,0.258095,0.409237,0.126592,0.090387,0.221365,...,0.159003,0.210327,0.240855,0.10986,0.147659,0.108712,0.267276,0.153641,0.177927,0.17799,0.149592,0.147655,0.140445,0.283661,0.254252,0.191211,0.227098,0.217724,0.138087,0.155133,0.198775,0.142843,0.119446,0.239787,0.217328,0.243447,0.113469,0.168782,0.17697,0.168851,0.225016,0.158125,0.211958,0.238056,0.334394,0.24699,0.159282,0.337353,0.296578,0.199609
1,0.37031,1.0,0.273559,0.269835,0.46766,0.254533,0.205088,0.423249,0.336953,0.277845,0.385039,0.298162,0.514984,0.184884,0.340919,0.338991,0.276438,0.453332,0.456182,0.264194,0.264794,0.411272,0.338335,0.357643,0.338843,0.406974,0.229609,0.43757,0.259389,0.305183,0.224193,0.276285,0.296924,0.15965,0.175197,0.311526,0.352198,0.318858,0.271062,0.32191,...,0.188163,0.120696,0.21963,0.27882,0.159315,0.078154,0.232236,0.212466,0.270001,0.185166,0.126465,0.032557,0.25631,0.109888,0.167769,0.19253,0.195525,0.121543,0.090885,0.149044,0.155588,0.194115,0.092071,0.214344,0.261706,0.161937,0.116368,0.072176,0.193571,0.078899,0.130996,0.12449,0.145686,0.127262,0.171083,0.142741,0.136436,0.206349,0.278405,0.286379
2,0.27023,0.273559,1.0,0.361671,0.362316,0.31379,0.410651,0.3634,0.307635,0.337055,0.41304,0.564182,0.328071,0.313736,0.307849,0.306189,0.410696,0.350931,0.405821,0.167056,0.366334,0.327678,0.157384,0.310658,0.2958,0.220914,0.253636,0.365689,0.114588,0.584885,0.284819,0.328625,0.327941,0.170136,0.201661,0.346827,0.31884,0.230602,0.311371,0.378564,...,0.267528,0.297702,0.29695,0.229227,0.35229,0.226862,0.347788,0.14607,0.211539,0.411881,0.108722,0.173397,0.163943,0.26693,0.255895,0.369495,0.212622,0.285534,0.093386,0.319064,0.358788,0.261349,0.217676,0.320932,0.308553,0.331278,0.128674,0.095174,0.195017,0.308913,0.354539,0.013209,0.231663,0.239572,0.247505,0.228359,0.13643,0.391457,0.333736,0.18794
3,0.303317,0.269835,0.361671,1.0,0.369392,0.420292,0.312396,0.439125,0.309762,0.604672,0.458128,0.426594,0.330067,0.477312,0.229705,0.324304,0.389298,0.33253,0.395817,0.230356,0.353617,0.415113,0.197593,0.223433,0.378224,0.226113,0.391815,0.30545,0.217988,0.396326,0.4371,0.325928,0.311563,0.220026,0.229926,0.253101,0.29854,0.332662,0.299922,0.357311,...,0.332407,0.365379,0.415144,0.185083,0.345054,0.170273,0.428435,0.237021,0.192488,0.334642,0.167639,0.109335,0.192922,0.245478,0.351271,0.277548,0.299203,0.262908,0.150187,0.286641,0.252908,0.185427,0.196216,0.221006,0.447648,0.351993,0.093106,-0.023532,0.214032,0.408449,0.323629,0.151408,0.314587,0.140397,0.327477,0.349046,0.080062,0.382139,0.365542,0.243153
4,0.490067,0.46766,0.362316,0.369392,1.0,0.353694,0.383349,0.540597,0.31896,0.417028,0.468761,0.316004,0.354037,0.291076,0.47386,0.386948,0.43243,0.40852,0.418897,0.336631,0.385106,0.383677,0.283964,0.324472,0.455083,0.371649,0.409955,0.533266,0.305571,0.399661,0.299656,0.428817,0.405242,0.283335,0.187262,0.329307,0.495405,0.324113,0.21914,0.45393,...,0.306446,0.31518,0.376574,0.15778,0.271571,0.164579,0.322216,0.268848,0.198748,0.303838,0.074708,0.172849,0.141996,0.25926,0.399254,0.335308,0.281231,0.252165,0.108472,0.307709,0.299894,0.341389,0.16869,0.323875,0.294609,0.277634,0.137704,0.212714,0.259115,0.24203,0.254525,0.081997,0.184798,0.220947,0.333953,0.337351,0.194649,0.450151,0.394229,0.263465


# __Question 4:__  Movie Recommender

Build a recommendation utility function to find top 5 movies similar to what you did in Week 1

In [25]:
import numpy as np

def movie_recommender(movie_title, movies=movies_list, doc_sims=None):
    # find movie id
    movie_idx = np.where(movies == movie_title)[0][0]

    # get movie similarities. 
    #Hint: movie index helps find the exact row
    movie_similarities = doc_sims.iloc[movie_idx].values
    
    # get top 5 similar movie IDs
    # Hint: use numpy utility to do a sort
    similar_movie_idxs = np.argsort(-movie_similarities)[1:6]
    
    # get top 5 movies
    similar_movies = movies[similar_movie_idxs]
    
    # return the top 5 movies
    return similar_movies

# __Question 5:__ : Complete the following snippet to get movie recommendations

In [27]:
popular_movies = ['Interstellar', 'Deadpool', 'Jurassic World', 'Pirates of the Caribbean: The Curse of the Black Pearl',
              'Dawn of the Planet of the Apes', 'The Hunger Games: Mockingjay - Part 1', 'Terminator Genisys', 
              'Captain America: Civil War', 'The Dark Knight', 'The Martian', 'Batman v Superman: Dawn of Justice', 
              'Pulp Fiction', 'The Godfather', 'The Shawshank Redemption', 'The Lord of the Rings: The Fellowship of the Ring',  
              'Harry Potter and the Chamber of Secrets', 'Star Wars', 'The Hobbit: The Battle of the Five Armies',
              'Iron Man']

In [28]:
for movie in popular_movies:
    print('Movie:', movie)
    print('Top 5 recommended Movies:', movie_recommender(movie_title=movie, movies=movies_list, doc_sims=doc_sim_df))
    print()

Movie: Interstellar
Top 5 recommended Movies: ['Event Horizon' 'Gattaca' 'Space Battleship Yamato' 'Lost in Space'
 'Star Trek IV: The Voyage Home']

Movie: Deadpool
Top 5 recommended Movies: ['American Hero' 'Hancock'
 'Teenage Mutant Ninja Turtles: Out of the Shadows'
 'X-Men Origins: Wolverine' 'The Expendables 3']

Movie: Jurassic World
Top 5 recommended Movies: ['Jurassic Park' 'Walking With Dinosaurs'
 'Sea Rex 3D: Journey to a Prehistoric World'
 'The Lost World: Jurassic Park' 'The Land Before Time']

Movie: Pirates of the Caribbean: The Curse of the Black Pearl
Top 5 recommended Movies: ['The Pirates! In an Adventure with Scientists!'
 'Pirates of the Caribbean: On Stranger Tides' 'Waterworld'
 "Pirates of the Caribbean: Dead Man's Chest"
 "VeggieTales: The Pirates Who Don't Do Anything"]

Movie: Dawn of the Planet of the Apes
Top 5 recommended Movies: ['Battle for the Planet of the Apes' 'The 5th Wave'
 'The Day the Earth Stood Still' 'Beneath the Planet of the Apes'
 'Soldie