# Exercise 1 - Movie Recommender System with FastText Embeddings

## Text Similarity

Recommender systems are one of the popular and most adopted applications of machine learning. They are typically used to recommend entities to users and these entites can be anything like products, movies, services and so on.

Popular examples of recommendations include,

- Amazon suggesting products on its website
- Amazon Prime, Netflix, Hotstar recommending movies\shows
- YouTube recommending videos to watch

Typically recommender systems can be implemented in three ways:

- Simple Rule-based Recommenders: Typically based on specific global metrics and thresholds like movie popularity, global ratings etc.
- Content-based Recommenders: This is based on providing similar entities based on a specific entity of interest. Content metadata can be used here like movie descriptions, genre, cast, director and so on
- Collaborative filtering Recommenders: Here we don't need metadata but we try to predict recommendations and ratings based on past ratings of different users and specific items.

__We will be building a movie recommendation system here where based on data\metadata pertaining to different movies, we try and recommend similar movies of interest!__

With this exercise we will learn how to apply concepts learnt through tutorials of week1. Let's get started

## Load Data

If you are using google colab please use the upload file button option from the 'Files' icon on the left pane to upload the `tmdb_5000_movies.csv.gz` dataset. 

In [1]:
import pandas as pd

df = pd.read_csv('tmdb_5000_movies.csv.gz', compression='gzip')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4803 non-null   int64  
 1   genres                4803 non-null   object 
 2   homepage              1712 non-null   object 
 3   id                    4803 non-null   int64  
 4   keywords              4803 non-null   object 
 5   original_language     4803 non-null   object 
 6   original_title        4803 non-null   object 
 7   overview              4800 non-null   object 
 8   popularity            4803 non-null   float64
 9   production_companies  4803 non-null   object 
 10  production_countries  4803 non-null   object 
 11  release_date          4802 non-null   object 
 12  revenue               4803 non-null   int64  
 13  runtime               4801 non-null   float64
 14  spoken_languages      4803 non-null   object 
 15  status               

### **Question 1**: **View** top few rows of the dataframe

In [2]:
df.head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124


In [3]:
column_list = ['title', 'tagline', 'overview', 'genres', 'popularity']
df = df[column_list]
df.tagline.fillna('', inplace=True)

### **Question 2**: Merge text from tagline column with text from overview column

In [4]:
df['description'] = df['tagline'].map(str) + ' ' + df['overview'].map(str)

In [5]:
df.dropna(inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4800 entries, 0 to 4802
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   title        4800 non-null   object 
 1   tagline      4800 non-null   object 
 2   overview     4800 non-null   object 
 3   genres       4800 non-null   object 
 4   popularity   4800 non-null   float64
 5   description  4800 non-null   object 
dtypes: float64(1), object(5)
memory usage: 262.5+ KB


## Text Preprocessing

First step is to prepare the text columns for analysis. In this section we will prepare textual columns before we extract features from them

In [6]:
import nltk

In [7]:
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

### **Question 3**: Complete the text normalization utility function

In [8]:
import re
import numpy as np

stop_words = nltk.corpus.stopwords.words('english')

In [9]:
def normalize_document(doc):
    # remove special characters\whitespaces, ignore case
    doc = re.sub(r'[^a-zA-Z0-9\s]', '', doc, re.I|re.A)

    # lower case  
    doc = doc.lower()

    # remove whitespaces
    doc = doc.strip()

    # tokenize document
    tokens = nltk.word_tokenize(doc)

    # filter stopwords out of document
    filtered_tokens = [token for token in tokens if token not in stop_words]

    # re-create/merge sentences from filtered content
    doc = ' '.join(filtered_tokens)
    return doc

In [10]:
normalize_corpus = np.vectorize(normalize_document)

norm_corpus = normalize_corpus(list(df['description']))
len(norm_corpus)

4800

In [11]:
movies_list = df['title'].values
movies_list, movies_list.shape

(array(['Avatar', "Pirates of the Caribbean: At World's End", 'Spectre',
        ..., 'Signed, Sealed, Delivered', 'Shanghai Calling',
        'My Date with Drew'], dtype=object), (4800,))

## Movie Recommendation with Embeddings
We used count based features in a similar assignment in the first course. Can we use word embeddings and then compute movie similarity? We definitely can! Here we will use the FastText model and train it on our corpus.

### **Question 4**: Use ``gensim`` to train a FastText model on the normalized corpus

You can keep:

- the embedding size to be 300
- context to be around 30
- min word count to be 2 (feel free to try more if needed as a filter)
- use a skipgram model
- iterations can be 50 (reduce it if it takes too long)

This might take a while to train!

In [12]:
from gensim.models import FastText

# iterate normalized corpus and split
tokenized_docs = [doc.split() for doc in norm_corpus]

# Set values for various parameters
feature_size = 300   # Set Word embedding dimensionality 
window_context = 30  # Set Context window size                                                                                  
min_word_count = 2   # Set Minimum word count                    
sg = 1               # set skip-gram model flag

# train FastText model
ft_model = FastText(tokenized_docs, 
                    size=feature_size, 
                    window=window_context, 
                    min_count=min_word_count, 
                    sg=sg, 
                    iter=50)

##Generate document level embeddings

Word embedding models give us an embedding for each word, how can we use it for downstream ML\DL tasks? one way is to flatten it or use sequential models. A simpler approach is to average all word embeddings for words in a document and generate a fixed-length document level emebdding

### **Question 5**: Complete the following utility to prepare document vectors by averaging word vectors

In [13]:
def averaged_word2vec_vectorizer(corpus, model, num_features):
    vocabulary = set(model.wv.index2word)
    
    def average_word_vectors(words, model, vocabulary, num_features):
        feature_vector = np.zeros((num_features,), dtype="float64")
        nwords = 0.
        
        for word in words:
            if word in vocabulary: 
                nwords = nwords + 1.
                feature_vector = np.add(feature_vector, model.wv[word])
        if nwords:
            feature_vector = np.divide(feature_vector, nwords)

        return feature_vector

    features = [average_word_vectors(tokenized_sentence, model, vocabulary, num_features)
                    for tokenized_sentence in corpus]
    return np.array(features)

In [14]:
doc_vecs_ft = averaged_word2vec_vectorizer(tokenized_docs, ft_model, 300)
doc_vecs_ft.shape

(4800, 300)

## Get Movie Recommendations

Recommendations in its most simplest form is a method of identifying items which are most similar to given user's preferences. In this scenario we use a content based recommendation system which tries to find similar movies based on the movie content i.e. description.

To identify similar items, we will use pairwise similarity measure called **cosine similarity**

We will leverage cosine similarity to generate recommendations

### **Question 6**: Complete the following snippet to prepare a dataframe of pair-wise cosine similarity of different movies

Create pairwise cosine similarity based on the document embeddings

In [15]:
from sklearn.metrics.pairwise import cosine_similarity

In [16]:
doc_sim = cosine_similarity(doc_vecs_ft)
doc_sim_df = pd.DataFrame(doc_sim)
doc_sim_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,4760,4761,4762,4763,4764,4765,4766,4767,4768,4769,4770,4771,4772,4773,4774,4775,4776,4777,4778,4779,4780,4781,4782,4783,4784,4785,4786,4787,4788,4789,4790,4791,4792,4793,4794,4795,4796,4797,4798,4799
0,1.0,0.543656,0.571502,0.576771,0.585465,0.514846,0.5481,0.593339,0.480626,0.555727,0.514425,0.529462,0.447749,0.469812,0.590477,0.509425,0.525093,0.521376,0.542951,0.520712,0.532792,0.509176,0.47691,0.453607,0.501605,0.487991,0.607755,0.647696,0.450206,0.589642,0.52175,0.561021,0.52054,0.504351,0.47667,0.57052,0.661068,0.424902,0.52581,0.522546,...,0.511775,0.50959,0.523163,0.462074,0.468663,0.437186,0.48631,0.498676,0.514734,0.52291,0.435219,0.556549,0.479761,0.517283,0.553285,0.506227,0.532386,0.533296,0.421123,0.537898,0.62577,0.51281,0.524349,0.496454,0.50348,0.470713,0.473932,0.403589,0.457871,0.520597,0.499712,0.485649,0.55202,0.574626,0.544143,0.480639,0.403255,0.536898,0.49115,0.506592
1,0.543656,1.0,0.531973,0.526304,0.61074,0.541783,0.577951,0.605901,0.589881,0.576854,0.509377,0.534877,0.600052,0.575227,0.636577,0.633299,0.601086,0.645071,0.626382,0.541157,0.596385,0.558236,0.55806,0.602392,0.553261,0.619319,0.593906,0.637502,0.525238,0.553103,0.578348,0.56038,0.575687,0.497541,0.51553,0.578095,0.60473,0.610149,0.585717,0.549419,...,0.629926,0.550752,0.521691,0.537824,0.561665,0.532062,0.531339,0.518528,0.540151,0.577129,0.51309,0.470172,0.573096,0.540939,0.622705,0.562116,0.587132,0.534118,0.508154,0.606822,0.597617,0.609799,0.511232,0.553702,0.533357,0.505693,0.482825,0.474862,0.587907,0.533585,0.581166,0.5352,0.600071,0.557906,0.541416,0.554876,0.443544,0.598936,0.556741,0.554506
2,0.571502,0.531973,1.0,0.580201,0.517781,0.529581,0.532595,0.596229,0.58324,0.565277,0.531609,0.724202,0.527497,0.546672,0.586876,0.534116,0.587137,0.606047,0.640868,0.570551,0.595836,0.527107,0.568406,0.557415,0.498409,0.494578,0.567184,0.560195,0.486927,0.713793,0.567058,0.575884,0.593751,0.550938,0.512813,0.664628,0.607573,0.548501,0.621753,0.563033,...,0.542687,0.592129,0.540951,0.489796,0.619029,0.480655,0.560138,0.505169,0.52078,0.585978,0.484433,0.483752,0.529018,0.53052,0.574858,0.582521,0.527885,0.569417,0.46571,0.559461,0.622671,0.522889,0.506316,0.547452,0.571032,0.5684,0.502715,0.490683,0.535784,0.565864,0.574146,0.478943,0.568461,0.549357,0.570718,0.520024,0.485759,0.535344,0.540606,0.519636
3,0.576771,0.526304,0.580201,1.0,0.556169,0.532214,0.542953,0.601797,0.571174,0.70134,0.520844,0.573428,0.4884,0.566145,0.566526,0.508648,0.527558,0.533691,0.590783,0.536273,0.5433,0.560087,0.508397,0.497619,0.559591,0.522373,0.575346,0.578125,0.467484,0.59091,0.552141,0.607856,0.601968,0.535657,0.457637,0.551117,0.610363,0.480101,0.573761,0.518896,...,0.58908,0.545485,0.571707,0.498981,0.563343,0.51363,0.558023,0.500063,0.5381,0.556101,0.52466,0.457818,0.482889,0.548536,0.559314,0.510231,0.487639,0.556048,0.553153,0.548332,0.604891,0.54081,0.480729,0.537616,0.521999,0.5299,0.521071,0.445925,0.528786,0.545271,0.553623,0.460783,0.614081,0.525804,0.540561,0.490932,0.467734,0.626549,0.557838,0.524413
4,0.585465,0.61074,0.517781,0.556169,1.0,0.500372,0.570013,0.623576,0.544509,0.566309,0.591789,0.537607,0.54631,0.575351,0.569946,0.544484,0.631793,0.606118,0.602823,0.558317,0.55744,0.505204,0.51334,0.498325,0.584458,0.552691,0.662019,0.66034,0.525247,0.580918,0.511814,0.55558,0.562019,0.584817,0.495729,0.538533,0.617747,0.431302,0.514509,0.549256,...,0.605153,0.524197,0.535379,0.493007,0.545192,0.462239,0.49311,0.491888,0.536544,0.5673,0.501358,0.460066,0.536477,0.525778,0.593337,0.493285,0.502796,0.532923,0.510724,0.544733,0.59595,0.536642,0.502615,0.552876,0.553793,0.498339,0.484628,0.502945,0.519216,0.527451,0.542307,0.48521,0.559869,0.574549,0.538851,0.492888,0.451816,0.600651,0.521886,0.52885


## Step by Step Methodology for Recommendation

### **Question 7**: Get a list of Movie titles

In [17]:
# movie titles
movies_list = df['title'].values
movies_list

array(['Avatar', "Pirates of the Caribbean: At World's End", 'Spectre',
       ..., 'Signed, Sealed, Delivered', 'Shanghai Calling',
       'My Date with Drew'], dtype=object)

### **Question 8**: Given a movie title, get its index value 

Here let's get the ID for the movie __Minions__

__Hint:__ Numpy has dedicated functions to find the index from a numpy array or you can use list indexing functions also. The output should be a number

In [18]:
## movie ID
movie_idx = np.where(movies_list == 'Minions')[0][0]
movie_idx

546

## Get Similar Movies

We already calculated pairwise similarity between all movies in our dataset. Next step is to extract moview similar to a given movie.

Let us use the movie _Minions_ at index _546_ to find some similar movies using ``doc_sim_df`` dataframe

### **Question 9**: Extract row of dataframe given an index

In [19]:
movie_similarities = doc_sim_df.iloc[movie_idx].values
movie_similarities

array([0.4838717 , 0.54614931, 0.54717532, ..., 0.5131163 , 0.50184772,
       0.55177593])

### Top Similar Movies

### **Question 10**: Get top 5 most similar movies in descending order of similarity

_hint: In descending order the index 0 represents the movie itself (as a movie description is 100% similar to itself, so it is safe to skip index 0_

#### Get top 5 movie IDs

In [20]:
similar_movie_idxs = np.argsort(-movie_similarities)[1:6]
similar_movie_idxs

array([ 614, 2825, 4568, 1358,  506])

#### Get top 5 movie names

In [21]:
similar_movies = movies_list[similar_movie_idxs]
similar_movies

array(['Despicable Me', 'Time Bandits',
       'Rise of the Entrepreneur: The Search for a Better Way',
       'Austin Powers: The Spy Who Shagged Me', 'Despicable Me 2'],
      dtype=object)

## Movie Recommender

Time to make use of all the smaller steps we have gone through so far to prepare a recommendation utility

### **Question 11**: Complete the utility function for getting movie recommendations

In [22]:
def movie_recommender(movie_title, movies=movies_list, doc_sims=None):
    # find movie id
    movie_idx = np.where(movies == movie_title)[0][0]

    # get movie similarities. 
    #Hint: movie index helps find the exact row
    movie_similarities = doc_sims.iloc[movie_idx].values
    
    # get top 5 similar movie IDs
    # Hint: use numpy utility to do a sort
    similar_movie_idxs = np.argsort(-movie_similarities)[1:6]
    
    # get top 5 movies
    similar_movies = movies[similar_movie_idxs]
    
    # return the top 5 movies
    return similar_movies

### Find Similar Movies

In [23]:
popular_movies = ['Minions', 'Interstellar', 'Deadpool', 'Jurassic World', 'Pirates of the Caribbean: The Curse of the Black Pearl',
              'Dawn of the Planet of the Apes', 'The Hunger Games: Mockingjay - Part 1', 'Terminator Genisys', 
              'Captain America: Civil War', 'The Dark Knight', 'The Martian', 'Batman v Superman: Dawn of Justice', 
              'Pulp Fiction', 'The Godfather', 'The Shawshank Redemption', 'The Lord of the Rings: The Fellowship of the Ring',  
              'Harry Potter and the Chamber of Secrets', 'Star Wars', 'The Hobbit: The Battle of the Five Armies',
              'Iron Man']

In [24]:
for movie in popular_movies:
    print('Movie:', movie)
    print('Top 5 recommended Movies:', movie_recommender(movie_title=movie, movies=movies_list, doc_sims=doc_sim_df))
    print()

Movie: Minions
Top 5 recommended Movies: ['Despicable Me' 'Time Bandits'
 'Rise of the Entrepreneur: The Search for a Better Way'
 'Austin Powers: The Spy Who Shagged Me' 'Despicable Me 2']

Movie: Interstellar
Top 5 recommended Movies: ['Prometheus' 'Gattaca' 'Starship Troopers'
 'Sea Rex 3D: Journey to a Prehistoric World' 'Space Cowboys']

Movie: Deadpool
Top 5 recommended Movies: ['Banshee Chapter' 'Fantastic Four' 'Enough' 'The Hunted' 'Spider-Man 3']

Movie: Jurassic World
Top 5 recommended Movies: ['Jurassic Park' 'The Lost World: Jurassic Park' 'Jurassic Park III'
 "National Lampoon's Vacation" 'Walking With Dinosaurs']

Movie: Pirates of the Caribbean: The Curse of the Black Pearl
Top 5 recommended Movies: ['Pirates of the Caribbean: On Stranger Tides'
 'The Pirates! In an Adventure with Scientists!'
 "Pirates of the Caribbean: Dead Man's Chest"
 'In the Name of the King III' 'American Ninja 2: The Confrontation']

Movie: Dawn of the Planet of the Apes
Top 5 recommended Movies