# Exercise 1 - Movie Recommender System with FastText Embeddings

## Text Similarity

Recommender systems are one of the popular and most adopted applications of machine learning. They are typically used to recommend entities to users and these entites can be anything like products, movies, services and so on.

Popular examples of recommendations include,

- Amazon suggesting products on its website
- Amazon Prime, Netflix, Hotstar recommending movies\shows
- YouTube recommending videos to watch

Typically recommender systems can be implemented in three ways:

- Simple Rule-based Recommenders: Typically based on specific global metrics and thresholds like movie popularity, global ratings etc.
- Content-based Recommenders: This is based on providing similar entities based on a specific entity of interest. Content metadata can be used here like movie descriptions, genre, cast, director and so on
- Collaborative filtering Recommenders: Here we don't need metadata but we try to predict recommendations and ratings based on past ratings of different users and specific items.

__We will be building a movie recommendation system here where based on data\metadata pertaining to different movies, we try and recommend similar movies of interest!__

With this exercise we will learn how to apply concepts learnt through tutorials of week1. Let's get started

___Fill in the blanks \ areas of code snippet with `<YOUR CODE HERE>` in the following code cells___

## Load Data

If you are using google colab please use the upload file button option from the 'Files' icon on the left pane to upload the `tmdb_5000_movies.csv.gz` dataset. 

In [1]:
import pandas as pd

df = pd.read_csv('tmdb_5000_movies.csv.gz', compression='gzip')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4803 non-null   int64  
 1   genres                4803 non-null   object 
 2   homepage              1712 non-null   object 
 3   id                    4803 non-null   int64  
 4   keywords              4803 non-null   object 
 5   original_language     4803 non-null   object 
 6   original_title        4803 non-null   object 
 7   overview              4800 non-null   object 
 8   popularity            4803 non-null   float64
 9   production_companies  4803 non-null   object 
 10  production_countries  4803 non-null   object 
 11  release_date          4802 non-null   object 
 12  revenue               4803 non-null   int64  
 13  runtime               4801 non-null   float64
 14  spoken_languages      4803 non-null   object 
 15  status               

### **Question 1**: **View** top few rows of the dataframe (1 point)

In [2]:
df.head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124


In [3]:
column_list = ['title', 'tagline', 'overview', 'genres', 'popularity']
df = df[column_list]
df.tagline.fillna('', inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  downcast=downcast,


### **Question 2**: Merge text from tagline column with text from overview column (1 point)

In [4]:
df['description'] = df['tagline'].map(str) + ' ' + df['overview'].map(str)

In [5]:
df.dropna(inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4800 entries, 0 to 4802
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   title        4800 non-null   object 
 1   tagline      4800 non-null   object 
 2   overview     4800 non-null   object 
 3   genres       4800 non-null   object 
 4   popularity   4800 non-null   float64
 5   description  4800 non-null   object 
dtypes: float64(1), object(5)
memory usage: 262.5+ KB


## Text Preprocessing

First step is to prepare the text columns for analysis. In this section we will prepare textual columns before we extract features from them

In [6]:
import nltk

In [7]:
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

### **Question 3**: Complete the text normalization utility function (2 points)

In [8]:
import re
import numpy as np

stop_words = nltk.corpus.stopwords.words('english')

In [9]:
def normalize_document(doc):
    # remove special characters\whitespaces, ignore case
    doc = re.sub(r'[^a-zA-Z0-9\s]', '', doc, re.IGNORECASE|re.A)

    # lower case  
    doc = doc.lower()

    # remove whitespaces
    doc = re.sub(' +', ' ', doc)
    doc = doc.strip()

    # tokenize document
    tokens = nltk.word_tokenize(doc)

    # filter stopwords out of document
    filtered_tokens = [token for token in tokens if token not in stop_words]

    # re-create/merge sentences from filtered content
    doc = ' '.join(filtered_tokens)
    return doc

In [10]:
normalize_corpus = np.vectorize(normalize_document)

norm_corpus = normalize_corpus(list(df['description']))
len(norm_corpus)

4800

In [11]:
movies_list = df['title'].values
movies_list, movies_list.shape

(array(['Avatar', "Pirates of the Caribbean: At World's End", 'Spectre',
        ..., 'Signed, Sealed, Delivered', 'Shanghai Calling',
        'My Date with Drew'], dtype=object), (4800,))

## Movie Recommendation with Embeddings
We used count based features in a similar assignment in the first course. Can we use word embeddings and then compute movie similarity? We definitely can! Here we will use the FastText model and train it on our corpus.

### **Question 4**: Use ``gensim`` to train a FastText model on the normalized corpus (1 point)

You can keep:

- the embedding size to be 300
- context to be around 30
- min word count to be 2 (feel free to try more if needed as a filter)
- use a skipgram model
- iterations can be 50 (reduce it if it takes too long)

This might take a while to train!

In [12]:
from gensim.models import FastText

# iterate normalized corpus and split
tokenized_docs = [nltk.word_tokenize(doc) for doc in norm_corpus]

# Set values for various parameters
feature_size = 300   # Set Word embedding dimensionality 
window_context = 30  # Set Context window size                                                                                  
min_word_count = 2   # Set Minimum word count                    
sg = 1               # set skip-gram model flag

# train FastText model
ft_model = FastText(tokenized_docs, size=feature_size, 
                     window=window_context, min_count = min_word_count,
                     sg=sg, iter=50)

##Generate document level embeddings

Word embedding models give us an embedding for each word, how can we use it for downstream ML\DL tasks? one way is to flatten it or use sequential models. A simpler approach is to average all word embeddings for words in a document and generate a fixed-length document level emebdding

### **Question 5**: Complete the following utility to prepare document vectors by averaging word vectors (3 points)

In [13]:
def average_word_vectors(words, model, vocabulary, num_features):
    
    feature_vector = np.zeros((num_features,),dtype="float64")
    nwords = 0.
    
    for word in words:
        if word in vocabulary: 
            nwords = nwords + 1.
            feature_vector = np.add(feature_vector, model[word])
    
    if nwords:
        feature_vector = np.divide(feature_vector, nwords)
        
    return feature_vector

def averaged_word2vec_vectorizer(corpus, model, num_features):
    vocabulary = set(model.wv.index2word)
    features = [average_word_vectors(tokenized_sentence, model, vocabulary, num_features)
                    for tokenized_sentence in corpus]
    return np.array(features)

In [14]:
doc_vecs_ft = averaged_word2vec_vectorizer(tokenized_docs, ft_model, 300)
doc_vecs_ft.shape

  if __name__ == '__main__':


(4800, 300)

## Get Movie Recommendations

Recommendations in its most simplest form is a method of identifying items which are most similar to given user's preferences. In this scenario we use a content based recommendation system which tries to find similar movies based on the movie content i.e. description.

To identify similar items, we will use pairwise similarity measure called **cosine similarity**

We will leverage cosine similarity to generate recommendations

### **Question 6**: Complete the following snippet to prepare a dataframe of pair-wise cosine similarity of different movies (1 point)

Create pairwise cosine similarity based on the document embeddings

In [15]:
from sklearn.metrics.pairwise import cosine_similarity

In [18]:
doc_sim = cosine_similarity(doc_vecs_ft)
doc_sim_df = pd.DataFrame(doc_sim)
doc_sim_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,4760,4761,4762,4763,4764,4765,4766,4767,4768,4769,4770,4771,4772,4773,4774,4775,4776,4777,4778,4779,4780,4781,4782,4783,4784,4785,4786,4787,4788,4789,4790,4791,4792,4793,4794,4795,4796,4797,4798,4799
0,1.0,0.544096,0.574028,0.586273,0.586716,0.513154,0.52478,0.585586,0.484311,0.556266,0.516617,0.545262,0.427396,0.474566,0.582865,0.498785,0.545152,0.52339,0.532972,0.51947,0.530975,0.498208,0.47592,0.471106,0.500777,0.494523,0.614883,0.643992,0.461723,0.583654,0.519035,0.557089,0.52465,0.500724,0.488408,0.566384,0.649833,0.429932,0.520832,0.52203,...,0.512999,0.523415,0.520055,0.466202,0.474431,0.438843,0.483846,0.498108,0.50244,0.532637,0.461943,0.555855,0.487958,0.516169,0.545966,0.503349,0.542265,0.528426,0.424817,0.547031,0.609431,0.513883,0.536077,0.501146,0.52551,0.468985,0.455308,0.417074,0.466493,0.524026,0.509253,0.498762,0.549868,0.577909,0.547308,0.494403,0.429733,0.552062,0.482813,0.505799
1,0.544096,1.0,0.53078,0.515946,0.623231,0.551669,0.565873,0.612478,0.57652,0.586182,0.512775,0.542721,0.593563,0.571942,0.644757,0.622417,0.609347,0.641814,0.635311,0.521585,0.601757,0.559315,0.549339,0.596379,0.547089,0.614159,0.603092,0.640145,0.533429,0.552545,0.579618,0.545391,0.587102,0.501398,0.532179,0.575488,0.592412,0.618342,0.58759,0.547346,...,0.626331,0.552716,0.529888,0.546989,0.57089,0.532375,0.532692,0.529679,0.55147,0.569802,0.526461,0.466845,0.581284,0.53993,0.613684,0.570754,0.584245,0.5246,0.517972,0.624898,0.593577,0.618915,0.515012,0.561141,0.539312,0.518334,0.505797,0.468164,0.566546,0.529691,0.584704,0.530688,0.585371,0.559174,0.556612,0.550232,0.449113,0.604774,0.55544,0.559372
2,0.574028,0.53078,1.0,0.587342,0.536151,0.521857,0.533096,0.592139,0.586232,0.560852,0.53162,0.721949,0.516257,0.556277,0.586271,0.528687,0.592197,0.599767,0.645568,0.555127,0.592534,0.512072,0.555483,0.552103,0.500818,0.487613,0.562732,0.567708,0.485878,0.720686,0.558431,0.556814,0.592571,0.545036,0.522728,0.658379,0.604158,0.536271,0.618803,0.554552,...,0.540455,0.604253,0.541938,0.499872,0.626494,0.477896,0.561558,0.502952,0.516438,0.590364,0.486799,0.476754,0.520751,0.53644,0.580422,0.582873,0.520939,0.580172,0.490341,0.546288,0.625191,0.539644,0.492316,0.551099,0.577029,0.5459,0.506469,0.479482,0.528173,0.556342,0.59344,0.487794,0.564246,0.545383,0.561137,0.514568,0.488869,0.53725,0.543664,0.517098
3,0.586273,0.515946,0.587342,1.0,0.553608,0.520636,0.530988,0.589412,0.564539,0.707334,0.529777,0.57956,0.480264,0.571933,0.564332,0.523172,0.536695,0.515065,0.594482,0.540421,0.550373,0.558061,0.50559,0.500313,0.55972,0.514836,0.567616,0.577624,0.467226,0.589679,0.546303,0.608629,0.590335,0.529318,0.467183,0.558114,0.622847,0.483069,0.576161,0.54316,...,0.586504,0.545612,0.571009,0.50577,0.57018,0.502594,0.557176,0.500705,0.531858,0.561728,0.532806,0.456826,0.48359,0.551554,0.56286,0.517786,0.502661,0.578146,0.561386,0.567011,0.60152,0.550621,0.480931,0.534457,0.530166,0.519822,0.521476,0.461086,0.546576,0.54573,0.546552,0.457831,0.613432,0.53375,0.551322,0.508012,0.463044,0.635904,0.563764,0.537014
4,0.586716,0.623231,0.536151,0.553608,1.0,0.498991,0.551627,0.623652,0.549274,0.570605,0.592966,0.542204,0.540633,0.576532,0.585191,0.552937,0.644237,0.607651,0.600822,0.562413,0.564523,0.515025,0.510214,0.504371,0.583872,0.564738,0.657242,0.663388,0.527729,0.592478,0.521141,0.553637,0.567834,0.571973,0.49476,0.545182,0.61703,0.44567,0.535554,0.560854,...,0.604689,0.517914,0.533691,0.474049,0.549196,0.465247,0.504387,0.483177,0.542009,0.564051,0.488866,0.462406,0.524149,0.530894,0.592476,0.49832,0.504916,0.54579,0.511028,0.546401,0.602596,0.532607,0.489795,0.561493,0.548034,0.514308,0.485106,0.511092,0.520338,0.534863,0.542865,0.479,0.563256,0.579525,0.536653,0.502785,0.446601,0.584623,0.522301,0.533792


## Step by Step Methodology for Recommendation

### **Question 7**: Get a list of Movie titles (1 point)

In [19]:
# movie titles
movies_list = df['title']
movies_list

0                                         Avatar
1       Pirates of the Caribbean: At World's End
2                                        Spectre
3                          The Dark Knight Rises
4                                    John Carter
                          ...                   
4798                                 El Mariachi
4799                                   Newlyweds
4800                   Signed, Sealed, Delivered
4801                            Shanghai Calling
4802                           My Date with Drew
Name: title, Length: 4800, dtype: object

### **Question 8**: Given a movie title, get its index value (1 point)

Here let's get the ID for the movie __Minions__

__Hint:__ Numpy has dedicated functions to find the index from a numpy array or you can use list indexing functions also. The output should be a number

In [20]:
## movie ID
movie_idx = np.where(movies_list == 'Minions')
movie_idx

(array([546]),)

## Get Similar Movies

We already calculated pairwise similarity between all movies in our dataset. Next step is to extract moview similar to a given movie.

Let us use the movie _Minions_ at index _546_ to find some similar movies using ``doc_sim_df`` dataframe

### **Question 9**: Extract row of dataframe given an index (1 point)

In [21]:
movie_similarities = doc_sim_df.iloc[movie_idx].values
movie_similarities

array([[0.47014731, 0.53810713, 0.53270891, ..., 0.51057078, 0.50204268,
        0.54564011]])

### Top Similar Movies

### **Question 10**: Get top 5 most similar movies in descending order of similarity (1 point)

_hint: In descending order the index 0 represents the movie itself (as a movie description is 100% similar to itself, so it is safe to skip index 0_

#### Get top 5 movie IDs

In [22]:
similar_movie_idxs = np.argsort(movie_similarities)
similar_movie_idxs = similar_movie_idxs[0][-6:-1][::-1]
similar_movie_idxs

array([ 614, 2825, 4568, 1358,  506])

#### Get top 5 movie names

In [23]:
similar_movies = movies_list[similar_movie_idxs]
similar_movies

614                               Despicable Me
2825                                      Yentl
4568    Poultrygeist: Night of the Chicken Dead
1358      Austin Powers: The Spy Who Shagged Me
506                             Despicable Me 2
Name: title, dtype: object

## Movie Recommender

Time to make use of all the smaller steps we have gone through so far to prepare a recommendation utility

### **Question 11**: Complete the utility function for getting movie recommendations (2 points)

In [24]:
def movie_recommender(movie_title, movies=movies_list, doc_sims=None):
    # find movie id
    movie_idx = np.where(movies_list == movie_title)

    # get movie similarities. 
    #Hint: movie index helps find the exact row
    movie_similarities = doc_sim_df.iloc[movie_idx].values
    
    # get top 5 similar movie IDs
    # Hint: use numpy utility to do a sort
    similar_movie_idxs = np.argsort(movie_similarities)
    similar_movie_idxs = similar_movie_idxs[0][-6:-1][::-1]
    
    # get top 5 movies
    similar_movies = movies_list[similar_movie_idxs]
    
    # return the top 5 movies
    return similar_movies

### Find Similar Movies

In [25]:
popular_movies = ['Minions', 'Interstellar', 'Deadpool', 'Jurassic World', 'Pirates of the Caribbean: The Curse of the Black Pearl',
              'Dawn of the Planet of the Apes', 'The Hunger Games: Mockingjay - Part 1', 'Terminator Genisys', 
              'Captain America: Civil War', 'The Dark Knight', 'The Martian', 'Batman v Superman: Dawn of Justice', 
              'Pulp Fiction', 'The Godfather', 'The Shawshank Redemption', 'The Lord of the Rings: The Fellowship of the Ring',  
              'Harry Potter and the Chamber of Secrets', 'Star Wars', 'The Hobbit: The Battle of the Five Armies',
              'Iron Man']

In [26]:
for movie in popular_movies:
    print('Movie:', movie)
    print('Top 5 recommended Movies:', movie_recommender(movie_title=movie, movies=movies_list, doc_sims=doc_sim_df))
    print()

Movie: Minions
Top 5 recommended Movies: 614                               Despicable Me
2825                                      Yentl
4568    Poultrygeist: Night of the Chicken Dead
1358      Austin Powers: The Spy Who Shagged Me
506                             Despicable Me 2
Name: title, dtype: object

Movie: Interstellar
Top 5 recommended Movies: 220                     Prometheus
1352                       Gattaca
3625    Keeping Up with the Steins
1645                      The Cave
300              Starship Troopers
Name: title, dtype: object

Movie: Deadpool
Top 5 recommended Movies: 4426       A Fine Step
242     Fantastic Four
1308            Enough
30        Spider-Man 2
5         Spider-Man 3
Name: title, dtype: object

Movie: Jurassic World
Top 5 recommended Movies: 675                     Jurassic Park
508     The Lost World: Jurassic Park
334                 Jurassic Park III
2527      National Lampoon's Vacation
1536                         Vacation
Name: title, dtype: