# Exercise 1 - Movie Recommender System with FastText Embeddings

## Text Similarity

Recommender systems are one of the popular and most adopted applications of machine learning. They are typically used to recommend entities to users and these entites can be anything like products, movies, services and so on.

Popular examples of recommendations include,

- Amazon suggesting products on its website
- Amazon Prime, Netflix, Hotstar recommending movies\shows
- YouTube recommending videos to watch

Typically recommender systems can be implemented in three ways:

- Simple Rule-based Recommenders: Typically based on specific global metrics and thresholds like movie popularity, global ratings etc.
- Content-based Recommenders: This is based on providing similar entities based on a specific entity of interest. Content metadata can be used here like movie descriptions, genre, cast, director and so on
- Collaborative filtering Recommenders: Here we don't need metadata but we try to predict recommendations and ratings based on past ratings of different users and specific items.

__We will be building a movie recommendation system here where based on data\metadata pertaining to different movies, we try and recommend similar movies of interest!__

With this exercise we will learn how to apply concepts learnt through tutorials of week1. Let's get started

___Fill in the blanks \ areas of code snippet with `<YOUR CODE HERE>` in the following code cells___

## Load Data

If you are using google colab please use the upload file button option from the 'Files' icon on the left pane to upload the `tmdb_5000_movies.csv.gz` dataset. 

In [None]:
import pandas as pd

df = pd.read_csv('tmdb_5000_movies.csv.gz', compression='gzip')
df.info()

### **Question 1**: **View** top few rows of the dataframe (1 point)

In [None]:
df._______()

In [None]:
column_list = ['title', 'tagline', 'overview', 'genres', 'popularity']
df = df[column_list]
df.tagline.fillna('', inplace=True)

### **Question 2**: Merge text from tagline column with text from overview column (1 point)

In [None]:
df['description'] = df['tagline'].map(str) + ' ' + ________

In [None]:
df.dropna(inplace=True)
df.info()

## Text Preprocessing

First step is to prepare the text columns for analysis. In this section we will prepare textual columns before we extract features from them

In [None]:
import nltk

In [None]:
nltk.download('stopwords')
nltk.download('punkt')

### **Question 3**: Complete the text normalization utility function (2 points)

In [None]:
import re
import numpy as np

stop_words = nltk.corpus.stopwords.words('english')

In [None]:
def normalize_document(doc):
    # remove special characters\whitespaces, ignore case
    doc = <YOUR CODE HERE>

    # lower case  
    doc = <YOUR CODE HERE>

    # remove whitespaces
    doc = <YOUR CODE HERE>

    # tokenize document
    tokens = <YOUR CODE HERE>

    # filter stopwords out of document
    filtered_tokens = <YOUR CODE HERE>

    # re-create/merge sentences from filtered content
    doc = <YOUR CODE HERE>
    return doc

In [None]:
normalize_corpus = np.vectorize(normalize_document)

norm_corpus = normalize_corpus(list(df['description']))
len(norm_corpus)

In [None]:
movies_list = df['title'].values
movies_list, movies_list.shape

## Movie Recommendation with Embeddings
We used count based features in a similar assignment in the first course. Can we use word embeddings and then compute movie similarity? We definitely can! Here we will use the FastText model and train it on our corpus.

### **Question 4**: Use ``gensim`` to train a FastText model on the normalized corpus (1 point)

You can keep:

- the embedding size to be 300
- context to be around 30
- min word count to be 2 (feel free to try more if needed as a filter)
- use a skipgram model
- iterations can be 50 (reduce it if it takes too long)

This might take a while to train!

In [None]:
from gensim.models import FastText

# iterate normalized corpus and split
tokenized_docs = <YOUR CODE HERE>

# Set values for various parameters
feature_size = <YOUR CODE HERE>   # Set Word embedding dimensionality 
window_context = <YOUR CODE HERE>  # Set Context window size                                                                                  
min_word_count = <YOUR CODE HERE>   # Set Minimum word count                    
sg = <YOUR CODE HERE>               # set skip-gram model flag

# train FastText model
ft_model = <YOUR CODE HERE>

##Generate document level embeddings

Word embedding models give us an embedding for each word, how can we use it for downstream ML\DL tasks? one way is to flatten it or use sequential models. A simpler approach is to average all word embeddings for words in a document and generate a fixed-length document level emebdding

### **Question 5**: Complete the following utility to prepare document vectors by averaging word vectors (3 points)

In [None]:
def averaged_word2vec_vectorizer(corpus, model, num_features):
    <YOUR CODE HERE>

In [None]:
doc_vecs_ft = averaged_word2vec_vectorizer(tokenized_docs, ft_model, 300)
doc_vecs_ft.shape

## Get Movie Recommendations

Recommendations in its most simplest form is a method of identifying items which are most similar to given user's preferences. In this scenario we use a content based recommendation system which tries to find similar movies based on the movie content i.e. description.

To identify similar items, we will use pairwise similarity measure called **cosine similarity**

We will leverage cosine similarity to generate recommendations

### **Question 6**: Complete the following snippet to prepare a dataframe of pair-wise cosine similarity of different movies (1 point)

Create pairwise cosine similarity based on the document embeddings

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
doc_sim = <YOUR CODE HERE>
doc_sim_df = pd.DataFrame(doc_sim)
doc_sim_df.head()

## Step by Step Methodology for Recommendation

### **Question 7**: Get a list of Movie titles (1 point)

In [None]:
# movie titles
movies_list = <YOUR CODE HERE>
movies_list

### **Question 8**: Given a movie title, get its index value (1 point)

Here let's get the ID for the movie __Minions__

__Hint:__ Numpy has dedicated functions to find the index from a numpy array or you can use list indexing functions also. The output should be a number

In [None]:
## movie ID
movie_idx = <YOUR CODE HERE>
movie_idx

## Get Similar Movies

We already calculated pairwise similarity between all movies in our dataset. Next step is to extract moview similar to a given movie.

Let us use the movie _Minions_ at index _546_ to find some similar movies using ``doc_sim_df`` dataframe

### **Question 9**: Extract row of dataframe given an index (1 point)

In [None]:
movie_similarities = <YOUR CODE HERE>
movie_similarities

### Top Similar Movies

### **Question 10**: Get top 5 most similar movies in descending order of similarity (1 point)

_hint: In descending order the index 0 represents the movie itself (as a movie description is 100% similar to itself, so it is safe to skip index 0_

#### Get top 5 movie IDs

In [None]:
similar_movie_idxs = <YOUR CODE HERE>
similar_movie_idxs

#### Get top 5 movie names

In [None]:
similar_movies = movies_list[similar_movie_idxs]
similar_movies

## Movie Recommender

Time to make use of all the smaller steps we have gone through so far to prepare a recommendation utility

### **Question 11**: Complete the utility function for getting movie recommendations (2 points)

In [None]:
def movie_recommender(movie_title, movies=movies_list, doc_sims=None):
    # find movie id
    movie_idx = <YOUR CODE HERE>

    # get movie similarities. 
    #Hint: movie index helps find the exact row
    movie_similarities = <YOUR CODE HERE>
    
    # get top 5 similar movie IDs
    # Hint: use numpy utility to do a sort
    similar_movie_idxs = <YOUR CODE HERE>
    
    # get top 5 movies
    similar_movies = <YOUR CODE HERE>
    
    # return the top 5 movies
    return similar_movies

### Find Similar Movies

In [None]:
popular_movies = ['Minions', 'Interstellar', 'Deadpool', 'Jurassic World', 'Pirates of the Caribbean: The Curse of the Black Pearl',
              'Dawn of the Planet of the Apes', 'The Hunger Games: Mockingjay - Part 1', 'Terminator Genisys', 
              'Captain America: Civil War', 'The Dark Knight', 'The Martian', 'Batman v Superman: Dawn of Justice', 
              'Pulp Fiction', 'The Godfather', 'The Shawshank Redemption', 'The Lord of the Rings: The Fellowship of the Ring',  
              'Harry Potter and the Chamber of Secrets', 'Star Wars', 'The Hobbit: The Battle of the Five Armies',
              'Iron Man']

In [None]:
for movie in popular_movies:
    print('Movie:', movie)
    print('Top 5 recommended Movies:', movie_recommender(movie_title=movie, movies=movies_list, doc_sims=doc_sim_df))
    print()