# Exercise 2 - Movie Recommender System

## Text Similarity

Recommender systems are one of the popular and most adopted applications of machine learning. They are typically used to recommend entities to users and these entites can be anything like products, movies, services and so on.

Popular examples of recommendations include,

- Amazon suggesting products on its website
- Amazon Prime, Netflix, Hotstar recommending movies\shows
- YouTube recommending videos to watch

Typically recommender systems can be implemented in three ways:

- Simple Rule-based Recommenders: Typically based on specific global metrics and thresholds like movie popularity, global ratings etc.
- Content-based Recommenders: This is based on providing similar entities based on a specific entity of interest. Content metadata can be used here like movie descriptions, genre, cast, director and so on
- Collaborative filtering Recommenders: Here we don't need metadata but we try to predict recommendations and ratings based on past ratings of different users and specific items.

__We will be building a movie recommendation system here where based on data\metadata pertaining to different movies, we try and recommend similar movies of interest!__

With this exercise we will learn how to apply concepts learnt through tutorials of week1. Let's get started

In [None]:
!nvidia-smi

## Load Data
If you are using google colab please use the upload file button option from the 'Files' icon on the left pane to upload the tmdb_5000_movies.csv.gz dataset.

In [None]:
import pandas as pd

df = pd.read_csv('tmdb_5000_movies.csv.gz', compression='gzip')
df.info()

## **View** top few rows of the dataframe

In [None]:
df.head()

In [None]:
column_list = ['title', 'tagline', 'overview', 'genres', 'popularity']
df = df[column_list]
df.tagline.fillna('', inplace=True)

In [None]:
df['description'] = df['tagline'].map(str) + ' ' + df['overview'].map(str)

In [None]:
df.dropna(inplace=True)
df.info()

# __Question 1:__ Simple Text Preprocessing (2 points)

First step is to prepare the text columns for analysis. In this section we will prepare textual columns before we extract features from them

In [None]:
from tqdm import tqdm
import re

def normalize_corpus(docs):
    norm_docs = []
    for doc in tqdm(docs):
        # use regex to remove special characters\whitespaces
        doc = <YOUR CODE HERE>

        # lower case  
        doc = <YOUR CODE HERE>

        # store in new list
        <YOUR CODE HERE>

    return norm_docs

### Preprocess the __description__ column

In [None]:
norm_corpus = <YOUR CODE HERE>
len(norm_corpus)

In [None]:
movies_list = df['title'].values
movies_list, movies_list.shape

## Movie Recommendation with Embeddings
Let us use sentence level embeddings and then compute movie similarity. Here we will use the **USE model** and use the pretrained weights.


# __Question 2:__  Use ``TensorFlow Hub`` to get embeddings using Universal Sentence Encoder for the normalized corpus (1 point)

In [None]:
import tensorflow_hub as hub

### The following may take some time to load

In [None]:
module_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
embed = hub.load(module_url)

In [None]:
movie_embeddings = <YOUR CODE HERE>

movie_embeddings, movie_embeddings.shape

# __Question 3:__ Get Movie Similarity Scores (1 point)

We will leverage cosine similarity again to generate similarity scores based on universal embeddings for each movie description

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

### Complete the following snippet to prepare a dataframe of pair-wise similarity of different movies

In [None]:
doc_sim = <YOUR CODE HERE>
doc_sim_df = <YOUR CODE HERE>
doc_sim_df.head()

# __Question 4:__  Movie Recommender (2 points)

Build a recommendation utility function to find top 5 movies similar to what you did in Week 1

In [None]:
import numpy as np

def movie_recommender(movie_title, movies=movies_list, doc_sims=None):
    # find movie id
    movie_idx = <YOUR CODE HERE>

    # get movie similarities. 
    #Hint: movie index helps find the exact row
    movie_similarities = <YOUR CODE HERE>
    
    # get top 5 similar movie IDs
    # Hint: use numpy utility to do a sort
    similar_movie_idxs = <YOUR CODE HERE>
    
    # get top 5 movies
    similar_movies = <YOUR CODE HERE>
    
    # return the top 5 movies
    return similar_movies

# __Question 5:__ : Complete the following snippet to get movie recommendations (1 point)

In [None]:
popular_movies = ['Interstellar', 'Deadpool', 'Jurassic World', 'Pirates of the Caribbean: The Curse of the Black Pearl',
              'Dawn of the Planet of the Apes', 'The Hunger Games: Mockingjay - Part 1', 'Terminator Genisys', 
              'Captain America: Civil War', 'The Dark Knight', 'The Martian', 'Batman v Superman: Dawn of Justice', 
              'Pulp Fiction', 'The Godfather', 'The Shawshank Redemption', 'The Lord of the Rings: The Fellowship of the Ring',  
              'Harry Potter and the Chamber of Secrets', 'Star Wars', 'The Hobbit: The Battle of the Five Armies',
              'Iron Man']

In [None]:
for movie in popular_movies:
    <YOUR CODE HERE>