# Exercise 1 - Movie Recommender System (10 points)

## Text Similarity

Recommender systems are one of the popular and most adopted applications of machine learning. They are typically used to recommend entities to users and these entites can be anything like products, movies, services and so on.

Popular examples of recommendations include,

- Amazon suggesting products on its website
- Amazon Prime, Netflix, Hotstar recommending movies\shows
- YouTube recommending videos to watch

Typically recommender systems can be implemented in three ways:

- Simple Rule-based Recommenders: Typically based on specific global metrics and thresholds like movie popularity, global ratings etc.
- Content-based Recommenders: This is based on providing similar entities based on a specific entity of interest. Content metadata can be used here like movie descriptions, genre, cast, director and so on
- Collaborative filtering Recommenders: Here we don't need metadata but we try to predict recommendations and ratings based on past ratings of different users and specific items.

__We will be building a movie recommendation system here where based on data\metadata pertaining to different movies, we try and recommend similar movies of interest!__

With this exercise we will learn how to apply concepts learnt through tutorials of week1 and week2 respectively. Let's get started

## Load Data

If you are using google colab please use the upload file button option from the 'Files' icon on the left pane to upload the `tmdb_5000_movies.csv.gz` dataset. 

In [1]:
import pandas as pd

df = pd.read_csv('tmdb_5000_movies.csv.gz', compression='gzip')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4803 non-null   int64  
 1   genres                4803 non-null   object 
 2   homepage              1712 non-null   object 
 3   id                    4803 non-null   int64  
 4   keywords              4803 non-null   object 
 5   original_language     4803 non-null   object 
 6   original_title        4803 non-null   object 
 7   overview              4800 non-null   object 
 8   popularity            4803 non-null   float64
 9   production_companies  4803 non-null   object 
 10  production_countries  4803 non-null   object 
 11  release_date          4802 non-null   object 
 12  revenue               4803 non-null   int64  
 13  runtime               4801 non-null   float64
 14  spoken_languages      4803 non-null   object 
 15  status               

### **Question 1**: **View** top few rows of the dataframe (1 point)

In [2]:
df.head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124


In [3]:
column_list = ['title', 'tagline', 'overview', 'genres', 'popularity']
df = df[column_list]
df.tagline.fillna('', inplace=True)

### **Question 2**: Merge text from tagline column with text from overview column (1 point)

In [8]:
df['description'] = df['tagline'].map(str) + ' ' + df['overview'].map(str)

In [9]:
df.dropna(inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4800 entries, 0 to 4802
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   title        4800 non-null   object 
 1   tagline      4800 non-null   object 
 2   overview     4800 non-null   object 
 3   genres       4800 non-null   object 
 4   popularity   4800 non-null   float64
 5   description  4800 non-null   object 
dtypes: float64(1), object(5)
memory usage: 262.5+ KB


## Text Preprocessing

First step is to prepare the text columns for analysis. In this section we will prepare textual columns before we extract features from them

In [10]:
import nltk

In [11]:
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to C:\Users\New
[nltk_data]     User\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.
[nltk_data] Downloading package punkt to C:\Users\New
[nltk_data]     User\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

### **Question 3**: Complete the text normalization utility function (1 point)

In [12]:
import re
import numpy as np

stop_words = nltk.corpus.stopwords.words('english')

In [17]:
def normalize_document(doc):
    # remove special characters\whitespaces, ignore case
    doc = re.sub(r'[^a-zA-Z0-9\s]', '', doc, re.IGNORECASE|re.A)

    # lower case  
    doc = doc.lower()

    # remove whitespaces
    doc = re.sub(' +', ' ', doc)
    doc = doc.strip()

    # tokenize document
    tokens = nltk.word_tokenize(doc)

    # filter stopwords out of document
    filtered_tokens = [token for token in tokens if token not in stop_words]

    # re-create/merge sentences from filtered content
    doc = ' '.join(filtered_tokens)
    return doc

In [18]:
normalize_corpus = np.vectorize(normalize_document)

norm_corpus = normalize_corpus(list(df['description']))
len(norm_corpus)

4800

## TF-IDF Features

Free text can be vectorized for use in different NLP tasks using a number of methods. TF-IDF is a robust method we covered in week-2 tutorials. 

Let us use the same to prepare features from description column 

### **Question 4**: Prepare Features based on TF-IDF (1 point)

use the TFidfVectorizer module with the following settings

- `ngram_range` as `(1, 2)` to use both uni and bi-grams
- `min_df` as `2` to remove words which occur only once across all movie descriptions

In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer

tf = TfidfVectorizer(ngram_range = (1,2), min_df = 2)

# fit and transform the text corpus
tfidf_matrix = tf.fit_transform(norm_corpus)
tfidf_matrix.shape

(4800, 20667)

## Pair-wise Similarity

Recommendations in its most simplest form is a method of identifying items which are most similar to given user's preferences. In this scenario we use a content based recommendation system which tries to find similar movies based on the movie content i.e. description.

To identify similar items, we will use pairwise similarity measure called **cosine similarity**

In [20]:
from sklearn.metrics.pairwise import cosine_similarity

doc_sim = cosine_similarity(tfidf_matrix)
doc_sim_df = pd.DataFrame(doc_sim)
doc_sim_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,4790,4791,4792,4793,4794,4795,4796,4797,4798,4799
0,1.0,0.010701,0.0,0.01903,0.028687,0.024901,0.0,0.026516,0.0,0.00742,...,0.009702,0.0,0.023336,0.033549,0.0,0.0,0.0,0.006892,0.0,0.0
1,0.010701,1.0,0.011891,0.0,0.041623,0.0,0.014564,0.027122,0.034688,0.007614,...,0.009956,0.0,0.004818,0.0,0.0,0.012593,0.0,0.022391,0.013724,0.0
2,0.0,0.011891,1.0,0.0,0.0,0.0,0.0,0.022242,0.015854,0.004891,...,0.042617,0.0,0.0,0.0,0.016519,0.0,0.0,0.011682,0.0,0.004
3,0.01903,0.0,0.0,1.0,0.008793,0.0,0.015976,0.023172,0.027452,0.07361,...,0.0,0.0,0.009667,0.0,0.0,0.0,0.0,0.028354,0.021785,0.027735
4,0.028687,0.041623,0.0,0.008793,1.0,0.0,0.022912,0.028676,0.0,0.023538,...,0.0148,0.0,0.0,0.0,0.0,0.01076,0.0,0.010514,0.0,0.0


## Step by Step Methodology for Recommendation

### **Question 5**: Get a list of Movie titles (1 point)

In [21]:
# movie titles
movies_list = df['title'].values
movies_list

array(['Avatar', "Pirates of the Caribbean: At World's End", 'Spectre',
       ..., 'Signed, Sealed, Delivered', 'Shanghai Calling',
       'My Date with Drew'], dtype=object)

### **Question 6**: Given a movie title, get its index value (1 point)

Here let's get the ID for the movie __Minions__

__Hint:__ Numpy has dedicated functions to find the index from a numpy array or you can use list indexing functions also. The output should be a number

In [23]:
## movie ID for Minions
movie_idx = np.where(movies_list == 'Minions')
movie_idx

(array([546], dtype=int64),)

## Get Similar Movies

We already calculated pairwise similarity between all movies in our dataset. Next step is to extract moview similar to a given movie.

Let us use the movie _Minions_ at index _546_ to find some similar movies using ``doc_sim_df`` dataframe

### **Question 7**: Extract row of dataframe given an index (1 point)

In [26]:
movie_similarities = doc_sim_df.iloc[movie_idx].values
movie_similarities

array([[0.0104544 , 0.01072835, 0.        , ..., 0.00690954, 0.        ,
        0.        ]])

### Top Similar Movies

### **Question 8**: Get top 5 most similar movies in descending order of similarity (1 point)

_hint: In descending order the index 0 represents the movie itself (as a movie description is 100% similar to itself, so it is safe to skip index 0_

#### Get top 5 movie IDs

In [105]:
similar_movie_idxs = np.argsort(movie_similarities)
similar_movie_idxs = similar_movie_idxs[0][-6:-1][::-1]
similar_movie_idxs

array([506, 614, 241, 813, 154], dtype=int64)

#### Get top 5 movie names

In [92]:
similar_movies = movies_list[similar_movie_idxs]
similar_movies

array(['Despicable Me 2', 'Despicable Me',
       'Teenage Mutant Ninja Turtles: Out of the Shadows', 'Superman',
       'Rise of the Guardians'], dtype=object)

## Movie Recommender

Time to make use of all the smaller steps we have gone through so far to prepare a recommendation utility

### **Question 9**: Complete the utility function for getting movie recommendations (2 point)

In [93]:
def movie_recommender(movie_title, movies=movies_list, doc_sims=doc_sim_df):
    # find movie id
    movie_idx = np.where(movies == movie_title)

    # get movie similarities. 
    #Hint: movie index helps find the exact row number
    movie_similarities = doc_sims.iloc[movie_idx].values
    
    # get top 5 similar movie IDs
    # Hint: use numpy utility to do a sort as before
    similar_movie_idxs = np.argsort(movie_similarities)
    similar_movie_idxs = similar_movie_idxs[0][-6:-1][::-1]
    
    # get top 5 movies names
    similar_movies = movies[similar_movie_idxs]
    
    # return the top 5 movies
    return similar_movies

### Find Similar Movies

In [94]:
popular_movies = ['Minions', 
                  'Interstellar', 
                  'Deadpool', 
                  'Jurassic World', 
                  'Pirates of the Caribbean: The Curse of the Black Pearl',
                  'The Dark Knight Rises',
                  'Dawn of the Planet of the Apes',
                  'Iron Man']

In [95]:
for movie in popular_movies:
    print('Movie:', movie)
    print('Top 5 recommended Movies:', movie_recommender(movie_title=movie))
    print()

Movie: Minions
Top 5 recommended Movies: ['Despicable Me 2' 'Despicable Me'
 'Teenage Mutant Ninja Turtles: Out of the Shadows' 'Superman'
 'Rise of the Guardians']

Movie: Interstellar
Top 5 recommended Movies: ['Gattaca' 'Space Pirate Captain Harlock' 'Space Cowboys'
 'Starship Troopers' 'Final Destination 2']

Movie: Deadpool
Top 5 recommended Movies: ['Silent Trigger' 'Underworld: Evolution' 'Bronson' 'Shaft' 'Don Jon']

Movie: Jurassic World
Top 5 recommended Movies: ['Jurassic Park' 'The Lost World: Jurassic Park' 'The Nut Job'
 "National Lampoon's Vacation" 'Vacation']

Movie: Pirates of the Caribbean: The Curse of the Black Pearl
Top 5 recommended Movies: ["Pirates of the Caribbean: Dead Man's Chest" 'The Pirate'
 'Pirates of the Caribbean: On Stranger Tides'
 'The Pirates! In an Adventure with Scientists!' 'Joyful Noise']

Movie: The Dark Knight Rises
Top 5 recommended Movies: ['Batman Forever' 'The Dark Knight' 'Batman Returns' 'Batman'
 'Batman: The Dark Knight Returns, Part

##**Bonus**: BM25 Similarity

BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document, regardless of their proximity within the document. It is a family of scoring functions with slightly different components and parameters.

It is widely used in popular text search engine **ElasticSearch**. Get more details here: [link](https://www.elastic.co/blog/practical-bm25-part-2-the-bm25-algorithm-and-its-variables) 

In [96]:
import math
from six import iteritems
from six.moves import xrange

In [97]:
PARAM_K1 = 2.5
PARAM_B = 0.85
EPSILON = 0.2

In [98]:
class BM25(object):
    """Implementation of Best Matching 25 ranking function.
    Attributes
    ----------
    corpus_size : int
        Size of corpus (number of documents).
    avgdl : float
        Average length of document in `corpus`.
    corpus : list of list of str
        Corpus of documents.
    f : list of dicts of int
        Dictionary with terms frequencies for each document in `corpus`. Words used as keys and frequencies as values.
    df : dict
        Dictionary with terms frequencies for whole `corpus`. Words used as keys and frequencies as values.
    idf : dict
        Dictionary with inversed terms frequencies for whole `corpus`. Words used as keys and frequencies as values.
    doc_len : list of int
        List of document lengths.
    """

    def __init__(self, corpus):
        """
        Parameters
        ----------
        corpus : list of list of str
            Given corpus.
        """
        self.corpus_size = len(corpus)
        self.avgdl = sum(float(len(x)) for x in corpus) / self.corpus_size
        self.corpus = corpus
        self.f = []
        self.df = {}
        self.idf = {}
        self.doc_len = []
        self.initialize()

    def initialize(self):
        """Calculates frequencies of terms in documents and in corpus. Also computes inverse document frequencies."""
        for document in self.corpus:
            frequencies = {}
            self.doc_len.append(len(document))
            for word in document:
                if word not in frequencies:
                    frequencies[word] = 0
                frequencies[word] += 1
            self.f.append(frequencies)

            for word, freq in iteritems(frequencies):
                if word not in self.df:
                    self.df[word] = 0
                self.df[word] += 1

        for word, freq in iteritems(self.df):
            self.idf[word] = math.log(self.corpus_size - freq + 0.5) - math.log(freq + 0.5)

    def get_score(self, document, index, average_idf):
        """Computes BM25 score of given `document` in relation to item of corpus selected by `index`.
        Parameters
        ----------
        document : list of str
            Document to be scored.
        index : int
            Index of document in corpus selected to score with `document`.
        average_idf : float
            Average idf in corpus.
        Returns
        -------
        float
            BM25 score.
        """
        score = 0
        for word in document:
            if word not in self.f[index]:
                continue
            idf = self.idf[word] if self.idf[word] >= 0 else EPSILON * average_idf
            score += (idf * self.f[index][word] * (PARAM_K1 + 1)
                      / (self.f[index][word] + PARAM_K1 * (1 - PARAM_B + PARAM_B * self.doc_len[index] / self.avgdl)))
        return score

    def get_scores(self, document, average_idf):
        """Computes and returns BM25 scores of given `document` in relation to
        every item in corpus.
        Parameters
        ----------
        document : list of str
            Document to be scored.
        average_idf : float
            Average idf in corpus.
        Returns
        -------
        list of float
            BM25 scores.
        """
        scores = []
        for index in xrange(self.corpus_size):
            score = self.get_score(document, index, average_idf)
            scores.append(score)
        return scores


def get_bm25_weights(corpus):
    """Returns BM25 scores (weights) of documents in corpus.
    Each document has to be weighted with every document in given corpus.
    Parameters
    ----------
    corpus : list of list of str
        Corpus of documents.
    Returns
    -------
    list of list of float
        BM25 scores.
    Examples
    --------
    >>> from gensim.summarization.bm25 import get_bm25_weights
    >>> corpus = [
    ...     ["black", "cat", "white", "cat"],
    ...     ["cat", "outer", "space"],
    ...     ["wag", "dog"]
    ... ]
    >>> result = get_bm25_weights(corpus)
    """
    bm25 = BM25(corpus)
    average_idf = sum(float(val) for val in bm25.idf.values()) / len(bm25.idf)

    weights = []
    for doc in corpus:
        scores = bm25.get_scores(doc, average_idf)
        weights.append(scores)

    return weights

In [99]:
norm_corpus_tokens = np.array([nltk.word_tokenize(doc) for doc in norm_corpus])
norm_corpus_tokens[:3]

  norm_corpus_tokens = np.array([nltk.word_tokenize(doc) for doc in norm_corpus])


array([list(['enter', 'world', 'pandora', '22nd', 'century', 'paraplegic', 'marine', 'dispatched', 'moon', 'pandora', 'unique', 'mission', 'becomes', 'torn', 'following', 'orders', 'protecting', 'alien', 'civilization']),
       list(['end', 'world', 'adventure', 'begins', 'captain', 'barbossa', 'long', 'believed', 'dead', 'come', 'back', 'life', 'headed', 'edge', 'earth', 'turner', 'elizabeth', 'swann', 'nothing', 'quite', 'seems']),
       list(['plan', 'one', 'escapes', 'cryptic', 'message', 'bonds', 'past', 'sends', 'trail', 'uncover', 'sinister', 'organization', 'battles', 'political', 'forces', 'keep', 'secret', 'service', 'alive', 'bond', 'peels', 'back', 'layers', 'deceit', 'reveal', 'terrible', 'truth', 'behind', 'spectre'])],
      dtype=object)

The following line may take some time to execute depending on compute

In [100]:
%%time
wts = get_bm25_weights(norm_corpus_tokens)

Wall time: 1min 19s


In [None]:
bm25_wts_df = pd.DataFrame(wts)
bm25_wts_df.head()

In [None]:
for movie in popular_movies:
    print('Movie:', movie)
    print('Top 5 recommended Movies:', movie_recommender(movie_title=movie, doc_sims=bm25_wts_df))
    print()