<a href="https://colab.research.google.com/github/ramapriyakp/Portfolio/blob/master/NLP/Movie_Recommender.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Movie Recommendations with Document Similarity

Recommender systems are one of the popular and most adopted applications of machine learning. They are typically used to recommend entities to users and these entites can be anything like products, movies, services and so on.

Popular examples of recommendations include,


*   Amazon suggesting products on its website
*   Amazon Prime, Netflix, Hotstar recommending movies\shows
*   YouTube recommending videos to watch

Typically recommender systems can be implemented in three ways:

*   **Simple Rule-based Recommenders**: Typically based on specific global metrics and thresholds like movie popularity, global ratings etc.
*  **Content-based Recommenders**: This is based on providing similar entities based on a specific entity of interest. Content metadata can be used here like movie descriptions, genre, cast, director and so on
*   **Collaborative filtering Recommenders**: Here we don't need metadata but we try to predict recommendations and ratings based on past ratings of different users and specific items.

<p>
We will be building a movie recommendation system here where based on data\metadata pertaining to different movies, we try and recommend similar movies of interest!</p>

Since our focus in not really recommendation engines but NLP, we will be leveraging the text-based metadata for each movie to try and recommend similar movies based on specific movies of interest. This falls under content-based recommenders.






In [0]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
cd /content/drive/My Drive/NLP

/content/drive/My Drive/NLP


## Load Dataset

In [0]:
#!wget https://github.com/dipanjanS/nlp_workshop_dhs18/blob/master/Solutions/tmdb_5000_movies.csv.gz

--2019-09-06 05:16:59--  https://github.com/dipanjanS/nlp_workshop_dhs18/blob/master/Solutions/tmdb_5000_movies.csv.gz
Resolving github.com (github.com)... 140.82.114.3
Connecting to github.com (github.com)|140.82.114.3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘tmdb_5000_movies.csv.gz’

tmdb_5000_movies.cs     [ <=>                ]  62.68K  --.-KB/s    in 0.1s    

2019-09-06 05:17:00 (490 KB/s) - ‘tmdb_5000_movies.csv.gz’ saved [64180]



In [0]:
#!gzip -d  tmdb_5000_movies.csv.gz

In [0]:
import pandas as pd

df = pd.read_csv('tmdb_5000_movies.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 20 columns):
budget                  4803 non-null int64
genres                  4803 non-null object
homepage                1712 non-null object
id                      4803 non-null int64
keywords                4803 non-null object
original_language       4803 non-null object
original_title          4803 non-null object
overview                4800 non-null object
popularity              4803 non-null float64
production_companies    4803 non-null object
production_countries    4803 non-null object
release_date            4802 non-null object
revenue                 4803 non-null int64
runtime                 4801 non-null float64
spoken_languages        4803 non-null object
status                  4803 non-null object
tagline                 3959 non-null object
title                   4803 non-null object
vote_average            4803 non-null float64
vote_count              4803 non-null 

In [0]:
df.head(3)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466


In [0]:
df = df[['title', 'tagline', 'overview', 'genres', 'popularity']]
df.tagline.fillna('', inplace=True)
df['description'] = df['tagline'].map(str) + ' ' + df['overview']
df.dropna(inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4800 entries, 0 to 4802
Data columns (total 6 columns):
title          4800 non-null object
tagline        4800 non-null object
overview       4800 non-null object
genres         4800 non-null object
popularity     4800 non-null float64
description    4800 non-null object
dtypes: float64(1), object(5)
memory usage: 262.5+ KB


In [0]:
df.head()

Unnamed: 0,title,tagline,overview,genres,popularity,description
0,Avatar,Enter the World of Pandora.,"In the 22nd century, a paraplegic Marine is di...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",150.437577,Enter the World of Pandora. In the 22nd centur...
1,Pirates of the Caribbean: At World's End,"At the end of the world, the adventure begins.","Captain Barbossa, long believed to be dead, ha...","[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",139.082615,"At the end of the world, the adventure begins...."
2,Spectre,A Plan No One Escapes,A cryptic message from Bond’s past sends him o...,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",107.376788,A Plan No One Escapes A cryptic message from B...
3,The Dark Knight Rises,The Legend Ends,Following the death of District Attorney Harve...,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",112.31295,The Legend Ends Following the death of Distric...
4,John Carter,"Lost in our world, found in another.","John Carter is a war-weary, former military ca...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",43.926995,"Lost in our world, found in another. John Cart..."


## Your Turn: Build a Movie Recommender System
Here you will build your own movie recommender system. We will use the following pipeline:


*   Text pre-processing
*   Feature Engineering
*   Document Similarity Computation
*   Find top similar movies
*   Build a movie recommendation function












##Document Similarity

Recommendations are about understanding the underlying features which make us favour one choice over the other. Similarity between items(in this case movies) is one way to understanding why we choose one movie over another. There are different ways to calculate similarity between two items. One of the most widely used measures is cosine similarity which we have already used in the previous unit.

##Cosine Similarity

Cosine Similarity is used to calculate a numeric score to denote the similarity between two text documents.<br>
![coine](https://render.githubusercontent.com/render/math?math=cosine%28x%2Cy%29%20%3D%20%5Cfrac%7Bx.%20y%5E%5Cintercal%7D%7B%7C%7Cx%7C%7C.%7C%7Cy%7C%7C%7D&mode=display)
 

In [0]:
import nltk
nltk.data.path.append('/content/drive/My Drive/NLP')
nltk.download('stopwords')
stop_words = nltk.corpus.stopwords.words('english')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

##Text pre-processing

We will do some basic text pre-processing on our movie descriptions before we build our features

In [0]:

import re
import numpy as np

stop_words = nltk.corpus.stopwords.words('english')

def normalize_document(doc):
    # lower case and remove special characters\whitespaces
    doc = re.sub(r'[^a-zA-Z0-9\s]', '', doc, re.I|re.A)
    doc = doc.lower()
    doc = doc.strip()
    # tokenize document
    tokens = nltk.word_tokenize(doc)
    # filter stopwords out of document
    filtered_tokens = [token for token in tokens if token not in stop_words]
    # re-create document from filtered tokens
    doc = ' '.join(filtered_tokens)
    return doc

  
normalize_corpus = np.vectorize(normalize_document)

norm_corpus = normalize_corpus(list(df['overview']))
len(norm_corpus)

4800

## Extract TF-IDF Features

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer

tf = TfidfVectorizer(ngram_range=(1, 2), min_df=2,stop_words='english')
tfidf_matrix = tf.fit_transform(df['description'])
tfidf_matrix.shape

(4800, 19029)

## Compute Pairwise Document Similarity
The similarity between
two documents is a
function of the angle
between their vectors in
the term vector space.

In [0]:
from sklearn.metrics.pairwise import cosine_similarity

doc_sim = cosine_similarity(tfidf_matrix)
doc_sim_df = pd.DataFrame(doc_sim)
doc_sim_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,4760,4761,4762,4763,4764,4765,4766,4767,4768,4769,4770,4771,4772,4773,4774,4775,4776,4777,4778,4779,4780,4781,4782,4783,4784,4785,4786,4787,4788,4789,4790,4791,4792,4793,4794,4795,4796,4797,4798,4799
0,1.0,0.0109,0.0,0.017381,0.014895,0.020905,0.0,0.027303,0.0,0.006796,0.0,0.013663,0.0,0.0,0.00854,0.0,0.008047,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.008298,0.0,0.035638,0.043384,0.007492,0.026091,0.0,0.053908,0.006067,0.0,0.0,0.0,0.059387,0.0,0.0,0.007564,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.047855,0.0,0.0,0.005291,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.011336,0.0,0.022451,0.033565,0.0,0.0,0.0,0.00638,0.0,0.0
1,0.0109,1.0,0.0,0.0,0.046851,0.0,0.013986,0.048842,0.036272,0.008119,0.0,0.0,0.041431,0.0,0.032665,0.005338,0.028153,0.038403,0.005539,0.031881,0.025883,0.0,0.0,0.024763,0.009912,0.030826,0.041082,0.03568,0.00895,0.00705,0.0,0.012648,0.007247,0.0,0.0,0.00879,0.018041,0.0,0.018382,0.009036,...,0.014591,0.0,0.0,0.0,0.016982,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.007424,0.010514,0.01661,0.035169,0.008152,0.0,0.009869,0.007231,0.014142,0.019248,0.009004,0.005798,0.0,0.0,0.0,0.0,0.009477,0.0,0.013542,0.0,0.005337,0.0,0.0,0.014553,0.0,0.024862,0.014993,0.0
2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.026555,0.016419,0.0,0.0,0.134325,0.0,0.0,0.0,0.0,0.0,0.038649,0.025757,0.0,0.0,0.0,0.0,0.017114,0.0,0.0,0.0,0.0,0.0,0.124928,0.0,0.0,0.022071,0.0,0.0,0.060224,0.014304,0.0,0.011407,0.0,...,0.016081,0.017824,0.025515,0.0,0.057629,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.045249,0.0,0.0,0.0,0.0,0.018116,0.0,0.0,0.0,0.0,0.024311,0.0,0.0,0.0,0.0,0.04134,0.0,0.0,0.0,0.017991,0.0,0.0,0.013308,0.0,0.0
3,0.017381,0.0,0.0,1.0,0.008594,0.00344,0.013215,0.023328,0.024011,0.129431,0.0,0.0,0.0,0.0,0.0,0.005828,0.015224,0.0,0.004603,0.009242,0.005352,0.026179,0.0,0.0,0.01398,0.027135,0.015355,0.0,0.006054,0.021185,0.014182,0.022524,0.018967,0.0,0.0,0.0,0.023523,0.0,0.022488,0.008728,...,0.010891,0.01499,0.029758,0.0,0.026441,0.0,0.005819,0.009617,0.0,0.017402,0.0,0.0,0.0,0.0,0.004275,0.0,0.0,0.0,0.033549,0.0,0.009128,0.009251,0.0,0.01955,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.009462,0.0,0.0,0.0,0.0,0.027134,0.034467,0.028035
4,0.014895,0.046851,0.0,0.008594,1.0,0.0,0.008966,0.031135,0.0,0.02032,0.017168,0.0,0.028309,0.0,0.01394,0.0,0.091171,0.015654,0.013326,0.015671,0.007845,0.015258,0.0,0.0,0.027581,0.0,0.05514,0.051313,0.027138,0.019616,0.0,0.017284,0.026499,0.0,0.0,0.012268,0.045677,0.0,0.0,0.025142,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.023107,0.0,0.0,0.0,0.0,0.008636,0.0,0.0,0.0,0.0,0.0,0.01493,0.007985,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.018505,0.0,0.0,0.0,0.0,0.0,0.0,0.023291,0.0,0.0


## Get List of Movie Titles

In [0]:
movies_list = df['title'].values
movies_list, movies_list.shape

(array(['Avatar', "Pirates of the Caribbean: At World's End", 'Spectre',
        ..., 'Signed, Sealed, Delivered', 'Shanghai Calling',
        'My Date with Drew'], dtype=object), (4800,))

## Sort Dataset by Popular Movies

In [0]:
pop_movies = df.sort_values(by='popularity', ascending=False)
pop_movies.head(10)

Unnamed: 0,title,tagline,overview,genres,popularity,description
546,Minions,"Before Gru, they had a history of bad bosses","Minions Stuart, Kevin and Bob are recruited by...","[{""id"": 10751, ""name"": ""Family""}, {""id"": 16, ""...",875.581305,"Before Gru, they had a history of bad bosses M..."
95,Interstellar,Mankind was born on Earth. It was never meant ...,Interstellar chronicles the adventures of a gr...,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 18, ""...",724.247784,Mankind was born on Earth. It was never meant ...
788,Deadpool,Witness the beginning of a happy ending,Deadpool tells the origin story of former Spec...,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",514.569956,Witness the beginning of a happy ending Deadpo...
94,Guardians of the Galaxy,All heroes start somewhere.,"Light years from Earth, 26 years after being a...","[{""id"": 28, ""name"": ""Action""}, {""id"": 878, ""na...",481.098624,All heroes start somewhere. Light years from E...
127,Mad Max: Fury Road,What a Lovely Day.,An apocalyptic story set in the furthest reach...,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",434.278564,What a Lovely Day. An apocalyptic story set in...
28,Jurassic World,The park is open.,Twenty-two years after the events of Jurassic ...,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",418.708552,The park is open. Twenty-two years after the e...
199,Pirates of the Caribbean: The Curse of the Bla...,Prepare to be blown out of the water.,"Jack Sparrow, a freewheeling 17th-century pira...","[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",271.972889,Prepare to be blown out of the water. Jack Spa...
82,Dawn of the Planet of the Apes,One last chance for peace.,A group of scientists in San Francisco struggl...,"[{""id"": 878, ""name"": ""Science Fiction""}, {""id""...",243.791743,One last chance for peace. A group of scientis...
200,The Hunger Games: Mockingjay - Part 1,Fire burns brighter in the darkness,Katniss Everdeen reluctantly becomes the symbo...,"[{""id"": 878, ""name"": ""Science Fiction""}, {""id""...",206.227151,Fire burns brighter in the darkness Katniss Ev...
88,Big Hero 6,From the creators of Wreck-it Ralph and Frozen,The special bond that develops between plus-si...,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 10751...",203.73459,From the creators of Wreck-it Ralph and Frozen...


## Find Top Similar Movies for a Sample Movie

Let's take **Minions** the most popular movie the the dataframe above and try and find the most similar movies which can be recommended

###Find movie ID

In [0]:
movie_idx = np.where(movies_list == 'Minions')[0][0]
movie_idx

546

###Get movie similarities

In [0]:
movie_similarities = doc_sim_df.iloc[movie_idx].values
movie_similarities

array([0.00980645, 0.01171429, 0.        , ..., 0.00685701, 0.        ,
       0.        ])

###Get top 5 similar movie IDs

In [0]:
similar_movie_idxs = np.argsort(-movie_similarities)[1:6]
similar_movie_idxs

array([ 506,  614, 3943,  221, 2511])

###Get top 5 similar movies

In [0]:
similar_movies = movies_list[similar_movie_idxs]
similar_movies

array(['Despicable Me 2', 'Despicable Me', 'Freeway', 'Stuart Little 2',
       'Home Alone'], dtype=object)

##Build a movie recommender function to recommend top 5 similar movies for any movie

The movie title, movie title list and document similarity matrix dataframe will be given as inputs to the function

In [0]:
def movie_recommender(movie_title, movies=movies_list, doc_sims=doc_sim_df):
    # find movie id
    movie_idx = np.where(movies == movie_title)[0][0]
    # get movie similarities
    movie_similarities = doc_sims.iloc[movie_idx].values
    # get top 5 similar movie IDs
    similar_movie_idxs = np.argsort(-movie_similarities)[1:6]
    # get top 5 movies
    similar_movies = movies[similar_movie_idxs]
    # return the top 5 movies
    return similar_movies

##Now use this function on the top 20 popular movies

Hint:  get first 20 titles from the popular_movies dataframe

In [0]:
popular_movies = pop_movies['title'].values[:20]

In [0]:
for movie in popular_movies:
    print('Movie:', movie)
    print('Top 5 recommended Movies:', movie_recommender(movie_title=movie))
    print()

Movie: Minions
Top 5 recommended Movies: ['Despicable Me 2' 'Despicable Me' 'Freeway' 'Stuart Little 2'
 'Home Alone']

Movie: Interstellar
Top 5 recommended Movies: ['Gattaca' 'Space Pirate Captain Harlock' 'Space Cowboys'
 'Final Destination 2' 'Starship Troopers']

Movie: Deadpool
Top 5 recommended Movies: ['Don Jon' 'Shaft' 'Mars Attacks!' 'Bronson' 'Underworld: Evolution']

Movie: Guardians of the Galaxy
Top 5 recommended Movies: ['Krull' 'E.T. the Extra-Terrestrial' 'American Sniper' 'Due Date'
 'Space Battleship Yamato']

Movie: Mad Max: Fury Road
Top 5 recommended Movies: ['Star Trek Beyond' 'The 6th Day' 'Kites' 'The Notebook'
 'Killing Them Softly']

Movie: Jurassic World
Top 5 recommended Movies: ['The Lost World: Jurassic Park' 'Jurassic Park' 'The Nut Job'
 "National Lampoon's Vacation" 'Vacation']

Movie: Pirates of the Caribbean: The Curse of the Black Pearl
Top 5 recommended Movies: ["Pirates of the Caribbean: Dead Man's Chest" 'The Pirate'
 'Pirates of the Caribbean: O