# Recommendation Systems

Because of computer resource limitation I split the project into 2 notebooks. Here is the 2nd section which concetrates on 8.2: Content Based Recommendation Systems. Most cells of other sections are removed. Here just includes cells that are necessary, for example loading libraries, loading dataset are remained for the use in 8.2.

8.2 uses TF_IDF for content based recommendation. The TF-IDF operation creates a huge sparse matrix that consumes a lot of memory. If I integrate 8.2 within the first notebook the jupyter kernel keeps on reseting. That is why 8.2 is extracted and ran in this notebook.

## 1. Dataset Acquisition

## 2: Import Necessary Dependencies

We will be leveraging __`keras`__ on top of __`tensorflow`__ for building some of the collaborative filtering and hybrid models. There are compatibility issues with handling sparse layers with dense layers till now in TensorFlow 2 hence we are leveraging native Keras but in the long run once this issue is resolved we can leverage __`tf.keras`__ with minimal code updates.

In [1]:
# filter out unncessary warnings
import warnings
warnings.filterwarnings('ignore')

In [2]:
# To store\load the data
import pandas as pd

# To do linear algebra
import numpy as np

# To create plots
import matplotlib.pyplot as plt
import seaborn as sns


# To compute similarities between vectors
from sklearn.metrics import mean_squared_error
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

# data load progress bars
from tqdm import tqdm

from collections import deque

# To create deep learning models
import tensorflow as tf
import keras
from keras.layers import Input, Embedding, Reshape, Dot, Concatenate, Dense, Dropout
from keras.models import Model

# To stack sparse matrices
from scipy.sparse import vstack

Using TensorFlow backend.


In [3]:
# remove unnecessary TF logs
import logging
tf.get_logger().setLevel(logging.ERROR)

In [4]:
# check keras and TF version used
print('TF Version:', tf.__version__)
print('Keras Version:', keras.__version__)
# TF Version: 1.15.0
# Keras Version: 2.2.5

TF Version: 1.15.0
Keras Version: 2.2.5


Let's start loading data that will be used for building the recommendation systems

# 3. Load Datasets

## 3.1: Load Movie Metadata Datasets

First, we will load the movie_titles.csv data from the Netflix prize data source

In [5]:
# Load a movie metadata dataset
movie_metadata = (pd.read_csv('./data/movies_metadata.csv.zip', 
                              low_memory=False)[['original_title', 'overview', 'vote_count']]
                    .set_index('original_title')
                    .dropna())

# Remove the long tail of rarly rated moves
movie_metadata = movie_metadata[movie_metadata['vote_count']>10].drop('vote_count', axis=1)

print('Shape Movie-Metadata:\t{}'.format(movie_metadata.shape))
movie_metadata.sample(5)

Shape Movie-Metadata:	(21604, 1)


Unnamed: 0_level_0,overview
original_title,Unnamed: 1_level_1
The Legend Fong Sai Yuk,This Hong Kong martial-arts extravaganza tells...
Locked Down,"Danny, a respected cop, is setup after an inve..."
There's Only One Jimmy Grimble,Jimmy Grimble is a shy Manchester school boy. ...
Sniper: Reloaded,"Brandon Beckett (Collins), the son of the prev..."
Dear Frankie,Nine-year-old Frankie and his single mum Lizzi...


Around 21,000 entries in the movies metadata dataset

# 4. Exploratory Data Analysis

# 5. Dimensionality Reduction & Filtering

# 6. Create Train and Test Datasets

# 7. Transformation

# 8. Building Recommendation Systems

## 8.2: Content Based Recommendation Systems


The Content-Based Recommender relies on the similarity of the items being recommended. The basic idea is that if you like an item, then you will also like a “similar” item. It generally works well when it’s easy to determine the context/properties of each item. If there is no historical data for a user or there is reliable metadata for each movie, it can be useful to compare the metadata of the movies to find similar ones.

![](./images/Content-based.png)

### Cosine TFIDF Movie Description Similarity

#### TF-IDF 

This is a text vectorization technique which is used to determine the relative importance of a document / article / news item / movie etc.

TF is simply the frequency of a word in a document. 

IDF is the inverse of the document frequency among the whole corpus of documents. 

TF-IDF is used mainly because of two reasons: Suppose we search for “the results of latest European Socccer games” on Google. It is certain that “the” will occur more frequently than “soccer games” but the relative importance of soccer games is higher than the search query point of view. 

In such cases, TF-IDF weighting negates the effect of high frequency words in determining the importance of an item (document).

![](./images/TF-IDF-FORMULA.png)


#### Cosine Similarity 
After calculating TF-IDF scores, how do we determine which items are closer to each other, rather closer to the user profile? This is accomplished using the Vector Space Model which computes the proximity based on the angle between the vectors.

Consider the following example

![](./images/Vector-space-model.png)

Sentence 2 is more likely to be using Term 2 than using Term 1. Vice-versa for Sentence 1. 

The method of calculating this relative measure is calculated by taking the cosine of the angle between the sentences and the terms. 

The ultimate reason behind using cosine is that the value of cosine will increase with decreasing value of the angle between which signifies more similarity. 

The vectors are length normalized after which they become vectors of length 1 and then the cosine calculation is simply the sum-product of vectors.

In this approch we will use the movie description to create a TFIDF-matrix, which counts and weights words in all descriptions, and compute a cosine similarity between all of those sparse text-vectors. This can easily be extended to more or different features if you like.
It is impossible for this model to compute a RMSE score, since the model does not recommend the movies directly.
In this way it is possible to find movies closly related to each other.

This approach of content based filtering can be extendend to increase the model performance by adding some more features like genres, cast, crew etc.

In [6]:
# view sample movie descriptions
movie_metadata['overview'].head(5)

original_title
Toy Story                      Led by Woody, Andy's toys live happily in his ...
Jumanji                        When siblings Judy and Peter discover an encha...
Grumpier Old Men               A family wedding reignites the ancient feud be...
Waiting to Exhale              Cheated on, mistreated and stepped on, the wom...
Father of the Bride Part II    Just when George Banks has recovered from his ...
Name: overview, dtype: object

In [7]:
# Create tf-idf matrix for text comparison
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(movie_metadata['overview'])
tfidf_matrix

<21604x48083 sparse matrix of type '<class 'numpy.float64'>'
	with 574154 stored elements in Compressed Sparse Row format>

In [8]:
tfidf_matrix.toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [9]:
# Compute cosine similarity between all movie-descriptions
similarity = cosine_similarity(tfidf_matrix)
similarity

array([[1.        , 0.01538454, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.01538454, 1.        , 0.04685421, ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.04685421, 1.        , ..., 0.        , 0.00710093,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.00710093, ..., 0.        , 1.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        1.        ]])

In [10]:
# Compute cosine similarity between all movie-descriptions
similarity = cosine_similarity(tfidf_matrix)
similarity_df = pd.DataFrame(similarity, 
                             index=movie_metadata.index.values, 
                             columns=movie_metadata.index.values)
similarity_df.head(10)

Unnamed: 0,Toy Story,Jumanji,Grumpier Old Men,Waiting to Exhale,Father of the Bride Part II,Heat,Sabrina,Tom and Huck,Sudden Death,GoldenEye,...,The Final Storm,In a Heartbeat,"Bloed, Zweet en Tranen",To Be Fat Like Me,Cadet Kelly,L'Homme à la tête de caoutchouc,Le locataire diabolique,L'Homme orchestre,Maa,Robin Hood
Toy Story,1.0,0.015385,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.023356,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Jumanji,0.015385,1.0,0.046854,0.0,0.0,0.047646,0.0,0.0,0.098488,0.0,...,0.0,0.0,0.0,0.004192,0.0,0.014642,0.0,0.0,0.0,0.0
Grumpier Old Men,0.0,0.046854,1.0,0.0,0.023903,0.0,0.0,0.006463,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.015409,0.0,0.0,0.007101,0.0
Waiting to Exhale,0.0,0.0,0.0,1.0,0.0,0.007417,0.0,0.008592,0.0,0.0,...,0.02846,0.0,0.0,0.0,0.0,0.0,0.016324,0.00684,0.0,0.0
Father of the Bride Part II,0.0,0.0,0.023903,0.0,1.0,0.0,0.030866,0.0,0.033213,0.0,...,0.0,0.0,0.0,0.022816,0.0,0.0,0.0,0.0,0.012584,0.0
Heat,0.0,0.047646,0.0,0.007417,0.0,1.0,0.0,0.0,0.046349,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.015837,0.0,0.0,0.0
Sabrina,0.0,0.0,0.0,0.0,0.030866,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.028344,0.0,0.0,0.105139,0.0,0.0,0.0
Tom and Huck,0.0,0.0,0.006463,0.008592,0.0,0.0,0.0,1.0,0.0,0.0,...,0.164136,0.071019,0.0,0.0,0.0,0.0,0.0,0.0,0.006162,0.0
Sudden Death,0.0,0.098488,0.0,0.0,0.033213,0.046349,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.014963,0.0
GoldenEye,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.043867,0.0,0.0,0.0,0.0,0.076444,0.0,0.016266,0.0,0.0


In [11]:
# movie list 
movie_list = similarity_df.columns.values


# sample movie
movie = 'Batman Begins'

# top recommendation movie count
top_n = 10

# get movie similarity records
movie_sim = similarity_df[similarity_df.index == movie].values[0]

# get movies sorted by similarity
sorted_movie_ids = np.argsort(movie_sim)[::-1]

# get recommended movie names
recommended_movies = movie_list[sorted_movie_ids[1:top_n+1]]

print('\n\nTop Recommended Movies for:', movie, 'are:-\n', recommended_movies)



Top Recommended Movies for: Batman Begins are:-
 ['Batman Unmasked: The Psychology of the Dark Knight'
 'Batman: The Dark Knight Returns, Part 1' 'Batman: Bad Blood'
 'Batman: Year One' 'Batman: Under the Red Hood'
 'Batman Beyond: The Movie' 'Batman Forever'
 'Batman: Mask of the Phantasm' 'Batman & Bill' 'Batman']


__Your turn:__ Create a function as defined below, __`content_movie_recommender()`__ which can take in sample movie names and print a list of top N recommended movies

In [12]:
def content_movie_recommender(input_movie, similarity_database=similarity_df, movie_database_list=movie_list, top_n=10):
    # get movie similarity records
    movie_sim = similarity_database[similarity_database.index == input_movie].values[0]

    # get movies sorted by similarity
    sorted_movie_ids = np.argsort(movie_sim)[::-1]

    # get recommended movie names
    recommended_movies = movie_database_list[sorted_movie_ids[1:top_n+1]]

    print('\n\nTop Recommended Movies for:', input_movie, 'are:-\n', recommended_movies)

__Your turn:__ Test your function below on the given sample movies

In [13]:
sample_movies = ['Captain America', 'The Terminator', 'The Exorcist', 
                 'The Hunger Games: Mockingjay - Part 1', 'The Blair Witch Project']
                 
for movie_name in sample_movies:
    content_movie_recommender(movie_name, similarity_df, movie_list, top_n=10)



Top Recommended Movies for: Captain America are:-
 ['Iron Man & Captain America: Heroes United'
 'Captain America: The First Avenger' 'Team Thor' 'Education for Death'
 'Captain America: The Winter Soldier' '49th Parallel' 'Ultimate Avengers'
 'Philadelphia Experiment II' 'Vice Versa' 'The Lair of the White Worm']


Top Recommended Movies for: The Terminator are:-
 ['Terminator 2: Judgment Day' 'Terminator Salvation'
 'Terminator 3: Rise of the Machines' 'Silent House' 'They Wait'
 'Another World' 'Teenage Caveman' 'Appleseed Alpha' 'Respire'
 'Just Married']


Top Recommended Movies for: The Exorcist are:-
 ['Exorcist II: The Heretic' 'Domestic Disturbance' 'Damien: Omen II'
 'The Exorcist III' 'Like Sunday, Like Rain' 'People Like Us'
 'Quand on a 17 Ans' "Don't Knock Twice" 'Zero Day' 'Brick Mansions']


Top Recommended Movies for: The Hunger Games: Mockingjay - Part 1 are:-
 ['The Hunger Games: Catching Fire' 'The Hunger Games: Mockingjay - Part 2'
 'Last Train from Gun Hill' 'Th