<h1>Collaborative Filtering Item-Based Recommendation Systems using Deep Learning</h1>

Source code originally from
https://github.com/Wann-Jiun/nycdsa_project_5_recommender/blob/master/deep_learning.ipynb

Blog post from
https://nycdatascience.com/blog/student-works/deep-learning-meets-recommendation-systems/


<h2>Step 1 - Read in initial movie data</h2>

The first step would always be to gather the data and pull it into the programming environment. For our use case, we download the MovieLens (small) dataset. Only the list of movies is needed for this project. We use the small dataset as downloading the movie posters and training the model against the target dataset takes a long time.

In [1]:
import numpy as np
import pandas as pd

The links.csv file has a list of movies, and their IMDB IDs. This is the movie dataset that we'll be using in this notebook.

In [2]:
# Read in the link.csv file and store the information in a dataframe
df_data = pd.read_csv('/home/nbuser/library/dataset/ml-latest-small/links.csv', sep=',')

In [3]:
# Check the type of df_id
type(df_data)

pandas.core.frame.DataFrame

In [5]:
# Print the first few records from the dataframe
df_data.head(10)

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0
5,6,113277,949.0
6,7,114319,11860.0
7,8,112302,45325.0
8,9,114576,9091.0
9,10,113189,710.0


In [6]:
# Check the number of movie links that were read in.
len(df_data.index)

9125

The IMDB ID is the relevant information needed from the input data. Extract this information out from the dataframe and store it separately in a separate variable.

In [7]:
# Iterate over the dataframe rows as namedtuples, extracting the IMDB ID and storing it in the idx_to_movie dict object.
idx_to_movie = {}
for row in df_data.itertuples():
    idx_to_movie[row[1]-1] = row[2]

In [8]:
type(idx_to_movie)

dict

In [9]:
# Print therecords from the dataframe. The two columns represent an index value and the corresponding IMDB ID
idx_to_movie

{0: 114709,
 1: 113497,
 2: 113228,
 3: 114885,
 4: 113041,
 5: 113277,
 6: 114319,
 7: 112302,
 8: 114576,
 9: 113189,
 10: 112346,
 11: 112896,
 12: 112453,
 13: 113987,
 14: 112760,
 15: 112641,
 16: 114388,
 17: 113101,
 18: 112281,
 19: 113845,
 20: 113161,
 21: 112722,
 22: 112401,
 23: 114168,
 24: 113627,
 25: 114057,
 26: 114011,
 27: 114117,
 28: 112682,
 29: 115012,
 30: 112792,
 31: 114746,
 33: 112431,
 34: 112637,
 35: 112818,
 36: 112286,
 37: 113442,
 38: 112697,
 39: 112749,
 40: 114279,
 41: 112819,
 42: 114272,
 43: 113855,
 44: 114681,
 45: 113347,
 46: 114369,
 47: 114148,
 48: 114916,
 49: 114814,
 51: 113819,
 52: 110299,
 53: 112499,
 54: 113158,
 56: 113321,
 57: 110877,
 58: 112714,
 59: 113419,
 60: 116260,
 61: 113862,
 62: 116126,
 63: 118002,
 64: 115683,
 65: 116839,
 67: 113149,
 68: 113118,
 69: 116367,
 70: 113010,
 71: 113537,
 72: 113828,
 73: 115644,
 75: 114367,
 76: 113973,
 77: 112744,
 78: 116731,
 79: 112445,
 80: 114660,
 81: 112379,
 82: 1140

In [10]:
# Check the number of movies that are present in the dict object.
len(idx_to_movie)

9125

Get a valid list of movies from the original retrieved list. Two levels of filtering need to happen here,
1. Only select IMDB IDs which are 6 digits long
2. Only select non-zero values

In [11]:
#total_movies = len(idx_to_movie)
#movies = [0]*total_movies
movies = [0] * len(idx_to_movie)

In [12]:
type(movies)

list

In [13]:
len(movies)

9125

In [14]:
# Should use len(idx_to_movie) instead, as its the same number however the idx_to_movie variable is being accessed

# Loop through the present list and only select those IMDB IDs that are 6 digits long
for i in range(len(movies)):
    if i in idx_to_movie.keys() and len(str(idx_to_movie[i])) == 6:
        movies[i] = (idx_to_movie[i]) 

In [15]:
# The function filter(function, list) is used to filter out all the elements of a list, for which function returns True. 
# The function filter(f, l) needs a function
#  - f as its first argument. f returns a Boolean value, i.e. either True or False.
#  - This function will be applied to every element of the list l.
# Only if f returns True will the element of the list be included in the result list.
# Here, we only select IMDB IDs which are non-zero

# In Python3, filter returns an iterator. Hence wrapping the function in list() so that a list is returned
movies = list(filter(lambda imdb: imdb != 0, movies))

In [16]:
type(movies)

list

In [18]:
# This is the total number of movies in our dataset. The number would change later on based on filtering etc.
total_movies  = len(movies)
total_movies

3091

<h2> Step 2 - Fetch Movie Poster images</h2>

In [19]:
# Import libraries that would be needed to fetch poster images over exposed APIs.
import requests
import json

from IPython.display import Image
from IPython.display import display
from IPython.display import HTML

In [20]:
# Get base url filepath structure. w185 corresponds to size of movie poster.
headers = {'Accept': 'application/json'}
payload = {'api_key': 'bb3beb7ec7af6d1c0c23ca7381b62a89'} 
response = requests.get("http://api.themoviedb.org/3/configuration", params=payload, headers=headers)
response = json.loads(response.text)
base_url = response['images']['base_url'] + 'w185'

In [21]:
# get_poster function fetches the poster from the exposed API.
# api_key used here is the same that was present in the source notebook.
def get_poster(imdbid, base_url):
    # Get IMDB movie ID
    movie_id = "tt0" + str(imdbid) 
    
    # Query themoviedb.org API for movie poster path.
    movie_url = 'http://api.themoviedb.org/3/movie/{:}/images'.format(movie_id)
    headers = {'Accept': 'application/json'}
    payload = {'api_key': 'bb3beb7ec7af6d1c0c23ca7381b62a89'} 
    response = requests.get(movie_url, params=payload, headers=headers)
    try:
        file_path = json.loads(response.text)['posters'][0]['file_path']
    except:
        file_path = ""
        
    return (base_url + file_path, imdbid)

In [22]:
# Define the URL_IMDB array which will hold the list of all URLs from which the movie poster can be fetched from
URL = [0]*total_movies 
IMDB = [0]*total_movies 
URL_IMDB = {"url":[],"imdb":[]}

Iterate through the entire list of URLs and fetch the movie poster one-by-one.

In [23]:
# Construct the URLs against at which the movie information is expected to be present
i = 0
for movie in movies:
    (URL[i], IMDB[i]) = get_poster(movie, base_url)
    if URL[i] != base_url+"":
        URL_IMDB["url"].append(URL[i])
        URL_IMDB["imdb"].append(IMDB[i])
    i += 1 
# URL = filter(lambda url: url != base_url+"", URL)

In [24]:
df = pd.DataFrame(data=URL_IMDB) 
df

Unnamed: 0,imdb,url
0,114709,http://image.tmdb.org/t/p/w185/uMZqKhT4YA6mqo2...
1,113497,http://image.tmdb.org/t/p/w185/vgpXmVaVyUL7GGi...
2,113228,http://image.tmdb.org/t/p/w185/6ksm1sjKMFLbO7U...
3,114885,http://image.tmdb.org/t/p/w185/eT79mN6LqeDXeLi...
4,113041,http://image.tmdb.org/t/p/w185/e64sOI48hQXyru7...
5,113277,http://image.tmdb.org/t/p/w185/zMyfPUelumio3ti...
6,114319,http://image.tmdb.org/t/p/w185/jQh15y5YB7bWz1N...
7,112302,http://image.tmdb.org/t/p/w185/sGO5Qa55p7wTu7F...
8,114576,http://image.tmdb.org/t/p/w185/eoWvKD60lT95Ss1...
9,113189,http://image.tmdb.org/t/p/w185/trtANqAEy9dxRCe...


In [None]:
# images = ''
# for i in range(n_display):
#     images += "<img style='width: 120px; margin: 0px; \
#                 float: left; border: 1px solid black;' src='%s' />" \
#                 % URL[i]

# display(HTML(images))

In [25]:
# Update the total_movies variable based on the present dataset that we have
total_movies = len(df)
total_movies

899

In [26]:
# Download the movie poster images from the movie website using their exposed APIs. Stored them locally
import urllib.request

poster_path = "/home/nbuser/library/dataset/ml-latest-small/posters/"

# Commenting out this code as the movie posters have already been downloaded, and re-downloading takes a long time.
# Only need to download it once.
# for i in range(total_movies):
#     urllib.request.urlretrieve(df.url[i], poster_path + str(i) + ".jpg")

<h2> Step 3 - Image Pre-processing</h2>

VGG is the Visual Geometry Group at the University of Oxford (http://www.robots.ox.ac.uk/~vgg/). In 2014 utilizing Convolutional Neural Networks, they produced an image classifier that outperformed other classifiers in the 2014 ILSVRC challenge. A research paper outlining their approach and method is present at https://arxiv.org/pdf/1409.1556.pdf

In [27]:
# Import the VGG model that is included as part of the keras distribution.
# Here 16 refers to a 16 layer convolutoinal neural network.
from keras.applications import VGG16
from keras.applications.vgg16 import preprocess_input
from keras.preprocessing import image as kimage

# Create two arrays, each of size of the total number of movies
img = [0]*total_movies
x = [0]*total_movies

# Loop through all the movies, and do the following
#   1. Load the downloaded poster images in an array. VGG expects the images to be 224x224 pixels in size.
#   2. Convert the image instance to a Numpy array using the keras preprocessing function
#   3. Expand the array by inserting a new axis that will appear at the axis position in the expanded array shape.
#   4. Pre-process this image encoding array using the vgg16 model.
for i in range(total_movies):
    img[i] = kimage.load_img(poster_path + str(i) + ".jpg", target_size=(224, 224))
    x[i] = kimage.img_to_array(img[i])
    x[i] = np.expand_dims(x[i], axis=0)
    x[i] = preprocess_input(x[i])

Using TensorFlow backend.


<h2> Step 4 - Image Classification</h2>

Image pre-processing has been done. Now time for image classification using the VGG16 pre-built / pre-trained model. Here, the pre-trained model has been previously trained on a dataset and contains the weights and biases that represent the features of whichever dataset it was trained on. Using a pre-trained model saves considerable computing time and resources. 

In [28]:
# The function returns a Keras model instance for VGG16
# Arguments
#  - image_top: whether to include the 3 fully-connected layers at the top of the network
#  - weights: None (random initialization), 'imagenet' (pre-training on ImageNet), or the path to the weights file
# model = VGG16(include_top=False, weights='imagenet')
model = VGG16(include_top=False, weights='imagenet')

Downloading data from https://github.com/fchollet/deep-learning-models/releases/download/v0.1/vgg16_weights_tf_dim_ordering_tf_kernels_notop.h5


In [37]:
# The current model prediction takes a very long time. Hence reducing the size of the dataset here.
# total_movies=992

total_movies=100

In [38]:
# Create two arrays, each of size of the total number of movies
# pred is the array containing the predicted classification
pred = [0]*total_movies
# pred_norm = [0]*total_movies

# matrix_res is an array of zeros with the given shape, dtype, and order. Here it's the size of total_movies x 25088
# There are total 25088 features learned by VGG16 and we use these features to characterize each movie in our data set.
matrix_res = np.zeros([total_movies,25088])

for i in range(total_movies):
    pred[i] = model.predict(x[i]).ravel()
    matrix_res[i,:] = pred[i]

The (cosine) similarity among users/movies is calculated based on the following formula.

<img src=http://nycdatascience.com/blog/wp-content/uploads/2017/01/eq1.png>

where, s(u,v) is just the cosine similarity measure between user u and user v.

In [31]:
# Come up with the consine similarity matrix for the movie posters based on the formula above
similarity_items = matrix_res.dot(matrix_res.T)
norms = np.array([np.sqrt(np.diagonal(similarity_items))])
similarity_items = similarity_items / (norms * norms.T) 

In [32]:
similarity_items

array([[ 1.        ,  0.16737608,  0.17290133,  0.18040296,  0.11182734,
         0.20095944,  0.09577529,  0.14465904,  0.11048406,  0.08306888],
       [ 0.16737608,  1.        ,  0.12153619,  0.11896121,  0.10495908,
         0.18118612,  0.13995847,  0.11421868,  0.06020713,  0.13731053],
       [ 0.17290133,  0.12153619,  1.        ,  0.13918662,  0.09197251,
         0.10816018,  0.09191217,  0.09381042,  0.09077845,  0.07768695],
       [ 0.18040296,  0.11896121,  0.13918662,  1.        ,  0.11697109,
         0.10033753,  0.11295806,  0.0608156 ,  0.10836169,  0.09469287],
       [ 0.11182734,  0.10495908,  0.09197251,  0.11697109,  1.        ,
         0.10164092,  0.08618508,  0.1084675 ,  0.16586467,  0.05208971],
       [ 0.20095944,  0.18118612,  0.10816018,  0.10033753,  0.10164092,
         1.        ,  0.11285575,  0.13310798,  0.06355406,  0.11566143],
       [ 0.09577529,  0.13995847,  0.09191217,  0.11295806,  0.08618508,
         0.11285575,  1.        ,  0.06938227

<h2> Step 5 - Build Movie Recommender</h2>

In [33]:
# Load in movie data
idx_to_movie2 = {}
i = 0

# df is the dataframe that contains a list of all (reduced) movie items. Create an index to these movie items.
for row in df.itertuples():
    idx_to_movie2[i] = row[1]
    i += 1

We are now ready to define a function that returns the top recommended movies based on a similiarity to a movie.

In [34]:
# Perform an indirect sort of the similarity matrix along the given axis using the algorithm specified (default is quicksort).
# It returns an array of indices of the same shape as a that index data along the given axis in sorted order.
def top_k_movies(similarity, mapper, movie_idx, k=6):
    return [mapper[x] for x in np.argsort(similarity[movie_idx,:])[:-k-1:-1]]

<h2> Step 6 - Generate Movie Recommendations</h2>

In [35]:
#idx = 1811
idx=5
movies = top_k_movies(similarity_items, idx_to_movie2, idx)
movies = movies[:5]

In [36]:
n_display = 5
URL = [0]*n_display
i = 0
for movie in movies:
    (URL[i], IMDB[i]) = get_poster(movie, base_url)
    i += 1 
    
images = ''
for i in range(n_display):
    images += "<img style='width: 110px; margin: 0px; \
                float: left; border: 1px solid black;' src='%s' />" \
                % URL[i]

display(HTML(images))