<a href="https://colab.research.google.com/github/rtkilian/recommendation-engine-movie-lens/blob/main/Recommending_Movies_Retrieval.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Recommending movies: retrieval
Real-world recommender systems are often made up of two tasks:
1. Retrieval: select an initial set of hundreds of candidates from all possible candidates. This needs to be computationally efficient.
2. Ranking: takes the output of the retrieval model and fine-tunes them to select only the best. 

Retrieval models are often composed of two sub-models:
1. Query model: computes the query representation (normally a fixed-dimensionality embedding vector) using query features.
2. Candidate model: computes the candidate representation (an equally-sized vector) using the candidate features

The outputs of the two models are then multiplied together to give a query-candidate affinity score, with higher scores expressing a better match between the candidate and the query.

In this notebook, I am going to build a two-tower model using the Movielens dataset. I will:
1. Get the data and split into a training and test set.
2. Implement a retrieval model.
3. Fit and evaluate the model.
4. Export the model for efficient serving by building an approximate nearest neighbours (ANN) index.

## Imports

In [1]:
!pip install -q numpy==1.18.5 # we have to downgrade otherwise we get an error

!pip install -q tensorflow-recommenders
!pip install -q --upgrade tensorflow-datasets
!pip install -q scann

[K     |████████████████████████████████| 20.1MB 1.4MB/s 
[31mERROR: tensorflow 2.4.0 has requirement numpy~=1.19.2, but you'll have numpy 1.18.5 which is incompatible.[0m
[31mERROR: datascience 0.10.6 has requirement folium==0.2.1, but you'll have folium 0.8.3 which is incompatible.[0m
[31mERROR: albumentations 0.1.12 has requirement imgaug<0.2.7,>=0.2.5, but you'll have imgaug 0.2.9 which is incompatible.[0m
[K     |████████████████████████████████| 51kB 3.8MB/s 
[K     |████████████████████████████████| 14.8MB 331kB/s 
[31mERROR: datascience 0.10.6 has requirement folium==0.2.1, but you'll have folium 0.8.3 which is incompatible.[0m
[31mERROR: albumentations 0.1.12 has requirement imgaug<0.2.7,>=0.2.5, but you'll have imgaug 0.2.9 which is incompatible.[0m
[K     |████████████████████████████████| 3.6MB 9.0MB/s 
[K     |████████████████████████████████| 11.7MB 5.3MB/s 
[K     |████████████████████████████████| 320.4MB 49kB/s 
[K     |████████████████████████████████

In [2]:
import numpy as np

print(np.__version__)

1.19.4


In [3]:
import os
import pprint
import tempfile

from typing import Dict, Text

import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds

In [4]:
import tensorflow_recommenders as tfrs

## Data
We can use the Movielens data in two ways:
1. Explicitly: use the ratings from 1-5
2. Implicitly: binary of 0 or 1, where 1=the user has watched the movie

We are going to use the latter.

We are going to use the data with 100k ratings.

In [5]:
# Ratings data
ratings = tfds.load("movielens/100k-ratings", split="train") # this data does not have any predefined splits

# Features of all the available movies
movies = tfds.load("movielens/100k-movies", split="train")

[1mDownloading and preparing dataset movielens/100k-ratings/0.1.0 (download: 4.70 MiB, generated: 32.41 MiB, total: 37.10 MiB) to /root/tensorflow_datasets/movielens/100k-ratings/0.1.0...[0m


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Completed...', max=1.0, style=Progre…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Size...', max=1.0, style=ProgressSty…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Extraction completed...', max=1.0, styl…









HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, max=100000.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/movielens/100k-ratings/0.1.0.incompleteWJ80H3/movielens-train.tfrecord


HBox(children=(FloatProgress(value=0.0, max=100000.0), HTML(value='')))

[1mDataset movielens downloaded and prepared to /root/tensorflow_datasets/movielens/100k-ratings/0.1.0. Subsequent calls will reuse this data.[0m
[1mDownloading and preparing dataset movielens/100k-movies/0.1.0 (download: 4.70 MiB, generated: 150.35 KiB, total: 4.84 MiB) to /root/tensorflow_datasets/movielens/100k-movies/0.1.0...[0m


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Completed...', max=1.0, style=Progre…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Size...', max=1.0, style=ProgressSty…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Extraction completed...', max=1.0, styl…









HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, max=1682.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/movielens/100k-movies/0.1.0.incomplete63XFV5/movielens-train.tfrecord


HBox(children=(FloatProgress(value=0.0, max=1682.0), HTML(value='')))

[1mDataset movielens downloaded and prepared to /root/tensorflow_datasets/movielens/100k-movies/0.1.0. Subsequent calls will reuse this data.[0m


The ratings dataset returns a dictionary of movie id, user id, the assigned rating, timestamp, movie information and user information.

In [7]:
for x in ratings.take(1).as_numpy_iterator():
  pprint.pprint(x)

{'bucketized_user_age': 45.0,
 'movie_genres': array([7]),
 'movie_id': b'357',
 'movie_title': b"One Flew Over the Cuckoo's Nest (1975)",
 'raw_user_age': 46.0,
 'timestamp': 879024327,
 'user_gender': True,
 'user_id': b'138',
 'user_occupation_label': 4,
 'user_occupation_text': b'doctor',
 'user_rating': 4.0,
 'user_zip_code': b'53211'}


The movies dataset contains the movie id, movie title, and data on what genres it belongs to. Note that the genres are encoded with integer labels.

In [8]:
for x in movies.take(1).as_numpy_iterator():
  pprint.pprint(x)

{'movie_genres': array([4]),
 'movie_id': b'1681',
 'movie_title': b'You So Crazy (1994)'}


We are only going to keep the movie title and the user id in this data.

In [9]:
ratings = ratings.map(lambda x: {
    "movie_title": x['movie_title'],
    "user_id": x["user_id"],
})

movies = movies.map(lambda x: x["movie_title"])

To fit and evaluate the model, we need to split it into a training and evaluation set. In an industrial recommender system, this would likely be done by time. The data up until a certain point would be used to predict the interactions after that point.

However, for the purpose of this example, I am going to use an 80/20 split.

In [13]:
tf.random.set_seed(42)
shuffled = ratings.shuffle(100_000, seed=42, reshuffle_each_iteration=False) # shuffle the data

train = shuffled.take(80_000)
test = shuffled.skip(80_000).take(20_000)

I am also going to determine the unique user ids and movie titles present in the data.

This is required as I need to be able to map the raw values of our categorical features to the embedded vectors in the models. To do this, I need a vocab that maps a raw feature value to an integer in a continuous range: this allows us to look up the corresponding embedding in our embedding tables.

In [20]:
movie_titles = movies.batch(1_000) # combines consecutive elements into batches
user_ids = ratings.batch(1_000_000).map(lambda x: x["user_id"])

unique_movie_titles = np.unique(np.concatenate(list(movie_titles)))
unique_user_ids = np.unique(np.concatenate(list(user_ids)))

unique_movie_titles[:10]

array([b"'Til There Was You (1997)", b'1-900 (1994)',
       b'101 Dalmatians (1996)', b'12 Angry Men (1957)', b'187 (1997)',
       b'2 Days in the Valley (1996)',
       b'20,000 Leagues Under the Sea (1954)',
       b'2001: A Space Odyssey (1968)',
       b'3 Ninjas: High Noon At Mega Mountain (1998)',
       b'39 Steps, The (1935)'], dtype=object)