<a href="https://colab.research.google.com/github/rtkilian/recommendation-engine-movie-lens/blob/main/Recommending_Movies_Retrieval.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Recommending movies: retrieval
Real-world recommender systems are often made up of two tasks:
1. Retrieval: select an initial set of hundreds of candidates from all possible candidates. This needs to be computationally efficient.
2. Ranking: takes the output of the retrieval model and fine-tunes them to select only the best. 

Retrieval models are often composed of two sub-models:
1. Query model: computes the query representation (normally a fixed-dimensionality embedding vector) using query features.
2. Candidate model: computes the candidate representation (an equally-sized vector) using the candidate features

The outputs of the two models are then multiplied together to give a query-candidate affinity score, with higher scores expressing a better match between the candidate and the query.

In this notebook, I am going to build a two-tower model using the Movielens dataset. I will:
1. Get the data and split into a training and test set.
2. Implement a retrieval model.
3. Fit and evaluate the model.
4. Export the model for efficient serving by building an approximate nearest neighbours (ANN) index.

## Imports

In [17]:
!pip install -q numpy==1.18.5 # we have to downgrade otherwise we get an error

!pip install -q tensorflow-recommenders
!pip install -q --upgrade tensorflow-datasets
!pip install -q scann

[31mERROR: datascience 0.10.6 has requirement folium==0.2.1, but you'll have folium 0.8.3 which is incompatible.[0m
[31mERROR: albumentations 0.1.12 has requirement imgaug<0.2.7,>=0.2.5, but you'll have imgaug 0.2.9 which is incompatible.[0m


In [18]:
import numpy as np

print(np.__version__)

1.19.4


In [19]:
import os
import pprint
import tempfile

from typing import Dict, Text

import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds

In [20]:
import tensorflow_recommenders as tfrs

## Data
We can use the Movielens data in two ways:
1. Explicitly: use the ratings from 1-5
2. Implicitly: binary of 0 or 1, where 1=the user has watched the movie

We are going to use the latter.

We are going to use the data with 100k ratings.

In [21]:
# Ratings data
ratings = tfds.load("movielens/100k-ratings", split="train") # this data does not have any predefined splits

# Features of all the available movies
movies = tfds.load("movielens/100k-movies", split="train")

The ratings dataset returns a dictionary of movie id, user id, the assigned rating, timestamp, movie information and user information.

In [22]:
for x in ratings.take(1).as_numpy_iterator():
  pprint.pprint(x)

{'bucketized_user_age': 45.0,
 'movie_genres': array([7]),
 'movie_id': b'357',
 'movie_title': b"One Flew Over the Cuckoo's Nest (1975)",
 'raw_user_age': 46.0,
 'timestamp': 879024327,
 'user_gender': True,
 'user_id': b'138',
 'user_occupation_label': 4,
 'user_occupation_text': b'doctor',
 'user_rating': 4.0,
 'user_zip_code': b'53211'}


The movies dataset contains the movie id, movie title, and data on what genres it belongs to. Note that the genres are encoded with integer labels.

In [23]:
for x in movies.take(1).as_numpy_iterator():
  pprint.pprint(x)

{'movie_genres': array([4]),
 'movie_id': b'1681',
 'movie_title': b'You So Crazy (1994)'}


We are only going to keep the movie title and the user id in this data.

In [24]:
ratings = ratings.map(lambda x: {
    "movie_title": x['movie_title'],
    "user_id": x["user_id"],
})

movies = movies.map(lambda x: x["movie_title"])

To fit and evaluate the model, we need to split it into a training and evaluation set. In an industrial recommender system, this would likely be done by time. The data up until a certain point would be used to predict the interactions after that point.

However, for the purpose of this example, I am going to use an 80/20 split.

In [25]:
tf.random.set_seed(42)
shuffled = ratings.shuffle(100_000, seed=42, reshuffle_each_iteration=False) # shuffle the data

train = shuffled.take(80_000)
test = shuffled.skip(80_000).take(20_000)

I am also going to determine the unique user ids and movie titles present in the data.

This is required as I need to be able to map the raw values of our categorical features to the embedded vectors in the models. To do this, I need a vocab that maps a raw feature value to an integer in a continuous range: this allows us to look up the corresponding embedding in our embedding tables.

In [26]:
movie_titles = movies.batch(1_000) # combines consecutive elements into batches
user_ids = ratings.batch(1_000_000).map(lambda x: x["user_id"])

unique_movie_titles = np.unique(np.concatenate(list(movie_titles)))
unique_user_ids = np.unique(np.concatenate(list(user_ids)))

unique_movie_titles[:10]

array([b"'Til There Was You (1997)", b'1-900 (1994)',
       b'101 Dalmatians (1996)', b'12 Angry Men (1957)', b'187 (1997)',
       b'2 Days in the Valley (1996)',
       b'20,000 Leagues Under the Sea (1954)',
       b'2001: A Space Odyssey (1968)',
       b'3 Ninjas: High Noon At Mega Mountain (1998)',
       b'39 Steps, The (1935)'], dtype=object)

In [27]:
unique_user_ids[:10]

array([b'1', b'10', b'100', b'101', b'102', b'103', b'104', b'105',
       b'106', b'107'], dtype=object)

## Modelling
As a reminder, we are building a two-tower retrieval model: query-tower and candidate tower. These can be built independently and combined at the end.

### Query tower
i.e. given a user id, what movies should we recommend to them with the candidate tower (i.e. movie titles).

The first step is to decide on the dimensionality of the query and candidate representations.

More information can be found here: https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/#:~:text=Keras%20offers%20an%20Embedding%20layer,represented%20by%20a%20unique%20integer.&text=The%20Embedding%20layer%20is%20initialized,words%20in%20the%20training%20dataset.

In [28]:
embedding_dimension = 32

Higher values may result in a more accurate model but it will also be slower to fit (i.e. more parameters to learn) and may be prone to overfitting.

The second step is to define the model itself. I am going to take the user ids and convert them into integers. These will then be converted into our learnt embeddings. The user ids will form part of our vocabulary.

In [29]:
user_model = tf.keras.Sequential([
                                  tf.keras.layers.experimental.preprocessing.StringLookup(vocabulary=unique_user_ids, mask_token=None),
                                  # I add an additional embedding to account for unknown tokens
                                  tf.keras.layers.Embedding(len(unique_user_ids) + 1, embedding_dimension)
])

[tf.keras.layers.experimental.preprocessing.StringLookup](https://www.tensorflow.org/guide/keras/preprocessing_layers): 
* Translates a set of arbitrary strings into an integer output via a table-based lookup, with optional out-of-vocabulary handling

[tf.keras.layers.Embedding](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding):
* Turns positive integers (indexes) into dense vectors of fixed size

A simple model like this corresponds exactly to a classic matrix factorisation approach. While defining a subclass of tf.keras.Model for this simple model might be overkill, it could easily be extended to an arbitrarily complex model using standard Keras components, as long as we return an embedding_dimension wide output at the end.

### Candidate Tower
I will use the same approach for the movies.

In [30]:
movie_model = tf.keras.Sequential([
                                   tf.keras.layers.experimental.preprocessing.StringLookup(vocabulary=unique_movie_titles, mask_token=None),
                                   tf.keras.layers.Embedding(len(unique_movie_titles) + 1, embedding_dimension)
])

### Metrics
In the training data, we have positive (user, movie) pairs. To figure out how good the model is, we need to compute an affinity score that the model calculates for this pair to the scores of all the possible other candidates. If the score for the positive pair is higher than all the other candidates, the model is highly accurate. 

To do this, I will use the [`tfrs.metrics.FactorizedTopK`](https://www.tensorflow.org/recommenders/examples/basic_retrieval) metric. The metric has one required argument: the dataset of candidate embeddings that are used as implicit negatives for evaluation. 

In [31]:
metrics = tfrs.metrics.FactorizedTopK(
    candidates=movies.batch(128).map(movie_model)
)

### Loss
The next component is the loss used to train the model. TFRS has several loss layers and tasks to make this easy.

In this instance, I'll make use of the [`Retrieval`](https://www.tensorflow.org/recommenders/api_docs/python/tfrs/tasks/Retrieval) task object: a convenient wrapper that bundles together the loss functions and metrics computation.

According to the documentation, the default loss function is categorical crossentropy.

In [32]:
task = tfrs.tasks.Retrieval(
    metrics=metrics
)

The task itself is a Keras layer that takes the query and candidate embeddings as arguments, and returns the computed loss: I'll use this to implement the model's training loop.

### The Full Model
I wil now combine it all into a single model. TFRS exposes a base model class [`tfrs.models.Model`](https://www.tensorflow.org/recommenders/api_docs/python/tfrs/models/Model) which streamlines building models. Here, all I need to do is to set up the components of the `__init__` method, and implement the `compute_loss` method, taking in the raw features and returning the loss value.

The base model will then take care of creating the appropriate trianing loop to fit our model. 

In [43]:
class MovielensModel(tfrs.Model):

  def __init__(self, user_model, movie_model):
    super().__init__()
    self.movie_model: tf.keras.Model = movie_model
    self.user_model: tf.keras.Model = user_model
    self.task: tf.keras.layers.Layer = task

  def compute_loss(self, features: Dict[Text, tf.Tensor], training=False) -> tf.Tensor:
    # We pick out the user features and pass them into the user model.
    user_embeddings = self.user_model(features["user_id"])
    # And pick out the movie features and pass them into the movie model,
    # getting embeddings back.
    positive_movie_embeddings = self.movie_model(features["movie_title"])

    # The task computes the loss and the metrics.
    return self.task(user_embeddings, positive_movie_embeddings)

## Fitting and Evaluating
After defining the model, I use the standard Keras fitting and evaluation routines to fit and evaluate the model.

I will first instantiate the model.

In [44]:
model = MovielensModel(user_model, movie_model)
model.compile(optimizer=tf.keras.optimizers.Adagrad(learning_rate=0.1))

Then shuffle, batch and cache the training and evaluation data.

In [45]:
cached_train = train.shuffle(100_000).batch(8192).cache()
cached_test = test.batch(4096).cache()

Then train the model:

In [46]:
model.fit(cached_train, epochs=3)

Epoch 1/3
















Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x7f968c173550>

As the model trains, the loss is falling and a set of top-k retrieval metrics is updated. These tell me whether the true positives in the top-k retrieved items from the entire candidate set. For example, a top-5 categorical accuracy metrics of 0.2 would tell me, on average, the true positive is in the top 5 retrieved items 20% of the time.

In this example, I am also calculating the metrics during training as well as evaluation. Because this can be quite slow with large candidate sets, it may be best to turn metric calculation off in training, and only run it in evaluation.

Finally, we can evaluate our model on the test set:

In [47]:
model.evaluate(cached_test, return_dict=True)



{'factorized_top_k/top_100_categorical_accuracy': 0.23600000143051147,
 'factorized_top_k/top_10_categorical_accuracy': 0.020899999886751175,
 'factorized_top_k/top_1_categorical_accuracy': 0.0010999999940395355,
 'factorized_top_k/top_50_categorical_accuracy': 0.12250000238418579,
 'factorized_top_k/top_5_categorical_accuracy': 0.009100000374019146,
 'loss': 28240.8515625,
 'regularization_loss': 0,
 'total_loss': 28240.8515625}

Test set performance is much worse than training performance due to two factors:
1. Overfitting - which can be improved by regularization or user and movie features to generalise better on unseen data
2. Re-recommendation of users' already watched movies. These known-positive watches can crowd out test movies out of top K recommendations.

The second phenomenon can be tackled be excluding previously seen movies from test recommendations. This approach is relatively common in the recommender systems literature but this isn't done in this notebook. If not recommending past movies is import, we should expect appropriately specified models to learn this behaviour from past user history and contextual information. Additionally, it is often appropriate to recommend the same item multiple times (e.g. evergreen TV series or a regularly purchased item).

## Making predictions
Now I have my model I can start to make predictions with it. I can use the [`tfrs.layers.factorized_top_k.BruteForce`](https://www.tensorflow.org/recommenders/api_docs/python/tfrs/layers/factorized_top_k/BruteForce) layer to do this.

In [50]:
# Create a model that takes in raw query features, and
index = tfrs.layers.factorized_top_k.BruteForce(model.user_model)
# recommends movies out of the entire movies dataset
index.index(movies.batch(100).map(model.movie_model), movies)

# Get recommendations
_, titles = index(tf.constant(["42"]))
print(f"Recommendations for user 42: {titles[0, :3]}")

Recommendations for user 42: [b'Bridges of Madison County, The (1995)' b'Aristocats, The (1970)'
 b'Rudy (1993)']


In [52]:
# Get recommendations
_, titles = index(tf.constant(["22"]))
print(f"Recommendations for user 22: {titles[0, :20]}")

Recommendations for user 22: [b'Super Mario Bros. (1993)' b'Shadow, The (1994)'
 b'Private Benjamin (1980)' b'Clean Slate (1994)'
 b'Naked Gun 33 1/3: The Final Insult (1994)' b'Home Alone (1990)'
 b'Judge Dredd (1995)' b'Star Trek V: The Final Frontier (1989)'
 b'Star Trek: The Motion Picture (1979)' b'Real Genius (1985)']


Alternatively, we can use an approximate retrieval index to speed up predictions. This will make it possible to efficiently surfact recommendations from sets of tens of millions of candidates.

To do so, we can use the `scann` package. 

In [53]:
scann_index = tfrs.layers.factorized_top_k.ScaNN(model.user_model)
scann_index.index(movies.batch(100).map(model.movie_model), movies)

<tensorflow_recommenders.layers.factorized_top_k.ScaNN at 0x7f968225ddd8>

In [54]:
# Get recommendations
_, titles = index(tf.constant(["22"]))
print(f"Recommendations for user 22: {titles[0, :20]}")

Recommendations for user 22: [b'Super Mario Bros. (1993)' b'Shadow, The (1994)'
 b'Private Benjamin (1980)' b'Clean Slate (1994)'
 b'Naked Gun 33 1/3: The Final Insult (1994)' b'Home Alone (1990)'
 b'Judge Dredd (1995)' b'Star Trek V: The Final Frontier (1989)'
 b'Star Trek: The Motion Picture (1979)' b'Real Genius (1985)']
