# Building a movie retrieval system

In this tutorial, we're going to build an end-to-end retrieval system for movies. 

A retrieval system is normally the first stage in a multi-stage recommender system and is responsible for retrieving a set of candidates out of a large corpus in response to a user query.

Retrieval models are often composed of two sub-models:

1. A query model computing the query representation (normally a fixed-dimensionality embedding vector) using query features.
2. A candidate model computing the candidate representation (an equally-sized vector) using the candidate features

The outputs of the two models are then multiplied together to give a query-candidate affinity score, with higher scores expressing a better match between the candidate and the query.

In this tutorial, we're going to build and train such a two-tower model using the Movielens dataset.

We're going to:

1. Get our data and split it into a training and test set.
2. Implement a retrieval model.
3. Fit and evaluate it.
4. Export it for efficient serving by building an approximate nearest neighbours (ANN) index.

## The dataset

The Movielens dataset is a classic dataset from the [GroupLens](https://grouplens.org/datasets/movielens/) research group at the University of Minnesota. It contains a set of ratings given to movies by a set of users, and is a workhorse of recommender system research.

The data can be treated in two ways:

1. It can be interpreted as expressesing which movies the users watched (and rated), and which they did not. This is a form of implicit feedback, where users' watches tell us which things they prefer to see and which they'd rather not see.
2. It can also be seen as expressesing how much the users liked the movies they did watch. This is a form of explicit feedback: given that a user watched a movie, we can tell roughly how much they liked by looking at the rating they have given.

In this tutorial, we are focusing on a retrieval system: a model that predicts a set of movies from the catalogue that the user is likely to watch. Often, implicit data is more useful here, and so we are going to treat Movielens as an implicit system. This means that every movie a user watched is a positive example, and every movie they have not seen is an implicit negative example.


## Imports


Let's first get our imports out of the way.



In [ ]:
import os
import pprint
import tempfile

from typing import Dict, Text

import numpy as np
import tensorflow as  tf

In [ ]:
import tensorflow_recommenders as tfrs

## Preparing the dataset

Let's first have a look at the data.

The `movielens_100K` function built into TFRS returns a pair of `tf.data.Dataset` objects: a dataset of ratings and a dataset of movie features.

In [ ]:
ratings, movies = tfrs.datasets.movielens_100K()

The ratings dataset returns a dictionary of movie id, user id, the assigned rating, and timestamp:

In [ ]:
for x in ratings.batch(5).take(1).as_numpy_iterator():
  pprint.pprint(x)

{'movie_id': array([242, 302, 377,  51, 346], dtype=int32),
 'rating': array([3., 3., 1., 2., 1.], dtype=float32),
 'timestamp': array([881250949, 891717742, 878887116, 880606923, 886397596], dtype=int32),
 'user_id': array([196, 186,  22, 244, 166], dtype=int32)}


The movies dataset contains the movie id, movie title, its release data, and data on what genres it belongs to:

In [ ]:
for x in movies.batch(1).take(1).as_numpy_iterator():
  pprint.pprint(x)

{'Action': array([0.], dtype=float32),
 'Adventure': array([0.], dtype=float32),
 'Animation': array([1.], dtype=float32),
 "Children's": array([1.], dtype=float32),
 'Comedy': array([1.], dtype=float32),
 'Crime': array([0.], dtype=float32),
 'Documentary': array([0.], dtype=float32),
 'Drama': array([0.], dtype=float32),
 'Fantasy': array([0.], dtype=float32),
 'Film-Noir': array([0.], dtype=float32),
 'Horror': array([0.], dtype=float32),
 'IMDb URL': array([b'http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)'],
      dtype=object),
 'Musical': array([0.], dtype=float32),
 'Mystery': array([0.], dtype=float32),
 'Romance': array([0.], dtype=float32),
 'Sci-Fi': array([0.], dtype=float32),
 'Thriller': array([0.], dtype=float32),
 'War': array([0.], dtype=float32),
 'Western': array([0.], dtype=float32),
 'movie_id': array([1], dtype=int32),
 'movie_title': array([b'Toy Story (1995)'], dtype=object),
 'release_date': array([b'01-Jan-1995'], dtype=object),
 'unknown': array([0.], 

In this example, we're going to focus on the ratings data. Other tutorials explore how to use the movie information data as well to improve the model quality.

To fit and evaluate the model, we need to split it into a training and evaluation set. In an industrial recommender system, this would most likely be done by time: the data up to time $T$ would be used to predict interactions after $T$.


In this simple example, however, let's use a random split, putting 80% of the ratings in the train set, and 20% in the test set.

In [ ]:
tf.random.set_seed(42)
shuffled = ratings.shuffle(100_000, seed=42, reshuffle_each_iteration=False)

train = shuffled.take(80_000)
test = shuffled.skip(80_000).take(20_000)

Let's also figure out unique user ids and movie ids present in the data. 

This is important because we need to be able to map the raw values of our categorical features to embedding vectors in our models. To do that, we need a vocabulary that maps a raw feature value to an integer in a contiguous range: this allows us to look up the corresponding embeddings in our embedding tables.

In [ ]:
movie_ids = ratings.batch(1_000_000).map(lambda x: x["movie_id"])
user_ids = ratings.batch(1_000_000).map(lambda x: x["user_id"])

unique_movie_ids = np.unique(np.concatenate(list(movie_ids)))
unique_user_ids = np.unique(np.concatenate(list(user_ids)))

unique_movie_ids[:10]

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10], dtype=int32)

## Implementing a model

This is the critical section where we choose the architecure of our model.

Because we are building a two-tower retrieval model, we can build each tower separately and then combine them in the final model.

### The query tower

Let's start with the query tower.

The first step is to decide on the dimensionality of the query and candidate representations:

In [ ]:
embedding_dimension = 32

The second is to define the input features. Here, we're going to use [feature columns](https://www.tensorflow.org/tutorials/structured_data/feature_columns), to define a simple embedding layer, taking `user_id` as its only input feature. Note that we use the list of unique user ids we computed earlier as a vocabulary:

In [ ]:
user_features = [tf.feature_column.embedding_column(
        tf.feature_column.categorical_column_with_vocabulary_list(
            "user_id", unique_user_ids),
        embedding_dimension)]

# The model itself is a single embedding layer.
# However, we could expand this to an arbitrarily complicated Keras model, as long
# as the output is an vector `embedding_dimension` wide.
user_model = tf.keras.Sequential([tf.keras.layers.DenseFeatures(user_features)])

A simple model like corresponds exactly to a classic [matrix factorization](https://ieeexplore.ieee.org/abstract/document/4781121) approach. However, we could easily extend it to an arbitrarily complex model using standard Keras components, as long as we return an `embedding_dimension`-wide output at the end.

### The candidate tower

We can do the same with the candidate tower.

In [ ]:
movie_features = [tf.feature_column.embedding_column(
  tf.feature_column.categorical_column_with_vocabulary_list(
    "movie_id", list(unique_movie_ids)),
  embedding_dimension)]

movie_model = tf.keras.Sequential([tf.keras.layers.DenseFeatures(movie_features)])

### Metrics

In our training data we have positive (user, movie) pairs. To figure out how good our model is, we need to compare the affinity score that the model calculates for this pair to the scores of all the other possible candidates: if the score for the positive pair is higher than for all other candidates, our model is highly accurate.

To do this, we can use the `tfrs.metrics.FactorizedTopK` metric. The metric has one required argument: the dataset of candidates that are used as implicit negatives for evaluation.

In our case, that's the `movies` dataset, converted into embeddings via our movie model:

In [ ]:
metrics = tfrs.metrics.FactorizedTopK(
  candidates=movies.batch(128).map(movie_model)
)



### Loss

The next component is the loss used to train our model. TFRS has several loss layers and tasks to make this easy.

In this instance, we'll make use of the `RetrievalTask` object: a convenience wrapper that bundles together the loss function and metric computation:

In [ ]:
task = tfrs.tasks.RetrievalTask(
  corpus_metrics=metrics
)

The task itself is a Keras layer that takes the query and candidate embeddings as arguments, and returns the computed loss: we'll use that to implement the model's training loop.

### The full model

We can now put it all together into a model. TFRS exposes a base model class `tfrs.models.Model` which streamlines bulding models: all we need to do is to set up the components in the `__init__` method, and implement the `train_loss` method, taking in the raw features and returning a loss value.

The base model will then take care of creating the appropriate training loop to fit our model.

In [ ]:
class MovielensModel(tfrs.models.Model):

  def __init__(self):
    super().__init__()

    self.movie_model: tf.keras.layers.Layer = movie_model
    self.user_model: tf.keras.layers.Layer = user_model
    self.task: tf.keras.layers.Layer = task

  def train_loss(self, features: Dict[Text, tf.Tensor]) -> tf.Tensor:

    # We pick out the user features and pass them into the user model.
    user_embeddings = self.user_model({"user_id": features["user_id"]})
    # And pick out the movie features and pass them into the movie model,
    # getting embeddings back.
    positive_movie_embeddings = self.movie_model(
        {"movie_id": features["movie_id"]})

    # The task computes the loss and the metrics.
    return self.task(user_embeddings, positive_movie_embeddings)

## Fitting and evaluating

After defining the model, we can use standard Keras fitting and evaluation routines to fit and evaluate the model.

Let's first instantiate the model.

In [ ]:
model = MovielensModel()
model.compile(optimizer=tf.keras.optimizers.Adagrad(learning_rate=0.1))

Then shuffle, batch, and cache the training and evaluation data.

In [ ]:
cached_train = train.shuffle(100_000).batch(8192).cache()
cached_test = test.batch(4096).cache()

Then train the  model:

In [ ]:
model.fit(cached_train, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7fcd87698630>

As the model trains, the loss is falling and a set of top-k retrieval metrics is updated. These tell us whether the true positive is in the top-k retrieved items from the entire candidate set. For example, a top-5 categorical accuracy metric of 0.2 would tell us that, on average, the true positive is in the top 5 retrieved items 20% of the time.

Note that, in this example, we evaluate the metrics during training as well as evaluation. Because this can be quite slow with large candidate sets, it may be prudent to turn metric calculation off in training, and only run it in evaluation.

Finally, we can evaluate our model on the test set:

In [ ]:
model.evaluate(cached_test, return_dict=True)



{'factorized_top_k': array([5.0000e-05, 2.6500e-03, 6.7500e-03, 7.3600e-02, 1.6655e-01],
       dtype=float32),
 'factorized_top_k/top_1_categorical_accuracy': 4.999999873689376e-05,
 'factorized_top_k/top_5_categorical_accuracy': 0.0026499999221414328,
 'factorized_top_k/top_10_categorical_accuracy': 0.006750000175088644,
 'factorized_top_k/top_50_categorical_accuracy': 0.07360000163316727,
 'factorized_top_k/top_100_categorical_accuracy': 0.16654999554157257,
 'loss': 29046.48046875}

Test set performance is much worse than training performance. This is due to two factors:

1. Our model is likely to perform better on the data that it has seen, simply because it can memorize it. This overfitting phenomenon is especially strong when models have many parameters. It can be mediated by model regularization and use of user and movie features that help the model generalize better to unseen data.
2. The model is re-recommending some of users' already watched movies. These known-positive watches can crowd out test movies out of top K recommendations.

The second phenomenon can be tackled by excluding previously seen movies from test recommendations.

In [ ]:
tfrs.examples.movielens.evaluate(
    user_model=model.user_model,
    movie_model=model.movie_model,
    test=test,
    movies=movies,
    train=train,
    k=10
)

{'precision_at_k': 0.06666666666666667, 'recall_at_k': 0.05449043107495953}

These values are higher than if we did not exclude the training set watches:

In [ ]:
tfrs.examples.movielens.evaluate(
    user_model=model.user_model,
    movie_model=model.movie_model,
    test=test,
    movies=movies,
    train=None,
    k=10
)

{'precision_at_k': 0.014437367303609342, 'recall_at_k': 0.014993950708787618}

Of course, accuracy on the training set is still much higher:

In [ ]:
tfrs.examples.movielens.evaluate(
    user_model=model.user_model,
    movie_model=model.movie_model,
    test=train,
    movies=movies,
    k=10
)

{'precision_at_k': 0.6779427359490986, 'recall_at_k': 0.15321799439374054}

## Making predictions

Now that we have a model, we would like to be able to make predictions. We can use the `DatasetIndexedTopK` layer to do this.

In [ ]:
top_k = tfrs.layers.corpus.DatasetIndexedTopK(
    # We transform the movies dataset into pairs of (movie title, movie embedding)
    # to allow us to retrieve most highly scored titles given embedding.
    # We use the `cache` transformation to make sure we don't recompute
    # movie embeddings every time we score a query.
    candidates=movies.batch(4096).map(lambda x: (x["movie_title"], model.movie_model(x))).cache()
)

Now that we have the candidate layer, all that remains is to get some user embeddings and run the top k queries:

In [ ]:
for user_id in (10, 123, 557):
  _, top_titles = top_k(model.user_model({"user_id": [user_id]}))
  print(f"Top titles for user {user_id}: {top_titles}")

Top titles for user 10: [[b'Wonderful, Horrible Life of Leni Riefenstahl, The (1993)'
  b'House of the Spirits, The (1993)' b'Laura (1944)'
  b'Big Sleep, The (1946)' b'Charade (1963)' b'Red Rock West (1992)'
  b'His Girl Friday (1940)' b'Kicking and Screaming (1995)'
  b'Band Wagon, The (1953)' b'Shall We Dance? (1937)']]
Top titles for user 123: [[b'Love in the Afternoon (1957)' b'Laura (1944)'
  b'House of the Spirits, The (1993)' b'Innocents, The (1961)'
  b'Third Man, The (1949)' b'Roman Holiday (1953)' b'Raging Bull (1980)'
  b'Bonnie and Clyde (1967)' b'Paths of Glory (1957)'
  b'Secrets & Lies (1996)']]
Top titles for user 557: [[b'Ice Storm, The (1997)' b'Welcome To Sarajevo (1997)'
  b'In the Company of Men (1997)' b'Gabbeh (1996)' b'Senseless (1998)'
  b'House of Yes, The (1997)' b"Eve's Bayou (1997)"
  b'Pillow Book, The (1995)' b'Cop Land (1997)' b'U Turn (1997)']]


## Model serving

After the model is trained, we need a way to deploy it.

In a two-tower retrieval model, serving has two components:

- a serving query model, taking in features of the query and transforming them into a query embedding, and
- a serving candidate model. This most often takes the form of an approximate nearest neighbours (ANN) index which allows fast approximate lookup of candidates in response to a query produced by the query model.

### Exporting a query model to serving

Exporting the query model is easy: we can either serialize the Keras model directly, or export it to a `SavedModel` format to make it possible to serve using [TensorFlow Serving](https://www.tensorflow.org/tfx/guide/serving).

To export to a `SavedModel` format, we can do the following:

In [ ]:
# Export the query model.
with tempfile.TemporaryDirectory() as tmp:
  path = os.path.join(tmp, "query_model")
  tf.saved_model.save(model.user_model, path)
  loaded = tf.saved_model.load(path)
  infer = loaded.signatures["serving_default"]

  query_embedding = infer(user_id=tf.constant([10], dtype=tf.int32))['output_1']

  print(f"Query embedding: {query_embedding[0, :3]}")

Query embedding: [-0.83877546 -0.16073981  0.10314572]


### Building a candidate ANN index

Exporting candidate representations is more involved. Firstly, we want to pre-compute them to make sure serving is fast; this is especially important if the candidate model is computationally intensive (for example, if it has many or wide layers; or uses complex representations for text or images). Secondly, we would like to take the precomputed representations and use them to construct a fast approximate retrieval index.


We can use [Annoy](https://github.com/spotify/annoy) to build such an index.

To do so, first instantiate the index object.

In [ ]:
from annoy import AnnoyIndex

index = AnnoyIndex(embedding_dimension, "dot")

Then take the candidate dataset and transform its raw features into embeddings using the movie model:

In [ ]:
movie_embeddings = (
    movies.batch(128).map(lambda x: (
        x["movie_id"], model.movie_model(x))))

And then index the movie_id, movie embedding pairs into our Annoy index:

In [ ]:
movie_id_to_title = dict(movies.map(lambda x: (x["movie_id"], x["movie_title"])).as_numpy_iterator())

# We unbatch the dataset because Annoy accepts only scalar (id, embedding) pairs.
for movie_id, movie_embedding in movie_embeddings.unbatch().as_numpy_iterator():
  index.add_item(movie_id, movie_embedding)

# Build a 10-tree ANN index.
index.build(10)

True

We can then retrieve nearest neighbours:

In [ ]:
for row in test.batch(1).take(10):
  query_embedding = model.user_model(row)[0]
  candidates = index.get_nns_by_vector(query_embedding, 3)
  print(f"Candidates: {[movie_id_to_title[x] for x in candidates]}.")


Candidates: [b'To Gillian on Her 37th Birthday (1996)', b'Jack (1996)', b'Associate, The (1996)'].
Candidates: [b'Oliver & Company (1988)', b'Star Trek V: The Final Frontier (1989)', b'Highlander III: The Sorcerer (1994)'].
Candidates: [b'Blue Chips (1994)', b'Free Willy (1993)', b'House Party 3 (1994)'].
Candidates: [b"Daniel Defoe's Robinson Crusoe (1996)", b'Police Story 4: Project S (Chao ji ji hua) (1993)', b'Audrey Rose (1977)'].
Candidates: [b'Foreign Correspondent (1940)', b"Daniel Defoe's Robinson Crusoe (1996)", b'Mis\xe9rables, Les (1995)'].
Candidates: [b'Pink Floyd - The Wall (1982)', b'Grand Day Out, A (1992)', b'Wrong Trousers, The (1993)'].
Candidates: [b'Ace Ventura: Pet Detective (1994)', b'Grease (1978)', b'Ref, The (1994)'].
Candidates: [b"Robert A. Heinlein's The Puppet Masters (1994)", b'In the Army Now (1994)', b'GoldenEye (1995)'].
Candidates: [b'Conan the Barbarian (1981)', b'Tank Girl (1995)', b'With Honors (1994)'].
Candidates: [b'Kissed (1996)', b'Wild Thing

## Next steps

This concludes the retrieval tutorial.

To expand on what is presented here, have a look at:

1. Learning multi-task models: jointly optimizing for ratings and clicks.
2. Using movie metadata: building a more complex movie model to alleviate cold-start.