##### Copyright 2020 The TensorFlow Authors.

In [None]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Building complex features

<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://www.tensorflow.org/recommenders/examples/movielens"><img src="https://www.tensorflow.org/images/tf_logo_32px.png" />View on TensorFlow.org</a>
  </td>
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/tensorflow/recommenders/blob/main/docs/examples/featurization.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/tensorflow/recommenders/blob/main/docs/examples/featurization.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a>
  </td>
  <td>
    <a href="https://storage.googleapis.com/tensorflow_docs/recommenders/docs/examples/featurization.ipynb"><img src="https://www.tensorflow.org/images/download_logo_32px.png" />Download notebook</a>
  </td>
</table>

One of the great advantages of usign a two-tower retrieval model built using a deep learning framework is the freedom to build rich, flexible feature representations.

In this tutorial we are going to explore how to do this using TFRS.

## The MovieLens dataset

The [MovieLens dataset](https://grouplens.org/datasets/movielens/) gives us multiple features we can play with.

In [None]:
import pprint

import tensorflow_datasets as tfds

ratings = tfds.load("movielens/100k-ratings", split="train")

for x in ratings.take(1).as_numpy_iterator():
  pprint.pprint(x)

Few of these features are immediately usable in a deep learning model; we need  to process most of them to make them available to our model.

In this tutorial, we're going to cover:

1. how to process categorical features (for example, `movie_id` or `user_occupation`) into embeddings,
2. how to normalize continuous features.

We will use Keras feature processing components throughout. This has the key advantage of keeping feature processing consistent between training and serving time: we can be sure that whatever pre-processing we add in the training phase will be applied in exactly the same way at serving time. This makes deploying TensorFlow models straightforward, as we can send them raw features and serving time and be sure that they will do the right thing.

## Turning categorical features into embeddings

A [categorical feature](https://en.wikipedia.org/wiki/Categorical_variable) is a feature that does not express a continuous quantity, but rather takes on one of a set of fixed values. For example, the id or the title of a movie are categorical features.

Most deep learning models express these feature by turning them into high-dimensional vectors. During model training, the value of that vector is adjusted to help the model predict its objective better.

For example, suppose that our goal is to predict which user is going to watch which movie. To do that, we represent each user and each movie by an embedding vector. Initially, these embeddings will take on random values - but during training, we will adjust them so that embeddings of users and the movies they watch end up closer together.

Taking raw categorical features and turning them into embeddings is normally a two-step process.

### Defining the vocabulary

The first step is to define a vocabulary: a mapping from the raw feature value (say, "doctor") to nonnegative integer.

We can do this easily using Keras preprocessing layers.

In [None]:
import numpy as np
import tensorflow as tf

movie_title_lookup = tf.keras.layers.experimental.preprocessing.StringLookup()

The layer itself does not have a vocabulary yet, but we can build it using our data.

In [None]:
movie_title_lookup.adapt(ratings.map(lambda x: x["movie_title"]))

print(f"Vocabulary: {movie_title_lookup.get_vocabulary()[:3]}")

Once we have this we can use the layer to translate raw tokens to embedding ids:

In [None]:
movie_title_lookup(["Star Wars (1977)", "One Flew Over the Cuckoo's Nest (1975)"])

Note that the layer's vocabulary includes one (or more!) unknown (or "out of vocabulary", OOV) tokens. This is really handy: it means that the layer can handle categorical values that are not in the vocabulary. In practical terms, this means that the model can continue to learn about and make recommendations even using features that have not been seen during vocabulary construction.

### Using feature hashing

In fact, the `StringLookup` layer allows us to configure multiple OOV indices. If we do that, any raw value that is not in the vocabulary will be deterministically hashed to one of the OOV indices. The more such indices we have, the less likley it is that two different raw feature values will hash to the same OOV index. Consequently, if we have enough such indices the model should be able to train about as well as a model with an explicit vocabulary without the disdvantage of having to maintain the token list.

We can take this to its logical extreme and rely entirely on feature hashing, with no vocabulary at all. This is implemented in the `tf.keras.layers.experimental.preprocessing.Hashing` layer.

In [None]:
# We set up a large number of bins to reduce the chance of hash collisions.
num_hashing_bins = 200_000

movie_title_hashing = tf.keras.layers.experimental.preprocessing.Hashing(
    num_bins=num_hashing_bins
)

We can do the lookup as before without the need to build vocabularies:

In [None]:
movie_title_hashing(["Star Wars (1977)", "One Flew Over the Cuckoo's Nest (1975)"])

### Defining the embeddings

Now that we integer ids, we can use the [`Embedding`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding) layer to turn those into embeddings.

An embedding layer has two dimensions: the first dimension tells us how many distinct categories we can embed; the second tells us how large the vector representing each of them can be.

When creating the embedding layer for movie titles, we are going to set the first value to the size of our title vocabulary (or the number of hashing bins). The second is up to us: the larger it is, the higher the capacity of the model, but the slower it is to fit and serve.

In [None]:
movie_title_embedding = tf.keras.layers.Embedding(
    # Let's use the hashing approach.
    input_dim=num_hashing_bins,
    output_dim=32
)

We can put the two together into a single layer which takes raw text in and yields embeddings.

In [None]:
movie_title_model = tf.keras.Sequential([movie_title_hashing, movie_title_embedding])

Just like that, we can directly get the embeddings for our movie titles:

In [None]:
movie_title_model(["Star Wars (1977)", "One Flew Over the Cuckoo's Nest (1975)"])

We can do the same with user embeddings:

In [None]:
user_id_lookup = tf.keras.layers.experimental.preprocessing.Hashing(
    num_bins=num_hashing_bins)
user_id_embedding = tf.keras.layers.Embedding(num_hashing_bins, 32)

user_id_model = tf.keras.Sequential([user_id_lookup, user_id_embedding])

## Normalizing continuous features

Continuous features also need normalization. For example, the `timestamp` feature is far too large to be used directly in a deep model:

In [None]:
for x in ratings.take(3).as_numpy_iterator():
  print(f"Timestamp: {x['timestamp']}.")

We need to process it before we can use it. While there are many ways in which we can do this, discretization and standardization are two common ones.

### Standardization

[Standardization](https://en.wikipedia.org/wiki/Feature_scaling#Standardization_(Z-score_Normalization)) rescales features to normalize their range by subtracting the feature's mean and dividing by its standard deviation. It is a common preprocessing transformation.

This can be easily accomplished using the [`tf.keras.layers.experimental.preprocessing.Normalization`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/experimental/preprocessing/Normalization) layer:

In [None]:
timestamp_normalization = tf.keras.layers.experimental.preprocessing.Normalization()
timestamp_normalization.adapt(ratings.map(lambda x: x["timestamp"]).batch(1024))

for x in ratings.take(3).as_numpy_iterator():
  print(f"Normalized timestamp: {timestamp_normalization(x['timestamp'])}.")

### Discretization

Another common transformation is to turn a continuous feature into a number of categorical features. This makes good sense if we have reasons to suspect that a feature's effect is non-continuous.

To do this, we first need to establish the boundaries of the buckets we will use for discretization. The easiest way is to identify the minimum and maximum value of the feature, and divide the resulting interval equally:

In [None]:
max_timestamp = ratings.map(lambda x: x["timestamp"]).reduce(
    tf.cast(0, tf.int64), tf.maximum).numpy().max()
min_timestamp = ratings.map(lambda x: x["timestamp"]).reduce(
    np.int64(1e9), tf.minimum).numpy().min()

timestamp_buckets = np.linspace(
    min_timestamp, max_timestamp, num=1000)

print(f"Buckets: {timestamp_buckets[:3]}")

Given the bucket boundaries we can transform timestamps into embeddings:

In [None]:
timestamp_embedding_model = tf.keras.Sequential([
  tf.keras.layers.experimental.preprocessing.Discretization(timestamp_buckets.tolist()),
  tf.keras.layers.Embedding(len(timestamp_buckets) + 2, 32)
])

for timestamp in ratings.take(1).map(lambda x: x["timestamp"]).batch(1).as_numpy_iterator():
  print(f"Timestamp embedding: {timestamp_embedding_model(timestamp)}.")                                       

## Training a model


With all these components present we can start building a model.

First, we'll build a user model that incorporates both timestamp and user id information:

In [None]:
class UserModel(tf.keras.Model):
  
  def __init__(self):
    super().__init__()

    self._user_id_model = user_id_model
    self._timestamp_embedding_model = timestamp_embedding_model
    self._timestamp_normalization_model = timestamp_normalization
    self._projection = tf.keras.layers.Dense(32)

  def call(self, inputs):

    # Take the input dictionary, pass it through each input layer,
    # and concatenate the result.
    features = tf.concat([
        self._user_id_model(inputs["user_id"]),
        self._timestamp_embedding_model(inputs["timestamp"]),
        self._timestamp_normalization_model(inputs["timestamp"])
    ], axis=1)

    # We need to add a dense layer to ensure that the output dimension
    # of the user model is the same as that of the movie model.
    return self._projection(features)

We can then set up the full model.

In [None]:
import tensorflow_recommenders as tfrs


class Model(tfrs.Model):

  def __init__(self):
    super().__init__()
    self.user_model = UserModel()
    self.movie_model = movie_title_model
    self.task = tfrs.tasks.RetrievalTask(
        corpus_metrics=tfrs.metrics.FactorizedTopK(
          candidates=movies.batch(128).map(lambda x: x["movie_title"]).map(self.movie_model)
        )
    )

  def compute_loss(self, features, training=False):
    # We pick out the user features and pass them into the user model.
    user_embeddings = self.user_model(
        {k: v for k, v in features.items() if k in ("timestamp", "user_id")})
    # And pick out the movie features and pass them into the movie model,
    # getting embeddings back.
    positive_movie_embeddings = self.movie_model(features["movie_title"])

    # The task computes the loss and the metrics.
    return self.task(user_embeddings, positive_movie_embeddings)

Prepare the data:

In [None]:
movies = tfds.load("movielens/100k-movies", split="train")

ratings = ratings.map(lambda x: {
    "movie_title": x["movie_title"],
    "user_id": x["user_id"],
    "timestamp": x["timestamp"]
})
movies = movies.map(lambda x: {
    "movie_title": x["movie_title"],
})

tf.random.set_seed(42)
shuffled = ratings.shuffle(100_000, seed=42, reshuffle_each_iteration=False)

train = shuffled.take(80_000)
test = shuffled.skip(80_000).take(20_000)

In [None]:
model = Model()
model.compile(optimizer=tf.keras.optimizers.Adagrad(learning_rate=0.1))

Fit:

In [None]:
model.fit(train.batch(4096), epochs=3)

And evaluate:

In [None]:
model.evaluate(test.batch(4096))

## Serving a model

Now that we have a model, let's prepare it for serving recommendations.

The simplest way is to do this by brute force: when given a query, compute scores across all possible candidates, and return the top ones. Of course, this is only feasible when the candidate set is relatively small; however, it will serve for demonstration purposes here.

We first precompute the candidate embeddings:

In [None]:
movie_titles = np.concatenate(list(
    movies.batch(1_000)
    .map(lambda x: x["movie_title"])
))
movie_embeddings = np.concatenate(list(
    movies.batch(1_000)
    .map(lambda x: model.movie_model(x["movie_title"]))
    .as_numpy_iterator()
))

And then use those to create a new layer for serving recommendations using the `tfrs.layers.ann.BruteForce` layer.

When constructing the layer, we will pass in the query model we just trained. By doing so, we make it possible for us to send raw query features to them model, and have the model automatically transform them into query embeddings. This is operationally very convenient, as it (1) makes sure we use the same feature processing in serving as we do in training, and (2) eliminates the need to maintain separate query models.

In [None]:
serving_model = tfrs.layers.ann.BruteForce(model.user_model)

Once we have the layer, we need to index it with the candidate embeddings we want to use. When calling the `index` method we pass movie titles as the identifiers argument. That way, when we issue a query we will get the movie titles of top recommended movies back (instead of more opaque identifiers we would have to post-process later):

In [None]:
movie_titles = movies.map(lambda x: x["movie_title"])

serving_model.index(
    candidates=movie_titles.batch(128).map(model.movie_model),
    identifiers=movie_titles)

With this, we're ready to get our recommendations:

In [None]:
scores, titles = serving_model(
    {"user_id": np.array(["42"]), "timestamp": np.array([879024327])},
    num_candidates=3
)
print(f"Top recommendations: {titles.numpy().tolist()}")

Once saved and restored, this model can be used in any TensorFlow serving infrastructure (such as [TensorFlow Serving](https://www.tensorflow.org/tfx/tutorials/serving/rest_simple) or your own microservice).

In [None]:
import os
import tempfile

tmp = tempfile.TemporaryDirectory()

path = os.path.join(tmp.name, "model")
tf.saved_model.save(serving_model, path)

In [None]:
loaded = tf.keras.models.load_model(path)

scores, titles = loaded(
    {"user_id": np.array(["42"]), "timestamp": np.array([879024327])},
)
print(f"Top recommendations: {titles.numpy().tolist()[0][:3]}")