https://www.tensorflow.org/recommenders/examples/deep_recommenders

In [the featurization tutorial](featurization) we incorporated multiple features into our models, but the models consist of only an embedding layer. We can add more dense layers to our models to increase their expressive power.

In general, deeper models are capable of learning more complex patterns than shallower models. For example, our [user model](featurization#user_model) incorporates user ids and timestamps to model user preferences at a point in time. A shallow model (say, a single embedding layer) may only be able to learn the simplest relationships between those features and movies: a given movie is most popular around the time of its release, and a given user generally prefers horror movies to comedies. To capture more complex relationships, such as user preferences evolving over time, we may need a deeper model with multiple stacked dense layers.

Of course, complex models also have their disadvantages. The first is computational cost, as larger models require both more memory and more computation to fit and serve. The second is the requirement for more data: in general, more training data is needed to take advantage of deeper models. With more parameters, deep models might overfit or even simply memorize the training examples instead of learning a function that can generalize. Finally, training deeper models may be harder, and more care needs to be taken in choosing settings like regularization and learning rate.

Finding a good architecture for a real-world recommender system is a complex art, requiring good intuition and careful [hyperparameter tuning](https://en.wikipedia.org/wiki/Hyperparameter_optimization). For example, factors such as the depth and width of the model, activation function, learning rate, and optimizer can radically change the performance of the model. Modelling choices are further complicated by the fact that good offline evaluation metrics may not correspond to good online performance, and that the choice of what to optimize for is often more critical than the choice of model itself.

Nevertheless, effort put into building and fine-tuning larger models often pays off. In this tutorial, we will illustrate how to build deep retrieval models using TensorFlow Recommenders. We'll do this by building progressively more complex models to see how this affects model performance.

## Preliminaries

We first import the necessary packages.

In [1]:
import os
import tempfile

%matplotlib inline
import matplotlib.pyplot as plt

import numpy as np
import pandas as pd
import tensorflow as tf
import tensorflow_datasets as tfds

import tensorflow_recommenders as tfrs

plt.style.use('seaborn-whitegrid')

In this tutorial we will use the models from [the featurization tutorial](featurization) to generate embeddings. Hence we will only be using the user id, timestamp, and movie title features.

In [3]:
def load_data_file_cold(file, stats):
    print('loading file:' + file)
    training_df = pd.read_csv(
        file,
        skiprows=[0],
        names=["viewer","broadcaster","viewer_age","viewer_gender","viewer_longitude","viewer_latitude","viewer_lang","viewer_country","broadcaster_age","broadcaster_gender","broadcaster_longitude","broadcaster_latitude","broadcaster_lang","broadcaster_country","duration", "viewer_network", "broadcaster_network", "count"], dtype={
            'viewer': np.unicode,
            'broadcaster': np.unicode,
            'viewer_age': np.single,
            'viewer_gender': np.unicode,
            'viewer_longitude': np.single,
            'viewer_latitude': np.single,
            'viewer_lang': np.unicode,
            'viewer_country': np.unicode,
            'broadcaster_age': np.single,
            'broadcaster_longitude': np.single,
            'broadcaster_latitude': np.single,
            'broadcaster_lang': np.unicode,
            'broadcaster_country': np.unicode,
            'viewer_network': np.unicode,
            'broadcaster_network': np.unicode,
            'count': np.int
        })

    values = {
        'viewer': 'unknown',
        'broadcaster': 'unknown',
        'viewer_age': 30,
        'viewer_gender': 'unknown',
        'viewer_longitude': 0,
        'viewer_latitude': 0,
        'viewer_lang': 'unknown',
        'viewer_country': 'unknown',
        'broadcaster_age': 30,
        'broadcaster_longitude': 0,
        'broadcaster_latitude': 0,
        'broadcaster_lang': 'unknown',
        'broadcaster_country': 'unknown',
        'duration': 0,
        'viewer_network': 'unknown',
        'broadcaster_network': 'unknown',
        'count': 0
    }
    training_df.fillna(value=values, inplace=True)
#     print(training_df.head(10))
#     print(training_df.iloc[-10:])
#     stats.send_stats('data-size', len(training_df.index))

    sampled_df = training_df.sample(frac=0.1)
    print(sampled_df.head(10))
    print(sampled_df.iloc[-10:])
    return sampled_df

def load_training_data_cold(file, stats):
    ratings_df = load_data_file_cold(file, stats)
    print('creating data set')
    training_ds = (
        tf.data.Dataset.from_tensor_slices(
            ({
                "viewer": tf.cast(
                    ratings_df['viewer'].values,
                    tf.string),
                "viewer_gender": tf.cast(
                    ratings_df['viewer_gender'].values,
                    tf.string),
                "viewer_lang": tf.cast(
                    ratings_df['viewer_lang'].values,
                    tf.string),
                "viewer_country": tf.cast(
                    ratings_df['viewer_country'].values,
                    tf.string),
                "viewer_age": tf.cast(
                    ratings_df['viewer_age'].values,
                    tf.int32),
                "viewer_longitude": tf.cast(
                    ratings_df['viewer_longitude'].values,
                    tf.float16),
                "viewer_latitude": tf.cast(
                    ratings_df['viewer_latitude'].values,
                    tf.float16),
                "broadcaster": tf.cast(
                    ratings_df['broadcaster'].values,
                    tf.string),
                "viewer_network": tf.cast(
                    ratings_df['viewer_network'].values,
                    tf.string),
                "broadcaster_network": tf.cast(
                    ratings_df['broadcaster_network'].values,
                    tf.string),
            })))

    return training_ds

In [4]:
ratings = load_training_data_cold(file="a3d86f3b-eb45-4641-b05d-30dff7423e6b.csv", stats="")

ratings = ratings.map(lambda x: {
    "broadcaster": x["broadcaster"],
    "viewer": x["viewer"],   
    "viewer_age": x["viewer_age"],
})

broadcaster = ratings.map(lambda x: x["broadcaster"])

loading file:a3d86f3b-eb45-4641-b05d-30dff7423e6b.csv
                   viewer       broadcaster  viewer_age viewer_gender  \
1938231     pof:333553197     pof:322045884        48.0          male   
3207580  meetme:291011164  meetme:312245706        30.0          male   
2528752     pof:326740458  meetme:266264990        34.0        female   
5037884    skout:69436719   skout:178518892        45.0          male   
5224811  meetme:278962360  meetme:275412896        21.0          male   
2677221  meetme:316544521     pof:192040040        66.0          male   
4143312     pof:333471872     pof:322761842        31.0          male   
2444346     pof:307744471  meetme:152006167        39.0          male   
5304332  meetme:317948716  meetme:317712363        25.0          male   
1868980  meetme:201849202  meetme:314739694        36.0          male   

         viewer_longitude  viewer_latitude viewer_lang viewer_country  \
1938231        -88.000000        43.000000          en             US

Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'


We also do some housekeeping to prepare feature vocabularies.

In [5]:
# Discretization
max_viewer_age = ratings.map(lambda x: x["viewer_age"]).reduce(
    tf.cast(0, tf.int32), tf.maximum).numpy().max()
min_viewer_age = ratings.map(lambda x: x["viewer_age"]).reduce(
    np.int32(100), tf.minimum).numpy().min()

viewer_age_buckets = np.linspace(
    min_viewer_age, max_viewer_age, num=10)

Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'


In [6]:
user_ids = ratings.batch(1_00_000).map(lambda x: x["viewer"])
unique_user_ids = np.unique(np.concatenate(list(user_ids)))
len(unique_user_ids)

Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'


249546

In [7]:
broadcaster_ids = ratings.batch(1_00_000).map(lambda x: x["broadcaster"])
unique_broadcaster_ids = np.unique(np.concatenate(list(broadcaster_ids)))
len(unique_broadcaster_ids)

Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'


69801

## Model definition

### Query model - UserModel

We start with the user model defined in [the featurization tutorial](featurization) as the first layer of our model, tasked with converting raw input examples into feature embeddings.

In [16]:
viewer_ages = np.concatenate(list(ratings.map(lambda x: x["viewer_age"]).batch(1000)))

Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'


In [25]:
class UserModel(tf.keras.Model):

    def __init__(self):
        super().__init__()

        self.user_embedding = tf.keras.Sequential([
            tf.keras.layers.experimental.preprocessing.StringLookup(
                vocabulary=unique_user_ids, mask_token=None),
            tf.keras.layers.Embedding(len(unique_user_ids) + 1, 32),
        ])
        
        self.viewer_age_embedding = tf.keras.Sequential([
            tf.keras.layers.experimental.preprocessing.Discretization(viewer_age_buckets.tolist()),
            tf.keras.layers.Embedding(len(viewer_age_buckets) + 1, 32),
        ])
        
        self.normalized_viewer_age = tf.keras.layers.experimental.preprocessing.Normalization(
            axis=None
        )

        self.normalized_viewer_age.adapt(viewer_ages)

    def call(self, inputs):
        # Take the input dictionary, pass it through each input layer,
        # and concatenate the result.
        return tf.concat([
            self.user_embedding(inputs["viewer"]),
            self.viewer_age_embedding(inputs["viewer_age"]),
            tf.reshape(self.normalized_viewer_age(inputs["viewer_age"]), (-1, 1)),
        ], axis=1)

Defining deeper models will require us to stack mode layers on top of this first input. A progressively narrower stack of layers, separated by an activation function, is a common pattern:

```
                            +----------------------+
                            |      128 x 64        |
                            +----------------------+
                                       | relu
                          +--------------------------+
                          |        256 x 128         |
                          +--------------------------+
                                       | relu
                        +------------------------------+
                        |          ... x 256           |
                        +------------------------------+
```
Since the expressive power of deep linear models is no greater than that of shallow linear models, we use ReLU activations for all but the last hidden layer. The final hidden layer does not use any activation function: using an activation function would limit the output space of the final embeddings and might negatively impact the performance of the model. For instance, if ReLUs are used in the projection layer, all components in the output embedding would be non-negative.

We're going to try something similar here. To make experimentation with different depths easy, let's define a model whose depth (and width) is defined by a set of constructor parameters. 

In [26]:
class QueryModel ( tf.keras.Model ):
	"""Model for encoding user queries."""

	def __init__ ( self , layer_sizes ):
		"""Model for encoding user queries.

		Args:
		  layer_sizes:
			A list of integers where the i-th entry represents the number of units
			the i-th layer contains.
		"""
		super ( ).__init__ ( )

		# We first use the user model for generating embeddings.
		self.embedding_model = UserModel ( )

		# Then construct the layers.
		self.dense_layers = tf.keras.Sequential ( )

		# Use the ReLU activation for all but the last layer.
		for layer_size in layer_sizes [ :-1 ]:
			self.dense_layers.add ( tf.keras.layers.Dense ( layer_size , activation = "relu" ) )

		# No activation for the last layer.
		for layer_size in layer_sizes [ -1: ]:
			self.dense_layers.add ( tf.keras.layers.Dense ( layer_size ) )

	def call ( self , inputs ):
		feature_embedding = self.embedding_model ( inputs )
		return self.dense_layers ( feature_embedding )

The `layer_sizes` parameter gives us the depth and width of the model. We can vary it to experiment with shallower or deeper models.

### Candidate model - BroadcasterModel

We can adopt the same approach for the movie model. Again, we start with the `MovieModel` from the [featurization](featurization) tutorial:

In [27]:
def split_on_colons(text):
    return tf.strings.split(text, sep=":")

In [33]:
class BroadcasterModel(tf.keras.Model):

	def __init__(self):
		super().__init__()
		embedding_dimension = 32
		max_tokens = 32

		self.broadcaster_embedding = tf.keras.Sequential([
			tf.keras.layers.experimental.preprocessing.StringLookup(
				vocabulary=unique_broadcaster_ids, mask_token=None),
			tf.keras.layers.Embedding(len(unique_broadcaster_ids) + 1, embedding_dimension)
		])

		self.broadcaster_vectorizer = tf.keras.layers.experimental.preprocessing.TextVectorization(
			standardize= None, split=split_on_colons, max_tokens = max_tokens)

		self.broadcaster_text_embedding = tf.keras.Sequential([
			self.broadcaster_vectorizer,
			tf.keras.layers.Embedding(max_tokens, embedding_dimension, mask_zero=True),
			tf.keras.layers.GlobalAveragePooling1D(),
		])

		self.broadcaster_vectorizer.adapt(broadcaster)

	def call(self, broadcaster):
		return tf.concat([
			self.broadcaster_embedding(broadcaster),
			self.broadcaster_text_embedding(broadcaster),
		], axis=1)

And expand it with hidden layers:

In [34]:
class CandidateModel(tf.keras.Model):
	"""Model for encoding movies."""

	def __init__(self, layer_sizes):
		"""Model for encoding movies.
	
		Args:
		  layer_sizes:
			A list of integers where the i-th entry represents the number of units
			the i-th layer contains.
		"""
		super().__init__()

		self.embedding_model = BroadcasterModel()

		# Then construct the layers.
		self.dense_layers = tf.keras.Sequential()

		# Use the ReLU activation for all but the last layer.
		for layer_size in layer_sizes[:-1]:
			self.dense_layers.add(tf.keras.layers.Dense(layer_size, activation="relu"))

		# No activation for the last layer.
		for layer_size in layer_sizes[-1:]:
			self.dense_layers.add(tf.keras.layers.Dense(layer_size))

	def call(self, inputs):
		feature_embedding = self.embedding_model(inputs)
		return self.dense_layers(feature_embedding)

### Combined model

With both `QueryModel` and `CandidateModel` defined, we can put together a combined model and implement our loss and metrics logic. To make things simple, we'll enforce that the model structure is the same across the query and candidate models.

In [35]:
class FinalModel ( tfrs.models.Model ):

	def __init__ ( self , layer_sizes ):
		super ( ).__init__ ( )
		self.query_model = QueryModel ( layer_sizes )
		self.candidate_model = CandidateModel ( layer_sizes )
		self.task = tfrs.tasks.Retrieval (
			metrics = tfrs.metrics.FactorizedTopK (
				candidates = broadcasters.batch(256).map (self.candidate_model) ,
			) ,
		)

	def compute_loss ( self , features , training = False ):
		# We only pass the user id and timestamp features into the query model. This
		# is to ensure that the training inputs would have the same keys as the
		# query inputs. Otherwise the discrepancy in input structure would cause an
		# error when loading the query model after saving it.
		query_embeddings = self.query_model ( {
			"viewer": features [ "viewer" ] ,
			"viewer_age": features [ "viewer_age" ] ,
		} )
		broadcaster_embeddings = self.candidate_model ( features [ "broadcaster" ] )

		return self.task (
			query_embeddings , broadcaster_embeddings , compute_metrics = not training )

## Training the model

### Prepare the data

We first split the data into a training set and a testing set.

In [36]:
tf.random.set_seed(42)
shuffled = ratings.shuffle(100_000, seed=42, reshuffle_each_iteration=False)

train = shuffled.take(80_000)
test = shuffled.skip(80_000).take(20_000)

cached_train = train.shuffle(100_000).batch(16384)
cached_test = test.batch(4096).cache()

### Shallow model

We're ready to try out our first, shallow, model!

In [37]:
num_epochs = 3

model = FinalModel([32])
model.compile(optimizer=tf.keras.optimizers.Adagrad(0.1))

one_layer_history = model.fit(
    cached_train,
    validation_data=cached_test,
    validation_freq=5,
    epochs=num_epochs,
    verbose=0)

accuracy = one_layer_history.history["val_factorized_top_k/top_100_categorical_accuracy"][-1]
print(f"Top-100 accuracy: {accuracy:.2f}.")

Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: 'arguments' object has no attribute 'posonlyargs'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: 'arguments' object has no attribute 'posonlyargs'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: 'arguments' object has no attribute 'posonlyargs'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: 'arguments' object has no attribute 'posonlyargs'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) 

KeyError: 'val_factorized_top_k/top_100_categorical_accuracy'

This gives us a top-100 accuracy of around 0.27. We can use this as a reference point for evaluating deeper models.



### Deeper model

What about a deeper model with two layers?

In [None]:
model = FinalModel([64, 32])
model.compile(optimizer=tf.keras.optimizers.Adagrad(0.1))

two_layer_history = model.fit(
    cached_train,
    validation_data=cached_test,
    validation_freq=5,
    epochs=num_epochs,
    verbose=0)

accuracy = two_layer_history.history["val_factorized_top_k/top_100_categorical_accuracy"][-1]
print(f"Top-100 accuracy: {accuracy:.2f}.")

The accuracy here is 0.29, quite a bit better than the shallow model.

We can plot the validation accuracy curves to illustrate this:

In [None]:
num_validation_runs = len(one_layer_history.history["val_factorized_top_k/top_100_categorical_accuracy"])
epochs = [(x + 1)* 5 for x in range(num_validation_runs)]

plt.plot(epochs, one_layer_history.history["val_factorized_top_k/top_100_categorical_accuracy"], label="1 layer")
plt.plot(epochs, two_layer_history.history["val_factorized_top_k/top_100_categorical_accuracy"], label="2 layers")
plt.title("Accuracy vs epoch")
plt.xlabel("epoch")
plt.ylabel("Top-100 accuracy");
plt.legend()

Even early on in the training, the larger model has a clear and stable lead over the shallow model, suggesting that adding depth helps the model capture more nuanced relationships in the data.

However, even deeper models are not necessarily better. The following model extends the depth to three layers:

In [None]:
model = MovielensModel([128, 64, 32])
model.compile(optimizer=tf.keras.optimizers.Adagrad(0.1))

three_layer_history = model.fit(
    cached_train,
    validation_data=cached_test,
    validation_freq=5,
    epochs=num_epochs,
    verbose=0)

accuracy = three_layer_history.history["val_factorized_top_k/top_100_categorical_accuracy"][-1]
print(f"Top-100 accuracy: {accuracy:.2f}.")


In fact, we don't see improvement over the shallow model:

In [None]:
plt.plot(epochs, one_layer_history.history["val_factorized_top_k/top_100_categorical_accuracy"], label="1 layer")
plt.plot(epochs, two_layer_history.history["val_factorized_top_k/top_100_categorical_accuracy"], label="2 layers")
plt.plot(epochs, three_layer_history.history["val_factorized_top_k/top_100_categorical_accuracy"], label="3 layers")
plt.title("Accuracy vs epoch")
plt.xlabel("epoch")
plt.ylabel("Top-100 accuracy");
plt.legend()

This is a good illustration of the fact that deeper and larger models, while capable of superior performance, often require very careful tuning. For example, throughout this tutorial we used a single, fixed learning rate. Alternative choices may give very different results and are worth exploring. 

With appropriate tuning and sufficient data, the effort put into building larger and deeper models is in many cases well worth it: larger models can lead to substantial improvements in prediction accuracy.



## Next Steps

In this tutorial we expanded our retrieval model with dense layers and activation functions. To see how to create a model that can perform not only retrieval tasks but also rating tasks, take a look at [the multitask tutorial](multitask).