##### Copyright 2025 The TensorFlow Authors.

In [None]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Unified Embedding Tables (UET)

<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://www.tensorflow.org/recommenders/examples/uet"><img src="https://www.tensorflow.org/images/tf_logo_32px.png" />View on TensorFlow.org</a>
  </td>
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/tensorflow/recommenders/blob/main/docs/examples/uet.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/tensorflow/recommenders/blob/main/docs/examples/uet.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a>
  </td>
  <td>
    <a href="https://storage.googleapis.com/tensorflow_docs/recommenders/docs/examples/uet.ipynb"><img src="https://www.tensorflow.org/images/download_logo_32px.png" />Download notebook</a>
  </td>
</table>

This tutorial demonstrates how to use Unified Embedding to improve performance in recommendation systems.

# Background

Embeddings are the first and largest component of most recommender models. In a typical recommendation problem, most of the inputs are categorical features that must be embedded as a high-dimensional vector before getting fed into a neural network. The embedding layer is a crucial information bottleneck in the model - if an important relationship isn't represented in the embeddings, the downstream model cannot learn from that signal.

**Example:** To make our discussion more concrete, let's imagine that we are building a recommender system for a clothing store. Our goal is to predict the probability that a user will be interested in a particular product, so that we can rank the search results by relevance. Our input features might be the customer's city, the product ID, and the customer's search terms. The supervision signal might be clicks, purchases, or some other measure of user interest. The modeling setup might look something like the following.

<div>
<center>
<img src="https://github.com/tensorflow/recommenders/blob/main/assets/embedding_baseline.png?raw=true" width="400"/>
</center>
</div>


## Why are embeddings challenging to learn?

1. **Large vocabulary:** Ideally, we would learn a separate embedding for every feature value, but this is often not possible in production settings. Our clothing store might have an inventory with 100M products - too many to fit a full table into memory.

2. **Dynamic vocabulary:** In some applications, new feature values come and go too quickly to keep an up-to-date list of all the values. For example, our clothing store might introduces many new products during the holiday shopping season. To avoid out-of-vocab performance losses and frequent re-trains, we need an embedding algorithm that can handle new values.

3. **Hash collisions:** The industry-standard solution to the previous two challenges is to assign feature values to embeddings using a hash function. However, this introduces even more problems stemming from hash collisions. If two values share the same embedding representation, these values become indistinguishable to the model.

To see how things could go wrong, suppose that customers in Anchorage, Alaska and Miami, Florida both issue queries including the words "warm" and "weather." The customer in Alaska probably wants a warm coat to protect against winter weather, but the Floridian might want clothing appropriate for warm weather in the sub-tropics. However, if Anchorage and Miami collide under the hash function, our model won't be able to distinguish between these search intents. We might accidentally recommend heavy winter coats to customers in Miami.


## What is Unified Embedding?

Unified embedding is an approach that uses a single indexing range to hash all of the categorical features in the model. Conceptually, you can think about this as using one massive embedding table for every feature.

In practice, we load-balance the embeddings across a few sub-tables because this helps alleviate hot-spot issues and allows us to more easily shard the parameters across accelerators. We also do multiple lookups, so that different features can have embeddings of different dimensions. The unified version of our clothing model might look like the following picture.

<div>
<center>
<img src="https://github.com/tensorflow/recommenders/blob/main/assets/embedding_unified.png?raw=true" width="400"/>
</center>
</div>



### Why does Unified Embedding work?

There are two reasons why Unified Embedding is a good idea.

1. **Tuning table sizes:** To design individual tables for each feature, we have hundreds of knobs to tune - one per feature. In this situation, it's easy to over / under-allocate parameters, leading to lost performance. With Unified Embedding, we only have to tune the size of one table. Unified Embedding is actually *optimal* in the sense that it has about the same number of collisions as an optimally-tuned set of individual tables. This auto-tuning behavior makes hyperparameter configuration much easier.


2. **Feature Multiplexing:** Surprisingly, we find that values from different features can share the same embedding representation without sacrificing quality. The intuition is that these features are processed by different downstream network parameters, so the model can learn to "undo" the collision and interpret the same embedding in different ways. We call this phenomenon *Feature Multiplexing,* and you can read more about it in our [NeurIPS 2023 paper](https://arxiv.org/abs/2305.12102).

**Intuition:** Going back to our clothing retailer example, advantage #1 means that Unified Embedding protects against poorly-tuned configurations. For example, if we had a catalog of 10M products as well as a list of 500 cities, Unified Embedding will automatically allocate more buckets for the products. Advantage #2 means that if a product happens to collide with a city, the model can still correctly interpret the semantic meaning of the embedding for both features.


# Movielens 1M Example

Next, let's take a look at how unified embeddings work in practice. We'll work with the Movielens-1M dataset, which is a popular benchmark in recommendation systems research. The task is to predict user movie ratings from features related to the user and the movie.

## Setup

First, we'll install and import the necessary packages for the code.

In [None]:
!pip install -q tensorflow-recommenders
!pip install -1 --upgrade tensorflow-datasets

In [None]:

import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds
import tensorflow_recommenders as tfrs

from tensorflow_recommenders.layers.feature_multiplexing import unified_embedding

UnifiedEmbedding = unified_embedding.UnifiedEmbedding
UnifiedEmbeddingConfig = unified_embedding.UnifiedEmbeddingConfig


## Loading the dataset

Next, we'll load the movielens dataset using the Tensorflow Datasets library. This follows a similar process to the [DCN](https://www.tensorflow.org/recommenders/examples/dcn) and [basic ranking](https://www.tensorflow.org/recommenders/examples/basic_ranking) tutorials. However, instead of casting this as a regression problem (where we predict the user's movie rating), we will instead pose this as a classification (where we predict whether the user's movie rating is at least 3 stars).

This is the same setup used in [our paper](https://arxiv.org/abs/2305.12102) and in the [DCNv2 paper](https://arxiv.org/abs/2008.13535).


In [None]:
ratings = tfds.load("movie_lens/100k-ratings", split="train")
ratings = ratings.map(lambda x: {
    "movie_id": x["movie_id"],
    "user_id": x["user_id"],
    "user_rating": int(x["user_rating"] >= 3),
    "user_gender": tf.strings.as_string(x["user_gender"]),
    "user_zip_code": x["user_zip_code"],
    "user_occupation_text": x["user_occupation_text"],
    "bucketized_user_age": tf.strings.as_string(x["bucketized_user_age"]),
})


Now let's partition the data into an 80-20 train-test split.

In [None]:
tf.random.set_seed(42)
shuffled = ratings.shuffle(100_000, seed=42, reshuffle_each_iteration=False)

train = shuffled.take(80_000)
test = shuffled.skip(80_000).take(20_000)

Finally, let's compute some basic statistics (such as vocabulary size) from the dataset. We will also store the vocabulary for later, since we'll need it to build the collisionless embedding model.

In [None]:
feature_names = ["movie_id", "user_id", "user_gender", "user_zip_code",
                 "user_occupation_text", "bucketized_user_age"]

vocabularies = {}
for feature_name in feature_names:
  vocab = ratings.batch(1_000_000).map(lambda x: x[feature_name])
  vocabularies[feature_name] = np.unique(np.concatenate(list(vocab)))

for feature_name in feature_names:
  print(f"Feature '{feature_name}' has cardinality "
        f"{len(vocabularies[feature_name])}")


In a moment, we'll construct three models that differ only in their embedding component, so let's also define some convenience functions to easily construct the network and task parts of the model.

In [None]:
def build_network(layer_sizes):
    network_layers = []
    # Concatenate the list of embedding features.
    network_layers.append(tf.keras.layers.Concatenate(axis=-1))
    # Pass the concatenated embeddings through the model.
    for layer_size in layer_sizes:
      network_layers.append(
          tf.keras.layers.Dense(layer_size, activation="relu"))
    # Add a logit layer to the top of the network.
    network_layers.append(tf.keras.layers.Dense(1, activation="sigmoid"))
    return tf.keras.Sequential(network_layers)

Since we binarized the movie ratings, our recommendation problem is a binary prediction task. We'll report both the AUC and the LogLoss as metrics.

In [None]:

# We will use the same task for all models.
recsys_task = tfrs.tasks.Ranking(
    loss=tf.keras.losses.BinaryCrossentropy(
        reduction=tf.keras.losses.Reduction.SUM
    ),
    metrics=[
        tf.keras.metrics.AUC(name="AUC"),
        tf.keras.metrics.BinaryCrossentropy(name="LogLoss"),
    ]
)

## Implementing the Models

Now let's implement Keras model classes for each of the modeling strategies we talked about - collisionless embeddings, hash embeddings, and unified embeddings.

### Collisionless Embedding Model

For our first model, we'll explicitly assign a single embedding to each vocabulary item. This is the ideal setup where we don't have any hash collisions or representation sharing. Unfortunately, collisionless embeddings aren't feasible in large-scale production systems. We're including it as an optimistic headroom point, to give a reference point for how large the improvement could be if we were able to use a full embedding table.


In [None]:
class CollisionlessEmbeddingModel(tfrs.Model):

  def __init__(self, layer_sizes, embed_dimension, vocabularies):
    super().__init__()
    self._embedding_dimension = embed_dimension
    self._embedding_tables = {}
    self._feature_names = vocabularies.keys()

    for feature_name, vocab in vocabularies.items():
      lookup_layer = tf.keras.layers.StringLookup(
          vocabulary=vocab, mask_token=None)
      # Collisionless embedding, without hashing.
      embed_layer = tf.keras.layers.Embedding(
          len(vocab) + 1, self._embedding_dimension)
      self._embedding_tables[feature_name] = tf.keras.Sequential(
          [lookup_layer, embed_layer])

    self._network = build_network(layer_sizes)

    self.task = recsys_task

  def call(self, features):
    embeddings = []
    for feature_name in self._feature_names:
      embedding_fn = self._embedding_tables[feature_name]
      embeddings.append(embedding_fn(features[feature_name]))
    return self._network(embeddings)

  def compute_loss(self, features, training=False):
    labels = features.pop("user_rating")
    scores = self(features)
    return self.task(
        labels=labels,
        predictions=scores,
    )

### Hash Embedding Model

Next, we construct a model with a much smaller embedding table. This is a more realistic setup for production settings, because we cannot afford to store the full table with millions or billions of inputs. Instead, we can adjust the model size to fit our memory budget. Unfortunately, hash embeddings introduce some unavoidable representation sharing because there are more items than embeddings.

In [None]:
class HashEmbeddingModel(tfrs.Model):

  def __init__(self, layer_sizes, embed_dimension, embed_buckets):
    super().__init__()
    self._embed_dimension = embed_dimension
    self._embedding_tables = {}
    self._feature_names = embed_buckets.keys()

    for feature_name, buckets in embed_buckets.items():
      lookup_layer = tf.keras.layers.Hashing(num_bins=buckets)
      embed_layer = tf.keras.layers.Embedding(
          buckets, self._embed_dimension)
      self._embedding_tables[feature_name] = tf.keras.Sequential(
          [lookup_layer, embed_layer])

    self._network = build_network(layer_sizes)

    self.task = recsys_task

  def call(self, features):
    embeddings = []
    for feature_name in self._feature_names:
      embedding_fn = self._embedding_tables[feature_name]
      embeddings.append(embedding_fn(features[feature_name]))
    return self._network(embeddings)

  def compute_loss(self, features, training=False):
    labels = features.pop("user_rating")
    scores = self(features)
    return self.task(
        labels=labels,
        predictions=scores,
    )

### Unified Embedding Model

Finally, let's define a class for a unified embedding model using the implementation in TFRS. The implementation is similar to the previous models, though it requires us to overload the compile method (as the embedding layer needs to know the optimizer if it is running on TPU).

In [None]:
translate_keras_optimizer = tfrs.layers.embedding.tpu_embedding_layer.translate_keras_optimizer

class UnifiedEmbeddingModel(tfrs.Model):

  def __init__(self, layer_sizes, embed_config):
    super().__init__()
    self._embed_config = embed_config
    self._network = build_network(layer_sizes)
    self.task = recsys_task

  def compile(self, **kwargs):
    # Because the embedding layer might have to run on TPU, we can only
    # construct the layer once the optimizer is known (at model.compile() time).
    tpu_embed_optimizer = translate_keras_optimizer(kwargs["optimizer"])
    self._embedding_layer = UnifiedEmbedding(
        self._embed_config, tpu_embed_optimizer)
    super().compile(**kwargs)

  def call(self, features):
    embeddings = self._embedding_layer(features)
    return self._network(embeddings)

  def compute_loss(self, features, training=False):
    labels = features.pop("user_rating")
    scores = self(features)
    return self.task(
        labels=labels,
        predictions=scores,
    )

## Training the models

Now we're ready to construct and train the models. We'll start by shuffling the training and test data.


In [None]:
cached_train = train.shuffle(100_000).batch(8192).cache()
cached_test = test.batch(4096).cache()

Next, we set some hyperparameters for the network. These parameters are just for demonstration purposes and are not tuned for performance. For better results, we could tune these parameters in tandem with the choice of embedding method.

We're using the legacy Keras optimizers because we need to maintain compatibility with the TPU embedding layer (in case we choose to run unified embedding on TPU). We're also going to run each model 3 times and report the mean and standard deviation for the performance metrics.

In [None]:
layer_sizes = [128, 64]
embed_dimension = 16

epochs = 5
learning_rate = 0.01
optimizer = tf.keras.optimizers.legacy.Adam(learning_rate)
# Number of times to independently repeat the training process.
num_runs = 5


Now let's create the models. We'll start with the collisionless model, which is very easy to configure.

In [None]:
# Configure and construct the collisionless model.
collisionless_models = []
for _ in range(num_runs):
  collisionless_model = CollisionlessEmbeddingModel(
      layer_sizes, embed_dimension, vocabularies)
  collisionless_model.compile(optimizer=optimizer)
  collisionless_models.append(collisionless_model)

Next, we'll build the hash embedding model. This is slightly more complicated to configure, since we now have to set table sizes for each of the features. For the best practical results, we would need to tune each of these hyperparameters. Another optimization is to use collisionless tables for the low-cardinality embeddings (such as user gender and occupation). We did this for the experiments in our paper, but it takes a little more work to set up. For demonstration purposes, we've just increased the number of embeddings to be larger than the cardinality for these features.

In [None]:
# Configure and construct the hash model.
hash_buckets = {
    "movie_id": 600,
    "user_id": 400,
    "user_gender": 20,
    "user_zip_code": 400,
    "user_occupation_text": 60,
    "user_occupation_text": 20,
}
hash_models = []
for _ in range(num_runs):
  hash_model = HashEmbeddingModel(layer_sizes, embed_dimension, hash_buckets)
  hash_model.compile(optimizer=optimizer)
  hash_models.append(hash_model)

Finally, let's build the unified embedding model. We consider a setup where we have two tables, each with half the embedding dimension, and get our embeddings by querying both tables and concatenating the results. To have a valid comparison with the hash embedding model, we construct the table so that it has the same overall number of parameters.

There are other ways to configure the unified table, with some tradeoffs. For example, we could use a single lookup in a full-dimension table for each feature - this would slightly reduce the representation capacity but reduce the number of hash lookups and improve latency. In the end, the best configuration will be the one that meets the latency, memory, and performance demands of the application.

In [None]:
# Configure and construct a unified model with the same size as the hash model.
total_buckets = sum(hash_buckets.values())
num_tables = 2
embed_config = UnifiedEmbeddingConfig(
    buckets_per_table=total_buckets,
    dim_per_table=embed_dimension // 2,
    num_tables=num_tables,
    name="unified_table",
)
for feature_name in feature_names:
  embed_config.add_feature(feature_name, 2)
unified_models = []
for _ in range(num_runs):
  unified_model = UnifiedEmbeddingModel(layer_sizes, embed_config)
  unified_model.compile(optimizer=optimizer)
  unified_models.append(unified_model)

For convenience, let's define a function to train several models and aggregate the outputs.

In [None]:
def run_models(models):
  aucs = []
  losses = []
  for model in models:
    model.fit(cached_train, epochs=epochs, verbose=False)
    metrics = model.evaluate(cached_test, return_dict=True, verbose=False)
    aucs.append(metrics["AUC"])
    losses.append(metrics["LogLoss"])
  return aucs, losses

## Results
Finally, let's train the models and report the results.

In [None]:
collisionless_aucs, collisionless_losses = run_models(collisionless_models)
hash_aucs, hash_losses = run_models(hash_models)
unified_aucs, unified_losses = run_models(unified_models)

def print_metrics(model_name, aucs, losses):
  s = f"{model_name} model: {np.mean(aucs):.3f} AUC (std = {np.std(aucs):.3f}) "
  s += f"and {np.mean(losses):.3f} LogLoss (std = {np.std(losses):.3f})."
  print(s)

print_metrics("Collisionless", collisionless_aucs, collisionless_losses)
print_metrics("Hash Embedding", hash_aucs, hash_losses)
print_metrics("Unified Embedding", unified_aucs, unified_losses)

We ended up with the following results, but there will be differences each time this is run due to the optimizer and initialization.

```
Collisionless model: 0.797 AUC (std = 0.000) and 0.380 LogLoss (std = 0.005).
Hash Embedding model: 0.743 AUC (std = 0.001) and 0.407 LogLoss (std = 0.003).
Unified Embedding model: 0.790 AUC (std = 0.001) and 0.380 LogLoss (std = 0.001).
```

Unified embedding should have about 4-5% higher AUC when compared to the hash embeddings, while having the same overall parameter budget. The performance is close to the collisionless model, but only uses about 57% of the parameters (43k vs 76k).

To learn more about how the method works, you may want to play around with the unified embedding configuration. For example, using only one table (no chunking) will reduce the AUC by about 2%. It should also be possible to increase the performance of the hash embedding model by carefully tuning the table sizes. It's also interesting to see how small the models can go - for very small table sizes, unified embedding should have even larger gains.

## Conclusion

We hope you developed some intuition about feature multiplexing and enjoyed learning how unified embedding can work in practice. To learn more, you can check out our NeurIPS paper, ["Unified Embedding: Battle-Tested Feature Representations for Web-Scale ML Systems."](https://arxiv.org/abs/2305.12102)

