# Overview
Hierarchical Parameter Server (HPS) is a distributed recommendation inference framework, which combines a high-performance GPU embedding cache with an hierarchical storage architecture, to realize low-latency retrieval of embeddings for inference tasks. It is provided as a Python toolkit and can be easily integrated into the TensorFlow (TF) model graph.

This tutorial will show you how to integrate HPS backend and Tensorflow backend via Triton ensemble mode. By leveraging HPS, trained Tensorflow DNN models with large embedding tables can be efficiently deployed through the Triton Inference Server. For more details about HPS, please refer to [HugeCTR Hierarchical Parameter Server (HPS)](https://nvidia-merlin.github.io/HugeCTR/master/hugectr_parameter_server.html#hugectr-hierarchical-parameter-server-database-backend).

The **01_model_training.ipynb** will cover following tasks
  * Generate mock datasets that meet the HPS input format
  * Train native Tensorflow DNN model
  * Separate the trained DNN model graph into two, embedding lookup and dense model graph
  * Reconstruct the dense model graph
  * Construct HPS lookup model, get DNN model weights and transfer to HPS

## Configurations

In [1]:
import os
import numpy as np
import tensorflow as tf
import struct

2022-11-22 03:26:59.182685: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-22 03:27:00.099052: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/compat/lib.real:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/lib:/repos/dist/lib:/opt/tritonserver/lib:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/lib:/repos/dist/lib:/usr/lib/jvm/default-java/lib:/usr/lib/jvm/default-java/lib/server:/usr/local/hugectr/lib
2022-11-

In [2]:
args = dict()
# define model training settings
args["gpu_num"] = 4                               # the number of available GPUs
args["iter_num"] = 10                             # the number of training iteration
args["embed_vec_size"] = 16                       # the dimension of embedding vectors
args["global_batch_size"] = 65536                 # the globally batchsize for all GPUs
args["slot_num"] = 3                              # the number of feature fields in this embedding layer
args["max_vocabulary_size"] = 30000
args["vocabulary_range_per_slot"] = [[0,10000],[10000,20000],[20000,30000]]
# define model save path
args["dense_model_path"]          = "naive_dnn_dense.model"
args["reshape_dense_model_path"]  = "naive_dnn_reshape_dense.model"
args["embedding_table_path"]      = "naive_dnn_sparse.model"
# define data type
args["np_key_type"]    = np.int64
args["np_vector_type"] = np.float32
args["tf_key_type"]    = tf.int64
args["tf_vector_type"] = tf.float32

# GPU environment configuration for model training
os.environ["CUDA_VISIBLE_DEVICES"] = ",".join(map(str, range(args["gpu_num"])))

## Data generation
Generate mock datasets that meet the HPS input format

In [3]:
def generate_random_samples(num_samples, vocabulary_range_per_slot, key_dtype = args["np_key_type"]):
    """
    Data generator
    
    Returns a randomly generated set of values for keys and labels
    """
    keys = list()
    for vocab_range in vocabulary_range_per_slot:
        keys_per_slot = np.random.randint(low=vocab_range[0], 
                                          high=vocab_range[1], 
                                          size=(num_samples, 1), 
                                          dtype=key_dtype)
        keys.append(keys_per_slot)
    keys = np.concatenate(np.array(keys), axis = 1)
    labels = np.random.randint(low=0, high=2, size=(num_samples, 1))
    return keys, labels

def tf_dataset(keys, labels, batchsize):
    """
    Slice tensor into batches
    """
    dataset = tf.data.Dataset.from_tensor_slices((keys, labels))
    dataset = dataset.batch(batchsize, drop_remainder=True)
    return dataset

## Model construction, native TF DNN model for training
We define the model graph for training with native TF layers, i.e., `tf.nn.embedding_lookup` and `tf.keras.layers.Dense`. Besides, the embedding weights are stored in `tf.Variable`. We can then train the model and extract the trained weights of the embedding table. As for the dense layers, they are saved as a separate model graph, which can be loaded directly during inference.

In [4]:
class DenseModel(tf.keras.models.Model):
    """
    Model with 1 input, 1 output and 3 fully-connected layer
    """
    def __init__(self,
                 init_tensors,
                 slot_num,
                 embed_vec_size,
                 **kwargs):
        super(DenseModel, self).__init__(**kwargs)
        self.slot_num       = slot_num
        self.embed_vec_size = embed_vec_size
        self.init_tensors   = init_tensors
        self.params         = tf.Variable(initial_value=tf.concat(self.init_tensors, axis=0))
        
        # define FC layers
        self.fc_1 = tf.keras.layers.Dense(units=256, activation=None,
                                                 kernel_initializer="ones",
                                                 bias_initializer="zeros",
                                                 name='fc_1')
        self.fc_2 = tf.keras.layers.Dense(units=128, activation=None,
                                                 kernel_initializer="ones",
                                                 bias_initializer="zeros",
                                                 name='fc_2')
        self.fc_3 = tf.keras.layers.Dense(units=1, activation=None,
                                                 kernel_initializer="ones",
                                                 bias_initializer="zeros",
                                                 name='fc_3')

    def call(self, inputs):
        embedding_vector = tf.nn.embedding_lookup(params=self.params, 
                                                  ids=inputs)
        embedding_vector = tf.reshape(embedding_vector, 
                                      shape=[-1, self.slot_num * self.embed_vec_size])
        fc1   = self.fc_1(embedding_vector)
        fc2   = self.fc_2(fc1)
        logit = self.fc_3(fc2)
        return logit, embedding_vector

    def summary(self):
        inputs = tf.keras.Input(shape=(self.slot_num,), 
                                dtype=args["tf_key_type"], 
                                name="input_dense")
        model  = tf.keras.models.Model(inputs=inputs, 
                                       outputs=self.call(inputs),
                                       name='tf_model')
        return model.summary()

In [5]:
def train(args):
    """
    Define TF DNN model training process
    """
    # initialize input tensor
    init_tensors = np.ones(shape=[args["max_vocabulary_size"], args["embed_vec_size"]], 
                           dtype=args["np_vector_type"])
    
    # model construction
    model = DenseModel(init_tensors, args["slot_num"], args["embed_vec_size"])
    model.summary()
    optimizer = tf.keras.optimizers.Adam(learning_rate=0.1)
    loss_fn = tf.keras.losses.BinaryCrossentropy(from_logits=True)
    
    # define model training steps
    def _train_step(inputs, labels):
        with tf.GradientTape() as tape:
            logit, embedding_vector = model(inputs)
            loss = loss_fn(labels, logit)
        grads = tape.gradient(loss, model.trainable_variables)
        optimizer.apply_gradients(zip(grads, model.trainable_variables))
        return logit, embedding_vector, loss

    # prepare dataset
    keys, labels = generate_random_samples(args["global_batch_size"] * args["iter_num"], 
                                           args["vocabulary_range_per_slot"],  
                                           args["np_key_type"])
    dataset = tf_dataset(keys, labels, args["global_batch_size"])
    
    # model training
    for i, (id_tensors, labels) in enumerate(dataset):
        _, embedding_vector, loss = _train_step(id_tensors, labels)
        print("-"*20, "Step {}, loss: {}".format(i, loss),  "-"*20)

    return model

In [6]:
trained_model = train(args)

# get trained model weights for HPS lookup
weights_list  = trained_model.get_weights()
embedding_weights = weights_list[-1]

Model: "tf_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_dense (InputLayer)    [(None, 3)]               0         
                                                                 
 tf.compat.v1.nn.embedding_l  (None, 3, 16)            0         
 ookup (TFOpLambda)                                              
                                                                 
 tf.reshape (TFOpLambda)     (None, 48)                0         
                                                                 
 fc_1 (Dense)                (None, 256)               12544     
                                                                 
 fc_2 (Dense)                (None, 128)               32896     
                                                                 
 fc_3 (Dense)                (None, 1)                 129       
                                                          

2022-11-22 03:27:01.518012: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-22 03:27:03.638472: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1613] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30970 MB memory:  -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:06:00.0, compute capability: 7.0
2022-11-22 03:27:03.639636: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1613] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 30970 MB memory:  -> device: 1, name: Tesla V100-SXM2-32GB, pci bus id: 0000:07:00.0, compute capability: 7.0
2022-11-22 03:27:03.640571: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1613] Created device /job:localhost/replica:0/task:0/device

-------------------- Step 0, loss: 783168.0 --------------------
-------------------- Step 1, loss: 518110.40625 --------------------
-------------------- Step 2, loss: 327503.125 --------------------
-------------------- Step 3, loss: 199153.59375 --------------------
-------------------- Step 4, loss: 115212.203125 --------------------
-------------------- Step 5, loss: 64676.38671875 --------------------
-------------------- Step 6, loss: 34635.75390625 --------------------
-------------------- Step 7, loss: 17304.05859375 --------------------
-------------------- Step 8, loss: 8057.4736328125 --------------------
-------------------- Step 9, loss: 3385.7255859375 --------------------


## Save dense model graph
Separate the trained DNN model graph into two, embedding layer and dense model graph

In [7]:
# save dense layers as a seperate model graph
dense_model = tf.keras.models.Model(trained_model.get_layer("fc_1").input, 
                                    trained_model.get_layer("fc_3").output, 
                                    name='tf_dense_model')
dense_model.summary()

# saved dense model graph will be load directly in the Triton inference part
dense_model.save(args["dense_model_path"])

Model: "tf_dense_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 48)]              0         
                                                                 
 fc_1 (Dense)                (None, 256)               12544     
                                                                 
 fc_2 (Dense)                (None, 128)               32896     
                                                                 
 fc_3 (Dense)                (None, 1)                 129       
                                                                 
Total params: 45,569
Trainable params: 45,569
Non-trainable params: 0
_________________________________________________________________
INFO:tensorflow:Assets written to: naive_dnn_dense.model/assets


## Reconstruct the dense model graph
Add reshape layer to top of loaded dense model to meet input format of Triton Tensorflow backend. 

For a Triton input, a model that supports batching expects a batched input to have shape [batch-size], which means that the batch dimension fully describes the shape. For the inference API the equivalent shape [batch-size, 1] must be specified since each input must specify a non-empty dims [[link](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md#reshape)]. 

In [8]:
class ReshapeDenseModel(tf.keras.models.Model):
    """
    Add a reshape layer on top of the loaded dense model,
    to support Triton's tensorflow backend input format.
    The output of HPS is 1-dimension [-1], while TF backend requires 2-dimensional input [batch_size, -1].
    """
    def __init__(self,
                 slot_num,
                 embed_vec_size,
                 dense_model_path,
                 **kwargs):
        super(ReshapeDenseModel, self).__init__(**kwargs)
        self.slot_num = slot_num
        self.embed_vec_size = embed_vec_size
        self.dense_model = tf.keras.models.load_model(dense_model_path)
        
    def call(self, inputs):
        reshaped_input_vector = tf.reshape(inputs, 
                                           shape=[-1, self.slot_num * self.embed_vec_size])
        logit = self.dense_model(reshaped_input_vector)
        return logit

In [9]:
reshape_dense_model = ReshapeDenseModel(args["slot_num"], 
                                        args["embed_vec_size"], 
                                        args["dense_model_path"])



In [10]:
# test the new reshaped dense model
input_test  = np.random.random((96)).astype(np.float32)
output_test = reshape_dense_model(input_test)
print("For reshape dense model, input shape is:{}".format(input_test.shape), 
      "output shape is:{}".format(output_test.shape))

For reshape dense model, input shape is:(96,) output shape is:(2, 1)


In [11]:
# save reshaped dense model graph, 
reshape_dense_model.save(args["reshape_dense_model_path"])

INFO:tensorflow:Assets written to: naive_dnn_reshape_dense.model/assets


## Construct HPS lookup model
In order to leverage HPS to facilitate the embedding lookup part, we need to convert the model lookup part of the model into a [format supported by HPS](../../docs/architecture.md#hierarchical-parameter-server-input-format). 
In this case, `tf.nn.embedding_lookup` of native TF DNN model will be replaced by HPS. Embedding weights of dense models are reloaded and transferred to HPS (sparse models) in preparation for future inference deployments.

In [12]:
def convert_to_sparse_model(embeddings_weights, embedding_table_path, embedding_vec_size):
    """
    Convert the lookup part of the model to a format supported by HPS (key-vector pair files),
    the embedding weights of the trained dense model will be reloaded.
    
    Outputs(key-vector pair files) will be saved to defined sparse model path
    """
    os.system("mkdir -p {}".format(embedding_table_path))
    
    with open("{}/key".format(embedding_table_path), 'wb') as key_file, \
        open("{}/emb_vector".format(embedding_table_path), 'wb') as vec_file:
        for key in range(embeddings_weights.shape[0]):
            vec = embeddings_weights[key]
            key_struct = struct.pack('q', key)
            vec_struct = struct.pack(str(embedding_vec_size) + "f", *vec)
            key_file.write(key_struct)
            vec_file.write(vec_struct)

In [13]:
convert_to_sparse_model(embedding_weights, 
                        args["embedding_table_path"], 
                        args["embed_vec_size"])

## Convert dense model graph to ONNX
We convert the dense TF SavedModel to ONNX, which will be used to build the TensorRT engine.

In [14]:
import numpy as np
import tensorflow as tf
import tf2onnx
import onnx
from tf2onnx import utils
from tf2onnx.handler import tf_op

BZ = 1024
TF_MODEL_PATH = "naive_dnn_dense.model"
ONNX_MODEL_PATH = "naive_dnn_dense.onnx"

model = tf.keras.models.load_model(TF_MODEL_PATH)

spec = [tf.TensorSpec(shape=(BZ, 3 * 16), dtype=tf.float32, name="input_1")]

onnx_model, _ = tf2onnx.convert.from_keras(model, input_signature=spec, opset=10)

with open(ONNX_MODEL_PATH, "wb") as f:
    f.write(onnx_model.SerializeToString())

print("loading onnx")
onnx_model = onnx.load(ONNX_MODEL_PATH)
onnx.checker.check_model(onnx_model)
print("onnx model checked")

loading onnx
onnx model checked


2022-11-22 03:27:10.445160: I tensorflow/core/grappler/devices.cc:66] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 4
2022-11-22 03:27:10.445278: I tensorflow/core/grappler/clusters/single_machine.cc:358] Starting new session
2022-11-22 03:27:10.456497: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1613] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30970 MB memory:  -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:06:00.0, compute capability: 7.0
2022-11-22 03:27:10.457164: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1613] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 30970 MB memory:  -> device: 1, name: Tesla V100-SXM2-32GB, pci bus id: 0000:07:00.0, compute capability: 7.0
2022-11-22 03:27:10.457812: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1613] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 30970 MB memory:  -> device: 2, name: Tesla V100-SXM2-32GB, pci bus id: 0000:0a:

In [15]:
 # Release the occupied GPU memory by TensorFlow and Keras
from numba import cuda
cuda.select_device(0)
cuda.close()

After finishing this notebook, please launch the container `nvcr.io/nvidia/ternsorrt:22.11-py3` to build the TensorRT engine from the ONNX model using the following command:

```bash
trtexec --onnx=naive_dnn_dense.onnx --saveEngine=naive_dnn_dense.trt
```

Then you can continue the notebooks [02_model_inference_hps_tf_ensemble.ipynb](02_model_inference_hps_tf_ensemble.ipynb) and [03_model_inference_hps_trt_ensemble.ipynb](02_model_inference_hps_trt_ensemble.ipynb) within the container `nvcr.io/nvidia/merlin/merlin-hugectr:22.11`.