# Verifying workable TF Serving

This tutorial shows:
- how to run TF Serving for a custom model in Docker container
- how to request for predictions via both gRPC and RestAPI calls
- the prediction timing result from TF Serving

This notebook is written by referencing the [official TF Serving gRPC example](https://github.com/tensorflow/serving/blob/master/tensorflow_serving/example/resnet_k8s.yaml) and [official TF Serving RestAPI example](https://www.tensorflow.org/tfx/tutorials/serving/rest_simple).

### Imports

In [None]:
!pip install -q requests
!pip install -q tensorflow-serving-api

In [2]:
import os
import tempfile
import pandas as pd
import tensorflow as tf
import numpy as np
import json
import requests

# gRPC request specific imports
import grpc
from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2_grpc

## Model

### Get a sample model 

The target model is the plain `ResNet50` trained on ImageNet.

In [3]:
core = tf.keras.applications.ResNet50(include_top=True, input_shape=(224, 224, 3))

inputs = tf.keras.layers.Input(shape=(224, 224, 3), name="image_input")
preprocess = tf.keras.applications.resnet50.preprocess_input(inputs)
outputs = core(preprocess, training=False)
model = tf.keras.Model(inputs=[inputs], outputs=[outputs])

Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/resnet/resnet50_weights_tf_dim_ordering_tf_kernels.h5


### Save the model

Below code saves the model under `MODEL_DIR`.

In [4]:
MODEL_DIR = tempfile.gettempdir()
version = 1
export_path = os.path.join(MODEL_DIR, str(version))
print('export_path = {}\n'.format(export_path))

tf.keras.models.save_model(
    model,
    export_path,
    overwrite=True,
    include_optimizer=True,
    save_format=None,
    signatures=None,
    options=None
)

print('\nSaved model:')
!ls -l {export_path}

export_path = /tmp/1

INFO:tensorflow:Assets written to: /tmp/1/assets

Saved model:
total 4040
drwxr-xr-x 2 root root    4096 Mar 23 07:32 assets
-rw-r--r-- 1 root root  557217 Mar 23 07:32 keras_metadata.pb
-rw-r--r-- 1 root root 3565545 Mar 23 07:32 saved_model.pb
drwxr-xr-x 2 root root    4096 Mar 23 07:32 variables


### Examine your saved model

TensorFlow comes with a handy `saved_model_cli` tool to investigate saved model.

Notice from `signature_def['serving_default']:` 
- the input name is `image_input`
- the output name is `resnet50`

You need to know these to make requests to the TF Serving server later

In [5]:
!saved_model_cli show --dir {export_path} --all


MetaGraphDef with tag-set: 'serve' contains the following SignatureDefs:

signature_def['__saved_model_init_op']:
  The given SavedModel SignatureDef contains the following input(s):
  The given SavedModel SignatureDef contains the following output(s):
    outputs['__saved_model_init_op'] tensor_info:
        dtype: DT_INVALID
        shape: unknown_rank
        name: NoOp
  Method name is: 

signature_def['serving_default']:
  The given SavedModel SignatureDef contains the following input(s):
    inputs['image_input'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1, 224, 224, 3)
        name: serving_default_image_input:0
  The given SavedModel SignatureDef contains the following output(s):
    outputs['resnet50'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1, 1000)
        name: StatefulPartitionedCall:0
  Method name is: tensorflow/serving/predict

Concrete Functions:
  Function Name: '__call__'
    Option #1
      Callable with:
        Argument #1
          inpu

## TF Serving

### Create dummy data

The dummy data is nothing but just contains random numbers in the batch size of 32.

In [6]:
dummy_inputs = tf.random.normal((32, 224, 224, 3))
dummy_inputs.shape

TensorShape([32, 224, 224, 3])

### Install TF Serving tool

In [None]:
!echo "deb http://storage.googleapis.com/tensorflow-serving-apt stable tensorflow-model-server tensorflow-model-server-universal" | sudo tee /etc/apt/sources.list.d/tensorflow-serving.list && \
curl https://storage.googleapis.com/tensorflow-serving-apt/tensorflow-serving.release.pub.gpg | sudo apt-key add -
!sudo apt update

In [None]:
!sudo apt-get install tensorflow-model-server

### Run TF Serving server

In [26]:
os.environ["MODEL_DIR"] = MODEL_DIR

`saved_model_cli` CLI accepts a set of options.
- `--rest_api_port` exposes additional port for RestAPI. By default `8500` is exposed as gRPC.
- `--model_name` lets TF Serving to identify which model to access. You can visually see this in the RestAPI's URI.
- `--enable_model_warmup` 
  - The TensorFlow runtime has components that are lazily initialized, which can cause high latency for the first request/s sent to a model after it is loaded. To reduce the impact of lazy initialization on request latency, it's possible to trigger the initialization of the sub-systems and components at model load time by providing a sample set of inference requests along with the SavedModel. This process is known as "warming up" the model.
  - To trigger warmup of the model at load time, attach a warmup data file under the assets.extra subfolder of the SavedModel directory.
  - `--enable_model_warmup` option triggers this process.
  - for further information, please look at the [official document](https://www.tensorflow.org/tfx/serving/saved_model_warmup?hl=en)

In [None]:
!nohup tensorflow_model_server \
  --rest_api_port=8501 \
  --model_name=resnet_model \
  --model_base_path=$MODEL_DIR >server.log 2>&1 &

# --enable_model_warmup for warmup(https://www.tensorflow.org/tfx/serving/saved_model_warmup)

In [28]:
!cat server.log

Notice that two ports are exposed for listening both RestAPI(`8501`) and gRPC(`8500`).

In [29]:
!sudo lsof -i -P -n | grep LISTEN

node         7 root   21u  IPv6  25789      0t0  TCP *:8080 (LISTEN)
colab-fil   30 root    5u  IPv4  26644      0t0  TCP *:3453 (LISTEN)
colab-fil   30 root    6u  IPv6  26645      0t0  TCP *:3453 (LISTEN)
jupyter-n   43 root    6u  IPv4  25864      0t0  TCP 172.28.0.2:9000 (LISTEN)
python3     61 root   15u  IPv4  27814      0t0  TCP 127.0.0.1:50215 (LISTEN)
python3     61 root   18u  IPv4  27818      0t0  TCP 127.0.0.1:54779 (LISTEN)
python3     61 root   21u  IPv4  27822      0t0  TCP 127.0.0.1:40395 (LISTEN)
python3     61 root   24u  IPv4  27826      0t0  TCP 127.0.0.1:60517 (LISTEN)
python3     61 root   30u  IPv4  27832      0t0  TCP 127.0.0.1:40255 (LISTEN)
python3     61 root   43u  IPv4  28831      0t0  TCP 127.0.0.1:53235 (LISTEN)
python3     81 root    3u  IPv4  29267      0t0  TCP 127.0.0.1:15144 (LISTEN)
python3     81 root    5u  IPv4  28223      0t0  TCP 127.0.0.1:42197 (LISTEN)
python3     81 root    9u  IPv4  28356      0t0  TCP 127.0.0.1:41627 (LISTEN)
tensorflo 593

## RestAPI request

### Convert dummy data in JSON format

In [30]:
data = json.dumps({"signature_name": "serving_default", "instances": dummy_inputs.numpy().tolist()})
print('Data: {} ... {}'.format(data[:50], data[len(data)-52:]))

Data: {"signature_name": "serving_default", "instances": ... 442383, 0.8007770776748657, -0.7472004890441895]]]]}


### Make a request

In [31]:
headers = {"content-type": "application/json"}

In [32]:
%%timeit
json_response = requests.post('http://localhost:8501/v1/models/resnet_model:predict', 
                              data=data, headers=headers)

1 loop, best of 5: 4.11 s per loop


### Interpret the output

In [36]:
json_response = requests.post('http://localhost:8501/v1/models/resnet_model:predict', 
                              data=data, headers=headers)
rest_predictions = json.loads(json_response.text)['predictions']
print('Prediction class: {}'.format(np.argmax(rest_predictions, axis=-1)))

Prediction class: [664 664 664 664 664 664 664 664 664 664 664 664 664 664 664 664 664 664
 664 664 664 664 664 664 664 851 664 664 851 664 664 664]


## gRPC request

### Open up gRPC channel

In [37]:
channel = grpc.insecure_channel('localhost:8500')
stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)

### Prepare a request

In [38]:
request = predict_pb2.PredictRequest()
request.model_spec.name = 'resnet_model'
request.model_spec.signature_name = 'serving_default'
request.inputs['image_input'].CopyFrom(
    tf.make_tensor_proto(dummy_inputs)) #, shape=[32,224,224,3]))

### Make a request

In [39]:
%%timeit
result = stub.Predict(request, 10.0)  # 10 secs timeout

1 loop, best of 5: 3.63 s per loop


### Interpret the output

In [40]:
grpc_predictions = stub.Predict(request, 10.0)  # 10 secs timeout
grpc_predictions = grpc_predictions.outputs['resnet50'].float_val
grpc_predictions = np.array(grpc_predictions).reshape(32, -1)
print('Prediction class: {}'.format(np.argmax(grpc_predictions, axis=-1)))

Prediction class: [664 664 664 664 664 664 664 664 664 664 664 664 664 664 664 664 664 664
 664 664 664 664 664 664 664 851 664 664 851 664 664 664]


## Compare the two results if they are identical

`np.testing.assert_allclose` raises exception when the given two arrays do not match exactly.

In [41]:
np.testing.assert_allclose(rest_predictions, grpc_predictions, atol=1e-4)

## Conclusion

gRPC call took about 3.64 seconds while RestAPI call took about 4.11 seconds on the data of the batch size of 32. This let use conclude that gRPC call is much faster than RestAPI. 

Also note that this is very close performance comparing to the ONNX inference without any Server framework involved. That means we can expect TF Serving with gRPC should be faster than ONNX hosted on FastAPI server framework since FastAPI is a python framework while TF Serving is C++ implementation.