# Using TF Serving with AI Platform Prediction Custom Containers (Beta)

This notebook demonstrates how to deploy a TensorFlow 2.x model using AI Platform Prediction Custom Containers (Beta) and TensorFlow Serving.


Although, this notebook uses the custom serving module developed in the `01-prepare-for-serving.ipynb` notebook, the discussed techniques can be applied to any TF 2.x model.

For more information about the AI Platform Prediction Custom Containers feature refer to [TBD].

In [1]:
import base64
import os
import json
import time
import numpy as np
import tensorflow as tf

import google.auth

from google.auth.credentials import Credentials
from google.auth.transport.requests import AuthorizedSession

from typing import List, Optional, Text, Tuple

## Setting up the environment

This notebook was tested on **AI Platform Notebooks** using the standard TF 2.2 image.

### Set the model store path

Set the `SAVED_MODEL_PATH` to the GCS location of the `SavedModel` created in the `01-prepare-for-serving.ipynb`

In [2]:
SAVED_MODEL_PATH = 'gs://mlops-dev-workspace/models/resnet_serving'

### Push the TF Serving container image to the local GCR

A container image that is used with AI Platform Prediction must be in the GCP Container Registry. Retrieving the container from an external registry like Docker Hub is not supported. In this example we are using the standard TensorFlow Serving docker image. To make it available to AI Platform Prediction you will upload it to the current project's Container Registry.

In [3]:
_ , project_id = google.auth.default()

cpu_image_name = 'gcr.io/{}/tensorflow_serving:latest-cpu'.format(project_id)
gpu_image_name = 'gcr.io/{}/tensorflow_serving:latest-gpu'.format(project_id)

In [4]:
!docker pull tensorflow/serving:latest
!docker pull tensorflow/serving:latest-gpu

latest: Pulling from tensorflow/serving
Digest: sha256:a94b7e3b0e825350675e83b0c2f2fc28f34be358c34e4126a1d828de899ec44f
Status: Image is up to date for tensorflow/serving:latest
docker.io/tensorflow/serving:latest
latest-gpu: Pulling from tensorflow/serving
Digest: sha256:9f2154baa458bf7b523d5f3c9f545056ed14d75ceac00742d1903d37d80393e9
Status: Image is up to date for tensorflow/serving:latest-gpu
docker.io/tensorflow/serving:latest-gpu


In [5]:
!docker tag tensorflow/serving:latest {cpu_image_name}
!docker tag tensorflow/serving:latest-gpu {gpu_image_name}

In [6]:
!docker push {cpu_image_name}
!docker push {gpu_image_name}

The push refers to repository [gcr.io/mlops-dev-env/tensorflow_serving]

[1Bac716820: Preparing 
[1Bbd8c4bd3: Preparing 
[1Be785c230: Preparing 
[1Ba73fd165: Preparing 
[1Bf9a74649: Preparing 
[1Bda143c91: Preparing 
[1B287e1f04: Preparing 
[1B68776582: Layer already exists [4A[2K[1A[2Klatest-cpu: digest: sha256:a94b7e3b0e825350675e83b0c2f2fc28f34be358c34e4126a1d828de899ec44f size: 1989
The push refers to repository [gcr.io/mlops-dev-env/tensorflow_serving]

[1B41b4553f: Preparing 
[1B6ab262b7: Preparing 
[1Bfdb5f1f9: Preparing 
[1B64ade40f: Preparing 
[1B0889ee68: Preparing 
[1Bd332a58a: Preparing 
[1Bf11cbf29: Preparing 
[1Ba4b22186: Preparing 
[1Bafb09dc3: Preparing 
[1Bb5a53aac: Preparing 
[1Bc8e5063e: Preparing 
[1B7c529ced: Layer already exists [6A[2K[3A[2Klatest-gpu: digest: sha256:9f2154baa458bf7b523d5f3c9f545056ed14d75ceac00742d1903d37d80393e9 size: 2835


## Deploying model versions

### Create an authorized session 

You will be using the AI Platform Prediction REST API to deploy a container. The API uses OAuth 2 for authentication. Instead of manually generating and maintaining OAuth tokens, you will use the `google.auth.transport.requests.AuthorizedSession` client that encapsulates the OAuth workflow.

In [7]:
service_endpoint = 'https://alpha-ml.googleapis.com'

credentials, project_ = google.auth.default()
authed_session = AuthorizedSession(credentials)

### List all models in the project

In [8]:
url = f'{service_endpoint}/v1/projects/{project_id}/models/'

response = authed_session.get(url)
response.json()

{'models': [{'name': 'projects/mlops-dev-env/models/ResNet101',
   'regions': ['us-central1'],
   'etag': 'S7FgvSfwfUY='}]}

### Create a model resource

In [9]:
model_name = 'ResNet101'

url = f'{service_endpoint}/v1/projects/{project_id}/models/'

request_body = {
    "name": model_name
}

response = authed_session.post(url, data=json.dumps(request_body))
response.json()

{'error': {'code': 409,
  'message': 'Field: model.name Error: A model with the same name already exists.',
  'status': 'ALREADY_EXISTS',
  'details': [{'@type': 'type.googleapis.com/google.rpc.BadRequest',
    'fieldViolations': [{'field': 'model.name',
      'description': 'A model with the same name already exists.'}]}]}}

### Get the model's info

In [10]:
url = f'{service_endpoint}/v1/projects/{project_id}/models/{model_name}'

response = authed_session.get(url)
response.json()

{'name': 'projects/mlops-dev-env/models/ResNet101',
 'regions': ['us-central1'],
 'etag': 'S7FgvSfwfUY='}

### Create a model version

When deploying a custom container to AI Platform Prediction you need to configure two groups of settings. The first group defines the configuration of the AI Platform Prediction service that hosts your container. For example, a node type, manual or autoscaling parameters, an accelerator configuration, etc. The second group are the settings specific to a given container. 

Refer to [TBD]() for a detailed discussion of the available service settings.

There are three ways of passing configuration settings to a container:
* the settings can be embedded in a custom container image
* you can pass the settings as command line arguments, or 
* you can supply a configuration file. 

In the first method, the configuration settings are supplied  at the time the container container is built. The other two methods allow you to set the settings  at the deployment time. 

Some model servers commonly used in AI Platform Prediction custom containers, including TF Serving used in this notebook, also expose a management API that allows you to change configurations after the server has been deployed. Configuring the server through the management API is currently not supported due to the constraints of the REST interface exposed by AI Platform Prediction.


Supplying configuration settings through a command line interface is straightforward. The AI Platform Prediction REST API utilizes JSON to encode requests and responses. You can provide the command line arguments as the `args` key of the JSON `container` object in the create model version request body.


Passing a config file to a container hosted in AI Platform Prediction is a little bit trickier. The container runs in an isolated environment and does not have access to resources (including Cloud Storage) outside of this environment. To pass file based assets (including a config file) to the container you need to stage them in the GCS deployment location. The GCS deployment location - set through the `deployment_uri` field of the REST API request body - is copied to the isolated environment by the create model version request. The url to the location of the copy in the isolated environment is exposed through the `AIP_STORAGE_URI` environment variable. 

In the following example you will use both the command line arguments and the configuration file to configure the TF Serving model server. Most of the configurations will be passed as command line arguments. The [server side batching]()(https://www.tensorflow.org/tfx/serving/serving_config#batching_configuration) parameters will be passed as a config file.


#### Create the config file with batching settings

In [11]:
batching_config = '/tmp/batching.pbtxt'

In [12]:
%%writefile {batching_config}

max_batch_size { value: 128 }
batch_timeout_micros { value: 150000 }
max_enqueued_batches { value: 16 }
num_batch_threads { value: 8 }

Writing /tmp/batching.pbtxt


#### Copy the batch config file to the staging location in GCS

You are going to use the folder where the custom ResNet10 model was saved as the staging location.

In [13]:
!gsutil cp {batching_config} {SAVED_MODEL_PATH}/{batching_config}

Copying file:///tmp/batching.pbtxt [Content-Type=application/octet-stream]...
/ [1 files][  136.0 B/  136.0 B]                                                
Operation completed over 1 objects/136.0 B.                                      


In [14]:
!gsutil cat {SAVED_MODEL_PATH}/batching.pbtxt


max_batch_size { value: 128 }
batch_timeout_micros { value: 150000 }
max_enqueued_batches { value: 16 }
num_batch_threads { value: 8 }


#### Deploy the container

In [15]:
version_name = 'batching_150'

url = f'{service_endpoint}/v1/projects/{project_id}/models/{model_name}/versions'

request_body = {
    # Service settings
    "name": version_name,
    "deployment_uri": SAVED_MODEL_PATH,
    "machine_type": 'n1-standard-8',
    "accelerator_config": {
        "count": 1,
        "type": 'NVIDIA_TESLA_P4'},
    "routes": {
        "predict": f"/v1/models/{model_name}:predict",
        "health": f"/v1/models/{model_name}"},
    
    # Container settings
    "container": {
        "image": gpu_image_name,
        "args": [
            "--rest_api_port=8080",
            f"--model_name={model_name}",
            "--model_base_path=$(AIP_STORAGE_URI)",
            "--enable_batching",
            "--batching_parameters_file=$(AIP_STORAGE_URI)/batching.pbtxt"]}
}
            
response = authed_session.post(url, data=json.dumps(request_body))
response.json()

{'name': 'projects/mlops-dev-env/operations/create_ResNet101_batching_150-1597681616585',
 'metadata': {'@type': 'type.googleapis.com/google.cloud.ml.v1.OperationMetadata',
  'createTime': '2020-08-17T16:26:57Z',
  'operationType': 'CREATE_VERSION',
  'modelName': 'projects/mlops-dev-env/models/ResNet101',
  'version': {'name': 'projects/mlops-dev-env/models/ResNet101/versions/batching_150',
   'deploymentUri': 'gs://mlops-dev-workspace/models/resnet_serving',
   'createTime': '2020-08-17T16:26:56Z',
   'etag': 'ZhHucsxrHMI=',
   'machineType': 'n1-standard-8',
   'acceleratorConfig': {'count': '1', 'type': 'NVIDIA_TESLA_P4'},
   'container': {'image': 'gcr.io/mlops-dev-env/tensorflow_serving:latest-gpu',
    'args': ['--rest_api_port=8080',
     '--model_name=ResNet101',
     '--model_base_path=$(AIP_STORAGE_URI)',
     '--enable_batching',
     '--batching_parameters_file=$(AIP_STORAGE_URI)/batching.pbtxt']},
   'routes': {'predict': '/v1/models/ResNet101:predict',
    'health': '/v1

#### Check the deployment status

In [16]:
url = f'{service_endpoint}/v1/projects/{project_id}/models/{model_name}/versions/{version_name}'

response = authed_session.get(url)
response.json()

{'name': 'projects/mlops-dev-env/models/ResNet101/versions/batching_150',
 'deploymentUri': 'gs://mlops-dev-workspace/models/resnet_serving',
 'createTime': '2020-08-17T16:26:56Z',
 'state': 'CREATING',
 'etag': '8O+3Tg7ULTM=',
 'machineType': 'n1-standard-8',
 'acceleratorConfig': {'count': '1', 'type': 'NVIDIA_TESLA_P4'},
 'container': {'image': 'gcr.io/mlops-dev-env/tensorflow_serving:latest-gpu',
  'args': ['--rest_api_port=8080',
   '--model_name=ResNet101',
   '--model_base_path=$(AIP_STORAGE_URI)',
   '--enable_batching',
   '--batching_parameters_file=$(AIP_STORAGE_URI)/batching.pbtxt']},
 'routes': {'predict': '/v1/models/ResNet101:predict',
  'health': '/v1/models/ResNet101'}}

## Testing the deployed model

You will now run inference by invoking the TF Serving `Predict` API.

Refer to the [TF Serving REST API Reference](https://www.tensorflow.org/tfx/serving/api_rest) for more information about the API format.

#### Load sample images

In [None]:
image_folder = 'locust/locust-image/test_images'
raw_images = [tf.io.read_file(os.path.join(image_folder, image_path)).numpy()
         for image_path in os.listdir(image_folder)]

encoded_images = [{'b64': base64.b64encode(image).decode('utf-8')} for image in raw_images]  

#### Call the `predict` endpoint 

In [None]:
signature = 'serving_preprocess'

client.call_predict(
    project_id=project_id, 
    model_name=model_name, 
    version_name=version_name, 
    signature=signature,
    instances=encoded_images)

## Cleaning up

### Delete model version and model resources
#### List model versions

In [None]:
model_name = 'ResNet101'

url = f'{service_endpoint}/v1/projects/{project_id}/models/{model_name}/versions'

response = authed_session.get(url)
response.json()

#### Delete the specific version

In [None]:
version_name = 'batching_150'

url = f'{service_endpoint}/v1/projects/{project_id}/models/{model_name}/versions/{version_name}'

response = authed_session.delete(url)
response.json()

#### Delete the model

In [None]:
url = f'{service_endpoint}/v1/projects/{project_id}/models/{model_name}'

response = authed_session.delete(url)
response.json()

## Next Steps

Walk through the `aipp_deploy.ipynb` notebook to learn how to deploy the custom serving module created in this notebook to **AI Platform Prediction** using TF Serving container image.

## License

<font size=-1>Licensed under the Apache License, Version 2.0 (the \"License\");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at [https://www.apache.org/licenses/LICENSE-2.0](https://www.apache.org/licenses/LICENSE-2.0)

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  See the License for the specific language governing permissions and limitations under the License.</font>