# Deep learning model Deployment at scale

**content**

* Introduction
* Serving TF model using **TF serving**
     * TF serving
* Create your model
* Save model as per timestamp
* Installing TF Serving
* QUERYING TF SERVING by REST API
* QUERYING TF SERVING by gRPC API
* Deploying a new model version after retraining 

## Introduction

Once we have created a model and we are satisfied with its outcome then we should put it into production where it'll be queried by the users to get the prediction. But we must consider following points -

1. You can serve your model by using simple **REST API**.
2. It should have quick response time.
3. As time passes we should also schedule **retraining** of our model on the fresh dataset to avoid model drift and keep our model robust and fresh.
4. After retraining we should also push the udpated version into the production. 
5. Proper **versioning of models** should be done and it should be handled with properly for transitioning from one version to newer without hampering the services.
6. One should maintain multiple models for A/B testing. Incase of any issue it should rollback to the previous stable version.
7. Our model should be able to handl high **Query Per seconds (QPS)** and it should be able to scale in case of spike in the no. of requests.
8. There are two ways to follow above steps. 
    * i.  By using your own hardware setup or server.
    * ii. By using Cloud services (PaaS - Platform as as Service) like GCP, Azure, AWS etc.

## Serving TF model using **TF serving**

Once you have built your model it can give you the prediction by using predict() method but as the app grows and you want to launch it to the public so that it can be utilised by other users accross the world. Then we should create a simple wrapper python code which will help us to serve this model as a service using simple REST API or gRPC API.

Creating a separate API for prediction purpose isolates it from rest of the infrastructure and it makes it manageable. There many ways to create such microservices among which two are-  
* Flask Library
* TF serving

### TF serving
* **TF serving** is model server from tensorflow written in C++. Hence its fast and efficient to use. 
* It can serve multiple models and automatically deplot the latest version of the available model. Hence it does most of the heavy lifting for us.


<img src="TF_deployment_setup_imgs/TF_serving.jpeg">
<a href="https://pbs.twimg.com/media/C4vf8SQUcAALCyl?format=jpg&name=large">source</a>

## CREATE YOUR MODEL

Lets create a simple ANN model using fashion mnist data.

In [None]:
import os

# uncomment for Google colab or if you want to run this notebook at different location
# ROOT = "/content/drive/My Drive/iNeuron_Retraining_trails"
# os.chdir(ROOT)
# os.getcwd()

In [None]:
# import tf and keras
import tensorflow as tf
from tensorflow import keras

In [None]:
# load mnist dataset from the 
fashion_mnist = keras.datasets.fashion_mnist
(X_train, y_train), (X_test, y_test) = fashion_mnist.load_data()

In [None]:
# scale the inputs train and valid data
X_valid, X_train = X_train[:5000]/255.0, X_train[5000:]/255.0
y_valid, y_train = y_train[:5000], y_train[5000:]

In [None]:
# Define the list of class name
class_names = ["T-shirt",
               "Trouser",
               "Pullover",
               "Dress",
               "Coat",
               "Sandal",
               "Shirt",
               "Sneaker",
               "Bag",
               "Ankle boot"]

In [None]:
# class_names[y_train[0]]

In [None]:
# define the layers of the ANN
LAYERS = [keras.layers.Flatten(input_shape=[28,28]),
          keras.layers.Dense(300, activation="relu"),
          keras.layers.Dense(100, activation="relu"),
          keras.layers.Dense(10, activation="softmax")]
model = keras.models.Sequential(LAYERS)

In [None]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_1 (Flatten)          (None, 784)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 300)               235500    
_________________________________________________________________
dense_4 (Dense)              (None, 100)               30100     
_________________________________________________________________
dense_5 (Dense)              (None, 10)                1010      
Total params: 266,610
Trainable params: 266,610
Non-trainable params: 0
_________________________________________________________________


In [None]:
model.compile(loss="sparse_categorical_crossentropy",
              optimizer="sgd",
              metrics=["accuracy"])

In [None]:
history = model.fit(X_train, 
                    y_train, 
                    epochs=30, 
                    validation_data=(X_valid, y_valid))

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


## SAVE MODEL AS PER TIMESTAMP

In [None]:
import time
fileName = time.strftime("%Y%M%d_%H%M%S")
fileName
model.save(f"model_{fileName}.h5")

In [None]:
# define version of the model and then save the model
model_version = "0002"
model_name = "the_mnist_model"
model_path = os.path.join(model_name, model_version)
tf.saved_model.save(model, model_path)

Instructions for updating:
If using Keras pass *_constraint arguments to layers.
INFO:tensorflow:Assets written to: the_mnist_model/0002/assets


> **Lets check the directory structure after saving the model -**

```
└── the_mnist_model
    ├── 0001
    │   ├── assets
    │   ├── saved_model.pb
    │   └── variables
    │       ├── variables.data-00000-of-00001
    │       └── variables.index
    └── 0002
        ├── assets
        ├── saved_model.pb
        └── variables
            ├── variables.data-00000-of-00001
            ├── variables.data-00000-of-00002
            ├── variables.data-00001-of-00002
            └── variables.index

```


In [None]:
# load the model back
saved_model = tf.keras.models.load_model(model_path) # since used keras model to save
y_pred = saved_model(X_valid, training=False)



In [None]:
y_pred[0]

<tf.Tensor: shape=(10,), dtype=float32, numpy=
array([0.05720502, 0.23146962, 0.07682422, 0.15916158, 0.0462238 ,
       0.20618983, 0.05293825, 0.04751342, 0.07723207, 0.04524222],
      dtype=float32)>

>**TF comes with a CLI tool to inspect Saved models**
Check the below command that can also be run in command prompt or terminal

In [None]:
!saved_model_cli show --dir the_mnist_model/0001/ --all


MetaGraphDef with tag-set: 'serve' contains the following SignatureDefs:

signature_def['__saved_model_init_op']:
  The given SavedModel SignatureDef contains the following input(s):
  The given SavedModel SignatureDef contains the following output(s):
    outputs['__saved_model_init_op'] tensor_info:
        dtype: DT_INVALID
        shape: unknown_rank
        name: NoOp
  Method name is: 

signature_def['serving_default']:
  The given SavedModel SignatureDef contains the following input(s):
    inputs['flatten_input'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1, 28, 28)
        name: serving_default_flatten_input:0
  The given SavedModel SignatureDef contains the following output(s):
    outputs['dense_2'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1, 10)
        name: StatefulPartitionedCall:0
  Method name is: tensorflow/serving/predict
Instructions for updating:
If using Keras pass *_constraint arguments to layers.

Defined Functions:
  Function Name: '__c

## Installing TF Serving

There are 3 ways to install TF serving among which via Docker its the easiest to install.

**STEP 1: Install docker**
To install using docker let's install docker first in our machine. Steps to install docker - 

* For ubuntu- [Click here](https://docs.docker.com/engine/install/ubuntu/)
* For mac- [Click here](https://docs.docker.com/docker-for-mac/install/)
* For Windows - [Click here](https://docs.docker.com/docker-for-windows/install/)
* Home Page - [Click here](https://docs.docker.com/get-docker/)

**STEP 2: get tf serving**
Run the below command in your terminal- 
`docker pull tensorflow/serving`

**STEP 3: Create docker container to run the image**
```
docker run -it --rm -p 8500:8500 -p 8501:8501 -v "LOCAL_MODEL_PATH:/models/my_model" -e MODEL_NAME=my_model tensorflow/serving
```

The above step loads the latest model and it is serving -

API type | port
-|-
gRPC | 8500
REST | 8501



## QUERYING TF SERVING by REST API

In [None]:
import json

In [None]:
X_new = X_valid.copy()

In [None]:
input_json_data = json.dumps({
    "signature_name": "serving_default",
    "instances": X_new.tolist(),
})

In [None]:
# input_json_data

'{"signature_name": "serving_default", "instances": [[0.0, 0.0, 0.0, 0.0, 0.00392156862745098, 0.0, 0.0, 0.0, 0.0, 0.08627450980392157, 0.34509803921568627, 0.7372549019607844, 0.6745098039215687, 0.5176470588235295, 0.49019607843137253, 0.5529411764705883, 0.7803921568627451, 0.5607843137254902, 0.03529411764705882, 0.0, 0.0, 0.0, 0.00392156862745098, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.00392156862745098, 0.0, 0.0, 0.0784313725490196, 0.5137254901960784, 0.7803921568627451, 0.807843137254902, 0.7686274509803922, 0.792156862745098, 0.9490196078431372, 1.0, 1.0, 0.9803921568627451, 0.8705882352941177, 0.7725490196078432, 0.807843137254902, 0.7372549019607844, 0.49411764705882355, 0.06666666666666667, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.00392156862745098, 0.0, 0.13725490196078433, 0.8392156862745098, 0.7490196078431373, 0.7176470588235294, 0.6980392156862745, 0.6862745098039216, 0.6588235294117647, 0.5882352941176471, 0.6352941176470588, 0.6235294117647059, 0.5960784

In [None]:
import requests
SERVER_URL = "http://localhost:8501/v1/models/the_mnist_model:predict"

response = requests.post(SERVER_URL, data=input_json_data)
response.raise_for_status()
response = response.json()

In [None]:
response

{'predictions': [[0.0572050065,
   0.231469616,
   0.076824218,
   0.159161627,
   0.0462238118,
   0.206189841,
   0.0529382303,
   0.0475134291,
   0.0772320405,
   0.0452422164],
  [0.0472382307,
   0.168135911,
   0.0528706238,
   0.137472361,
   0.0496005043,
   0.261057943,
   0.0847247913,
   0.0483505391,
   0.130521834,
   0.0200273097],
  [0.0820436627,
   0.158727407,
   0.0849240944,
   0.113156088,
   0.0787044615,
   0.109994702,
   0.104340911,
   0.0901152864,
   0.113607585,
   0.0643857121],
  [0.0858198553,
   0.192299441,
   0.0818804,
   0.126614451,
   0.064550519,
   0.159590483,
   0.0838090479,
   0.0623709597,
   0.0973394588,
   0.0457253531],
  [0.0475638956,
   0.169087052,
   0.070210278,
   0.152007759,
   0.0479101427,
   0.152159974,
   0.0957847759,
   0.0606140494,
   0.15392442,
   0.0507376678],
  [0.0463236161,
   0.21379827,
   0.0508589186,
   0.1030728,
   0.0623621419,
   0.199949175,
   0.120301619,
   0.0502906628,
   0.113015287,
   0.040027

> Above API output can be verified using postman tool - 

<img src="TF_deployment_setup_imgs/postman_output.png">

## QUERYING TF SERVING by gRPC API

**gRPC** - 
```
gRPC (gRPC Remote Procedure Calls) is an open source remote procedure call (RPC) system initially developed at Google in 2015. 
It uses HTTP/2 for transport, Protocol Buffers as the interface description language, and provides features such as authentication, bidirectional streaming and flow control, blocking or nonblocking bindings, and cancellation and timeouts. 
It generates cross-platform client and server bindings for many languages. 
Most common usage scenarios include connecting services in microservices style architecture and connect mobile devices, browser clients to backend services 
- WikiPedia
```

Before using this API we have to install following package - 

* Tensorflow serving API - 
`pip install tensorflow-serving-api==2.2.0`

In [None]:
!pip install tensorflow-serving-api==2.2.0

Collecting tensorflow-serving-api==2.2.0
  Downloading tensorflow_serving_api-2.2.0-py2.py3-none-any.whl (38 kB)
Installing collected packages: tensorflow-serving-api


Successfully installed tensorflow-serving-api-2.2.0


In [None]:
from tensorflow_serving.apis.predict_pb2 import PredictRequest

In [None]:
request = PredictRequest()
model_name = "the_mnist_model"

request.model_spec.name = model_name
request.model_spec.signature_name = "serving_default"
input_name = model.input_names[0]
# input_name= "flatten_input"
request.inputs[input_name].CopyFrom(tf.make_tensor_proto(X_new))

In [None]:
# model.input_names

In [None]:
import grpc
from tensorflow_serving.apis import prediction_service_pb2_grpc

In [None]:
########### ERROR YET TO BE RESOLVED ##############
channel = grpc.insecure_channel('localhost:8500')
predict_service = prediction_service_pb2_grpc.PredictionServiceStub(channel)
response = predict_service.Predict(request, timeout=10.0)

## Deploying a new model version after retraining -

* Once you or the system creates a new version of model after retraining, then at regular intervals (which is configurable), TF Serving keeps on checking for the new version. 

* And once it gets the new version it handles the transition by itself.

* During this transition it handles the pending requests with previous version of the model.