# Vertex AI - Training Job and Serving

This notebook creates a python package for a TensorFlow training project that uses the `<PROJECT_ID>.digits.digits_prepped` BigQuery table. Vertex AI clients are used to setup a Vertex AI custom training job to run the training package.  Then Vertex AI clients are used to upload the model and deploy it to an endpoint for online predictions.

**Prerequisites**
- `00 - Initial Setup`
- `01 - BigQuery - Data`
- `04 - Vertex AI - Notebook` (helpful to understand)
    - the model created here is stored in a python package in this notebook
    
**Resources**
- Adopted From:
    - https://github.com/GoogleCloudPlatform/ai-platform-samples/blob/master/ai-platform-unified/notebooks/custom_job_image_classification_model_for_online_prediction.ipynb
- Using Google Cloud Client Libraries (Python for Vertex AI):
    - https://googleapis.dev/python/aiplatform/latest/index.html

**Overview**

<img src="architectures/statmike-mlops-05b.png">

---
## Setup

Setup the environment:

In [15]:
from google.cloud import aiplatform
from google.cloud import aiplatform_v1beta1
from google.protobuf import json_format
from google.protobuf.struct_pb2 import Value
from datetime import datetime

Define Parameters:

In [16]:
# Locations
REGION = 'us-central1'
PROJECT_ID='statmike-mlops'
BUCKET_NAME='gs://{}/digits/model/05_aip_train_job'.format(PROJECT_ID)
TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")
EXPERIMENT_NAME = '05_AIP_DIGITS_'
JOB_NAME = EXPERIMENT_NAME+TIMESTAMP
MODEL_DIR = '{}/{}'.format(BUCKET_NAME, JOB_NAME)
PARENT = "projects/" + PROJECT_ID + "/locations/" + REGION

# Resources
TRAIN_IMAGE='us-docker.pkg.dev/cloud-aiplatform/training/tf-cpu.2-4:latest'
DEPLOY_IMAGE ='us-docker.pkg.dev/cloud-aiplatform/prediction/tf2-cpu.2-3:latest'
TRAIN_COMPUTE='n1-standard-4'
DEPLOY_COMPUTE='n1-standard-4'

# TF Parameters to pass
EPOCHS = 25
BATCH_SIZE = 30

Get Service Account email:

In [36]:
from googleapiclient import discovery
from oauth2client.client import GoogleCredentials

credentials = GoogleCredentials.get_application_default()

service = discovery.build('iam', 'v1', credentials=credentials)

Traceback (most recent call last):
  File "/home/jupyter/.local/lib/python3.7/site-packages/googleapiclient/discovery_cache/file_cache.py", line 33, in <module>
    from oauth2client.contrib.locked_file import LockedFile
ModuleNotFoundError: No module named 'oauth2client.contrib.locked_file'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/jupyter/.local/lib/python3.7/site-packages/googleapiclient/discovery_cache/file_cache.py", line 37, in <module>
    from oauth2client.locked_file import LockedFile
ModuleNotFoundError: No module named 'oauth2client.locked_file'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/jupyter/.local/lib/python3.7/site-packages/googleapiclient/discovery_cache/__init__.py", line 44, in autodetect
    from . import file_cache
  File "/home/jupyter/.local/lib/python3.7/site-packages/googleapiclient/discovery_cache/file_cache.

In [37]:
request = service.projects().serviceAccounts().list(name="projects/"+PROJECT_ID)
response = request.execute()

In [54]:
SERVICE_ACCOUNT = response['accounts'][0]['email']
SERVICE_ACCOUNT

'691911073727-compute@developer.gserviceaccount.com'

Setup AI Platform Python Clients
- https://googleapis.dev/python/aiplatform/latest/index.html

In [55]:
API_ENDPOINT = "{}-aiplatform.googleapis.com".format(REGION)
client_options = {"api_endpoint": API_ENDPOINT}
clients = {}

---
## Get Tensorboard Instance Name
The training job will show up as an experiment for the Tensorboard instance and have the same name as the training job ID.

In [56]:
clients['tb'] = aiplatform_v1beta1.TensorboardServiceClient(client_options=client_options)
BASE_RESOURCE_PATH = clients['tb'].common_location_path(PROJECT_ID, REGION)

In [57]:
tensorboard = list(clients['tb'].list_tensorboards(parent=PARENT))[0]
tensorboard

name: "projects/691911073727/locations/us-central1/tensorboards/3200124194595536896"
display_name: "Tensorboad for handwritten digits model training"
create_time {
  seconds: 1626258358
  nanos: 718869000
}
update_time {
  seconds: 1626258390
  nanos: 357761000
}
etag: "AMEw9yOmSjLuxd9EdRyz-ArmyrlA60OHilHfg5_ZYfcIbJNTEyTXbJkRkzf373PvkWwJ"
blob_storage_path_prefix: "cloud-ai-platform-f89edcdf-ac24-4545-a8c4-14e268905904"

In [59]:
TENSORBOARD_NAME = tensorboard.name

---
## Training Job

Create a client for the AIP Job Service:

In [60]:
#clients['job'] = aiplatform.gapic.JobServiceClient(client_options=client_options)
clients['job'] = aiplatform_v1beta1.JobServiceClient(client_options=client_options)

Define job parameters to pass to the client:

In [66]:
MACHINE_SPEC = {
    "machine_type": TRAIN_COMPUTE,
    "accelerator_count": 0
}


CMDARGS = [
    "--epochs=" + str(EPOCHS),
    "--batch_size=" + str(BATCH_SIZE)
]

WORKER_POOL_SPEC = [
    {
        "replica_count": 1,
        "machine_spec": MACHINE_SPEC,
        "python_package_spec": {
            "executor_image_uri": TRAIN_IMAGE,
            "package_uris": [BUCKET_NAME + "/trainer_cifar.tar.gz"],
            "python_module": "trainer.task",
            "args": CMDARGS
        }
    }
]

JOB_SPEC = {
    "worker_pool_specs": WORKER_POOL_SPEC,
    "base_output_directory": {"output_uri_prefix": MODEL_DIR},
    "tensorboard": TENSORBOARD_NAME,
    "service_account": SERVICE_ACCOUNT
}

CUSTOM_JOB = {
    "display_name": JOB_NAME,
    "job_spec": JOB_SPEC
}

Assemble the Python training package:

In [67]:
!rm -rf custom
!mkdir custom


setup_py = "from setuptools import setup\n\
if __name__ == '__main__':\n\
    setup()"

setup_py = "import setuptools\n\
REQUIRED_PACKAGES = ['tensorflow_io']\n\
setuptools.setup(\n\
    name='trainer',\n\
    version='0.1',\n\
    #install_requires=REQUIRED_PACKAGES,\n\
    packages=setuptools.find_packages(),\n\
    #include_package_data=True,\n\
    description='Digit Training Package')"

!echo "$setup_py" > custom/setup.py

!mkdir custom/trainer
!touch custom/trainer/__init__.py

E0714 17:10:02.837703536    1201 backup_poller.cc:133]       Run client channel backup poller: {"created":"@1626282602.837552623","description":"pollset_work","file":"src/core/lib/iomgr/ev_epollex_linux.cc","file_line":321,"referenced_errors":[{"created":"@1626282602.837524068","description":"Bad file descriptor","errno":9,"file":"src/core/lib/iomgr/ev_epollex_linux.cc","file_line":952,"os_error":"Bad file descriptor","syscall":"epoll_wait"}]}


In [68]:
%%writefile custom/trainer/task.py
from tensorflow_io.bigquery import BigQueryClient
from tensorflow_io.bigquery import BigQueryReadSession
import tensorflow as tf
from google.cloud import bigquery
import argparse
import os
import sys

parser = argparse.ArgumentParser()
# the passed param, dest: a name for the param, default: if absent fetch this param from the OS, type: type to convert to, help: description of argument
parser.add_argument('--model-dir', dest='model_dir', default=os.getenv("AIP_MODEL_DIR"), type=str, help='Model dir.')
parser.add_argument('--epochs',dest='epochs', default=10, type=int, help='Number of Epochs')
parser.add_argument('--batch_size',dest='batch_size', default=32, type=int, help='Batch Size')
#parser.add_argument('',dest='', default=, type=, help='')
args = parser.parse_args()

# built in parameters for data source:
PROJECT_ID='statmike-mlops'
BQDATASET_ID='digits'
BQTABLE_ID='digits_prepped'

selected_fields = ['p0', 'p1', 'p2', 'p3', 'p4', 'p5', 'p6', 'p7', 'p8', 'p9', 'p10', 'p11', 'p12', 'p13', 'p14', 'p15', 'p16', 'p17', 'p18', 'p19', 'p20', 'p21', 'p22', 'p23', 'p24', 'p25', 'p26', 'p27', 'p28', 'p29', 'p30', 'p31', 'p32', 'p33', 'p34', 'p35', 'p36', 'p37', 'p38', 'p39', 'p40', 'p41', 'p42', 'p43', 'p44', 'p45', 'p46', 'p47', 'p48', 'p49', 'p50', 'p51', 'p52', 'p53', 'p54', 'p55', 'p56', 'p57', 'p58', 'p59', 'p60', 'p61', 'p62', 'p63', 'target']
output_types = ['FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'INT64']

feature_columns = []
feature_layer_inputs = {}
for header in selected_fields:
    if header != 'target':
        feature_columns.append(tf.feature_column.numeric_column(header))
        feature_layer_inputs[header] = tf.keras.Input(shape=(1,),name=header)

from tensorflow.python.framework import dtypes
output_types = [dtypes.float64 if x=='FLOAT64' else dtypes.int64 for x in output_types]

def transTable(row_dict):
    target=row_dict.pop('target')
    target = tf.one_hot(tf.cast(target,tf.int64),10)
    target = tf.cast(target,tf.float32)
    return(row_dict,target)

client = BigQueryClient()
session = client.read_session("projects/"+PROJECT_ID,PROJECT_ID,BQTABLE_ID,BQDATASET_ID,selected_fields,output_types,row_restriction="SPLITS='TRAIN'",requested_streams=3)
table = session.parallel_read_rows()
table = table.map(transTable)
train = table.shuffle(100000).batch(args.batch_size)

client = BigQueryClient()
session = client.read_session("projects/"+PROJECT_ID,PROJECT_ID,BQTABLE_ID,BQDATASET_ID,selected_fields,output_types,row_restriction="SPLITS='TEST'",requested_streams=3)
table = session.parallel_read_rows()
table = table.map(transTable)
test = table.batch(args.batch_size)

feature_layer = tf.keras.layers.DenseFeatures(feature_columns)
feature_layer_outputs = feature_layer(feature_layer_inputs)
model = tf.keras.Model(inputs=[v for v in feature_layer_inputs.values()],outputs=tf.keras.layers.Dense(10,activation=tf.nn.softmax)(feature_layer_outputs))
model.compile(optimizer='sgd', loss='categorical_crossentropy', metrics=['accuracy'])
tf.keras.utils.plot_model(model,show_shapes=True, show_dtype=True)

# setup tensorboard logs
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=os.environ['AIP_TENSORBOARD_LOG_DIR'], histogram_freq=1) #(log_dir=model_dir+'/logs', histogram_freq=1)

history = model.fit(train,epochs=args.epochs, callbacks=[tensorboard_callback])

model.save(args.model_dir)

Writing custom/trainer/task.py


Store the training package in a Cloud Storage Bucket:

In [69]:
!rm -f custom.tar custom.tar.gz
!tar cvf custom.tar custom
!gzip custom.tar
!gsutil cp custom.tar.gz $BUCKET_NAME/trainer_cifar.tar.gz

custom/
custom/setup.py
custom/trainer/
custom/trainer/task.py
custom/trainer/__init__.py
E0714 17:10:07.837672161    1238 backup_poller.cc:133]       Run client channel backup poller: {"created":"@1626282607.837542376","description":"pollset_work","file":"src/core/lib/iomgr/ev_epollex_linux.cc","file_line":321,"referenced_errors":[{"created":"@1626282607.837523882","description":"Bad file descriptor","errno":9,"file":"src/core/lib/iomgr/ev_epollex_linux.cc","file_line":952,"os_error":"Bad file descriptor","syscall":"epoll_wait"}]}
Copying file://custom.tar.gz [Content-Type=application/x-tar]...
/ [1 files][  1.7 KiB/  1.7 KiB]                                                
Operation completed over 1 objects/1.7 KiB.                                      


Submit the training job:

In [70]:
jobID = clients['job'].create_custom_job(parent=PARENT, custom_job=CUSTOM_JOB)

Get information as the job runs:

In [72]:
jobIDresponse = clients['job'].get_custom_job(name=jobID.name)

In [None]:
#response = clients['job'].list_custom_jobs(parent=PARENT)
#response = clients['job'].cancel_custom_job(name=name)

---
## Serving

---
### Upload the Model

Check that model training was successful (and update the path to the model store):

In [73]:
if jobIDresponse.state == aiplatform.gapic.JobState.JOB_STATE_SUCCEEDED:
    MODEL_DIR = MODEL_DIR + "/model"
    
print(MODEL_DIR)

gs://statmike-mlops/digits/model/05_aip_train_job/05_AIP_DIGITS_20210714165425/model


Create a client to the Model Service:

In [74]:
clients['model'] = aiplatform.gapic.ModelServiceClient(client_options=client_options)

Upload the model using the client:

In [75]:
MODEL = {
    "display_name": jobIDresponse.display_name,
    "metadata_schema_uri": "",
    "artifact_uri": MODEL_DIR,
    "container_spec": {
        "image_uri": DEPLOY_IMAGE,
        "command": [],
        "args": [],
        "env": [{"name": "env_name", "value": "env_value"}],
        "ports": [{"container_port": 8080}],
        "predict_route": "",
        "health_route": ""
    }
}

uploaded_model = clients['model'].upload_model(parent=PARENT, model=MODEL)

Review the uploaded models information:

In [76]:
model_info = clients['model'].get_model(name=uploaded_model.result(timeout=180).model)
model_info.name

'projects/691911073727/locations/us-central1/models/7751688917216133120'

---
### Endpoint Creation

Create a client to the endpoint service:

In [77]:
clients['endpoint'] = aiplatform.gapic.EndpointServiceClient(client_options=client_options)

Create the endpoint:

In [78]:
ENDPOINT_NAME = 'ENDPOINT_'+JOB_NAME
endpoint = clients['endpoint'].create_endpoint(parent=PARENT, endpoint={"display_name": ENDPOINT_NAME})

In [79]:
endpoint_info = clients['endpoint'].get_endpoint(name=endpoint.result(timeout=180).name)
endpoint_info.name

'projects/691911073727/locations/us-central1/endpoints/5623500598771974144'

---
### Deploy Model to Endpoint

Setup Deployment Parameters:

In [80]:
MACHINE_SPEC = {
    "machine_type": DEPLOY_COMPUTE,
    "accelerator_count": 0,
}
DMODEL = {
        "model": model_info.name,
        "display_name": 'DEPLOYED_'+JOB_NAME,
        "dedicated_resources": {
            "min_replica_count": 1,
            "max_replica_count": 2,
            "machine_spec": MACHINE_SPEC
        }   
}
TRAFFIC = {
    '0' : 100
}

Deploy the Model to the Endpoint:

In [81]:
dmodel = clients['endpoint'].deploy_model(endpoint=endpoint_info.name, deployed_model=DMODEL, traffic_split=TRAFFIC)

In [83]:
dmodel_info = dmodel.result().deployed_model
dmodel_info.id

'361396277910437888'

In [84]:
clients['endpoint'].get_endpoint(name=endpoint_info.name)

name: "projects/691911073727/locations/us-central1/endpoints/5623500598771974144"
display_name: "ENDPOINT_05_AIP_DIGITS_20210714165425"
deployed_models {
  id: "361396277910437888"
  model: "projects/691911073727/locations/us-central1/models/7751688917216133120"
  display_name: "DEPLOYED_05_AIP_DIGITS_20210714165425"
  create_time {
    seconds: 1626286423
    nanos: 890928000
  }
  dedicated_resources {
    machine_spec {
      machine_type: "n1-standard-4"
    }
    min_replica_count: 1
    max_replica_count: 2
  }
}
traffic_split {
  key: "361396277910437888"
  value: 100
}
etag: "AMEw9yOOLDZ8_xK41oRaU8igfK8CG5X_ZerivcX9BGZUPVsB0r9QSG0keij9yFtSOCoU"
create_time {
  seconds: 1626286418
  nanos: 295380000
}
update_time {
  seconds: 1626286730
  nanos: 555078000
}

---
## Prediction

Create a client to the prediction service:

In [85]:
clients['prediction'] = aiplatform.gapic.PredictionServiceClient(client_options=client_options)

Setup an observation for prediction:

In [86]:
%%bigquery pred
SELECT *
FROM `digits.digits_prepped`
WHERE splits='TEST'

Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 667.25query/s] 
Downloading: 100%|██████████| 373/373 [00:01<00:00, 324.53rows/s]


In [87]:
pred.head(1)

Unnamed: 0,p0,p1,p2,p3,p4,p5,p6,p7,p8,p9,...,p57,p58,p59,p60,p61,p62,p63,target,target_OE,SPLITS
0,0.0,0.0,0.0,12.0,6.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,10.0,16.0,11.0,0.0,0.0,0,Even,TEST


In [88]:
newob = pred.loc[:0,'p0':'p63'].to_dict(orient='records')[0]
#newob

Request prediction from prediction service:

In [89]:
from google.protobuf import json_format
from google.protobuf.struct_pb2 import Value

response = clients['prediction'].predict(endpoint=endpoint_info.name, instances=[json_format.ParseDict(newob, Value())], parameters=json_format.ParseDict({}, Value()))

In [90]:
response.predictions

[[0.978379905, 1.45449587e-06, 1.92477455e-06, 2.8487952e-09, 1.43555098e-05, 1.3682502e-06, 0.0216009039, 5.85146653e-11, 1.4910124e-07, 9.21134902e-09]]

In [91]:
import numpy as np
np.argmax(response.predictions[0])

0

---

## Remove Resources
- undeploy model
- delete endpoint
- remove model
- delete training > custom job
- delete python training application:
    - these could be used by later notebooks in this project:
    - delete custom and custom.tar.gz

Undeploy Model:

In [34]:
dmodel = clients['endpoint'].get_endpoint(name=endpoint_info.name).deployed_models[0].id
clients['endpoint'].undeploy_model(endpoint=endpoint_info.name, deployed_model_id=dmodel)

<google.api_core.operation.Operation at 0x7fd29f70be90>

Delete Endpoint:

In [35]:
clients['endpoint'].delete_endpoint(name=endpoint_info.name)

<google.api_core.operation.Operation at 0x7fd29f7e84d0>

Remove Model:

In [36]:
clients['model'].delete_model(name=model_info.name)

<google.api_core.operation.Operation at 0x7fd29effb150>

Delete Training > Custom Job:

In [38]:
clients['job'].delete_custom_job(name=jobIDresponse.name)

<google.api_core.operation.Operation at 0x7fd29f70e550>

Delete Model Files: 
- This includes the training package and the saved model
- These could be useful for other notebooks

In [49]:
from google.cloud import storage
gcs = storage.Client()

path = gcs.bucket(PROJECT_ID)
blobs = path.list_blobs(prefix='digits/aip_train_job')
for blob in blobs:
    print(blob)
    blob.delete()

<Blob: statmike-mlops, digits/aip_train_job/AIP_DIGITS_20210413130018/, 1618319425187771>
<Blob: statmike-mlops, digits/aip_train_job/AIP_DIGITS_20210413130018/model/, 1618319425350552>
<Blob: statmike-mlops, digits/aip_train_job/AIP_DIGITS_20210413130018/model/assets/, 1618319427738045>
<Blob: statmike-mlops, digits/aip_train_job/AIP_DIGITS_20210413130018/model/saved_model.pb, 1618319428158820>
<Blob: statmike-mlops, digits/aip_train_job/AIP_DIGITS_20210413130018/model/variables/, 1618319425484405>
<Blob: statmike-mlops, digits/aip_train_job/AIP_DIGITS_20210413130018/model/variables/variables.data-00000-of-00001, 1618319426990754>
<Blob: statmike-mlops, digits/aip_train_job/AIP_DIGITS_20210413130018/model/variables/variables.index, 1618319427197772>
<Blob: statmike-mlops, digits/aip_train_job/trainer_cifar.tar.gz, 1618318849607448>
