# Vertex AI - Training > Hyperparameter Tuning Job
## Using A Python Package

This notebook creates a python package for a TensorFlow training project that uses the `<PROJECT_ID>.digits.digits_prepped` BigQuery table. Vertex AI clients are used to setup a Vertex AI custom training job to run the training package.  Then Vertex AI clients are used to upload the model and deploy it to an endpoint for online predictions.

**Prerequisites**
- `00 - Initial Setup`
- `01 - BigQuery - Data`
- `05 - Vertex AI - Tensorboard`
- `05b - Vertex AI - Training Job from Python Package` (helpful to understand)
    - Contains the Python Package used in this notebooks Pipeline Job
    
**Resources**
- Adopted From:
    - https://github.com/GoogleCloudPlatform/ai-platform-samples/blob/master/ai-platform-unified/notebooks/custom_job_image_classification_model_for_online_prediction.ipynb
- Using Google Cloud Client Libraries (Python for Vertex AI):
    - https://googleapis.dev/python/aiplatform/latest/index.html

**Overview**

<img src="architectures/statmike-mlops-05c.png">

---
## Setup

Setup the environment:

In [96]:
from google.cloud import aiplatform

from google.protobuf import json_format
from google.protobuf.struct_pb2 import Value

from datetime import datetime

Define Parameters:

In [97]:
# Locations
REGION = 'us-central1'
PROJECT_ID='statmike-mlops'
BUCKET_NAME='gs://{}/digits/model/05c_aip_train_job'.format(PROJECT_ID)
TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")
EXPERIMENT_NAME = '05C_AIP_DIGITS_'
JOB_NAME = EXPERIMENT_NAME+TIMESTAMP
MODEL_DIR = '{}/{}'.format(BUCKET_NAME, JOB_NAME)
PARENT = "projects/" + PROJECT_ID + "/locations/" + REGION

# files
PACKAGE = 'custom_5c'

# Resources
TRAIN_IMAGE='us-docker.pkg.dev/cloud-aiplatform/training/tf-cpu.2-4:latest'
DEPLOY_IMAGE ='us-docker.pkg.dev/cloud-aiplatform/prediction/tf2-cpu.2-3:latest'
TRAIN_COMPUTE='n1-standard-4'
DEPLOY_COMPUTE='n1-standard-4'

# TF Parameters to pass
EPOCHS = 25
BATCH_SIZE = 30

# Tensorboard Info
TENSORBOARD_ID = 'digits_tensorboard'

Get Service Account email:

In [98]:
from googleapiclient import discovery
from oauth2client.client import GoogleCredentials

credentials = GoogleCredentials.get_application_default()

service = discovery.build('iam', 'v1', credentials=credentials)

Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/googleapiclient/discovery_cache/file_cache.py", line 33, in <module>
    from oauth2client.contrib.locked_file import LockedFile
ModuleNotFoundError: No module named 'oauth2client.contrib.locked_file'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/googleapiclient/discovery_cache/file_cache.py", line 37, in <module>
    from oauth2client.locked_file import LockedFile
ModuleNotFoundError: No module named 'oauth2client.locked_file'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/googleapiclient/discovery_cache/__init__.py", line 44, in autodetect
    from . import file_cache
  File "/opt/conda/lib/python3.7/site-packages/googleapiclient/discovery_cache/file_cache.py", line 41, in <module>
    "file_cach

In [99]:
request = service.projects().serviceAccounts().list(name="projects/"+PROJECT_ID)
response = request.execute()

INFO:oauth2client.transport:Attempting refresh to obtain initial access_token


In [100]:
SERVICE_ACCOUNT = response['accounts'][0]['email']
SERVICE_ACCOUNT

'691911073727-compute@developer.gserviceaccount.com'

Setup AI Platform Python Clients
- https://googleapis.dev/python/aiplatform/latest/index.html

In [101]:
API_ENDPOINT = "{}-aiplatform.googleapis.com".format(REGION)
client_options = {"api_endpoint": API_ENDPOINT}
clients = {}

In [102]:
aiplatform.init(project=PROJECT_ID, location=REGION)

---
## Get Tensorboard Instance Name
The training job will show up as an experiment for the Tensorboard instance and have the same name as the training job ID.

In [103]:
tb = aiplatform.Tensorboard.list(filter='display_name={}'.format(TENSORBOARD_ID))

In [104]:
TENSORBOARD_NAME = tb[0].resource_name

In [105]:
TENSORBOARD_NAME

'projects/691911073727/locations/us-central1/tensorboards/3436703912520843264'

In [111]:
tb[0].display_name

'digits_tensorboard'

---
## Training

### Assemble Python Package for Training

Make a subdirectory, called `custom_5c`. If the directory has been previously created then delete it first.

In [106]:
!rm -rf {PACKAGE}
!mkdir {PACKAGE}

Create the `setup.py` file in the subdirectory:

In [107]:
setup_py = "from setuptools import setup\n\
if __name__ == '__main__':\n\
    setup()"

setup_py = "import setuptools\n\
REQUIRED_PACKAGES = ['tensorflow_io']\n\
setuptools.setup(\n\
    name='trainer',\n\
    version='0.1',\n\
    #install_requires=REQUIRED_PACKAGES,\n\
    packages=setuptools.find_packages(),\n\
    #include_package_data=True,\n\
    description='Digit Training Package')"

!echo "$setup_py" > {PACKAGE}/setup.py

Make a subdirectory, called `trainer`.  Include a `__init__.py` file in the subdirectory.

In [108]:
!mkdir {PACKAGE}/trainer
!touch {PACKAGE}/trainer/__init__.py

Create the main python trainer file as `/trainer/task.py`:

In [109]:
%%writefile {PACKAGE}/trainer/task.py
from tensorflow_io.bigquery import BigQueryClient
from tensorflow_io.bigquery import BigQueryReadSession
import tensorflow as tf
from google.cloud import bigquery
import argparse
import os
import sys

parser = argparse.ArgumentParser()
# the passed param, dest: a name for the param, default: if absent fetch this param from the OS, type: type to convert to, help: description of argument
parser.add_argument('--epochs',dest='epochs', default=10, type=int, help='Number of Epochs')
parser.add_argument('--batch_size',dest='batch_size', default=32, type=int, help='Batch Size')
#parser.add_argument('',dest='', default=, type=, help='')
args = parser.parse_args()

# built in parameters for data source:
PROJECT_ID='statmike-mlops'
BQDATASET_ID='digits'
BQTABLE_ID='digits_prepped'

selected_fields = ['p0', 'p1', 'p2', 'p3', 'p4', 'p5', 'p6', 'p7', 'p8', 'p9', 'p10', 'p11', 'p12', 'p13', 'p14', 'p15', 'p16', 'p17', 'p18', 'p19', 'p20', 'p21', 'p22', 'p23', 'p24', 'p25', 'p26', 'p27', 'p28', 'p29', 'p30', 'p31', 'p32', 'p33', 'p34', 'p35', 'p36', 'p37', 'p38', 'p39', 'p40', 'p41', 'p42', 'p43', 'p44', 'p45', 'p46', 'p47', 'p48', 'p49', 'p50', 'p51', 'p52', 'p53', 'p54', 'p55', 'p56', 'p57', 'p58', 'p59', 'p60', 'p61', 'p62', 'p63', 'target']
output_types = ['FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'INT64']

feature_columns = []
feature_layer_inputs = {}
for header in selected_fields:
    if header != 'target':
        feature_columns.append(tf.feature_column.numeric_column(header))
        feature_layer_inputs[header] = tf.keras.Input(shape=(1,),name=header)

from tensorflow.python.framework import dtypes
output_types = [dtypes.float64 if x=='FLOAT64' else dtypes.int64 for x in output_types]

def transTable(row_dict):
    target=row_dict.pop('target')
    target = tf.one_hot(tf.cast(target,tf.int64),10)
    target = tf.cast(target,tf.float32)
    return(row_dict,target)

client = BigQueryClient()
session = client.read_session("projects/"+PROJECT_ID,PROJECT_ID,BQTABLE_ID,BQDATASET_ID,selected_fields,output_types,row_restriction="SPLITS='TRAIN'",requested_streams=3)
table = session.parallel_read_rows()
table = table.map(transTable)
train = table.shuffle(100000).batch(args.batch_size)

client = BigQueryClient()
session = client.read_session("projects/"+PROJECT_ID,PROJECT_ID,BQTABLE_ID,BQDATASET_ID,selected_fields,output_types,row_restriction="SPLITS='TEST'",requested_streams=3)
table = session.parallel_read_rows()
table = table.map(transTable)
test = table.batch(args.batch_size)

feature_layer = tf.keras.layers.DenseFeatures(feature_columns)
feature_layer_outputs = feature_layer(feature_layer_inputs)
opt = tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.0)
model = tf.keras.Model(inputs=[v for v in feature_layer_inputs.values()],outputs=tf.keras.layers.Dense(10,activation=tf.nn.softmax)(feature_layer_outputs))
model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])
tf.keras.utils.plot_model(model,show_shapes=True, show_dtype=True)

# setup tensorboard logs
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=os.environ['AIP_TENSORBOARD_LOG_DIR'], histogram_freq=1) #(log_dir=model_dir+'/logs', histogram_freq=1)

history = model.fit(train,epochs=args.epochs, callbacks=[tensorboard_callback])

model.save(os.getenv("AIP_MODEL_DIR"))

Writing custom_5c/trainer/task.py


Store the training package in a Cloud Storage Bucket:

In [110]:
!rm -f {PACKAGE}.tar {PACKAGE}.tar.gz
!tar cvf {PACKAGE}.tar {PACKAGE}
!gzip {PACKAGE}.tar
!gsutil cp {PACKAGE}.tar.gz $BUCKET_NAME/{PACKAGE}.tar.gz

custom_5c/
custom_5c/setup.py
custom_5c/trainer/
custom_5c/trainer/task.py
custom_5c/trainer/__init__.py
Copying file://custom_5c.tar.gz [Content-Type=application/x-tar]...
/ [1 files][  1.7 KiB/  1.7 KiB]                                                
Operation completed over 1 objects/1.7 KiB.                                      


### Setup Training Job

In [None]:
MACHINE_SPEC = {
    "machine_type": TRAIN_COMPUTE,
    "accelerator_count": 0
}


CMDARGS = [
    "--epochs=" + str(EPOCHS),
    "--batch_size=" + str(BATCH_SIZE)
]

WORKER_POOL_SPEC = [
    {
        "replica_count": 1,
        "machine_spec": MACHINE_SPEC,
        "python_package_spec": {
            "executor_image_uri": TRAIN_IMAGE,
            "package_uris": [BUCKET_NAME + "/{}.tar.gz".format(PACKAGE)],
            "python_module": "trainer.task",
            "args": CMDARGS
        }
    }
]

In [None]:
customJob = aiplatform.CustomJob(
    display_name = JOB_NAME,
    worker_pool_specs = WORKER_POOL_SPEC,
    base_output_dir = MODEL_DIR,
    staging_bucket = MODEL_DIR
)

### Setup Hyperparameter Tuning Job

In [None]:
METRIC_SPEC = {}

PARAMETER_SPEC = {}

In [None]:
htJob = aiplatform.HyperparameterTuningJob(
    display_name = JOB_NAME,
    custom_job = customJob,
    metric_spec = METRIC_SPEC,
    parameter_spec = PARAMETER_SPEC,
    max_trial_count = 12,
    parallel_trial_count = 3,
    search_algorithm = 'random'
)

### Run Training Job

In [None]:
htJob.run(
    service_account = SERVICE_ACCOUNT,
    tensorboard = TENSORBOARD_NAME
)

In [64]:
CMDARGS = [
    "--epochs=" + str(EPOCHS),
    "--batch_size=" + str(BATCH_SIZE)
]

In [65]:
runJob = customJob.run(
    base_output_dir = MODEL_DIR,
    service_account = SERVICE_ACCOUNT,
    args = CMDARGS,
    replica_count = 1,
    machine_type = TRAIN_COMPUTE,
    accelerator_count = 0,
    tensorboard = TENSORBOARD_NAME
)

INFO:google.cloud.aiplatform.training_jobs:Training Output directory:
gs://statmike-mlops/digits/model/05_aip_train_job/05_AIP_DIGITS_20210901205004 
INFO:google.cloud.aiplatform.training_jobs:View Training:
https://console.cloud.google.com/ai/platform/locations/us-central1/training/1611833468784738304?project=691911073727
INFO:google.cloud.aiplatform.training_jobs:CustomPythonPackageTrainingJob projects/691911073727/locations/us-central1/trainingPipelines/1611833468784738304 current state:
PipelineState.PIPELINE_STATE_RUNNING
INFO:google.cloud.aiplatform.training_jobs:View backing custom job:
https://console.cloud.google.com/ai/platform/locations/us-central1/training/576075923233701888?project=691911073727
INFO:google.cloud.aiplatform.training_jobs:View tensorboard:
https://us-central1.tensorboard.googleusercontent.com/experiment/projects+691911073727+locations+us-central1+tensorboards+3436703912520843264+experiments+576075923233701888
INFO:google.cloud.aiplatform.training_jobs:Custom

---
## Serving

### Upload The Model

In [75]:
model = aiplatform.Model.upload(
    display_name = JOB_NAME,
    serving_container_image_uri = DEPLOY_IMAGE,
    artifact_uri = MODEL_DIR+"/model"
)

INFO:google.cloud.aiplatform.models:Creating Model
INFO:google.cloud.aiplatform.models:Create Model backing LRO: projects/691911073727/locations/us-central1/models/8992342653626482688/operations/8141787671291756544
INFO:google.cloud.aiplatform.models:Model created. Resource name: projects/691911073727/locations/us-central1/models/8992342653626482688
INFO:google.cloud.aiplatform.models:To use this Model in another session:
INFO:google.cloud.aiplatform.models:model = aiplatform.Model('projects/691911073727/locations/us-central1/models/8992342653626482688')


In [78]:
model.display_name

'05_AIP_DIGITS_20210901205004'

### Create An Endpoint

In [79]:
endpoint = aiplatform.Endpoint.create(display_name = JOB_NAME)

INFO:google.cloud.aiplatform.models:Creating Endpoint
INFO:google.cloud.aiplatform.models:Create Endpoint backing LRO: projects/691911073727/locations/us-central1/endpoints/7922236364823724032/operations/9078536393784819712
INFO:google.cloud.aiplatform.models:Endpoint created. Resource name: projects/691911073727/locations/us-central1/endpoints/7922236364823724032
INFO:google.cloud.aiplatform.models:To use this Endpoint in another session:
INFO:google.cloud.aiplatform.models:endpoint = aiplatform.Endpoint('projects/691911073727/locations/us-central1/endpoints/7922236364823724032')


In [80]:
endpoint.display_name

'05_AIP_DIGITS_20210901205004'

### Deploy Model To Endpoint

In [81]:
endpoint.deploy(
    model=model,
    deployed_model_display_name=JOB_NAME+'_DEPLOYED',
    traffic_percentage = 100,
    machine_type = 'n1-standard-4',
    min_replica_count = 1,
    max_replica_count = 1,
    sync=True
)

INFO:google.cloud.aiplatform.models:Deploying Model projects/691911073727/locations/us-central1/models/8992342653626482688 to Endpoint : projects/691911073727/locations/us-central1/endpoints/7922236364823724032
INFO:google.cloud.aiplatform.models:Deploy Endpoint model backing LRO: projects/691911073727/locations/us-central1/endpoints/7922236364823724032/operations/4270380841613393920
INFO:google.cloud.aiplatform.models:Endpoint model deployed. Resource name: projects/691911073727/locations/us-central1/endpoints/7922236364823724032


---
## Prediction

### Data For Prediction

In [82]:
%%bigquery pred
SELECT *
FROM `digits.digits_source`
LIMIT 10

Query complete after 0.02s: 100%|██████████| 1/1 [00:00<00:00, 385.12query/s]                          
Downloading: 100%|██████████| 10/10 [00:01<00:00,  9.39rows/s]


In [83]:
pred

Unnamed: 0,p0,p1,p2,p3,p4,p5,p6,p7,p8,p9,...,p56,p57,p58,p59,p60,p61,p62,p63,target,target_OE
0,0.0,5.0,16.0,15.0,5.0,0.0,0.0,0.0,0.0,2.0,...,0.0,6.0,16.0,16.0,16.0,16.0,7.0,0.0,2,Even
1,0.0,5.0,16.0,12.0,1.0,0.0,0.0,0.0,0.0,5.0,...,0.0,8.0,16.0,16.0,16.0,16.0,4.0,0.0,2,Even
2,0.0,5.0,15.0,16.0,6.0,0.0,0.0,0.0,0.0,11.0,...,0.0,6.0,16.0,16.0,16.0,13.0,3.0,0.0,2,Even
3,0.0,4.0,15.0,15.0,8.0,0.0,0.0,0.0,0.0,8.0,...,0.0,7.0,14.0,11.0,0.0,0.0,0.0,0.0,2,Even
4,0.0,6.0,16.0,16.0,16.0,15.0,10.0,0.0,0.0,9.0,...,0.0,9.0,16.0,11.0,0.0,0.0,0.0,0.0,5,Odd
5,0.0,8.0,16.0,12.0,15.0,16.0,7.0,0.0,0.0,13.0,...,0.0,7.0,16.0,16.0,10.0,0.0,0.0,0.0,5,Odd
6,0.0,8.0,13.0,15.0,16.0,16.0,8.0,0.0,0.0,9.0,...,0.0,9.0,16.0,6.0,0.0,0.0,0.0,0.0,5,Odd
7,0.0,7.0,12.0,14.0,16.0,8.0,0.0,0.0,0.0,8.0,...,0.0,9.0,12.0,0.0,0.0,0.0,0.0,0.0,7,Odd
8,0.0,0.0,5.0,13.0,9.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,6.0,13.0,10.0,0.0,0.0,0.0,0,Even
9,0.0,0.0,1.0,9.0,15.0,11.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,10.0,13.0,3.0,0.0,0.0,0,Even


### Prepare Prediction Request

In [91]:
pred.loc[:0]

Unnamed: 0,p0,p1,p2,p3,p4,p5,p6,p7,p8,p9,...,p56,p57,p58,p59,p60,p61,p62,p63,target,target_OE
0,0.0,5.0,16.0,15.0,5.0,0.0,0.0,0.0,0.0,2.0,...,0.0,6.0,16.0,16.0,16.0,16.0,7.0,0.0,2,Even


In [84]:
newob = pred.loc[:0,'p0':'p63'].to_dict(orient='records')[0]

In [85]:
instances = [json_format.ParseDict(newob, Value())]
parameters = json_format.ParseDict({}, Value())

### Get Prediction

In [86]:
prediction = endpoint.predict(instances=instances,parameters=parameters)

In [87]:
prediction

Prediction(predictions=[[5.49534143e-08, 0.000343886757, 0.999521, 4.42351052e-07, 5.04839974e-08, 7.08414373e-05, 4.84852626e-06, 8.81201672e-08, 5.87819486e-05, 4.02276124e-09]], deployed_model_id='2039435739850080256', explanations=None)

In [89]:
prediction.predictions[0]

[5.49534143e-08,
 0.000343886757,
 0.999521,
 4.42351052e-07,
 5.04839974e-08,
 7.08414373e-05,
 4.84852626e-06,
 8.81201672e-08,
 5.87819486e-05,
 4.02276124e-09]

In [90]:
import numpy as np

np.argmax(prediction.predictions[0])

2

---
## Remove Resources
see notebook "XX - Cleanup"