# Vertex AI - Training > Hyperparameter Tuning Job
## Using A Python Package
## With Vertex AI - Experiments > Experiments (Tensorboard)

This notebook creates a python package for a TensorFlow training project that uses the `<PROJECT_ID>.digits.digits_prepped` BigQuery table. Vertex AI clients are used to setup a Vertex AI custom training job to run the training package.  Then Vertex AI clients are used to upload the model and deploy it to an endpoint for online predictions.

**Prerequisites**
- `00 - Initial Setup`
- `01 - BigQuery - Data`
- `05 - Vertex AI - Tensorboard`
- `05b - Vertex AI - Training Job from Python Package` (helpful to understand)
    - Contains the Python Package used in this notebooks Pipeline Job
    
**Resources**
- Adopted From:
    - https://github.com/GoogleCloudPlatform/ai-platform-samples/blob/master/ai-platform-unified/notebooks/custom_job_image_classification_model_for_online_prediction.ipynb
- Using Google Cloud Client Libraries (Python for Vertex AI):
    - https://googleapis.dev/python/aiplatform/latest/index.html

**Overview**

<img src="architectures/statmike-mlops-05c.png">

---
## Setup

Setup the environment:

In [1]:
from google.cloud import aiplatform

from google.protobuf import json_format
from google.protobuf.struct_pb2 import Value

from datetime import datetime

Define Parameters:

In [2]:
# Locations
REGION = 'us-central1'
PROJECT_ID='statmike-mlops'
BUCKET_NAME='gs://{}/digits/model/05e_aip_train_job'.format(PROJECT_ID)
TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")
EXPERIMENT_NAME = '05E_AIP_DIGITS_'
JOB_NAME = EXPERIMENT_NAME+TIMESTAMP
MODEL_DIR = '{}/{}'.format(BUCKET_NAME, JOB_NAME)
PARENT = "projects/" + PROJECT_ID + "/locations/" + REGION

# files
PACKAGE = 'custom_5e'

# Resources
TRAIN_IMAGE='us-docker.pkg.dev/cloud-aiplatform/training/tf-cpu.2-4:latest'
DEPLOY_IMAGE ='us-docker.pkg.dev/cloud-aiplatform/prediction/tf2-cpu.2-3:latest'
TRAIN_COMPUTE='n1-standard-4'
DEPLOY_COMPUTE='n1-standard-4'

# TF Parameters to pass
EPOCHS = 25
BATCH_SIZE = 30

# Tensorboard Info
TENSORBOARD_ID = 'digits_tensorboard'

Get Service Account email:

In [3]:
from googleapiclient import discovery
from oauth2client.client import GoogleCredentials

credentials = GoogleCredentials.get_application_default()

service = discovery.build('iam', 'v1', credentials=credentials)

Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/googleapiclient/discovery_cache/file_cache.py", line 33, in <module>
    from oauth2client.contrib.locked_file import LockedFile
ModuleNotFoundError: No module named 'oauth2client.contrib.locked_file'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/googleapiclient/discovery_cache/file_cache.py", line 37, in <module>
    from oauth2client.locked_file import LockedFile
ModuleNotFoundError: No module named 'oauth2client.locked_file'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/googleapiclient/discovery_cache/__init__.py", line 44, in autodetect
    from . import file_cache
  File "/opt/conda/lib/python3.7/site-packages/googleapiclient/discovery_cache/file_cache.py", line 41, in <module>
    "file_cach

In [4]:
request = service.projects().serviceAccounts().list(name="projects/"+PROJECT_ID)
response = request.execute()

INFO:oauth2client.transport:Attempting refresh to obtain initial access_token


In [5]:
SERVICE_ACCOUNT = response['accounts'][0]['email']
SERVICE_ACCOUNT

'691911073727-compute@developer.gserviceaccount.com'

Setup AI Platform Python Clients
- https://googleapis.dev/python/aiplatform/latest/index.html

In [6]:
API_ENDPOINT = "{}-aiplatform.googleapis.com".format(REGION)
client_options = {"api_endpoint": API_ENDPOINT}
clients = {}

In [7]:
aiplatform.init(project=PROJECT_ID, location=REGION)

---
## Get Tensorboard Instance Name
The training job will show up as an experiment for the Tensorboard instance and have the same name as the training job ID.

In [8]:
tb = aiplatform.Tensorboard.list(filter='display_name={}'.format(TENSORBOARD_ID))

In [9]:
TENSORBOARD_NAME = tb[0].resource_name

In [10]:
TENSORBOARD_NAME

'projects/691911073727/locations/us-central1/tensorboards/3436703912520843264'

In [11]:
tb[0].display_name

'digits_tensorboard'

---
## Training

### Assemble Python Package for Training

Make a subdirectory, called `custom_5*`. If the directory has been previously created then delete it first.

In [12]:
!rm -rf {PACKAGE}
!mkdir {PACKAGE}

Create the `setup.py` file in the subdirectory:

In [13]:
%%writefile {PACKAGE}/setup.py
import setuptools
REQUIRED_PACKAGES = ['tensorflow_io']
setuptools.setup(
    name='trainer',
    version='0.1',
    packages=['trainer'],
    description='Digit Training Package')

Make a subdirectory, called `trainer`.  Include a `__init__.py` file in the subdirectory.

In [14]:
!mkdir {PACKAGE}/trainer
!touch {PACKAGE}/trainer/__init__.py

Create the main python trainer file as `/trainer/task.py`:

In [15]:
%%writefile {PACKAGE}/trainer/task.py
from tensorflow_io.bigquery import BigQueryClient
from tensorflow_io.bigquery import BigQueryReadSession
import tensorflow as tf
from tensorboard.plugins.hparams import api as hp 
from google.cloud import bigquery
import argparse
import os
import sys
import hypertune

parser = argparse.ArgumentParser()
# the passed param, dest: a name for the param, default: if absent fetch this param from the OS, type: type to convert to, help: description of argument
parser.add_argument('--epochs',dest='epochs', default=10, type=int, help='Number of Epochs')
parser.add_argument('--batch_size',dest='batch_size', default=32, type=int, help='Batch Size')
parser.add_argument('--lr',dest='learning_rate', required=True, type=float, help='Learning Rate')
parser.add_argument('--m',dest='momentum', required=True, type=float, help='Momentum')
#parser.add_argument('',dest='', default=, type=, help='')
args = parser.parse_args()

# built in parameters for data source:
PROJECT_ID='statmike-mlops'
BQDATASET_ID='digits'
BQTABLE_ID='digits_prepped'

selected_fields = ['p0', 'p1', 'p2', 'p3', 'p4', 'p5', 'p6', 'p7', 'p8', 'p9', 'p10', 'p11', 'p12', 'p13', 'p14', 'p15', 'p16', 'p17', 'p18', 'p19', 'p20', 'p21', 'p22', 'p23', 'p24', 'p25', 'p26', 'p27', 'p28', 'p29', 'p30', 'p31', 'p32', 'p33', 'p34', 'p35', 'p36', 'p37', 'p38', 'p39', 'p40', 'p41', 'p42', 'p43', 'p44', 'p45', 'p46', 'p47', 'p48', 'p49', 'p50', 'p51', 'p52', 'p53', 'p54', 'p55', 'p56', 'p57', 'p58', 'p59', 'p60', 'p61', 'p62', 'p63', 'target']
output_types = ['FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'FLOAT64', 'INT64']

feature_columns = []
feature_layer_inputs = {}
for header in selected_fields:
    if header != 'target':
        feature_columns.append(tf.feature_column.numeric_column(header))
        feature_layer_inputs[header] = tf.keras.Input(shape=(1,),name=header)

from tensorflow.python.framework import dtypes
output_types = [dtypes.float64 if x=='FLOAT64' else dtypes.int64 for x in output_types]

def transTable(row_dict):
    target=row_dict.pop('target')
    target = tf.one_hot(tf.cast(target,tf.int64),10)
    target = tf.cast(target,tf.float32)
    return(row_dict,target)

client = BigQueryClient()
session = client.read_session("projects/"+PROJECT_ID,PROJECT_ID,BQTABLE_ID,BQDATASET_ID,selected_fields,output_types,row_restriction="SPLITS='TRAIN'",requested_streams=3)
table = session.parallel_read_rows()
table = table.map(transTable)
train = table.shuffle(100000).batch(args.batch_size)

client = BigQueryClient()
session = client.read_session("projects/"+PROJECT_ID,PROJECT_ID,BQTABLE_ID,BQDATASET_ID,selected_fields,output_types,row_restriction="SPLITS='TEST'",requested_streams=3)
table = session.parallel_read_rows()
table = table.map(transTable)
test = table.batch(args.batch_size)

feature_layer = tf.keras.layers.DenseFeatures(feature_columns)
feature_layer_outputs = feature_layer(feature_layer_inputs)
opt = tf.keras.optimizers.SGD(learning_rate=args.learning_rate, momentum=args.momentum)
model = tf.keras.Model(inputs=[v for v in feature_layer_inputs.values()],outputs=tf.keras.layers.Dense(10,activation=tf.nn.softmax)(feature_layer_outputs))
model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])
tf.keras.utils.plot_model(model,show_shapes=True, show_dtype=True)

# setup tensorboard logs
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=os.environ['AIP_TENSORBOARD_LOG_DIR'], histogram_freq=1) #(log_dir=model_dir+'/logs', histogram_freq=1)

history = model.fit(train,epochs=args.epochs, callbacks=[tensorboard_callback])

# report hypertune info back to Vertex AI Training > Hyperparamter Tuning Job
hpt = hypertune.HyperTune()
hpt.report_hyperparameter_tuning_metric(
    hyperparameter_metric_tag = 'loss',
    metric_value = history.history['loss'][-1],
    global_step = 1)

#model.save(os.getenv("AIP_MODEL_DIR"))

Writing custom_5e/trainer/task.py


Store the training package in a Cloud Storage Bucket:

In [16]:
!rm -f {PACKAGE}.tar {PACKAGE}.tar.gz
!tar cvf {PACKAGE}.tar {PACKAGE}
!gzip {PACKAGE}.tar
!gsutil cp {PACKAGE}.tar.gz $BUCKET_NAME/{PACKAGE}.tar.gz

custom_5e/
custom_5e/setup.py
custom_5e/trainer/
custom_5e/trainer/task.py
custom_5e/trainer/__init__.py
Copying file://custom_5e.tar.gz [Content-Type=application/x-tar]...
/ [1 files][  1.8 KiB/  1.8 KiB]                                                
Operation completed over 1 objects/1.8 KiB.                                      


### Setup Training Job

In [17]:
MACHINE_SPEC = {
    "machine_type": TRAIN_COMPUTE,
    "accelerator_count": 0
}


CMDARGS = [
    "--epochs=" + str(EPOCHS),
    "--batch_size=" + str(BATCH_SIZE)
]

WORKER_POOL_SPEC = [
    {
        "replica_count": 3,
        "machine_spec": MACHINE_SPEC,
        "python_package_spec": {
            "executor_image_uri": TRAIN_IMAGE,
            "package_uris": [BUCKET_NAME + "/{}.tar.gz".format(PACKAGE)],
            "python_module": "trainer.task",
            "args": CMDARGS
        }
    }
]

In [18]:
customJob = aiplatform.CustomJob(
    display_name = JOB_NAME,
    worker_pool_specs = WORKER_POOL_SPEC,
    base_output_dir = MODEL_DIR,
    staging_bucket = MODEL_DIR
)

### Setup Hyperparameter Tuning Job

In [22]:
METRIC_SPEC = {
    "loss": "minimize"
}

PARAMETER_SPEC = {
    "lr": aiplatform.hyperparameter_tuning.DoubleParameterSpec(min=0.001, max=0.1, scale="log"),
    "m": aiplatform.hyperparameter_tuning.DoubleParameterSpec(min=1e-7, max=0.9, scale="linear")
}

In [23]:
htJob = aiplatform.HyperparameterTuningJob(
    display_name = JOB_NAME,
    custom_job = customJob,
    metric_spec = METRIC_SPEC,
    parameter_spec = PARAMETER_SPEC,
    max_trial_count = 12,
    parallel_trial_count = 3,
    search_algorithm = 'random'
)

### Run Training Job

In [24]:
htJob.run(
    service_account = SERVICE_ACCOUNT,
    tensorboard = TENSORBOARD_NAME
)

INFO:google.cloud.aiplatform.jobs:Creating HyperparameterTuningJob
INFO:google.cloud.aiplatform.jobs:HyperparameterTuningJob created. Resource name: projects/691911073727/locations/us-central1/hyperparameterTuningJobs/4116538549044510720
INFO:google.cloud.aiplatform.jobs:To use this HyperparameterTuningJob in another session:
INFO:google.cloud.aiplatform.jobs:hpt_job = aiplatform.HyperparameterTuningJob.get('projects/691911073727/locations/us-central1/hyperparameterTuningJobs/4116538549044510720')
INFO:google.cloud.aiplatform.jobs:View HyperparameterTuningJob:
https://console.cloud.google.com/ai/platform/locations/us-central1/training/4116538549044510720?project=691911073727
INFO:google.cloud.aiplatform.jobs:View Tensorboard:
https://us-central1.tensorboard.googleusercontent.com/experiment/projects+691911073727+locations+us-central1+tensorboards+3436703912520843264+experiments+4116538549044510720
INFO:google.cloud.aiplatform.jobs:HyperparameterTuningJob projects/691911073727/locations/

In [37]:
htJob.trials[0].final_measurement.metrics[0].value

0.09431399405002594

In [43]:
losses = [trial.final_measurement.metrics[0].value for trial in htJob.trials]
htJob.trials[losses.index(min(losses))]

id: "11"
state: SUCCEEDED
parameters {
  parameter_id: "lr"
  value {
    number_value: 0.052648029934358465
  }
}
parameters {
  parameter_id: "m"
  value {
    number_value: 0.036914641886908954
  }
}
final_measurement {
  step_count: 1
  metrics {
    metric_id: "loss"
    value: 0.0048503512516617775
  }
}
start_time {
  seconds: 1630609701
  nanos: 256115055
}
end_time {
  seconds: 1630610103
}

---
## Remove Resources
see notebook "XX - Cleanup"