# Vertex AI - Pipeline for Training and Serving

This notebook creates a python package for a TensorFlow training project that uses the `<PROJECT_ID>.digits.digits_prepped` BigQuery table. Vertex AI clients are used to setup a Vertex AI custom training pipeline that runs the training job and uploads the resulting model.  Then Vertex AI clients are used to deploy the model to an endpoint for online predictions.

**Prerequisites**
- `00 - Initial Setup`
- `01 - BigQuery - Data`
- `05 - Vertex AI - Training Job and Serving` 
    - this notebook uses the python package created in 05
    
**Resources**
- Based on:
    - https://cloud.google.com/ai-platform-unified/docs/training/create-training-pipeline#custom-job-model-upload
- Using PipelineService:
    - https://googleapis.dev/python/aiplatform/latest/aiplatform_v1beta1/pipeline_service.html

**Overview**

<img src="architectures/statmike-mlops-06.png">

---
## Setup

Setup the environment:

In [1]:
from google.cloud import aiplatform
from google.protobuf import json_format
from google.protobuf.struct_pb2 import Value
from datetime import datetime

Define Parameters:

In [2]:
# Locations
REGION = 'us-central1'
PROJECT_ID='statmike-mlops'
BUCKET_NAME='gs://{}/digits/model/05_aip_train_job'.format(PROJECT_ID)
TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")
JOB_NAME='06_AIP_DIGITS_'+TIMESTAMP
MODEL_DIR = '{}/{}'.format(BUCKET_NAME, JOB_NAME)
PARENT = "projects/" + PROJECT_ID + "/locations/" + REGION

# Resources
TRAIN_IMAGE='us-docker.pkg.dev/cloud-aiplatform/training/tf-cpu.2-4:latest'
DEPLOY_IMAGE ='us-docker.pkg.dev/cloud-aiplatform/prediction/tf2-cpu.2-3:latest'
TRAIN_COMPUTE='n1-standard-4'
DEPLOY_COMPUTE='n1-standard-4'

# TF Parameters to pass
EPOCHS = 25
BATCH_SIZE = 30

Setup AI Platform Python Clients
- https://googleapis.dev/python/aiplatform/latest/index.html

In [3]:
API_ENDPOINT = "{}-aiplatform.googleapis.com".format(REGION)
client_options = {"api_endpoint": API_ENDPOINT}
clients = {}

---
## Custom Pipeline Job

Create a client for the AIP Pipeline Service:

In [4]:
clients['pipeline'] = aiplatform.gapic.PipelineServiceClient(client_options=client_options)

Define Training Job Parameters - to match the yaml spec in
- `gs://google-cloud-aiplatform/schema/trainingjob/definition/custom_task_1.0.0.yaml`

In [5]:
MACHINE_SPEC = {
    "machineType": TRAIN_COMPUTE,
    "acceleratorCount": 0
}


CMDARGS = [
    "--epochs=" + str(EPOCHS),
    "--batch_size=" + str(BATCH_SIZE)
]

WORKER_POOL_SPEC = [
    {
        "replicaCount": 1,
        "machineSpec": MACHINE_SPEC,
        "pythonPackageSpec": {
            "executorImageUri": TRAIN_IMAGE,
            "packageUris": [BUCKET_NAME + "/trainer_cifar.tar.gz"],
            "pythonModule": "trainer.task",
            "args": CMDARGS
        }
    }
]

JOB_SPEC = {
    "workerPoolSpecs": WORKER_POOL_SPEC,
    "baseOutputDirectory": {"outputUriPrefix": MODEL_DIR}
    
}

CUSTOM_JOB = {
    "display_name": JOB_NAME,
    "job_spec": JOB_SPEC
}

Define Training Pipeline Parameters:

In [6]:
training_task_definition = "gs://google-cloud-aiplatform/schema/trainingjob/definition/custom_task_1.0.0.yaml"
training_task_inputs = json_format.ParseDict(JOB_SPEC, Value())

training_pipeline = {
    "display_name": JOB_NAME,
    "training_task_definition": training_task_definition,
    "training_task_inputs": training_task_inputs,
    "model_to_upload": {
        "display_name": JOB_NAME,
        "container_spec": {"image_uri": DEPLOY_IMAGE},
    },
}

Submit pipeline job:
- this create a pipeline job
    - which create a custom training job
    - and when complete, uploads a model

In [7]:
pipeline = clients['pipeline'].create_training_pipeline(parent=PARENT, training_pipeline=training_pipeline)

In [10]:
clients['pipeline'].get_training_pipeline(name=pipeline.name).state

<PipelineState.PIPELINE_STATE_SUCCEEDED: 4>

In [11]:
clients['pipeline'].get_training_pipeline(name=pipeline.name).model_to_upload.name

'projects/691911073727/locations/us-central1/models/6801429395840958464'

---
## Endpoint Creation

Create a client to the endpoint service:

In [12]:
clients['endpoint'] = aiplatform.gapic.EndpointServiceClient(client_options=client_options)

Create the endpoint:

In [13]:
ENDPOINT_NAME = 'ENDPOINT_'+JOB_NAME
endpoint = clients['endpoint'].create_endpoint(parent=PARENT, endpoint={"display_name": ENDPOINT_NAME})

In [14]:
endpoint_info = clients['endpoint'].get_endpoint(name=endpoint.result(timeout=180).name)
endpoint_info.name

'projects/691911073727/locations/us-central1/endpoints/2200764881970397184'

---
## Deploy Model to Endpoint

Setup Deployment Parameters:

In [15]:
MACHINE_SPEC = {
    "machine_type": DEPLOY_COMPUTE,
    "accelerator_count": 0,
}
DMODEL = {
        "model": clients['pipeline'].get_training_pipeline(name=pipeline.name).model_to_upload.name,
        "display_name": 'DEPLOYED_'+JOB_NAME,
        "dedicated_resources": {
            "min_replica_count": 1,
            "max_replica_count": 2,
            "machine_spec": MACHINE_SPEC
        }   
}
TRAFFIC = {
    '0' : 100
}

Deploy the Model to the Endpoint:

In [16]:
dmodel = clients['endpoint'].deploy_model(endpoint=endpoint_info.name, deployed_model=DMODEL, traffic_split=TRAFFIC)

In [17]:
dmodel_info = dmodel.result().deployed_model
dmodel_info.id

'2683564835773349888'

In [18]:
clients['endpoint'].get_endpoint(name=endpoint_info.name)

name: "projects/691911073727/locations/us-central1/endpoints/2200764881970397184"
display_name: "ENDPOINT_06_AIP_DIGITS_20210714132709"
deployed_models {
  id: "2683564835773349888"
  model: "projects/691911073727/locations/us-central1/models/6801429395840958464"
  display_name: "DEPLOYED_06_AIP_DIGITS_20210714132709"
  create_time {
    seconds: 1626270734
    nanos: 535303000
  }
  dedicated_resources {
    machine_spec {
      machine_type: "n1-standard-4"
    }
    min_replica_count: 1
    max_replica_count: 2
  }
}
traffic_split {
  key: "2683564835773349888"
  value: 100
}
etag: "AMEw9yNyz-a2CsrpjehCfY0Hp3fuIu00B1vQZG1N37_80Qj8LBQZS8mUPMiLCh0mUJjr"
create_time {
  seconds: 1626270730
  nanos: 130632000
}
update_time {
  seconds: 1626270980
  nanos: 328027000
}

---
## Prediction

Create a client to the prediction service:

In [19]:
clients['prediction'] = aiplatform.gapic.PredictionServiceClient(client_options=client_options)

Setup an observation for prediction:

In [20]:
%%bigquery pred
SELECT *
FROM `digits.digits_prepped`
WHERE splits='TEST'

Query complete after 0.01s: 100%|██████████| 1/1 [00:00<00:00, 594.77query/s] 
Downloading: 100%|██████████| 373/373 [00:01<00:00, 342.97rows/s]


In [21]:
pred.head(1)

Unnamed: 0,p0,p1,p2,p3,p4,p5,p6,p7,p8,p9,...,p57,p58,p59,p60,p61,p62,p63,target,target_OE,SPLITS
0,0.0,0.0,0.0,12.0,6.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,10.0,16.0,11.0,0.0,0.0,0,Even,TEST


In [22]:
newob = pred.loc[:0,'p0':'p63'].to_dict(orient='records')[0]
#newob

In [23]:
from google.protobuf import json_format
from google.protobuf.struct_pb2 import Value

response = clients['prediction'].predict(endpoint=endpoint_info.name, instances=[json_format.ParseDict(newob, Value())], parameters=json_format.ParseDict({}, Value()))

In [24]:
response.predictions

[[0.999546707, 1.32121215e-06, 3.21994804e-08, 3.34853048e-11, 3.60508267e-07, 7.07400574e-08, 0.000447585859, 7.27099825e-10, 3.68008546e-06, 1.82077656e-07]]

In [25]:
import numpy as np
np.argmax(response.predictions[0])

0

---

## Remove Resources
- undeploy model
- delete endpoint
- remove model
- delete training > custom job
- delete training > training pipeline
- delete python training application
    - these could be used by later notebooks in this project

Get Pipeline Info:

In [34]:
pipe_info = clients['pipeline'].get_training_pipeline(name=pipeline.name)

Undeploy Model:

In [26]:
dmodel = clients['endpoint'].get_endpoint(name=endpoint_info.name).deployed_models[0].id
clients['endpoint'].undeploy_model(endpoint=endpoint_info.name, deployed_model_id=dmodel)

<google.api_core.operation.Operation at 0x7f294c6d86d0>

Delete Endpoint:

In [27]:
clients['endpoint'].delete_endpoint(name=endpoint_info.name)

<google.api_core.operation.Operation at 0x7f294c69fad0>

Remove Model:

In [36]:
clients['model'] = aiplatform.gapic.ModelServiceClient(client_options=client_options)
clients['model'].delete_model(name=pipe_info.model_to_upload.name)

<google.api_core.operation.Operation at 0x7f294c188650>

Delete Training > Custom Job:

In [46]:
clients['job'] = aiplatform.gapic.JobServiceClient(client_options=client_options)
clients['job'].delete_custom_job(name=pipe_info.training_task_metadata._pb['backingCustomJob'].string_value)

<google.api_core.operation.Operation at 0x7f294c6ff710>

Delete Training > Training Pipeline:

In [49]:
clients['pipeline'].delete_training_pipeline(name=pipeline.name)

<google.api_core.operation.Operation at 0x7f294c18e8d0>

Delete Model Files:
- This includes the training package and the saved model
- These could be useful for other notebooks

In [50]:
from google.cloud import storage
gcs = storage.Client()

path = gcs.bucket(PROJECT_ID)
blobs = path.list_blobs(prefix='digits/aip_train_job')
for blob in blobs:
    print(blob)
    blob.delete()

<Blob: statmike-mlops, digits/aip_train_job/AIP_DIGITS_20210413150956/, 1618327171620676>
<Blob: statmike-mlops, digits/aip_train_job/AIP_DIGITS_20210413150956/model/, 1618327171918285>
<Blob: statmike-mlops, digits/aip_train_job/AIP_DIGITS_20210413150956/model/assets/, 1618327175253828>
<Blob: statmike-mlops, digits/aip_train_job/AIP_DIGITS_20210413150956/model/saved_model.pb, 1618327175719390>
<Blob: statmike-mlops, digits/aip_train_job/AIP_DIGITS_20210413150956/model/variables/, 1618327172067587>
<Blob: statmike-mlops, digits/aip_train_job/AIP_DIGITS_20210413150956/model/variables/variables.data-00000-of-00001, 1618327174237522>
<Blob: statmike-mlops, digits/aip_train_job/AIP_DIGITS_20210413150956/model/variables/variables.index, 1618327174494912>
<Blob: statmike-mlops, digits/aip_train_job/trainer_cifar.tar.gz, 1618322304382464>
