# Python Training - Vertex AI Training Custom Jobs
### IN ACTIVE DEVELOPMENT - not complete

ML Training with Python code as a Vertex AI Training Custom Job

Why?  This notebook is an IDE that happens to also happen to have:
- **compute**: CPU, Memory, GPU
- **software**: container running with Python and loaded packages like TensorFlow, PyTorch, ...
- **code**: user-written instruction for ML training

But scaling this notebook instance to run our ML training code has limitations:
- paying `$$$$` while typing and troubleshooting
- running training code multiple times with different data sources
- running training code with multiple configuration of hyperparameters for tuning
- automating training code execution based on time or events

Rather than scaling this notebook up to larger **compute** we want to launch a fit for purpose job that runs our training **code** using the **software** of choice on the needed **compute** to handle the size of our training data.  That is made simple with Vertex AI Training Custom Jobs.  

Our training code can be in many locations and forms:
- local files
    - single script
    - folders/modules
    - Python Package Distribution
- GCS Bucket
    - single script
    - folders/modules
    - Python Package Distribution
- GitHub
    - single script
    - folders/modules
    - Python Package Distribution
- Repository
    - Python Package hosted on Artifact Registry
    
Vertex AI Training Custom Jobs can use training code from:
- local files: single script
- GCS Bucket: Python Source Distribution
- Custom Container
    - Built with code originating at any of the locations and forms above!

---

**Prerequisites:**

The examples below use:
- the code in various formats created in the [Python Packages](./Python%20Packages.ipynb) notebook
- the custom containers created in multiple workflows by the [Python Custom Containers](./Python%20Custom%20Containers.ipynb) notebook



---
## Setup

inputs:

In [1]:
project = !gcloud config get-value project
PROJECT_ID = project[0]
PROJECT_ID

'statmike-mlops-349915'

In [2]:
REGION = 'us-central1'
EXPERIMENT = 'training'
SERIES = 'tips'

packages:

In [3]:
import os, shutil
import pkg_resources
from datetime import datetime

from google.cloud import aiplatform

clients:

In [4]:
aiplatform.init(project = PROJECT_ID, location = REGION)

parameters:

In [5]:
DIR = f'temp/{EXPERIMENT}'

In [6]:
# Give service account roles/storage.objectAdmin permissions
# Console > IMA > Select Account <projectnumber>-compute@developer.gserviceaccount.com > edit - give role
SERVICE_ACCOUNT = !gcloud config list --format='value(core.account)' 
SERVICE_ACCOUNT = SERVICE_ACCOUNT[0]
SERVICE_ACCOUNT

'1026793852137-compute@developer.gserviceaccount.com'

environment:

In [7]:
# remove directory named DIR if exists
shutil.rmtree(DIR, ignore_errors = True)

# create directory DIR
os.makedirs(DIR)

# check for existance of DIR
print('DIR exists? ', os.path.exists(DIR))

# list contents of directory one level higher than DIR
os.listdir(DIR + '/../')

DIR exists?  True


['job-parms', 'gcs', 'containers', 'multiprocess', 'packages', 'training']

---
## Common Prep for Examples

### Inputs & Parameters

In [9]:
# Model Training
VAR_TARGET = 'Class'
VAR_OMIT = 'transaction_id' # add more variables to the string with space delimiters
EPOCHS = 10
BATCH_SIZE = 100

# source data
BQ_PROJECT = PROJECT_ID
BQ_DATASET = 'fraud'
BQ_TABLE = 'fraud_prepped'

# Experiment Tracking
FRAMEWORK = 'tf'
TASK = 'classification'
MODEL_TYPE = 'dnn'
EXPERIMENT_NAME = f'experiment-{SERIES}-{EXPERIMENT}-{FRAMEWORK}-{TASK}-{MODEL_TYPE}'

# Resources
TRAIN_COMPUTE = 'n1-standard-4'
REPOSITORY = f"{REGION}-docker.pkg.dev/{PROJECT_ID}/{PROJECT_ID}-docker"

# parameters
BUCKET = PROJECT_ID
URI = f"gs://{BUCKET}/{SERIES}/{EXPERIMENT}"
DIR = f"temp/{EXPERIMENT}"

### Tensorboard

The example test jobs below are based on jobs in the `05 - TensorFlow` series and takes advantage of Vertex AI Experiments and mangaed TensorBoard.  This section creates a TensorBoard instance and gets other inputs for the jobs:

In [10]:
tb = aiplatform.Tensorboard.list(filter=f"labels.series={SERIES}")
if tb:
    tb = tb[0]
else: 
    tb = aiplatform.Tensorboard.create(display_name = SERIES, labels = {'series' : f'{SERIES}'})

In [11]:
tb.resource_name

'projects/1026793852137/locations/us-central1/tensorboards/7360834523774320640'

### Experiment

The code in this section initializes the experiment that represents this notebook.  Throughout the notebook sections the model training and evaluation information will be logged to the experiment using as an experiment run using:

- [.log_params](https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform#google_cloud_aiplatform_log_params)
- [.log_metrics](https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform#google_cloud_aiplatform_log_metrics)
- [.log_time_series_metrics](https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform#google_cloud_aiplatform_log_time_series_metrics)

In [12]:
aiplatform.init(experiment = EXPERIMENT_NAME, experiment_tensorboard = tb.resource_name)

---
## Usage Examples
Vertex AI Training Custom Jobs that use:
- a local script
- GCS housed source distribution
- custom containers
    - all the workflows from Python Custom Container notebook

This section show examples of running Vertex AI Custom Jobs

---
### Custom Job With Custom Container - Workflow 1 - Copy Script To Container
<a id = 'workflow1'></a>

The custom container used here was created by [Python Custom Containers - Workflow 1](./Python%20Custom%20Containers.ipynb#workflow1).

> This is a modified version of notebook [05c - Vertex AI Custom Model - TensorFlow - Custom Job With Custom Container](../05%20-%20TensorFlow/05c%20-%20Vertex%20AI%20Custom%20Model%20-%20TensorFlow%20-%20Custom%20Job%20With%20Custom%20Container.ipynb).


Job Parameters:

In [15]:
TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")
WORKFLOW = 'workflow_1'
RUN_NAME = f"run-{WORKFLOW.replace('_', '-')}-{TIMESTAMP}"
WORKFLOW_IMAGE = f"{REPOSITORY}/tips_trainer_{WORKFLOW}"

CMDARGS = [
    "--epochs=" + str(EPOCHS),
    "--batch_size=" + str(BATCH_SIZE),
    "--var_target=" + VAR_TARGET,
    "--var_omit=" + VAR_OMIT,
    "--project_id=" + PROJECT_ID,
    "--bq_project=" + BQ_PROJECT,
    "--bq_dataset=" + BQ_DATASET,
    "--bq_table=" + BQ_TABLE,
    "--region=" + REGION,
    "--experiment=" + EXPERIMENT,
    "--series=" + SERIES,
    "--experiment_name=" + EXPERIMENT_NAME,
    "--run_name=" + RUN_NAME
]

MACHINE_SPEC = {
    "machine_type": TRAIN_COMPUTE,
    "accelerator_count": 0
}

WORKER_POOL_SPEC = [
    {
        "replica_count": 1,
        "machine_spec": MACHINE_SPEC,
        "container_spec": {
            "image_uri": WORKFLOW_IMAGE,
            "command": [],
            "args": CMDARGS
        }
    }
]

Define the `aiplatform.CustomJob`:

In [17]:
customJob = aiplatform.CustomJob(
    display_name = f'{EXPERIMENT}_{SERIES}_{WORKFLOW}_{TIMESTAMP}',
    worker_pool_specs = WORKER_POOL_SPEC,
    base_output_dir = f"{URI}/{WORKFLOW}/{TIMESTAMP}",
    staging_bucket = f"{URI}/{WORKFLOW}/{TIMESTAMP}",
    labels = {'series' : f'{SERIES}', 'experiment' : f'{EXPERIMENT}', 'experiment_name' : f'{EXPERIMENT_NAME}', 'run_name' : f'{RUN_NAME}'}
)

Run the job:

In [18]:
customJob.run(
    service_account = SERVICE_ACCOUNT,
    tensorboard = tb.resource_name
)

Creating CustomJob
CustomJob created. Resource name: projects/1026793852137/locations/us-central1/customJobs/3255733919315656704
To use this CustomJob in another session:
custom_job = aiplatform.CustomJob.get('projects/1026793852137/locations/us-central1/customJobs/3255733919315656704')
View Custom Job:
https://console.cloud.google.com/ai/platform/locations/us-central1/training/3255733919315656704?project=1026793852137
View Tensorboard:
https://us-central1.tensorboard.googleusercontent.com/experiment/projects+1026793852137+locations+us-central1+tensorboards+7360834523774320640+experiments+3255733919315656704
CustomJob projects/1026793852137/locations/us-central1/customJobs/3255733919315656704 current state:
JobState.JOB_STATE_QUEUED
CustomJob projects/1026793852137/locations/us-central1/customJobs/3255733919315656704 current state:
JobState.JOB_STATE_PENDING
CustomJob projects/1026793852137/locations/us-central1/customJobs/3255733919315656704 current state:
JobState.JOB_STATE_PENDING
C

Review the Job:

In [19]:
customJob.display_name

'training_tips_workflow_1_20220922193615'

In [20]:
customJob.resource_name

'projects/1026793852137/locations/us-central1/customJobs/3255733919315656704'

In [21]:
job_link = f"https://console.cloud.google.com/vertex-ai/locations/{REGION}/training/{customJob.resource_name.split('/')[-1]}/cpu?cloudshell=false&project={PROJECT_ID}"
board_link = f"https://{REGION}.tensorboard.googleusercontent.com/experiment/{tb.resource_name.replace('/', '+')}+experiments+{customJob.resource_name.split('/')[-1]}"

In [22]:
print(f'Review the Job here:\n{job_link}')
print(f'Review the TensorBoard From the Job here:\n{board_link}')

Review the Job here:
https://console.cloud.google.com/vertex-ai/locations/us-central1/training/3255733919315656704/cpu?cloudshell=false&project=statmike-mlops-349915
Review the TensorBoard From the Job here:
https://us-central1.tensorboard.googleusercontent.com/experiment/projects+1026793852137+locations+us-central1+tensorboards+7360834523774320640+experiments+3255733919315656704


---
### Custom Job With Custom Container - Workflow 2 - Copy Folder To Container
<a id = 'workflow2'></a>

The custom container used here was created by [Python Custom Containers - Workflow 2](./Python%20Custom%20Containers.ipynb#workflow2).

> This is a modified version of notebook [05c - Vertex AI Custom Model - TensorFlow - Custom Job With Custom Container](../05%20-%20TensorFlow/05c%20-%20Vertex%20AI%20Custom%20Model%20-%20TensorFlow%20-%20Custom%20Job%20With%20Custom%20Container.ipynb).

Job Parameters:

In [23]:
TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")
WORKFLOW = 'workflow_2'
RUN_NAME = f"run-{WORKFLOW.replace('_', '-')}-{TIMESTAMP}"
WORKFLOW_IMAGE = f"{REPOSITORY}/tips_trainer_{WORKFLOW}"

CMDARGS = [
    "--epochs=" + str(EPOCHS),
    "--batch_size=" + str(BATCH_SIZE),
    "--var_target=" + VAR_TARGET,
    "--var_omit=" + VAR_OMIT,
    "--project_id=" + PROJECT_ID,
    "--bq_project=" + BQ_PROJECT,
    "--bq_dataset=" + BQ_DATASET,
    "--bq_table=" + BQ_TABLE,
    "--region=" + REGION,
    "--experiment=" + EXPERIMENT,
    "--series=" + SERIES,
    "--experiment_name=" + EXPERIMENT_NAME,
    "--run_name=" + RUN_NAME
]

MACHINE_SPEC = {
    "machine_type": TRAIN_COMPUTE,
    "accelerator_count": 0
}

WORKER_POOL_SPEC = [
    {
        "replica_count": 1,
        "machine_spec": MACHINE_SPEC,
        "container_spec": {
            "image_uri": WORKFLOW_IMAGE,
            "command": [],
            "args": CMDARGS
        }
    }
]

Define the `aiplatform.CustomJob`:

In [27]:
customJob = aiplatform.CustomJob(
    display_name = f'{EXPERIMENT}_{SERIES}_{WORKFLOW}_{TIMESTAMP}',
    worker_pool_specs = WORKER_POOL_SPEC,
    base_output_dir = f"{URI}/{WORKFLOW}/{TIMESTAMP}",
    staging_bucket = f"{URI}/{WORKFLOW}/{TIMESTAMP}",
    labels = {'series' : f'{SERIES}', 'experiment' : f'{EXPERIMENT}', 'experiment_name' : f'{EXPERIMENT_NAME}', 'run_name' : f'{RUN_NAME}'}
)

Run the job:

In [28]:
customJob.run(
    service_account = SERVICE_ACCOUNT,
    tensorboard = tb.resource_name
)

Creating CustomJob
CustomJob created. Resource name: projects/1026793852137/locations/us-central1/customJobs/1381392049399398400
To use this CustomJob in another session:
custom_job = aiplatform.CustomJob.get('projects/1026793852137/locations/us-central1/customJobs/1381392049399398400')
View Custom Job:
https://console.cloud.google.com/ai/platform/locations/us-central1/training/1381392049399398400?project=1026793852137
View Tensorboard:
https://us-central1.tensorboard.googleusercontent.com/experiment/projects+1026793852137+locations+us-central1+tensorboards+7360834523774320640+experiments+1381392049399398400
CustomJob projects/1026793852137/locations/us-central1/customJobs/1381392049399398400 current state:
JobState.JOB_STATE_PENDING
CustomJob projects/1026793852137/locations/us-central1/customJobs/1381392049399398400 current state:
JobState.JOB_STATE_PENDING
CustomJob projects/1026793852137/locations/us-central1/customJobs/1381392049399398400 current state:
JobState.JOB_STATE_PENDING


Review the Job:

In [29]:
customJob.display_name

'training_tips_workflow_2_20220922212304'

In [30]:
customJob.resource_name

'projects/1026793852137/locations/us-central1/customJobs/1381392049399398400'

In [31]:
job_link = f"https://console.cloud.google.com/vertex-ai/locations/{REGION}/training/{customJob.resource_name.split('/')[-1]}/cpu?cloudshell=false&project={PROJECT_ID}"
board_link = f"https://{REGION}.tensorboard.googleusercontent.com/experiment/{tb.resource_name.replace('/', '+')}+experiments+{customJob.resource_name.split('/')[-1]}"

In [32]:
print(f'Review the Job here:\n{job_link}')
print(f'Review the TensorBoard From the Job here:\n{board_link}')

Review the Job here:
https://console.cloud.google.com/vertex-ai/locations/us-central1/training/1381392049399398400/cpu?cloudshell=false&project=statmike-mlops-349915
Review the TensorBoard From the Job here:
https://us-central1.tensorboard.googleusercontent.com/experiment/projects+1026793852137+locations+us-central1+tensorboards+7360834523774320640+experiments+1381392049399398400


---
### Custom Job With Custom Container - Workflow 3 - Copy Package To Container
<a id = 'workflow3'></a>

The custom container used here was created by [Python Custom Containers - Workflow 3](./Python%20Custom%20Containers.ipynb#workflow3).

> This is a modified version of notebook [05c - Vertex AI Custom Model - TensorFlow - Custom Job With Custom Container](../05%20-%20TensorFlow/05c%20-%20Vertex%20AI%20Custom%20Model%20-%20TensorFlow%20-%20Custom%20Job%20With%20Custom%20Container.ipynb).

Job Parameters:

In [33]:
TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")
WORKFLOW = 'workflow_3'
RUN_NAME = f"run-{WORKFLOW.replace('_', '-')}-{TIMESTAMP}"
WORKFLOW_IMAGE = f"{REPOSITORY}/tips_trainer_{WORKFLOW}"

CMDARGS = [
    "--epochs=" + str(EPOCHS),
    "--batch_size=" + str(BATCH_SIZE),
    "--var_target=" + VAR_TARGET,
    "--var_omit=" + VAR_OMIT,
    "--project_id=" + PROJECT_ID,
    "--bq_project=" + BQ_PROJECT,
    "--bq_dataset=" + BQ_DATASET,
    "--bq_table=" + BQ_TABLE,
    "--region=" + REGION,
    "--experiment=" + EXPERIMENT,
    "--series=" + SERIES,
    "--experiment_name=" + EXPERIMENT_NAME,
    "--run_name=" + RUN_NAME
]

MACHINE_SPEC = {
    "machine_type": TRAIN_COMPUTE,
    "accelerator_count": 0
}

WORKER_POOL_SPEC = [
    {
        "replica_count": 1,
        "machine_spec": MACHINE_SPEC,
        "container_spec": {
            "image_uri": WORKFLOW_IMAGE,
            "command": [],
            "args": CMDARGS
        }
    }
]

Define the `aiplatform.CustomJob`:

In [34]:
customJob = aiplatform.CustomJob(
    display_name = f'{EXPERIMENT}_{SERIES}_{WORKFLOW}_{TIMESTAMP}',
    worker_pool_specs = WORKER_POOL_SPEC,
    base_output_dir = f"{URI}/{WORKFLOW}/{TIMESTAMP}",
    staging_bucket = f"{URI}/{WORKFLOW}/{TIMESTAMP}",
    labels = {'series' : f'{SERIES}', 'experiment' : f'{EXPERIMENT}', 'experiment_name' : f'{EXPERIMENT_NAME}', 'run_name' : f'{RUN_NAME}'}
)

Run the job:

In [35]:
customJob.run(
    service_account = SERVICE_ACCOUNT,
    tensorboard = tb.resource_name
)

Creating CustomJob
CustomJob created. Resource name: projects/1026793852137/locations/us-central1/customJobs/4519934796746457088
To use this CustomJob in another session:
custom_job = aiplatform.CustomJob.get('projects/1026793852137/locations/us-central1/customJobs/4519934796746457088')
View Custom Job:
https://console.cloud.google.com/ai/platform/locations/us-central1/training/4519934796746457088?project=1026793852137
View Tensorboard:
https://us-central1.tensorboard.googleusercontent.com/experiment/projects+1026793852137+locations+us-central1+tensorboards+7360834523774320640+experiments+4519934796746457088
CustomJob projects/1026793852137/locations/us-central1/customJobs/4519934796746457088 current state:
JobState.JOB_STATE_PENDING
CustomJob projects/1026793852137/locations/us-central1/customJobs/4519934796746457088 current state:
JobState.JOB_STATE_PENDING
CustomJob projects/1026793852137/locations/us-central1/customJobs/4519934796746457088 current state:
JobState.JOB_STATE_PENDING


Review the Job:

In [40]:
customJob.display_name

'training_tips_workflow_3_20220923184924'

In [41]:
customJob.resource_name

'projects/1026793852137/locations/us-central1/customJobs/4519934796746457088'

In [42]:
job_link = f"https://console.cloud.google.com/vertex-ai/locations/{REGION}/training/{customJob.resource_name.split('/')[-1]}/cpu?cloudshell=false&project={PROJECT_ID}"
board_link = f"https://{REGION}.tensorboard.googleusercontent.com/experiment/{tb.resource_name.replace('/', '+')}+experiments+{customJob.resource_name.split('/')[-1]}"

In [43]:
print(f'Review the Job here:\n{job_link}')
print(f'Review the TensorBoard From the Job here:\n{board_link}')

Review the Job here:
https://console.cloud.google.com/vertex-ai/locations/us-central1/training/4519934796746457088/cpu?cloudshell=false&project=statmike-mlops-349915
Review the TensorBoard From the Job here:
https://us-central1.tensorboard.googleusercontent.com/experiment/projects+1026793852137+locations+us-central1+tensorboards+7360834523774320640+experiments+4519934796746457088


---
### Custom Job With Custom Container - Workflow 4 - pip install package from GCS to container
<a id = 'workflow4'></a>

The custom container used here was created by [Python Custom Containers - Workflow 4](./Python%20Custom%20Containers.ipynb#workflow4).

> This is a modified version of notebook [05c - Vertex AI Custom Model - TensorFlow - Custom Job With Custom Container](../05%20-%20TensorFlow/05c%20-%20Vertex%20AI%20Custom%20Model%20-%20TensorFlow%20-%20Custom%20Job%20With%20Custom%20Container.ipynb).

---
### Custom Job With Custom Container - Workflow 5 - pip install package from GitHub to container
<a id = 'workflow5'></a>

The custom container used here was created by [Python Custom Containers - Workflow 5](./Python%20Custom%20Containers.ipynb#workflow5).

> This is a modified version of notebook [05c - Vertex AI Custom Model - TensorFlow - Custom Job With Custom Container](../05%20-%20TensorFlow/05c%20-%20Vertex%20AI%20Custom%20Model%20-%20TensorFlow%20-%20Custom%20Job%20With%20Custom%20Container.ipynb).

In [44]:
TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")
WORKFLOW = 'workflow_5'
RUN_NAME = f"run-{WORKFLOW.replace('_', '-')}-{TIMESTAMP}"
WORKFLOW_IMAGE = f"{REPOSITORY}/tips_trainer_{WORKFLOW}"

CMDARGS = [
    "--epochs=" + str(EPOCHS),
    "--batch_size=" + str(BATCH_SIZE),
    "--var_target=" + VAR_TARGET,
    "--var_omit=" + VAR_OMIT,
    "--project_id=" + PROJECT_ID,
    "--bq_project=" + BQ_PROJECT,
    "--bq_dataset=" + BQ_DATASET,
    "--bq_table=" + BQ_TABLE,
    "--region=" + REGION,
    "--experiment=" + EXPERIMENT,
    "--series=" + SERIES,
    "--experiment_name=" + EXPERIMENT_NAME,
    "--run_name=" + RUN_NAME
]

MACHINE_SPEC = {
    "machine_type": TRAIN_COMPUTE,
    "accelerator_count": 0
}

WORKER_POOL_SPEC = [
    {
        "replica_count": 1,
        "machine_spec": MACHINE_SPEC,
        "container_spec": {
            "image_uri": WORKFLOW_IMAGE,
            "command": [],
            "args": CMDARGS
        }
    }
]

Define the `aiplatform.CustomJob`:

In [45]:
customJob = aiplatform.CustomJob(
    display_name = f'{EXPERIMENT}_{SERIES}_{WORKFLOW}_{TIMESTAMP}',
    worker_pool_specs = WORKER_POOL_SPEC,
    base_output_dir = f"{URI}/{WORKFLOW}/{TIMESTAMP}",
    staging_bucket = f"{URI}/{WORKFLOW}/{TIMESTAMP}",
    labels = {'series' : f'{SERIES}', 'experiment' : f'{EXPERIMENT}', 'experiment_name' : f'{EXPERIMENT_NAME}', 'run_name' : f'{RUN_NAME}'}
)

Run the job:

In [46]:
customJob.run(
    service_account = SERVICE_ACCOUNT,
    tensorboard = tb.resource_name
)

Creating CustomJob
CustomJob created. Resource name: projects/1026793852137/locations/us-central1/customJobs/1618490736813015040
To use this CustomJob in another session:
custom_job = aiplatform.CustomJob.get('projects/1026793852137/locations/us-central1/customJobs/1618490736813015040')
View Custom Job:
https://console.cloud.google.com/ai/platform/locations/us-central1/training/1618490736813015040?project=1026793852137
View Tensorboard:
https://us-central1.tensorboard.googleusercontent.com/experiment/projects+1026793852137+locations+us-central1+tensorboards+7360834523774320640+experiments+1618490736813015040
CustomJob projects/1026793852137/locations/us-central1/customJobs/1618490736813015040 current state:
JobState.JOB_STATE_PENDING
CustomJob projects/1026793852137/locations/us-central1/customJobs/1618490736813015040 current state:
JobState.JOB_STATE_PENDING
CustomJob projects/1026793852137/locations/us-central1/customJobs/1618490736813015040 current state:
JobState.JOB_STATE_PENDING


Review the Job:

In [51]:
customJob.display_name

'training_tips_workflow_5_20220923191307'

In [52]:
customJob.resource_name

'projects/1026793852137/locations/us-central1/customJobs/1618490736813015040'

In [53]:
job_link = f"https://console.cloud.google.com/vertex-ai/locations/{REGION}/training/{customJob.resource_name.split('/')[-1]}/cpu?cloudshell=false&project={PROJECT_ID}"
board_link = f"https://{REGION}.tensorboard.googleusercontent.com/experiment/{tb.resource_name.replace('/', '+')}+experiments+{customJob.resource_name.split('/')[-1]}"

In [54]:
print(f'Review the Job here:\n{job_link}')
print(f'Review the TensorBoard From the Job here:\n{board_link}')

Review the Job here:
https://console.cloud.google.com/vertex-ai/locations/us-central1/training/1618490736813015040/cpu?cloudshell=false&project=statmike-mlops-349915
Review the TensorBoard From the Job here:
https://us-central1.tensorboard.googleusercontent.com/experiment/projects+1026793852137+locations+us-central1+tensorboards+7360834523774320640+experiments+1618490736813015040


---
### Custom Job With Custom Container - Workflow 6 - pip install package from Artifact Registry to container
<a id = 'workflow6'></a>

The custom container used here was created by [Python Custom Containers - Workflow 6](./Python%20Custom%20Containers.ipynb#workflow6).

> This is a modified version of notebook [05c - Vertex AI Custom Model - TensorFlow - Custom Job With Custom Container](../05%20-%20TensorFlow/05c%20-%20Vertex%20AI%20Custom%20Model%20-%20TensorFlow%20-%20Custom%20Job%20With%20Custom%20Container.ipynb).

In [55]:
TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")
WORKFLOW = 'workflow_6'
RUN_NAME = f"run-{WORKFLOW.replace('_', '-')}-{TIMESTAMP}"
WORKFLOW_IMAGE = f"{REPOSITORY}/tips_trainer_{WORKFLOW}"

CMDARGS = [
    "--epochs=" + str(EPOCHS),
    "--batch_size=" + str(BATCH_SIZE),
    "--var_target=" + VAR_TARGET,
    "--var_omit=" + VAR_OMIT,
    "--project_id=" + PROJECT_ID,
    "--bq_project=" + BQ_PROJECT,
    "--bq_dataset=" + BQ_DATASET,
    "--bq_table=" + BQ_TABLE,
    "--region=" + REGION,
    "--experiment=" + EXPERIMENT,
    "--series=" + SERIES,
    "--experiment_name=" + EXPERIMENT_NAME,
    "--run_name=" + RUN_NAME
]

MACHINE_SPEC = {
    "machine_type": TRAIN_COMPUTE,
    "accelerator_count": 0
}

WORKER_POOL_SPEC = [
    {
        "replica_count": 1,
        "machine_spec": MACHINE_SPEC,
        "container_spec": {
            "image_uri": WORKFLOW_IMAGE,
            "command": [],
            "args": CMDARGS
        }
    }
]

Define the `aiplatform.CustomJob`:

In [56]:
customJob = aiplatform.CustomJob(
    display_name = f'{EXPERIMENT}_{SERIES}_{WORKFLOW}_{TIMESTAMP}',
    worker_pool_specs = WORKER_POOL_SPEC,
    base_output_dir = f"{URI}/{WORKFLOW}/{TIMESTAMP}",
    staging_bucket = f"{URI}/{WORKFLOW}/{TIMESTAMP}",
    labels = {'series' : f'{SERIES}', 'experiment' : f'{EXPERIMENT}', 'experiment_name' : f'{EXPERIMENT_NAME}', 'run_name' : f'{RUN_NAME}'}
)

Run the job:

In [57]:
customJob.run(
    service_account = SERVICE_ACCOUNT,
    tensorboard = tb.resource_name
)

Creating CustomJob
CustomJob created. Resource name: projects/1026793852137/locations/us-central1/customJobs/2791256227277963264
To use this CustomJob in another session:
custom_job = aiplatform.CustomJob.get('projects/1026793852137/locations/us-central1/customJobs/2791256227277963264')
View Custom Job:
https://console.cloud.google.com/ai/platform/locations/us-central1/training/2791256227277963264?project=1026793852137
View Tensorboard:
https://us-central1.tensorboard.googleusercontent.com/experiment/projects+1026793852137+locations+us-central1+tensorboards+7360834523774320640+experiments+2791256227277963264
CustomJob projects/1026793852137/locations/us-central1/customJobs/2791256227277963264 current state:
JobState.JOB_STATE_QUEUED
CustomJob projects/1026793852137/locations/us-central1/customJobs/2791256227277963264 current state:
JobState.JOB_STATE_PENDING
CustomJob projects/1026793852137/locations/us-central1/customJobs/2791256227277963264 current state:
JobState.JOB_STATE_PENDING
C

Review the Job:

In [58]:
customJob.display_name

'training_tips_workflow_6_20220924020627'

In [59]:
customJob.resource_name

'projects/1026793852137/locations/us-central1/customJobs/2791256227277963264'

In [60]:
job_link = f"https://console.cloud.google.com/vertex-ai/locations/{REGION}/training/{customJob.resource_name.split('/')[-1]}/cpu?cloudshell=false&project={PROJECT_ID}"
board_link = f"https://{REGION}.tensorboard.googleusercontent.com/experiment/{tb.resource_name.replace('/', '+')}+experiments+{customJob.resource_name.split('/')[-1]}"

In [61]:
print(f'Review the Job here:\n{job_link}')
print(f'Review the TensorBoard From the Job here:\n{board_link}')

Review the Job here:
https://console.cloud.google.com/vertex-ai/locations/us-central1/training/2791256227277963264/cpu?cloudshell=false&project=statmike-mlops-349915
Review the TensorBoard From the Job here:
https://us-central1.tensorboard.googleusercontent.com/experiment/projects+1026793852137+locations+us-central1+tensorboards+7360834523774320640+experiments+2791256227277963264


---
# CONTENT IN DEVELOPMENT

- also show a local run of the code on the notebook instance

---
## Vertex AI Training - Custom Jobs

Vertex AI Training has Custom Jobs that can be launched with:
- single files/modules from local disk
- source distributions from GCS URI's
- custom containers

Below are tests/examples for the single file and source distribution created in this notebook.  The custom container workflows will be examined further in the [Python Client for Cloud Build](./Python%20Client%20for%20Cloud%20Build.ipynb) notebook.

### Custom Job: From Local Script

This is a modified version of notebook [05a - Vertex AI Custom Model - TensorFlow - Custom Job With Python File](../05%20-%20TensorFlow/05a%20-%20Vertex%20AI%20Custom%20Model%20-%20TensorFlow%20-%20Custom%20Job%20With%20Python%20File.ipynb) that uses the local script for this project.

Notes:
- This uses a single `file.py` from the local directory, not a GCS URI
- When you run `aiplatform.CustomJob.from_local_script()` it responds with a message confirming the local script was copied to the GCS URI provide in the parameter `staging_bucket = `.

In [209]:
TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")
EXPERIMENT_NAME = f'experiment-{SERIES}-{EXPERIMENT}-tf-classification-dnn'
RUN_NAME = f'run-{TIMESTAMP}'

TRAIN_IMAGE = 'us-docker.pkg.dev/vertex-ai/training/tf-cpu.2-7:latest'
TRAIN_COMPUTE = 'n1-standard-4'
URI = f'gs://{PROJECT_ID}/{SERIES}/{EXPERIMENT}/models'


CMDARGS = [
    "--epochs=5",
    "--batch_size=100",
    "--var_target=Class",
    "--var_omit=transaction_id",
    f"--project_id={PROJECT_ID}",
    f"--bq_project={PROJECT_ID}",
    "--bq_dataset=fraud",
    "--bq_table=fraud_prepped",
    f"--region={REGION}",
    f"--experiment={EXPERIMENT}",
    f"--series={SERIES}",
    f"--experiment_name={EXPERIMENT_NAME}",
    f"--run_name={RUN_NAME}"
]

In [211]:
customJob = aiplatform.CustomJob.from_local_script(
    display_name = f'{SERIES}_{EXPERIMENT}_{TIMESTAMP}',
    script_path = f"{DIR}/trainer/src/trainer/train.py",
    container_uri = TRAIN_IMAGE,
    args = CMDARGS,
    requirements = ['tensorflow_io', 'google-cloud-aiplatform==1.16.0'],
    replica_count = 1,
    machine_type = TRAIN_COMPUTE,
    accelerator_count = 0,
    base_output_dir = f"{URI}/{TIMESTAMP}",
    staging_bucket = f"{URI}/{TIMESTAMP}",
    labels = {'series' : f'{SERIES}', 'experiment' : f'{EXPERIMENT}', 'experiment_name' : f'{EXPERIMENT_NAME}', 'run_name' : f'{RUN_NAME}'}
)

Training script copied to:
gs://statmike-mlops-349915/tips/packages/models/20220921105723/aiplatform-2022-09-21-10:57:25.432-aiplatform_custom_trainer_script-0.1.tar.gz.


In [212]:
customJob.run(
    service_account = SERVICE_ACCOUNT,
    tensorboard = tb.resource_name
)

Creating CustomJob
CustomJob created. Resource name: projects/1026793852137/locations/us-central1/customJobs/3836663086874361856
To use this CustomJob in another session:
custom_job = aiplatform.CustomJob.get('projects/1026793852137/locations/us-central1/customJobs/3836663086874361856')
View Custom Job:
https://console.cloud.google.com/ai/platform/locations/us-central1/training/3836663086874361856?project=1026793852137
View Tensorboard:
https://us-central1.tensorboard.googleusercontent.com/experiment/projects+1026793852137+locations+us-central1+tensorboards+7360834523774320640+experiments+3836663086874361856
CustomJob projects/1026793852137/locations/us-central1/customJobs/3836663086874361856 current state:
JobState.JOB_STATE_PENDING
CustomJob projects/1026793852137/locations/us-central1/customJobs/3836663086874361856 current state:
JobState.JOB_STATE_PENDING
CustomJob projects/1026793852137/locations/us-central1/customJobs/3836663086874361856 current state:
JobState.JOB_STATE_PENDING


In [213]:
job_link = f"https://console.cloud.google.com/vertex-ai/locations/{REGION}/training/{customJob.resource_name.split('/')[-1]}/cpu?cloudshell=false&project={PROJECT_ID}"
print(f'Review the Job here:\n{job_link}')

Review the Job here:
https://console.cloud.google.com/vertex-ai/locations/us-central1/training/3836663086874361856/cpu?cloudshell=false&project=statmike-mlops-349915


In [214]:
print(f'Review the model output here:\nhttps://console.cloud.google.com/storage/browser/{PROJECT_ID}/{SERIES}/{EXPERIMENT}/models/{TIMESTAMP}?project={PROJECT_ID}')

Review the model output here:
https://console.cloud.google.com/storage/browser/statmike-mlops-349915/tips/packages/models/20220921105723?project=statmike-mlops-349915


### Custom Job: From Python Source Distribution

This is a modified version of notebook [05b - Vertex AI Custom Model - TensorFlow - Custom Job With Python Source Distribution](../05%20-%20TensorFlow/05b%20-%20Vertex%20AI%20Custom%20Model%20-%20TensorFlow%20-%20Custom%20Job%20With%20Python%20Source%20Distribution.ipynb) that uses the source distribution stored in GCS for this project.

Notes:

In [215]:
TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")
EXPERIMENT_NAME = f'experiment-{SERIES}-{EXPERIMENT}-tf-classification-dnn'
RUN_NAME = f'run-{TIMESTAMP}'

TRAIN_IMAGE = 'us-docker.pkg.dev/vertex-ai/training/tf-cpu.2-7:latest'
TRAIN_COMPUTE = 'n1-standard-4'
URI = f'gs://{PROJECT_ID}/{SERIES}/{EXPERIMENT}/models'

CMDARGS = [
    "--epochs=5",
    "--batch_size=100",
    "--var_target=Class",
    "--var_omit=transaction_id",
    f"--project_id={PROJECT_ID}",
    f"--bq_project={PROJECT_ID}",
    "--bq_dataset=fraud",
    "--bq_table=fraud_prepped",
    f"--region={REGION}",
    f"--experiment={EXPERIMENT}",
    f"--series={SERIES}",
    f"--experiment_name={EXPERIMENT_NAME}",
    f"--run_name={RUN_NAME}"
]

MACHINE_SPEC = {
    "machine_type": TRAIN_COMPUTE,
    "accelerator_count": 0
}

WORKER_POOL_SPEC = [
    {
        "replica_count": 1,
        "machine_spec": MACHINE_SPEC,
        "python_package_spec": {
            "executor_image_uri": TRAIN_IMAGE,
            "package_uris": [f"gs://{PROJECT_ID}/{SERIES}/{EXPERIMENT}/trainer/dist/trainer-0.1.tar.gz"],
            "python_module": "trainer.train",
            "args": CMDARGS
        }
    }
]

In [216]:
aiplatform.init(experiment = EXPERIMENT_NAME, experiment_tensorboard = tb.resource_name)

In [217]:
customJob = aiplatform.CustomJob(
    display_name = f'{SERIES}_{EXPERIMENT}_{TIMESTAMP}',
    worker_pool_specs = WORKER_POOL_SPEC,
    base_output_dir = f"{URI}/{TIMESTAMP}",
    staging_bucket = f"{URI}/{TIMESTAMP}",
    labels = {'series' : f'{SERIES}', 'experiment' : f'{EXPERIMENT}', 'experiment_name' : f'{EXPERIMENT_NAME}', 'run_name' : f'{RUN_NAME}'}
)

In [218]:
customJob.run(
    service_account = SERVICE_ACCOUNT,
    tensorboard = tb.resource_name
)

Creating CustomJob
CustomJob created. Resource name: projects/1026793852137/locations/us-central1/customJobs/3756442718511824896
To use this CustomJob in another session:
custom_job = aiplatform.CustomJob.get('projects/1026793852137/locations/us-central1/customJobs/3756442718511824896')
View Custom Job:
https://console.cloud.google.com/ai/platform/locations/us-central1/training/3756442718511824896?project=1026793852137
View Tensorboard:
https://us-central1.tensorboard.googleusercontent.com/experiment/projects+1026793852137+locations+us-central1+tensorboards+7360834523774320640+experiments+3756442718511824896
CustomJob projects/1026793852137/locations/us-central1/customJobs/3756442718511824896 current state:
JobState.JOB_STATE_PENDING
CustomJob projects/1026793852137/locations/us-central1/customJobs/3756442718511824896 current state:
JobState.JOB_STATE_PENDING
CustomJob projects/1026793852137/locations/us-central1/customJobs/3756442718511824896 current state:
JobState.JOB_STATE_PENDING


In [219]:
job_link = f"https://console.cloud.google.com/vertex-ai/locations/{REGION}/training/{customJob.resource_name.split('/')[-1]}/cpu?cloudshell=false&project={PROJECT_ID}"
print(f'Review the Job here:\n{job_link}')

Review the Job here:
https://console.cloud.google.com/vertex-ai/locations/us-central1/training/3756442718511824896/cpu?cloudshell=false&project=statmike-mlops-349915


In [220]:
print(f'Review the model output here:\nhttps://console.cloud.google.com/storage/browser/{PROJECT_ID}/{SERIES}/{EXPERIMENT}/models/{TIMESTAMP}?project={PROJECT_ID}')

Review the model output here:
https://console.cloud.google.com/storage/browser/statmike-mlops-349915/tips/packages/models/20220921110542?project=statmike-mlops-349915
