# Python Training - Vertex AI Training Custom Jobs
### IN ACTIVE DEVELOPMENT - not complete

ML Training with Python code as a Vertex AI Training Custom Job

Why?  This notebook is an IDE that happens to also happen to have:
- **compute**: CPU, Memory, GPU
- **software**: container running with Python and loaded packages like TensorFlow, PyTorch, ...
- **code**: user-written instruction for ML training

But scaling this notebook instance to run our ML training code has limitations:
- paying `$$$$` while typing and troubleshooting
- running training code multiple times with different data sources
- running training code with multiple configuration of hyperparameters for tuning
- automating training code execution based on time or events

Rather than scaling this notebook up to larger **compute** we want to launch a fit for purpose job that runs our training **code** using the **software** of choice on the needed **compute** to handle the size of our training data.  That is made simple with Vertex AI Training Custom Jobs.  

Our training code can be in many forms:
- local files
    - single script
    - folders/modules
    - Python Package Distribution
- GCS Bucket
    - single script
    - folders/modules
    - Python Package Distribution
- GitHub
    - single script
    - folders/modules
    - Python Package Distribution
- Repository
    - Python Package hosted on Artifact Registry
    
Vertex AI Training Custom Jobs can use training code from:
- local files: single script
- GCS Bucket: Python Source Distribution
- Custom Container
    - Built with code originating at any of the locations above!

note: link to Python Custom Container here and talk about workflows



---
# Usage Examples
Vertex AI Training Jobs that use:
- a local script
- GCS housed source distribution
- custom containers
    - all the workflows from Python Custom Container notebook
    
re-engineer this section to not use tensorboard, not setup experiments, and create naming in GCS that matches the repository - try not to use timestamps for clarity

---
## Vertex AI Training - Custom Jobs

Vertex AI Training has Custom Jobs that can be launched with:
- single files/modules from local disk
- source distributions from GCS URI's
- custom containers

Below are tests/examples for the single file and source distribution created in this notebook.  The custom container workflows will be examined further in the [Python Client for Cloud Build](./Python%20Client%20for%20Cloud%20Build.ipynb) notebook.

### Job Inputs

The example test jobs below are based on jobs in the `05 - TensorFlow` series and takes advantage of Vertex AI Experiments and mangaed TensorBoard.  This section creates a TensorBoard instance and gets other inputs for the jobs:

In [206]:
tb = aiplatform.Tensorboard.list(filter=f"labels.series={SERIES}")
if tb:
    tb = tb[0]
else: 
    tb = aiplatform.Tensorboard.create(display_name = SERIES, labels = {'series' : f'{SERIES}'})

In [207]:
tb.resource_name

'projects/1026793852137/locations/us-central1/tensorboards/7360834523774320640'

In [208]:
# Give service account roles/storage.objectAdmin permissions
# Console > IMA > Select Account <projectnumber>-compute@developer.gserviceaccount.com > edit - give role
SERVICE_ACCOUNT = !gcloud config list --format='value(core.account)' 
SERVICE_ACCOUNT = SERVICE_ACCOUNT[0]
SERVICE_ACCOUNT

'1026793852137-compute@developer.gserviceaccount.com'

### Custom Job: From Local Script

This is a modified version of notebook [05a - Vertex AI Custom Model - TensorFlow - Custom Job With Python File](../05%20-%20TensorFlow/05a%20-%20Vertex%20AI%20Custom%20Model%20-%20TensorFlow%20-%20Custom%20Job%20With%20Python%20File.ipynb) that uses the local script for this project.

Notes:
- This uses a single `file.py` from the local directory, not a GCS URI
- When you run `aiplatform.CustomJob.from_local_script()` it responds with a message confirming the local script was copied to the GCS URI provide in the parameter `staging_bucket = `.

In [209]:
TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")
EXPERIMENT_NAME = f'experiment-{SERIES}-{EXPERIMENT}-tf-classification-dnn'
RUN_NAME = f'run-{TIMESTAMP}'

TRAIN_IMAGE = 'us-docker.pkg.dev/vertex-ai/training/tf-cpu.2-7:latest'
TRAIN_COMPUTE = 'n1-standard-4'
URI = f'gs://{PROJECT_ID}/{SERIES}/{EXPERIMENT}/models'


CMDARGS = [
    "--epochs=5",
    "--batch_size=100",
    "--var_target=Class",
    "--var_omit=transaction_id",
    f"--project_id={PROJECT_ID}",
    f"--bq_project={PROJECT_ID}",
    "--bq_dataset=fraud",
    "--bq_table=fraud_prepped",
    f"--region={REGION}",
    f"--experiment={EXPERIMENT}",
    f"--series={SERIES}",
    f"--experiment_name={EXPERIMENT_NAME}",
    f"--run_name={RUN_NAME}"
]

In [210]:
aiplatform.init(experiment = EXPERIMENT_NAME, experiment_tensorboard = tb.resource_name)

In [211]:
customJob = aiplatform.CustomJob.from_local_script(
    display_name = f'{SERIES}_{EXPERIMENT}_{TIMESTAMP}',
    script_path = f"{DIR}/trainer/src/trainer/train.py",
    container_uri = TRAIN_IMAGE,
    args = CMDARGS,
    requirements = ['tensorflow_io', 'google-cloud-aiplatform==1.16.0'],
    replica_count = 1,
    machine_type = TRAIN_COMPUTE,
    accelerator_count = 0,
    base_output_dir = f"{URI}/{TIMESTAMP}",
    staging_bucket = f"{URI}/{TIMESTAMP}",
    labels = {'series' : f'{SERIES}', 'experiment' : f'{EXPERIMENT}', 'experiment_name' : f'{EXPERIMENT_NAME}', 'run_name' : f'{RUN_NAME}'}
)

Training script copied to:
gs://statmike-mlops-349915/tips/packages/models/20220921105723/aiplatform-2022-09-21-10:57:25.432-aiplatform_custom_trainer_script-0.1.tar.gz.


In [212]:
customJob.run(
    service_account = SERVICE_ACCOUNT,
    tensorboard = tb.resource_name
)

Creating CustomJob
CustomJob created. Resource name: projects/1026793852137/locations/us-central1/customJobs/3836663086874361856
To use this CustomJob in another session:
custom_job = aiplatform.CustomJob.get('projects/1026793852137/locations/us-central1/customJobs/3836663086874361856')
View Custom Job:
https://console.cloud.google.com/ai/platform/locations/us-central1/training/3836663086874361856?project=1026793852137
View Tensorboard:
https://us-central1.tensorboard.googleusercontent.com/experiment/projects+1026793852137+locations+us-central1+tensorboards+7360834523774320640+experiments+3836663086874361856
CustomJob projects/1026793852137/locations/us-central1/customJobs/3836663086874361856 current state:
JobState.JOB_STATE_PENDING
CustomJob projects/1026793852137/locations/us-central1/customJobs/3836663086874361856 current state:
JobState.JOB_STATE_PENDING
CustomJob projects/1026793852137/locations/us-central1/customJobs/3836663086874361856 current state:
JobState.JOB_STATE_PENDING


In [213]:
job_link = f"https://console.cloud.google.com/vertex-ai/locations/{REGION}/training/{customJob.resource_name.split('/')[-1]}/cpu?cloudshell=false&project={PROJECT_ID}"
print(f'Review the Job here:\n{job_link}')

Review the Job here:
https://console.cloud.google.com/vertex-ai/locations/us-central1/training/3836663086874361856/cpu?cloudshell=false&project=statmike-mlops-349915


In [214]:
print(f'Review the model output here:\nhttps://console.cloud.google.com/storage/browser/{PROJECT_ID}/{SERIES}/{EXPERIMENT}/models/{TIMESTAMP}?project={PROJECT_ID}')

Review the model output here:
https://console.cloud.google.com/storage/browser/statmike-mlops-349915/tips/packages/models/20220921105723?project=statmike-mlops-349915


### Custom Job: From Python Source Distribution

This is a modified version of notebook [05b - Vertex AI Custom Model - TensorFlow - Custom Job With Python Source Distribution](../05%20-%20TensorFlow/05b%20-%20Vertex%20AI%20Custom%20Model%20-%20TensorFlow%20-%20Custom%20Job%20With%20Python%20Source%20Distribution.ipynb) that uses the source distribution stored in GCS for this project.

Notes:

In [215]:
TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")
EXPERIMENT_NAME = f'experiment-{SERIES}-{EXPERIMENT}-tf-classification-dnn'
RUN_NAME = f'run-{TIMESTAMP}'

TRAIN_IMAGE = 'us-docker.pkg.dev/vertex-ai/training/tf-cpu.2-7:latest'
TRAIN_COMPUTE = 'n1-standard-4'
URI = f'gs://{PROJECT_ID}/{SERIES}/{EXPERIMENT}/models'

CMDARGS = [
    "--epochs=5",
    "--batch_size=100",
    "--var_target=Class",
    "--var_omit=transaction_id",
    f"--project_id={PROJECT_ID}",
    f"--bq_project={PROJECT_ID}",
    "--bq_dataset=fraud",
    "--bq_table=fraud_prepped",
    f"--region={REGION}",
    f"--experiment={EXPERIMENT}",
    f"--series={SERIES}",
    f"--experiment_name={EXPERIMENT_NAME}",
    f"--run_name={RUN_NAME}"
]

MACHINE_SPEC = {
    "machine_type": TRAIN_COMPUTE,
    "accelerator_count": 0
}

WORKER_POOL_SPEC = [
    {
        "replica_count": 1,
        "machine_spec": MACHINE_SPEC,
        "python_package_spec": {
            "executor_image_uri": TRAIN_IMAGE,
            "package_uris": [f"gs://{PROJECT_ID}/{SERIES}/{EXPERIMENT}/trainer/dist/trainer-0.1.tar.gz"],
            "python_module": "trainer.train",
            "args": CMDARGS
        }
    }
]

In [216]:
aiplatform.init(experiment = EXPERIMENT_NAME, experiment_tensorboard = tb.resource_name)

In [217]:
customJob = aiplatform.CustomJob(
    display_name = f'{SERIES}_{EXPERIMENT}_{TIMESTAMP}',
    worker_pool_specs = WORKER_POOL_SPEC,
    base_output_dir = f"{URI}/{TIMESTAMP}",
    staging_bucket = f"{URI}/{TIMESTAMP}",
    labels = {'series' : f'{SERIES}', 'experiment' : f'{EXPERIMENT}', 'experiment_name' : f'{EXPERIMENT_NAME}', 'run_name' : f'{RUN_NAME}'}
)

In [218]:
customJob.run(
    service_account = SERVICE_ACCOUNT,
    tensorboard = tb.resource_name
)

Creating CustomJob
CustomJob created. Resource name: projects/1026793852137/locations/us-central1/customJobs/3756442718511824896
To use this CustomJob in another session:
custom_job = aiplatform.CustomJob.get('projects/1026793852137/locations/us-central1/customJobs/3756442718511824896')
View Custom Job:
https://console.cloud.google.com/ai/platform/locations/us-central1/training/3756442718511824896?project=1026793852137
View Tensorboard:
https://us-central1.tensorboard.googleusercontent.com/experiment/projects+1026793852137+locations+us-central1+tensorboards+7360834523774320640+experiments+3756442718511824896
CustomJob projects/1026793852137/locations/us-central1/customJobs/3756442718511824896 current state:
JobState.JOB_STATE_PENDING
CustomJob projects/1026793852137/locations/us-central1/customJobs/3756442718511824896 current state:
JobState.JOB_STATE_PENDING
CustomJob projects/1026793852137/locations/us-central1/customJobs/3756442718511824896 current state:
JobState.JOB_STATE_PENDING


In [219]:
job_link = f"https://console.cloud.google.com/vertex-ai/locations/{REGION}/training/{customJob.resource_name.split('/')[-1]}/cpu?cloudshell=false&project={PROJECT_ID}"
print(f'Review the Job here:\n{job_link}')

Review the Job here:
https://console.cloud.google.com/vertex-ai/locations/us-central1/training/3756442718511824896/cpu?cloudshell=false&project=statmike-mlops-349915


In [220]:
print(f'Review the model output here:\nhttps://console.cloud.google.com/storage/browser/{PROJECT_ID}/{SERIES}/{EXPERIMENT}/models/{TIMESTAMP}?project={PROJECT_ID}')

Review the model output here:
https://console.cloud.google.com/storage/browser/statmike-mlops-349915/tips/packages/models/20220921110542?project=statmike-mlops-349915
