# 08f - Vertex AI > Training > Training Pipelines - With Custom Container

# <IN ACTIVE DEVELOPMENT - NOT COMPLETE>

Dev Notes:
- Python Kernel
- Orchestrates Vertex AI Services Sequentially

Workflow:
- code to script
- script to container
- model to Vertex AI Model Registry
- predictions with Vertex Ai: Batch and Online

Next Steps:
- Pipeline to conduct steps
- use [reticulate](https://rstudio.github.io/reticulate/) for an R centric workflow

---
## Setup

### Package Installs (if needed)

This notebook uses the Python Clients for
- Google Service Usage
    - to enable APIs (Artifact Registry and Cloud Build)
- Artifact Registry
    - to create repositories for Python packages and Docker containers
- Cloud Build
    - To build custom Docker containers

The cells below check to see if the required Python libraries are installed.  If any are not it will print a message to do the install with the associated pip command to use.  These installs must be completed before continuing this notebook.

In [2]:
try:
    import google.cloud.service_usage_v1
except ImportError:
    print('You need to pip install google-cloud-service-usage')
    !pip install google-cloud-service-usage -q

In [3]:
try:
    import google.cloud.artifactregistry_v1
except ImportError:
    print('You need to pip install google-cloud-artifact-registry')
    !pip install google-cloud-artifact-registry -q

In [4]:
try:
    import google.cloud.devtools.cloudbuild
except ImportError:
    print('You need to pip install google-cloud-build')
    !pip install google-cloud-build

### Environment

inputs:

In [26]:
project = !gcloud config get-value project
PROJECT_ID = project[0]
PROJECT_ID

'statmike-mlops-349915'

In [27]:
REGION = 'us-central1'
EXPERIMENT = '08f'
SERIES = '08'

# source data
BQ_PROJECT = PROJECT_ID
BQ_DATASET = 'fraud'
BQ_TABLE = 'fraud_prepped'

# Resources
TRAIN_COMPUTE = 'n1-standard-4'
DEPLOY_COMPUTE = 'n1-standard-4'

# Model Training
VAR_TARGET = 'Class'
VAR_OMIT = 'transaction_id' # add more variables to the string with space delimiters

packages:

In [49]:
from google.cloud import aiplatform
from datetime import datetime
import os, shutil, glob
import pkg_resources
from IPython.display import Markdown as md
from google.cloud import service_usage_v1
from google.cloud.devtools import cloudbuild_v1
from google.cloud import artifactregistry_v1
from google.cloud import storage
from google.cloud import bigquery
from google.protobuf import json_format
from google.protobuf.struct_pb2 import Value
import json
import numpy as np
import pandas as pd

clients:

In [29]:
aiplatform.init(project=PROJECT_ID, location=REGION)
bq = bigquery.Client()
gcs = storage.Client()
su_client = service_usage_v1.ServiceUsageClient()
ar_client = artifactregistry_v1.ArtifactRegistryClient()
cb_client = cloudbuild_v1.CloudBuildClient()

parameters:

In [30]:
TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")
BUCKET = PROJECT_ID
URI = f"gs://{BUCKET}/{SERIES}/{EXPERIMENT}"
DIR = f"temp/{EXPERIMENT}"

In [31]:
SERVICE_ACCOUNT = !gcloud config list --format='value(core.account)' 
SERVICE_ACCOUNT = SERVICE_ACCOUNT[0]
SERVICE_ACCOUNT

'1026793852137-compute@developer.gserviceaccount.com'

List the service accounts current roles:

In [32]:
!gcloud projects get-iam-policy $PROJECT_ID --filter="bindings.members:$SERVICE_ACCOUNT" --format='table(bindings.role)' --flatten="bindings[].members"

ROLE
roles/bigquery.admin
roles/owner
roles/run.admin
roles/storage.objectAdmin


>Note: If the resulting list is missing [roles/storage.objectAdmin](https://cloud.google.com/storage/docs/access-control/iam-roles) then [revisit the setup notebook](../00%20-%20Setup/00%20-%20Environment%20Setup.ipynb#permissions) and add this permission to the service account with the provided instructions.

environment:

In [33]:
!rm -rf {DIR}
!mkdir -p {DIR}

Experiment Tracking:

In [34]:
FRAMEWORK = 'r'
TASK = 'classification'
MODEL_TYPE = 'logistic_regression'
EXPERIMENT_NAME = f'experiment-{SERIES}-{EXPERIMENT}-{FRAMEWORK}-{TASK}-{MODEL_TYPE}'
RUN_NAME = f'run-{TIMESTAMP}'

### Enable APIs

Using Cloud Build and Artifact Registry requires enabling these APIs for the Google Cloud Project.

Options for enabeling these.  In this notebook option 2 is used.
 1. Use the APIs & Services page in the console: https://console.cloud.google.com/apis
     - `+ Enable APIs and Services`
     - Search for Cloud Build and Enable
     - Search for Artifact Registry and Enable
 2. Use [Google Service Usage](https://cloud.google.com/service-usage/docs) API from Python
     - [Python Client For Service Usage](https://github.com/googleapis/python-service-usage)
     - [Python Client Library Documentation](https://cloud.google.com/python/docs/reference/serviceusage/latest)
     
The following code cells use the Service Usage Client to:
- get the state of the service
- if 'DISABLED':
    - Try enabling the service and return the state after trying
- if 'ENABLED' print the state for confirmation

#### Artifact Registry

In [35]:
artifactregistry = su_client.get_service(
    request = service_usage_v1.GetServiceRequest(
        name = f'projects/{PROJECT_ID}/services/artifactregistry.googleapis.com'
    )
).state.name


if artifactregistry == 'DISABLED':
    print(f'Artifact Registry is currently {artifactregistry} for project: {PROJECT_ID}')
    print(f'Trying to Enable...')
    operation = su_client.enable_service(
        request = service_usage_v1.EnableServiceRequest(
            name = f'projects/{PROJECT_ID}/services/artifactregistry.googleapis.com'
        )
    )
    response = operation.result()
    if response.service.state.name == 'ENABLED':
        print(f'Artifact Registry is now enabled for project: {PROJECT_ID}')
    else:
        print(response)
else:
    print(f'Artifact Registry already enabled for project: {PROJECT_ID}')

Artifact Registry already enabled for project: statmike-mlops-349915


#### Cloud Build

In [36]:
cloudbuild = su_client.get_service(
    request = service_usage_v1.GetServiceRequest(
        name = f'projects/{PROJECT_ID}/services/cloudbuild.googleapis.com'
    )
).state.name


if cloudbuild == 'DISABLED':
    print(f'Cloud Build is currently {cloudbuild} for project: {PROJECT_ID}')
    print(f'Trying to Enable...')
    operation = su_client.enable_service(
        request = service_usage_v1.EnableServiceRequest(
            name = f'projects/{PROJECT_ID}/services/cloudbuild.googleapis.com'
        )
    )
    response = operation.result()
    if response.service.state.name == 'ENABLED':
        print(f'Cloud Build is now enabled for project: {PROJECT_ID}')
    else:
        print(response)
else:
    print(f'Cloud Build already enabled for project: {PROJECT_ID}')

Cloud Build already enabled for project: statmike-mlops-349915


---
## Training & Serving

### R Script for Training

This notebook trains the same R model from [08 - Vertex AI Custom Model - R - in Notebook](./08%20-%20Vertex%20AI%20Custom%20Model%20-%20R%20-%20in%20Notebook.ipynb) by first modifying and saving the training code to an R script as shown in [08 - Vertex Ai Custom Model - R - Notebook to Script](08%20-%20Vertex%20AI%20Custom%20Model%20-%20R%20-%20Notebook%20to%20Script.ipynb) which stores the script in [`./code/train.R`](./code/train.R).

**Review the script:**

In [37]:
SCRIPT_PATH = './code/train.R'

with open(SCRIPT_PATH, 'r') as file:
    data = file.read()
md(f"```R\n\n{data}\n```")

```R


# library import
library(bigrquery)
library(dplyr)

# inputs
args <- commandArgs(trailingOnly = TRUE)
project_id <- args[1]
region <- args[2]
experiment <- args[3]
series <- args[4]
bq_project <- args[5]
bq_dataset <- args[6]
bq_table <- args[7]
var_target <- args[8]
var_omit <- args[9]

# data source
get_data <- function(s){
    query = sprintf('SELECT * EXCEPT(%s, splits) FROM `%s.%s.%s` WHERE splits = "%s"', var_omit, bq_project, bq_dataset, bq_table, s)
    table <- bq_project_query(bq_project, query)
    ds <- bq_table_download(table)
    return(ds)
}
train <- get_data("TRAIN")
test <- get_data("TEST")

# logistic regression model
model <- glm(
    Class ~ .,
    data = train,
    family = binomial)

# predictions for evaluation
preds <- predict(model, test, type = "response")

# evaluate
actual <- test[, var_target]
names(actual) <- 'actual'
pred <- tibble(round(preds))
names(pred) <- 'pred'
results <- cbind(actual, pred)
cm <- table(results)

# save model to file
saveRDS(model, "model.rds")

```

Make a copy of the script in the notebooks temp folder and append code for saving to GCS model directory:

In [41]:
shutil.copyfile(SCRIPT_PATH, f'./{DIR}/train.R')

'./temp/08f/train.R'

In [43]:
%%writefile -a './temp/08f/train.R'

# use Vertex AI Training Pre-Defined Environment Variables to Write to GCS
Sys.getenv()
system2('gsutil', c('cp', 'model.rds', Sys.getenv('AIP_MODEL_DIR')))

Appending to ./temp/08f/train.R


### R Script for Serving

To serve the model, another script that uses [plumber](https://www.rplumber.io/) is created:

**Review the script:**

In [44]:
%%writefile './temp/08f/serve.R'

# library import
library(plumber)

# use Vertex AI Training Pre-Defined Environment Variables to Read from GCS
Sys.getenv()
system2('gsutil', c('cp', '-r', Sys.getenv('AIP_STORAGE_URI'), '.'))

# import model
model <- readRDS('artifacts/model.rds')

# prediction route function
predict_route <- function(req, res){
    print("Processing Prediction Request...")
    df <- as.data.frame(req$body$instances)
    preds <- predict(model, df, type = "response")
    return(list(predictions = preds))
}

# serving
print("Start Serving Process...")
pr() %>%
    pr_get(Sys.getenv("AIP_HEALTH_ROUTE"), function() "OK") %>%
    pr_post(Sys.getenv("AIP_PREDICT_ROUTE"), predict_route) %>%
    pr_run(host = "0.0.0.0", port=as.integer(Sys.getenv("AIP_HTTP_PORT", 8080)))

Writing ./temp/08f/serve.R


### Creating a Custom Container with Cloud Build

Cloud Build creates and manages the build on GCP.  The API creates a build by providing:
- location of the source
- instructions
- location to store the built artifacts

The instruction part of Cloud Build has options:
- Dockerfile
- Build Config file (YAML or JSON)
- Cloud Native Buildpacks

This notebook uses the approach of using the Python Client for Cloud Build and not referencing any local files.  For that reason, the first step is creating a Dockerfile for the workflow and storing it in GCS. The next step is running Cloud Build and using the client to specify the Build config rather than a config file.  The steps of the build config start with getting the code (git clone, or copy from GCS) and copying the Dockerfile.  

There are many workflows for creating containers with ML training code.  Many of the most common ones are explored in the tips notebook [Python Custom Containers](../Tips/Python%20Custom%20Containers.ipynb).  The method used here is the simplest - copy the training code directly into the container.  The other methods include packaging the training code as a Python Distribution and using `pip install` in from GCS, GitHub and even Artifact Registry as a private repository.

#### Create the Dockerfile
A basic dockerfile thats take the base image and copies the code in and add additional installs:

In [45]:
%%writefile './temp/08f/Dockerfile'
FROM gcr.io/deeplearning-platform-release/r-cpu.4-1:latest

WORKDIR /root

# copy requirements and install them
COPY train.R /root/train.R
COPY serve.R /root/serve.R

RUN apt-get update
RUN apt_get install gfortran -yy
RUN R -e 'install.packages(c{"plumber"})'

EXPOSE 8080

Writing ./temp/08f/Dockerfile


#### Store Resources in Cloud Storage

In [57]:
os.listdir(DIR)

['train.R', 'serve.R', '.ipynb_checkpoints', 'Dockerfile']

In [58]:
bucket = gcs.lookup_bucket(PROJECT_ID)
SOURCEPATH = f'{SERIES}/{EXPERIMENT}'

In [59]:
for file in [f for f in os.listdir(DIR) if not f.startswith('.')]:
    print(file)
    blob = bucket.blob(f'{SOURCEPATH}/{file}')
    blob.upload_from_filename(f'{DIR}/{file}')

train.R
serve.R
Dockerfile


In [62]:
list(bucket.list_blobs(prefix = SOURCEPATH))

[<Blob: statmike-mlops-349915, 08/08f/Dockerfile, 1665948395160783>,
 <Blob: statmike-mlops-349915, 08/08f/serve.R, 1665948395117160>,
 <Blob: statmike-mlops-349915, 08/08f/train.R, 1665948395032856>]

In [63]:
print(f"View the bucket directly here:\nhttps://console.cloud.google.com/storage/browser/{PROJECT_ID}/{SOURCEPATH};tab=objects&project={PROJECT_ID}")

View the bucket directly here:
https://console.cloud.google.com/storage/browser/statmike-mlops-349915/08/08f;tab=objects&project=statmike-mlops-349915


#### Setup Artifact Registry

Artifact registry organizes artifacts with repositories.  Each repository contains packages and is designated to hold a partifcular format of package: Docker images, Python Packages and [others](https://cloud.google.com/artifact-registry/docs/supported-formats#package).

##### List Repositories

This may be empty if no repositories have been created for this project

In [64]:
for repo in ar_client.list_repositories(parent = f'projects/{PROJECT_ID}/locations/{REGION}'):
    print(repo.name)

projects/statmike-mlops-349915/locations/us-central1/repositories/statmike-mlops-349915
projects/statmike-mlops-349915/locations/us-central1/repositories/statmike-mlops-349915-docker
projects/statmike-mlops-349915/locations/us-central1/repositories/statmike-mlops-349915-python


#### Create Docker Image Repository

Create an Artifact Registry Repository to hold Docker Images created by this notebook.  First, check to see if it is already created by a previous run and retrieve it if it has.  Otherwise, create!

In [65]:
docker_repo = None
for repo in ar_client.list_repositories(parent = f'projects/{PROJECT_ID}/locations/{REGION}'):
    if f'{PROJECT_ID}-docker' in repo.name:
        docker_repo = repo
        print(f'Retrieved existing repo: {docker_repo.name}')

if not docker_repo:
    operation = ar_client.create_repository(
        request = artifactregistry_v1.CreateRepositoryRequest(
            parent = f'projects/{PROJECT_ID}/locations/{REGION}',
            repository_id = f'{PROJECT_ID}-docker',
            repository = artifactregistry_v1.Repository(
                description = f'A repository for the {EXPERIMENT} experiment that holds docker images.',
                name = f'{PROJECT_ID}-docker',
                format_ = artifactregistry_v1.Repository.Format.DOCKER,
                labels = {'series': SERIES, 'experiment': EXPERIMENT}
            )
        )
    )
    print('Creating Repository ...')
    docker_repo = operation.result()
    print(f'Completed creating repo: {docker_repo.name}')

Retrieved existing repo: projects/statmike-mlops-349915/locations/us-central1/repositories/statmike-mlops-349915-docker


In [66]:
docker_repo.name, docker_repo.format_.name

('projects/statmike-mlops-349915/locations/us-central1/repositories/statmike-mlops-349915-docker',
 'DOCKER')

In [67]:
REPOSITORY = f"{REGION}-docker.pkg.dev/{PROJECT_ID}/{docker_repo.name.split('/')[-1]}"

#### Build Custom Container
Use the Cloud Build client to construct and run the build instructions.  Here the files collected in GCS are copied to the build instance, then the Docker build in run in the folder with the `Dockerfile`.  The resulting image is pushed to Artifact Registry (setup above).

In [28]:
# setup the build config with empty list of steps - these will be added sequentially
build = cloudbuild_v1.Build(
    steps = []
)
# retrieve the source
build.steps.append(
    {
        'name': 'gcr.io/cloud-builders/gsutil',
        'args': ['cp', '-r', f'gs://{PROJECT_ID}/{SOURCEPATH}/*', '/workspace']
    }
)
# docker build
build.steps.append(
    {
        'name': 'gcr.io/cloud-builders/docker',
        'args': ['build', '-t', f'{REPOSITORY}/{EXPERIMENT}_trainer', '/workspace']
    }    
)
# docker push
build.images = [f"{REPOSITORY}/{EXPERIMENT}_trainer"]

In [29]:
build

steps {
  name: "gcr.io/cloud-builders/gsutil"
  args: "cp"
  args: "-r"
  args: "gs://statmike-mlops-349915/05/05f/training/*"
  args: "/workspace"
}
steps {
  name: "gcr.io/cloud-builders/docker"
  args: "build"
  args: "-t"
  args: "us-central1-docker.pkg.dev/statmike-mlops-349915/statmike-mlops-349915-docker/05f_trainer"
  args: "/workspace"
}
images: "us-central1-docker.pkg.dev/statmike-mlops-349915/statmike-mlops-349915-docker/05f_trainer"

In [30]:
operation = cb_client.create_build(
    project_id = PROJECT_ID,
    build = build
)

In [31]:
response = operation.result()
response.status, response.artifacts

(<Status.SUCCESS: 3>,
 images: "us-central1-docker.pkg.dev/statmike-mlops-349915/statmike-mlops-349915-docker/05f_trainer")

In [32]:
print(f"Review the Custom Container with Artifact Registry in the Google Cloud Console:\nhttps://console.cloud.google.com/artifacts/docker/{PROJECT_ID}/{REGION}/{PROJECT_ID}-docker?project={PROJECT_ID}")

Review the Custom Container with Artifact Registry in the Google Cloud Console:
https://console.cloud.google.com/artifacts/docker/statmike-mlops-349915/us-central1/statmike-mlops-349915-docker?project=statmike-mlops-349915


### Setup Training Job

In [33]:
CMDARGS = [
    "--epochs=" + str(EPOCHS),
    "--batch_size=" + str(BATCH_SIZE),
    "--var_target=" + VAR_TARGET,
    "--var_omit=" + VAR_OMIT,
    "--project_id=" + PROJECT_ID,
    "--bq_project=" + BQ_PROJECT,
    "--bq_dataset=" + BQ_DATASET,
    "--bq_table=" + BQ_TABLE,
    "--region=" + REGION,
    "--experiment=" + EXPERIMENT,
    "--series=" + SERIES,
    "--experiment_name=" + EXPERIMENT_NAME,
    "--run_name=" + RUN_NAME
]

In [34]:
trainingJob = aiplatform.CustomContainerTrainingJob(
    display_name = f'{SERIES}_{EXPERIMENT}_{TIMESTAMP}',
    container_uri = f"{REPOSITORY}/{EXPERIMENT}_trainer",
    model_serving_container_image_uri = DEPLOY_IMAGE,
    staging_bucket = f"{URI}/models/{TIMESTAMP}",
    labels = {'series' : f'{SERIES}', 'experiment' : f'{EXPERIMENT}', 'experiment_name' : f'{EXPERIMENT_NAME}', 'run_name' : f'{RUN_NAME}'}
)

### Run Training Job AND Upload The Model
The training job will automatically upload the model to the Vertex AI Model Registry and return the link to the model.

In [35]:
modelmatch = aiplatform.Model.list(filter = f'display_name={SERIES}_{EXPERIMENT} AND labels.series={SERIES} AND labels.experiment={EXPERIMENT}')
if modelmatch:
    print("Model Already in Registry:")
    if RUN_NAME in modelmatch[0].version_aliases:
        print("This version already loaded, no action taken.")
        model = aiplatform.Model(model_name = modelmatch[0].resource_name)
    else:
        print('Loading model as new default version.')
        model = trainingJob.run(
            model_display_name = f'{SERIES}_{EXPERIMENT}',
            model_labels = {'series' : f'{SERIES}', 'experiment' : f'{EXPERIMENT}', 'experiment_name' : f'{EXPERIMENT_NAME}', 'run_name' : f'{RUN_NAME}'},
            model_id = f'model_{SERIES}_{EXPERIMENT}',
            parent_model = modelmatch[0].resource_name,
            is_default_version = True,
            model_version_aliases = [RUN_NAME],
            model_version_description = RUN_NAME,
            base_output_dir = f"{URI}/models/{TIMESTAMP}",
            service_account = SERVICE_ACCOUNT,
            args = CMDARGS,
            replica_count = 1,
            machine_type = TRAIN_COMPUTE,
            accelerator_count = 0,
            tensorboard = tb.resource_name
        )
else:
    print('This is a new model, creating in model registry')
    model = trainingJob.run(
        model_display_name = f'{SERIES}_{EXPERIMENT}',
        model_labels = {'series' : f'{SERIES}', 'experiment' : f'{EXPERIMENT}', 'experiment_name' : f'{EXPERIMENT_NAME}', 'run_name' : f'{RUN_NAME}'},
        model_id = f'model_{SERIES}_{EXPERIMENT}',
        is_default_version = True,
        model_version_aliases = [RUN_NAME],
        model_version_description = RUN_NAME,
        base_output_dir = f"{URI}/models/{TIMESTAMP}",
        service_account = SERVICE_ACCOUNT,
        args = CMDARGS,
        replica_count = 1,
        machine_type = TRAIN_COMPUTE,
        accelerator_count = 0,
        tensorboard = tb.resource_name
    )

This is a new model, creating in model registry
Training Output directory:
gs://statmike-mlops-349915/05/05f/models/20220927190441 
View Training:
https://console.cloud.google.com/ai/platform/locations/us-central1/training/2939962975911411712?project=1026793852137
CustomContainerTrainingJob projects/1026793852137/locations/us-central1/trainingPipelines/2939962975911411712 current state:
PipelineState.PIPELINE_STATE_RUNNING
View backing custom job:
https://console.cloud.google.com/ai/platform/locations/us-central1/training/7560497863919140864?project=1026793852137
View tensorboard:
https://us-central1.tensorboard.googleusercontent.com/experiment/projects+1026793852137+locations+us-central1+tensorboards+7179142426307592192+experiments+7560497863919140864
CustomContainerTrainingJob projects/1026793852137/locations/us-central1/trainingPipelines/2939962975911411712 current state:
PipelineState.PIPELINE_STATE_RUNNING
CustomContainerTrainingJob projects/1026793852137/locations/us-central1/tra

Get the backing Custom Job for the Training Pipeline:

In [36]:
clientPL = aiplatform.gapic.PipelineServiceClient(client_options = {'api_endpoint': f'{REGION}-aiplatform.googleapis.com'})

In [37]:
from google.protobuf.json_format import MessageToDict

backingCustomJob = MessageToDict(clientPL.get_training_pipeline(name = trainingJob.resource_name)._pb)['trainingTaskMetadata']['backingCustomJob']

In [38]:
customJob = aiplatform.CustomJob.get(backingCustomJob)
customJob.resource_name, customJob.display_name

('projects/1026793852137/locations/us-central1/customJobs/7560497863919140864',
 '05_05f_20220927190441-custom-job')

Create hyperlinks to job and tensorboard here:

In [39]:
job_link = f"https://console.cloud.google.com/vertex-ai/locations/{REGION}/training/{customJob.resource_name.split('/')[-1]}/cpu?cloudshell=false&project={PROJECT_ID}"
board_link = f"https://{REGION}.tensorboard.googleusercontent.com/experiment/{tb.resource_name.replace('/', '+')}+experiments+{customJob.name.split('/')[-1]}"

print(f'Review the Training Pipeline here:\nhttps://console.cloud.google.com/vertex-ai/training/training-pipelines?project={PROJECT_ID}')
print(f'Review the Custom Job here:\n{job_link}')
print(f'Review the TensorBoard From the Job here:\n{board_link}')
print(f'Review the model in the Vertex AI Model Registry:\nhttps://console.cloud.google.com/vertex-ai/locations/{REGION}/models/{model.name}?project={PROJECT_ID}')

Review the Training Pipeline here:
https://console.cloud.google.com/vertex-ai/training/training-pipelines?project=statmike-mlops-349915
Review the Custom Job here:
https://console.cloud.google.com/vertex-ai/locations/us-central1/training/7560497863919140864/cpu?cloudshell=false&project=statmike-mlops-349915
Review the TensorBoard From the Job here:
https://us-central1.tensorboard.googleusercontent.com/experiment/projects+1026793852137+locations+us-central1+tensorboards+7179142426307592192+experiments+7560497863919140864
Review the model in the Vertex AI Model Registry:
https://console.cloud.google.com/vertex-ai/locations/us-central1/models/model_05_05f?project=statmike-mlops-349915
