![ga4](https://www.google-analytics.com/collect?v=2&tid=G-6VDTYWLKX6&cid=1&en=page_view&sid=1&dl=statmike%2Fvertex-ai-mlops%2F08+-+R&dt=R+-+Vertex+AI+Custom+Training+Jobs.ipynb)
<!--- header table --->
<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/statmike/vertex-ai-mlops/blob/main/08%20-%20R/R%20-%20Vertex%20AI%20Custom%20Training%20Jobs.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Google Colaboratory logo">
      <br>Run in<br>Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https%3A//raw.githubusercontent.com/statmike/vertex-ai-mlops/main/08%20-%20R/R%20-%20Vertex%20AI%20Custom%20Training%20Jobs.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo">
      <br>Run in<br>Colab Enterprise
    </a>
  </td>      
  <td style="text-align: center">
    <a href="https://github.com/statmike/vertex-ai-mlops/blob/main/08%20-%20R/R%20-%20Vertex%20AI%20Custom%20Training%20Jobs.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      <br>View on<br>GitHub
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https%3A//raw.githubusercontent.com/statmike/vertex-ai-mlops/main/08%20-%20R/R%20-%20Vertex%20AI%20Custom%20Training%20Jobs.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      <br>Open in<br>Vertex AI Workbench
    </a>
  </td>
</table>

# R - Vertex AI Custom Training Jobs

Running an **R** script as a job using [Vertex AI Custom Training](https://cloud.google.com/vertex-ai/docs/training/overview).  These allow the specification of a fully manged job with three parts: code, container, compute.

Why?
- The same script that is being developed in an IDE can be ramped up with much larger data and compute by launching it as a custom training job.
- Easily automate the running of jobs by schedule or event triggers
- Use larger compute while controlling cost - only pay for the runtime of the job.

>This notebook use a Python kernel in order to use the Vertex AI SDK Python Client. To have a complete **R** based workflow, the code in this workflow could be adapted to run in **R** with the [reticulate](https://cran.r-project.org/web/packages/reticulate/vignettes/calling_python.html) package.  This package provides an **R** interface to Python.

---
Part of the series of [**R**](https://github.com/statmike/vertex-ai-mlops/blob/main/08%20-%20R/readme.md) workflows:

A series of workflows focused on using **R** in Vertex AI as well as other Google Cloud services to run R code, train models with R, and serve predictionns with R.

---

**Prerequisites:**

- This notebook running in Vertex AI Workbench Instance as described in the series [readme](./readme.md)
- Run the workflow: [R - Notebook Based Workflow](./R%20-%20Notebook%20Based%20Workflow.ipynb)
    - This prepares the data source used by the custom job in this workflow

---
## Installs

The list `packages` contains tuples of package import names and install names.  If the import name is not found then the install name is used to install quitely for the current user.

In [46]:
# tuples of (import name, install name)
packages = [
    ('google.cloud.aiplatform', 'google-cloud-aiplatform'),
    ('google.cloud.storage', 'google-cloud-storage'),
    ('google.cloud.artifactregistry_v1', 'google-cloud-artifact-registry'),
    ('google.cloud.devtools', 'google-cloud-build')
]

import importlib
install = False
for package in packages:
    if not importlib.util.find_spec(package[0]):
        print(f'installing package {package[1]}')
        install = True
        !pip install {package[1]} -U -q --user

### Enable APIs

In [47]:
!gcloud services enable aiplatform.googleapis.com
!gcloud services enable artifactregistry.googleapis.com
!gcloud services enable cloudbuild.googleapis.com

### Restart Kernel (If Installs Occured)

After a kernel restart the code submission can start with the next cell after this one.

In [48]:
if install:
    import IPython
    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

---
## Setup

inputs:

In [49]:
project = !gcloud config get-value project
PROJECT_ID = project[0]
PROJECT_ID

'statmike-mlops-349915'

In [50]:
REGION = 'us-central1'
EXPERIMENT = 'bigquery-data'
SERIES = 'r'

# BigQuery Parameters
BQ_PROJECT = PROJECT_ID
BQ_DATASET = SERIES
BQ_TABLE = EXPERIMENT
BQ_REGION = REGION[0:2]

# GCS Parameters: Give bucket name
GCS_BUCKET = PROJECT_ID

# key columns in the data:
VAR_TARGET = 'Class'
VAR_OMIT = ['transaction_id', 'splits']

packages:

In [51]:
from google.cloud import aiplatform
from google.cloud import storage
from google.cloud.devtools import cloudbuild_v1
from google.cloud import artifactregistry_v1
from IPython.display import Markdown as md
from datetime import datetime
import os

parameters:

In [52]:
TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")
URI = f"gs://{GCS_BUCKET}/{SERIES}/{EXPERIMENT}"
DIR = f"temp/{EXPERIMENT}"

clients:

In [53]:
aiplatform.init(project = PROJECT_ID, location = REGION)
gcs = storage.Client(project = PROJECT_ID)
ar_client = artifactregistry_v1.ArtifactRegistryClient()
cb_client = cloudbuild_v1.CloudBuildClient()

environment:

In [54]:
if not os.path.exists(DIR):
    os.makedirs(DIR)

---
## Prepare Training Code: **R** Script

The prior workflow in this series, [R - Notebook Based Workflow](./R%20-%20Notebook%20Based%20Workflow.ipynb), did the model training work in a notebook using an **R** kernel.  The first step to making this a training job is converting the notebook into an actual **R** script.  

The steps from the notebook workflow have been replicated in the **R** script included with this repository.  The cell below loads and shows this script.  
- review directly in GitHub with [this link](https://github.com/statmike/vertex-ai-mlops/blob/main/08%20-%20R/code/train.R)

**Notes On Script**
- The steps are replicated identically with the following additions:
    - The parameters are defined as inputs at the top of the code
    - the last step of the code uses `saveRDS` to save the model and then uses the automatically setup local path to GCS to copy the model to GCS for further use using `system2` to run a `cp` command.

In [102]:
# load a view the script:
SCRIPT_PATH = './code/train.R'

with open(SCRIPT_PATH, 'r') as file:
    data = file.read()
md(f"```R\n\n{data}\n```")

```R

# library import
library(bigrquery)
library(dplyr)

# inputs
args <- commandArgs(trailingOnly = TRUE)
bq_project <- args[1]
bq_dataset <- args[2]
bq_table <- args[3]
var_target <- args[4]
var_omit <- args[5]

# data source
get_data <- function(s){
    
    # query for table
    query <- sprintf('
        SELECT * EXCEPT(%s)
        FROM `%s.%s.%s`
        WHERE splits = "%s"
    ', var_omit, bq_project, bq_dataset, bq_table, s)
    
    # connect to table
    table <- bq_project_query(bq_project, query)
    
    # load table to dataframe
    return(bq_table_download(table, n_max = Inf))

}
train <- get_data("TRAIN")
test <- get_data("TEST")

# logistic regression model
model_exp = paste0(var_target, "~ .")

model <- glm(
    as.formula(model_exp),
    data = train,
    family = binomial)

# predictions for evaluation
preds <- predict(model, test, type = "response")

# evaluate
actual <- test[, var_target]
names(actual) <- 'actual'
pred <- tibble(round(preds))
names(pred) <- 'pred'
results <- cbind(actual, pred)
cm <- table(results)

# save model to file
saveRDS(model, "model.rds")

# get GCS fusemount location to save file to:
path <- sub('gs://', '/gcs/', Sys.getenv('AIP_MODEL_DIR'))
#system2('cp', c('model.rds', path))

# copy model file to GCS
system2('gsutil', c('cp', 'model.rds', Sys.getenv('AIP_MODEL_DIR')))

```

---
## Create Custom Training Job

There are many ways to create a custom training job by combining code with a container.  Check out [this comprehensive review](../Tips/Python%20Training.ipynb) of the possible workflows.

This workflow uses a local script, reviewed above, along with a container and compute specification. 

### Choose Computing Environment

When using [custom training on Vertex AI](https://cloud.google.com/vertex-ai/docs/training/custom-training-methods) the compute environment is specified as parameters.  At a minimum this will include the [compute resources](https://cloud.google.com/vertex-ai/docs/training/configure-compute) and [container](https://cloud.google.com/vertex-ai/docs/training/configure-container-settings) URIs.

This example uses minimal compute with a single node and no accelerators (GPU).

Google Cloud provides prebuilt containers for machine learning that can be used directly or used as the base to build [custom containers](https://cloud.google.com/vertex-ai/docs/training/containers-overview) by adding packages and code with Docker.

**Pre-Built Containers Sources**

The sources for pre-built containers can be found in these locations:
- [Deep Learning Containers With Vertex AI](https://cloud.google.com/vertex-ai/docs/general/deep-learning)
    - list these with `gcloud container images list --repository="gcr.io/deeplearning-platform-release"`
- [Vertex AI Pre-Built Containers for Custom Training](https://cloud.google.com/vertex-ai/docs/training/pre-built-containers)
    - list these with `gcloud container images list --repository="us-docker.pkg.dev/vertex-ai/training"`
- [Vertex AI Pre-Built Containers for Prediction](https://cloud.google.com/vertex-ai/docs/predictions/pre-built-containers)
    - list these with `gcloud container images list --repository="us-docker.pkg.dev/vertex-ai/prediction"`

In [103]:
# Resources
TRAIN_COMPUTE = 'n1-standard-4'
TRAIN_IMAGE = 'gcr.io/deeplearning-platform-release/r-cpu.4-3'

### Creating a Custom Container with Cloud Build

Cloud Build creates and manages the build on GCP.  The API creates a build by providing:
- location of the source
- instructions
- location to store the built artifacts

The instruction part of Cloud Build has options:
- Dockerfile
- Build Config file (YAML or JSON)
- Cloud Native Buildpacks

This notebook uses the approach of using the Python Client for Cloud Build and not referencing any local files.  For that reason, the first step is creating a Dockerfile for the workflow and storing it in GCS. The next step is running Cloud Build and using the client to specify the Build config rather than a config file.  The steps of the build config start with getting the code (git clone, or copy from GCS) and copying the Dockerfile.  

There are many workflows for creating containers with ML training code.  Many of the most common ones are explored in the tips notebook [Python Custom Containers](../Tips/Python%20Custom%20Containers.ipynb).  The method used here is the simplest - copy the training code directly into the container.  The other methods include packaging the training code as a Python Distribution and using `pip install` in from GCS, GitHub and even Artifact Registry as a private repository.

#### Store Resources in Cloud Storage

In [104]:
bucket = gcs.lookup_bucket(GCS_BUCKET)
SOURCEPATH = f'{SERIES}/{EXPERIMENT}/models/{TIMESTAMP}'

#### Copy Training Code

In [105]:
blob = bucket.blob(f'{SOURCEPATH}/train.R')
blob.upload_from_filename(SCRIPT_PATH)

#### Create the Dockerfile
A basic dockerfile thats take the base image and copies the code in and define an entrypoint - what python script to run first in this case.  Add RUN entries to pip install additional packages.

In [106]:
dockerfile = f"""
FROM {TRAIN_IMAGE}

WORKDIR /root

# copy code into /code folder:
COPY ./*.R ./code/
"""

In [107]:
blob = bucket.blob(f'{SOURCEPATH}/Dockerfile')
blob.upload_from_string(dockerfile)

#### Setup Artifact Registry

Artifact registry organizes artifacts with repositories.  Each repository contains packages and is designated to hold a partifcular format of package: Docker images, Python Packages and [others](https://cloud.google.com/artifact-registry/docs/supported-formats#package).

##### List Repositories

This may be empty if no repositories have been created for this project

In [108]:
for repo in ar_client.list_repositories(parent = f'projects/{PROJECT_ID}/locations/{REGION}'):
    print(repo.name)

projects/statmike-mlops-349915/locations/us-central1/repositories/statmike-mlops-349915
projects/statmike-mlops-349915/locations/us-central1/repositories/statmike-mlops-349915-docker
projects/statmike-mlops-349915/locations/us-central1/repositories/statmike-mlops-349915-python


#### Create Docker Image Repository

Create an Artifact Registry Repository to hold Docker Images created by this notebook.  First, check to see if it is already created by a previous run and retrieve it if it has.  Otherwise, create!

In [109]:
docker_repo = None
for repo in ar_client.list_repositories(parent = f'projects/{PROJECT_ID}/locations/{REGION}'):
    if f'{PROJECT_ID}' == repo.name.split('/')[-1]:
        docker_repo = repo
        print(f'Retrieved existing repo: {docker_repo.name}')

if not docker_repo:
    operation = ar_client.create_repository(
        request = artifactregistry_v1.CreateRepositoryRequest(
            parent = f'projects/{PROJECT_ID}/locations/{REGION}',
            repository_id = f'{PROJECT_ID}',
            repository = artifactregistry_v1.Repository(
                description = f'A repository for the {SERIES} series that holds docker images.',
                name = f'{PROJECT_ID}',
                format_ = artifactregistry_v1.Repository.Format.DOCKER,
                labels = {'series': SERIES}
            )
        )
    )
    print('Creating Repository ...')
    docker_repo = operation.result()
    print(f'Completed creating repo: {docker_repo.name}')

Retrieved existing repo: projects/statmike-mlops-349915/locations/us-central1/repositories/statmike-mlops-349915


In [110]:
docker_repo.name, docker_repo.format_.name

('projects/statmike-mlops-349915/locations/us-central1/repositories/statmike-mlops-349915',
 'DOCKER')

In [111]:
REPOSITORY = f"{REGION}-docker.pkg.dev/{PROJECT_ID}/{docker_repo.name.split('/')[-1]}"

#### Build Custom Container
Use the Cloud Build client to construct and run the build instructions.  Here the files collected in GCS are copied to the build instance, then the Docker build is run in the folder with the `Dockerfile`.  The resulting image is pushed to Artifact Registry (setup above).

In [112]:
# setup the build config with empty list of steps - these will be added sequentially
build = cloudbuild_v1.Build(
    steps = []
)
# retrieve the source
build.steps.append(
    {
        'name': 'gcr.io/cloud-builders/gsutil',
        'args': ['cp', '-r', f'gs://{GCS_BUCKET}/{SOURCEPATH}/*', '/workspace']
    }
)
# docker build
build.steps.append(
    {
        'name': 'gcr.io/cloud-builders/docker',
        'args': ['build', '-t', f'{REPOSITORY}/{SERIES}_{EXPERIMENT}_trainer', '/workspace']
    }    
)
# docker push
build.images = [f"{REPOSITORY}/{SERIES}_{EXPERIMENT}_trainer"]

In [113]:
build

steps {
  name: "gcr.io/cloud-builders/gsutil"
  args: "cp"
  args: "-r"
  args: "gs://statmike-mlops-349915/r/bigquery-data/models/20240127154845/*"
  args: "/workspace"
}
steps {
  name: "gcr.io/cloud-builders/docker"
  args: "build"
  args: "-t"
  args: "us-central1-docker.pkg.dev/statmike-mlops-349915/statmike-mlops-349915/r_bigquery-data_trainer"
  args: "/workspace"
}
images: "us-central1-docker.pkg.dev/statmike-mlops-349915/statmike-mlops-349915/r_bigquery-data_trainer"

In [114]:
operation = cb_client.create_build(
    project_id = PROJECT_ID,
    build = build
)

In [115]:
response = operation.result()
response.status, response.artifacts

(<Status.SUCCESS: 3>,
 images: "us-central1-docker.pkg.dev/statmike-mlops-349915/statmike-mlops-349915/r_bigquery-data_trainer")

In [116]:
print(f"Review the Custom Container with Artifact Registry in the Google Cloud Console:\nhttps://console.cloud.google.com/artifacts/docker/{PROJECT_ID}/{REGION}/{PROJECT_ID}?project={PROJECT_ID}")

Review the Custom Container with Artifact Registry in the Google Cloud Console:
https://console.cloud.google.com/artifacts/docker/statmike-mlops-349915/us-central1/statmike-mlops-349915?project=statmike-mlops-349915


### Setup Training Job

In [117]:
CMDARGS = [
    BQ_PROJECT,
    BQ_DATASET,
    BQ_TABLE,
    VAR_TARGET,
    ','.join(VAR_OMIT)
]

MACHINE_SPEC = {
    "machine_type": TRAIN_COMPUTE,
    "accelerator_count": 0
}

WORKER_POOL_SPEC = [
    {
        "replica_count": 1,
        "machine_spec": MACHINE_SPEC,
        "container_spec": {
            "image_uri": f"{REPOSITORY}/{SERIES}_{EXPERIMENT}_trainer",
            "command": ["Rscript", "./code/train.R"],
            "args": CMDARGS
        }
    }
]

In [118]:
customJob = aiplatform.CustomJob(
    display_name = f'{SERIES}_{EXPERIMENT}_{TIMESTAMP}',
    worker_pool_specs = WORKER_POOL_SPEC,
    base_output_dir = f"{URI}/models/{TIMESTAMP}",
    staging_bucket = f"{URI}/models/{TIMESTAMP}",
    labels = {'series' : f'{SERIES}', 'experiment' : f'{EXPERIMENT}', 'timestamp' : f'{TIMESTAMP}'}
)

### Run Training Job

In [119]:
customJob.run()

Creating CustomJob
CustomJob created. Resource name: projects/1026793852137/locations/us-central1/customJobs/2444232421768429568
To use this CustomJob in another session:
custom_job = aiplatform.CustomJob.get('projects/1026793852137/locations/us-central1/customJobs/2444232421768429568')
View Custom Job:
https://console.cloud.google.com/ai/platform/locations/us-central1/training/2444232421768429568?project=1026793852137
CustomJob projects/1026793852137/locations/us-central1/customJobs/2444232421768429568 current state:
JobState.JOB_STATE_QUEUED
CustomJob projects/1026793852137/locations/us-central1/customJobs/2444232421768429568 current state:
JobState.JOB_STATE_PENDING
CustomJob projects/1026793852137/locations/us-central1/customJobs/2444232421768429568 current state:
JobState.JOB_STATE_PENDING
CustomJob projects/1026793852137/locations/us-central1/customJobs/2444232421768429568 current state:
JobState.JOB_STATE_PENDING
CustomJob projects/1026793852137/locations/us-central1/customJobs/

In [120]:
customJob.display_name

'r_bigquery-data_20240127154845'

In [121]:
customJob.resource_name

'projects/1026793852137/locations/us-central1/customJobs/2444232421768429568'

Create hyperlinks to job and tensorboard here:

In [122]:
job_link = f"https://console.cloud.google.com/vertex-ai/locations/{REGION}/training/{customJob.resource_name.split('/')[-1]}/cpu?cloudshell=false&project={PROJECT_ID}"

print(f'Review the Custom Job here:\n{job_link}')

Review the Custom Job here:
https://console.cloud.google.com/vertex-ai/locations/us-central1/training/2444232421768429568/cpu?cloudshell=false&project=statmike-mlops-349915


### Review Files In GCS

In [123]:
print(f'Review the files in GCS here:\nhttps://console.cloud.google.com/storage/browser/{GCS_BUCKET}/{SOURCEPATH}?project={PROJECT_ID}')

Review the files in GCS here:
https://console.cloud.google.com/storage/browser/statmike-mlops-349915/r/bigquery-data/models/20240127154845?project=statmike-mlops-349915


In [124]:
list(bucket.list_blobs(prefix = SOURCEPATH))

[<Blob: statmike-mlops-349915, r/bigquery-data/models/20240127154845/Dockerfile, 1706373769111426>,
 <Blob: statmike-mlops-349915, r/bigquery-data/models/20240127154845/model/model.rds, 1706374211615633>,
 <Blob: statmike-mlops-349915, r/bigquery-data/models/20240127154845/train.R, 1706373768502733>]