# Train Merlin TwoTower model

### Notebook Steps
* Build custom Vertex training container based on NVIDIA NGC Merlin Training container
* Confiruger and submit Vertec custom training job
* Configure and submit hyperparameter tuning job
* Evaluate results of hyperparameter tuning job

### Negative Sampling

* Merlin provides scalable negative sampling algorithms for the Item Retrieval Task 
* In this example, the in-batch sampling algorithm, which uses the items interacted by other users as negatives within the same mini-batch

## Training Strategy

* `MirroredStrategy`: Train on a single VM with multiple GPUs.
* `MultiWorkerMirroredStrategy`: Train on multiple VMs with automatic setup of replicas.
* `MultiWorkerMirroredStrategy`: Train on multiple VMs with fine grain control of replicas.
* `ReductionServer`: Train on multiple VMS and sync updates across VMS with Vertex AI Reduction Server.
* `TPUTraining`: Train with multiple Cloud TPUs.

### Mirrored Strategy
When training on a single VM, one can either train was a single compute device or with multiple compute devices on the same VM. With Vertex AI Distributed Training you can specify both the number of compute devices for the VM instance and type of compute devices: CPU, GPU.

Vertex AI Distributed Training supports `tf.distribute.MirroredStrategy' for TensorFlow models. 

To enable training across multiple compute devices on the same VM, you do the following additional steps in your Python training script:

1. Set the tf.distribute.MirrorStrategy
2. Compile the model within the scope of tf.distribute.MirrorStrategy. Note: Tells MirroredStrategy which variables to mirror across your compute devices.
3. Increase the batch size for each compute device to num_devices * batch size.

During transitions, the distribution of batches will be synchronized as well as the updates to the model parameters.

### Setup

In [307]:
import json
import os
import time

from google.cloud import aiplatform as vertex_ai
from google.cloud.aiplatform import hyperparameter_tuning as hpt

from pprint import pprint

In [308]:
# TODO: Project definitions
PROJECT_ID = 'hybrid-vertex' # Change to your project ID.
REGION = 'us-central1' # Change to your region.

# TODO: Service Account address
VERTEX_SA = '934903580331-compute@developer.gserviceaccount.com' # Change to your service account with Vertex AI Admin permitions.

### For HugeCTR data access

* must be a `/gcs/BUCKET_NAME/...` path for GCSFuse 

In [309]:
# using GCSFuse file lists
# TRAIN_DATA = '/gcs/spotify-merlin-v1/nvt-preprocessing-spotify-v24/nvt-processed/train/_gcs_file_list.txt'
# VALID_DATA = '/gcs/spotify-merlin-v1/nvt-preprocessing-spotify-v24/nvt-processed/valid/_gcs_file_list.txt'

# Schema used by the training pipepine
# SCHEMA_PATH = '/gcs/spotify-merlin-v1/nvt-preprocessing-spotify-v25-subset/nvt-defined/train/schema.pbtxt'

# Merline Datatsets
# train = MerlinDataset(output_train_dir + "/*.parquet", schema=schema, part_size="500MB")
# valid = MerlinDataset(output_valid_dir + "/*.parquet", schema=schema, part_size="500MB")

In [310]:
# Bucket definitions
BUCKET = 'spotify-merlin-v1'

VERSION = 'v13' # changed merlin image from "..:07" to "...:06"
MODEL_NAME = 'twotower'
FRAMEWORK = 'merlin-tf'
MODEL_DISPLAY_NAME = f'vertex-{FRAMEWORK}-{MODEL_NAME}-{VERSION}'
WORKSPACE = f'gs://{BUCKET}/{MODEL_DISPLAY_NAME}'

# Docker definitions for training
IMAGE_NAME = f'{FRAMEWORK}-{MODEL_NAME}-training-{VERSION}'
IMAGE_URI = f'gcr.io/{PROJECT_ID}/{IMAGE_NAME}'
# DOCKERNAME = 'hugectr'
DOCKERNAME = 'merlintf'
MACHINE_TYPE ='e2-highcpu-32'
FILE_LOCATION = './src'

### Initialize Vertex AI SDK

In [311]:
vertex_ai.init(
    project=PROJECT_ID,
    location=REGION,
    staging_bucket=os.path.join(WORKSPACE, 'staging')
)

### Create Train Image

In [312]:
!pwd

/home/jupyter/spotify-merlin


In [313]:
REPO_DOCKER_PATH_PREFIX = 'src'

> `RUN pip install merlin-models==0.6.0`

In [314]:
%%writefile {REPO_DOCKER_PATH_PREFIX}/Dockerfile.{DOCKERNAME}

FROM nvcr.io/nvidia/merlin/merlin-tensorflow:22.07

WORKDIR /src

RUN pip install -U pip
RUN pip install google-cloud-bigquery gcsfs cloudml-hypertune
RUN pip install google-cloud-aiplatform kfp
RUN echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] http://packages.cloud.google.com/apt cloud-sdk main" | tee -a /etc/apt/sources.list.d/google-cloud-sdk.list && curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key --keyring /usr/share/keyrings/cloud.google.gpg  add - && apt-get update -y && apt-get install google-cloud-sdk -y

COPY training/* ./

ENV LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/compat/lib.real:/usr/local/hugectr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/lib:/repos/dist/lib

Overwriting src/Dockerfile.merlintf


In [315]:
print(f"DOCKERNAME: {DOCKERNAME}")
print(f"IMAGE_URI: {IMAGE_URI}")
print(f"FILE_LOCATION: {FILE_LOCATION}")
print(f"MACHINE_TYPE: {MACHINE_TYPE}")

DOCKERNAME: merlintf
IMAGE_URI: gcr.io/hybrid-vertex/merlin-tf-twotower-training-v13
FILE_LOCATION: ./src
MACHINE_TYPE: e2-highcpu-32


### Submit a Vertex custom training job

In [316]:
!tree /home/jupyter/spotify-merlin/src/training

[01;34m/home/jupyter/spotify-merlin/src/training[00m
├── __init__.py
├── train_task.py
├── training.py
├── two_tower_model.py
└── utils.py

0 directories, 5 files


In [317]:
os.chdir('/home/jupyter/spotify-merlin')
os.getcwd()

'/home/jupyter/spotify-merlin'

In [318]:
FILE_LOCATION = './src'
! gcloud builds submit --config src/cloudbuild.yaml --substitutions _DOCKERNAME=$DOCKERNAME,_IMAGE_URI=$IMAGE_URI,_FILE_LOCATION=$FILE_LOCATION --timeout=2h --machine-type=$MACHINE_TYPE

Creating temporary tarball archive of 81 file(s) totalling 1.7 MiB before compression.
Uploading tarball of [.] to [gs://hybrid-vertex_cloudbuild/source/1663106031.143978-8a29bbba89d949f1a8d2ec5cb8c03e4e.tgz]
Created [https://cloudbuild.googleapis.com/v1/projects/hybrid-vertex/locations/global/builds/fab486f3-ccbc-4666-8ef0-fcc0454f1dde].
Logs are available at [https://console.cloud.google.com/cloud-build/builds/fab486f3-ccbc-4666-8ef0-fcc0454f1dde?project=934903580331].
----------------------------- REMOTE BUILD OUTPUT ------------------------------
starting build "fab486f3-ccbc-4666-8ef0-fcc0454f1dde"

FETCHSOURCE
Fetching storage object: gs://hybrid-vertex_cloudbuild/source/1663106031.143978-8a29bbba89d949f1a8d2ec5cb8c03e4e.tgz#1663106031670083
Copying gs://hybrid-vertex_cloudbuild/source/1663106031.143978-8a29bbba89d949f1a8d2ec5cb8c03e4e.tgz#1663106031670083...
/ [1 files][215.7 KiB/215.7 KiB]                                                
Operation completed over 1 objects/215.7 

## Vertex Training

* See [here](https://cloud.google.com/vertex-ai/docs/training/configure-compute#specifying_gpus) for GPU config options

## Prepare Worker Pool Specs 

#### Artifact Directories

In [319]:
# full dataset - output of preprocessing pipeline
TRAIN_DATA = f'/gcs/{BUCKET}/nvt-preprocessing-spotify-v32-subset/nvt-processed/train'
VALID_DATA = f'/gcs/{BUCKET}/nvt-preprocessing-spotify-v32-subset/nvt-processed/valid'
WORKFLOW_DIR = f'gs://{BUCKET}/nvt-preprocessing-spotify-v32-subset/nvt-analyzed'
SCHEMA_PATH = f'/gcs/{BUCKET}/nvt-preprocessing-spotify-v32-subset/nvt-defined/train/schema.pbtxt' # Schema used by the training pipepine

# smaller dataset for testing
# TRAIN_DATA = '/gcs/spotify-builtin-2t/merlin-processed/train/'
# VALID_DATA = '/gcs/spotify-builtin-2t/merlin-processed/valid/'
# WORKFLOW_DIR = 'gs://spotify-builtin-2t/merlin-processed/workflow/2t-spotify-workflow'

# location to save trained model artifacts
MODEL_DIR = f'gs://{BUCKET}/model-dir/{VERSION}'

print(f'TRAIN_DATA: {TRAIN_DATA}')
print(f'VALID_DATA: {VALID_DATA}')
print(f'WORKFLOW_DIR: {WORKFLOW_DIR}')
print(f'SCHEMA_PATH: {SCHEMA_PATH}')
print(f'MODEL_DIR: {MODEL_DIR}')

TRAIN_DATA: /gcs/spotify-merlin-v1/nvt-preprocessing-spotify-v32-subset/nvt-processed/train
VALID_DATA: /gcs/spotify-merlin-v1/nvt-preprocessing-spotify-v32-subset/nvt-processed/valid
WORKFLOW_DIR: gs://spotify-merlin-v1/nvt-preprocessing-spotify-v32-subset/nvt-analyzed
SCHEMA_PATH: /gcs/spotify-merlin-v1/nvt-preprocessing-spotify-v32-subset/nvt-defined/train/schema.pbtxt
MODEL_DIR: gs://spotify-merlin-v1/model-dir/v13


#### Configure worker pools

**Reduction Server**

* Consider the network bandwidth supported by a reducer replica’s machine type
> * In GCP, a VM’s machine type defines its maximum possible egress bandwidth. 
> * For example, the egress bandwidth of the `n1-highcpu-16` machine type is limited at 32 Gbps
> See [Network bandwidths and GPUs](https://cloud.google.com/compute/docs/gpus/gpu-network-bandwidth) for details
* Reductions servers **do not** use GPUs
* For the maximum available bandwidth of each node in the third worker pool, see the "Maximum egress bandwidth (Gbps)" columns in [General-purpose machine family](https://cloud.google.com/compute/docs/general-purpose-machines)

Because reducers perform a very limited function, aggregating blocks of gradients, they can run on relatively low-powered and cost effective machines. 
Even with a large number of gradients this computation does not require accelerated hardware or high CPU or memory resources. 

**To avoid network bottlenecks, the total aggregate bandwidth of all replicas in the reducer worker pool must be greater or equal to the total aggregate bandwidth of all replicas in worker pools 0 and 1, which host the GPU workers.**

#### Hardware Accelerators

In [339]:
# WORKER_MACHINE_TYPE = 'a2-highgpu-4g'
WORKER_MACHINE_TYPE = 'a2-highgpu-1g'
REPLICA_COUNT = 1
ACCELERATOR_TYPE = 'NVIDIA_TESLA_A100'
# PER_MACHINE_ACCELERATOR_COUNT = 4
PER_MACHINE_ACCELERATOR_COUNT = 1

DISTRIBUTE_STRATEGY = 'single' # single mirrored multiworker tpu


# if single-node training, RS server = 0
REDUCTION_SERVER_COUNT = 0                                                      
REDUCTION_SERVER_MACHINE_TYPE = "n1-highcpu-16"

gpus = json.dumps([list(range(PER_MACHINE_ACCELERATOR_COUNT))]).replace(' ','')


print(f'WORKER_MACHINE_TYPE: {WORKER_MACHINE_TYPE}')
print(f'REPLICA_COUNT: {REPLICA_COUNT}')
print(f'ACCELERATOR_TYPE: {ACCELERATOR_TYPE}')
print(f'PER_MACHINE_ACCELERATOR_COUNT: {PER_MACHINE_ACCELERATOR_COUNT}')
print(f'REDUCTION_SERVER_COUNT: {REDUCTION_SERVER_COUNT}')
print(f'REDUCTION_SERVER_MACHINE_TYPE: {REDUCTION_SERVER_MACHINE_TYPE}')
print(f'gpus: {gpus}')

WORKER_MACHINE_TYPE: a2-highgpu-1g
REPLICA_COUNT: 1
ACCELERATOR_TYPE: NVIDIA_TESLA_A100
PER_MACHINE_ACCELERATOR_COUNT: 1
REDUCTION_SERVER_COUNT: 0
REDUCTION_SERVER_MACHINE_TYPE: n1-highcpu-16
gpus: [[0]]


#### Training parameters

In [340]:
EXPERIMENT_NAME = f"nb-vtt-{VERSION}-{DISTRIBUTE_STRATEGY}"
RUN_NAME = f'run-{time.strftime("%Y%m%d-%H%M%S")}'

print(f'EXPERIMENT_NAME: {EXPERIMENT_NAME}')
print(f'RUN_NAME: {RUN_NAME}')

EXPERIMENT_NAME: nb-vtt-v13-single
RUN_NAME: run-20220913-223315


In [344]:

NUM_EPOCHS = 5
MAX_ITERATIONS = 25000
# EVAL_INTERVAL = 1000
# EVAL_BATCHES = 500
# EVAL_BATCHES_FINAL = 2500
# DISPLAY_INTERVAL = 200
# SNAPSHOT_INTERVAL = 0
PER_GPU_BATCH_SIZE = 2048
# LR = 0.001
# DROPOUT_RATE = 0.5
# NUM_WORKERS = 12
# LAYER_SIZES='[1024,512,256]'


WORKER_CMD = [
    'sh',
    '-euc',
    f'''python -m train_task --per_gpu_batch_size={PER_GPU_BATCH_SIZE} \
    --model_name={MODEL_NAME} --train_dir={TRAIN_DATA} \
    --valid_dir={VALID_DATA} \
    --schema={SCHEMA_PATH} \
    --workflow_dir={WORKFLOW_DIR} \
    --max_iter={MAX_ITERATIONS} --num_epochs={NUM_EPOCHS} --gpus={gpus} \
    --model_dir={MODEL_DIR} --distribute={DISTRIBUTE_STRATEGY} \
    --experiment_name={EXPERIMENT_NAME} --experiment_run={RUN_NAME}'''
]    

pprint(WORKER_CMD)

['sh',
 '-euc',
 'python -m train_task --per_gpu_batch_size=2048     --model_name=twotower '
 '--train_dir=/gcs/spotify-merlin-v1/nvt-preprocessing-spotify-v32-subset/nvt-processed/train     '
 '--valid_dir=/gcs/spotify-merlin-v1/nvt-preprocessing-spotify-v32-subset/nvt-processed/valid     '
 '--schema=/gcs/spotify-merlin-v1/nvt-preprocessing-spotify-v32-subset/nvt-defined/train/schema.pbtxt     '
 '--workflow_dir=gs://spotify-merlin-v1/nvt-preprocessing-spotify-v32-subset/nvt-analyzed     '
 '--max_iter=25000 --num_epochs=5 --gpus=[[0]]     '
 '--model_dir=gs://spotify-merlin-v1/model-dir/v13 --distribute=single     '
 '--experiment_name=nb-vtt-v13-single --experiment_run=run-20220913-223315']


### Create a custom training job

* specifies multiple machines (nodes) in a training cluster. 

The training service allocates the resources for the machine types you specify. 
* A running job on a given node is called a `replica`
* A group of `replicas` with the same configuration is called a `worker_pool` 
* Vertex Training provides 4 `worker pools` to cover the different types of machine tasks

To use the Reduction Server, you'll need to use 3 of the 4 available worker pools

In [345]:
def prepare_worker_pool_specs(
    image_uri,
    # args,
    cmd,
    replica_count=1,
    machine_type="n1-standard-16",
    accelerator_count=1,
    accelerator_type="ACCELERATOR_TYPE_UNSPECIFIED",
    reduction_server_count=0,
    reduction_server_machine_type="n1-highcpu-16",
    reduction_server_image_uri="us-docker.pkg.dev/vertex-ai-restricted/training/reductionserver:latest",
):

    if accelerator_count > 0:
        machine_spec = {
            "machine_type": machine_type,
            "accelerator_type": accelerator_type,
            "accelerator_count": accelerator_count,
        }
    else:
        machine_spec = {"machine_type": machine_type}

    container_spec = {
        "image_uri": image_uri,
        # "args": args,
        "command": cmd,
    }

    chief_spec = {
        "replica_count": 1,
        "machine_spec": machine_spec,
        "container_spec": container_spec,
    }

    worker_pool_specs = [chief_spec]
    if replica_count > 1:
        workers_spec = {
            "replica_count": replica_count - 1,
            "machine_spec": machine_spec,
            "container_spec": container_spec,
        }
        worker_pool_specs.append(workers_spec)
    if reduction_server_count > 1:
        workers_spec = {
            "replica_count": reduction_server_count,
            "machine_spec": {
                "machine_type": reduction_server_machine_type,
            },
            "container_spec": {"image_uri": reduction_server_image_uri},
        }
        worker_pool_specs.append(workers_spec)

    return worker_pool_specs

In [346]:
WORKER_POOL_SPECS = prepare_worker_pool_specs(
    image_uri=IMAGE_URI,
    # args=WORKER_ARGS,
    cmd=WORKER_CMD,
    replica_count=REPLICA_COUNT,
    machine_type=WORKER_MACHINE_TYPE,
    accelerator_count=PER_MACHINE_ACCELERATOR_COUNT,
    accelerator_type=ACCELERATOR_TYPE,
    reduction_server_count=REDUCTION_SERVER_COUNT,
    reduction_server_machine_type=REDUCTION_SERVER_MACHINE_TYPE,
)

from pprint import pprint
pprint(WORKER_POOL_SPECS)

[{'container_spec': {'command': ['sh',
                                 '-euc',
                                 'python -m train_task '
                                 '--per_gpu_batch_size=2048     '
                                 '--model_name=twotower '
                                 '--train_dir=/gcs/spotify-merlin-v1/nvt-preprocessing-spotify-v32-subset/nvt-processed/train     '
                                 '--valid_dir=/gcs/spotify-merlin-v1/nvt-preprocessing-spotify-v32-subset/nvt-processed/valid     '
                                 '--schema=/gcs/spotify-merlin-v1/nvt-preprocessing-spotify-v32-subset/nvt-defined/train/schema.pbtxt     '
                                 '--workflow_dir=gs://spotify-merlin-v1/nvt-preprocessing-spotify-v32-subset/nvt-analyzed     '
                                 '--max_iter=25000 --num_epochs=5 '
                                 '--gpus=[[0]]     '
                                 '--model_dir=gs://spotify-merlin-v1/model-dir/v13 '
 

### Submit and monitor train job

In [347]:
job_name = f'merlin_towers_{VERSION}_{time.strftime("%Y%m%d_%H%M%S")}'
base_output_dir =  os.path.join(WORKSPACE, job_name)

job = vertex_ai.CustomJob(
    display_name=job_name,
    worker_pool_specs=WORKER_POOL_SPECS,
    base_output_dir=base_output_dir
)
job.run(
    sync=False,
    service_account=VERTEX_SA,
    restart_job_on_worker_restart=False,
    enable_web_access=True,
)

Creating CustomJob
CustomJob created. Resource name: projects/934903580331/locations/us-central1/customJobs/2727438923134402560
To use this CustomJob in another session:
custom_job = aiplatform.CustomJob.get('projects/934903580331/locations/us-central1/customJobs/2727438923134402560')
View Custom Job:
https://console.cloud.google.com/ai/platform/locations/us-central1/training/2727438923134402560?project=934903580331
CustomJob projects/934903580331/locations/us-central1/customJobs/2727438923134402560 current state:
JobState.JOB_STATE_PENDING
CustomJob projects/934903580331/locations/us-central1/customJobs/2727438923134402560 current state:
JobState.JOB_STATE_PENDING
CustomJob projects/934903580331/locations/us-central1/customJobs/2727438923134402560 current state:
JobState.JOB_STATE_PENDING
CustomJob projects/934903580331/locations/us-central1/customJobs/2727438923134402560 current state:
JobState.JOB_STATE_PENDING
CustomJob projects/934903580331/locations/us-central1/customJobs/2727438

### Submit and monitor train job

In [None]:
# # full dataset - output of preprocessing pipeline
# TRAIN_DATA = '/gcs/spotify-merlin-v1/nvt-preprocessing-spotify-v32-subset/nvt-processed/train'
# VALID_DATA = '/gcs/spotify-merlin-v1/nvt-preprocessing-spotify-v32-subset/nvt-processed/valid'
# WORKFLOW_DIR = f'gs://{BUCKET}/nvt-preprocessing-spotify-v32-subset/nvt-analyzed'
# SCHEMA_PATH = '/gcs/spotify-merlin-v1/nvt-preprocessing-spotify-v32-subset/nvt-defined/train/schema.pbtxt' # Schema used by the training pipepine

# # smaller dataset for testing
# # TRAIN_DATA = '/gcs/spotify-builtin-2t/merlin-processed/train/'
# # VALID_DATA = '/gcs/spotify-builtin-2t/merlin-processed/valid/'
# # WORKFLOW_DIR = 'gs://spotify-builtin-2t/merlin-processed/workflow/2t-spotify-workflow'

# # location to save trained model artifacts
# MODEL_DIR = f'gs://{BUCKET}/model-dir/{VERSION}'

# # # Single A100 GPU config
# # MACHINE_TYPE = 'a2-highgpu-1g'
# # ACCELERATOR_TYPE = 'NVIDIA_TESLA_A100'
# # ACCELERATOR_NUM = 1

# # Multi A100 GPU config
# MACHINE_TYPE = 'a2-highgpu-2g'
# ACCELERATOR_TYPE = 'NVIDIA_TESLA_A100'
# ACCELERATOR_NUM = 2

# # # Smaller GPU config
# # MACHINE_TYPE = "n1-standard-16"
# # ACCELERATOR_TYPE = "NVIDIA_TESLA_T4"
# # ACCELERATOR_NUM = 1

# gpus = json.dumps([list(range(ACCELERATOR_NUM))]).replace(' ','')

In [186]:
                 
# worker_pool_specs =  [
#     {
#         "machine_spec": {
#             "machine_type": MACHINE_TYPE,
#             "accelerator_type": ACCELERATOR_TYPE,
#             "accelerator_count": ACCELERATOR_NUM,
#         },
#         "replica_count": 1,
#         "container_spec": {
#             "image_uri": IMAGE_URI,
#             "command": ["python", "-m", "train_task"],
#             "args": [
#                 f'--per_gpu_batch_size={PER_GPU_BATCH_SIZE}',
#                 f'--model_name={MODEL_NAME}',
#                 f'--train_dir={TRAIN_DATA}',
#                 f'--valid_dir={VALID_DATA}',
#                 f'--schema={SCHEMA_PATH}',
#                 f'--workflow_dir={WORKFLOW_DIR}',
#                 # f'--layer_sizes={LAYER_SIZES}',
#                 # f'--slot_size_array={cardinalities}',
#                 f'--max_iter={MAX_ITERATIONS}',
#                 # f'--max_eval_batches={EVAL_BATCHES}',
#                 # f'--eval_batches={EVAL_BATCHES_FINAL}',
#                 # f'--dropout_rate={DROPOUT_RATE}',
#                 # f'--lr={LR}',
#                 # f'--num_workers={NUM_WORKERS}',
#                 f'--num_epochs={NUM_EPOCHS}',
#                 # f'--eval_interval={EVAL_INTERVAL}',
#                 # f'--snapshot={SNAPSHOT_INTERVAL}',
#                 # f'--display_interval={DISPLAY_INTERVAL}',
#                 f'--gpus={gpus}',
#                 # f'--train_dir, --valid_dir, --layer_sizes
#             ],
#         },
#     }
# ]
worker_pool_specs =  [
    {
        "machine_spec": {
            "machine_type": MACHINE_TYPE,
            "accelerator_type": ACCELERATOR_TYPE,
            "accelerator_count": ACCELERATOR_NUM,
        },
        "replica_count": 1,
        "container_spec": {
            "image_uri": IMAGE_URI,
            'command': ['sh','-euc',f'''
                    python -m train_task --per_gpu_batch_size={PER_GPU_BATCH_SIZE} \
                    --model_name={MODEL_NAME} --train_dir={TRAIN_DATA} \
                    --valid_dir={VALID_DATA} \
                    --schema={SCHEMA_PATH} \
                    --workflow_dir={WORKFLOW_DIR} \
                    --max_iter={MAX_ITERATIONS} --num_epochs={NUM_EPOCHS} --gpus={gpus} \
                    --model_dir={MODEL_DIR}
                    '''
            ]
        }
    }
]


In [187]:
from pprint import pprint

pprint(worker_pool_specs)

[{'container_spec': {'command': ['sh',
                                 '-euc',
                                 '\n'
                                 '                    python -m train_task '
                                 '--per_gpu_batch_size=2048                     '
                                 '--model_name=twotower '
                                 '--train_dir=/gcs/spotify-merlin-v1/nvt-preprocessing-spotify-v32-subset/nvt-processed/train                     '
                                 '--valid_dir=/gcs/spotify-merlin-v1/nvt-preprocessing-spotify-v32-subset/nvt-processed/valid                     '
                                 '--schema=/gcs/spotify-merlin-v1/nvt-preprocessing-spotify-v32-subset/nvt-defined/train/schema.pbtxt                     '
                                 '--workflow_dir=gs://spotify-merlin-v1/nvt-preprocessing-spotify-v32-subset/nvt-analyzed                     '
                                 '--max_iter=25000 --num_epochs=10 '


In [188]:
job_name = 'merlin_towers_{}'.format(time.strftime("%Y%m%d_%H%M%S"))
base_output_dir =  os.path.join(WORKSPACE, job_name)

job = vertex_ai.CustomJob(
    display_name=job_name,
    worker_pool_specs=worker_pool_specs,
    base_output_dir=base_output_dir
)
job.run(
    sync=False,
    service_account=VERTEX_SA,
    restart_job_on_worker_restart=False,
    enable_web_access=True,
)

Creating CustomJob


## Archive

In [None]:
# worker_pool_specs =  [
#     {
#         "machine_spec": {
#             "machine_type": "a2-highgpu-1g",
#             "accelerator_type": "NVIDIA_TESLA_A100",
#             "accelerator_count": 1,
#         },
#         "replica_count": 1,
#         "container_spec": {
#             "image_uri": IMAGE_URI,
#             'command': ['sh','-euc','''
#                     python -m train_task --per_gpu_batch_size=2048 \
#                     --model_name=twotower --train_dir=/gcs/spotify-merlin-v1/nvt-preprocessing-spotify-v24/nvt-processed/train \
#                     --valid_dir=/gcs/spotify-merlin-v1/nvt-preprocessing-spotify-v24/nvt-processed/valid \
#                     --schema=/gcs/spotify-merlin-v1/nvt-preprocessing-spotify-v24/nvt-defined/train/schema.pbtxt \
#                     --workflow_dir=gs://spotify-merlin-v1/nvt-preprocessing-spotify-v24/nvt-analyzed \
#                     --max_iter=25000 --num_epochs=2 --gpus=[[0]] --model_dir={MODEL_DIR}
#                     '''
#             ]
#         }
#     }
# ]
# spotify-merlin-v1/nvt-preprocessing-spotify-v24/nvt-processed
# spotify-merlin-v1/nvt-preprocessing-spotify-v24/nvt-processed/train