# Train Merlin TwoTower model

### Notebook Steps
* Build custom Vertex training container based on NVIDIA NGC Merlin Training container
* Confiruger and submit Vertec custom training job
* Configure and submit hyperparameter tuning job
* Evaluate results of hyperparameter tuning job

### Negative Sampling

* Merlin provides scalable negative sampling algorithms for the Item Retrieval Task 
* In this example, the in-batch sampling algorithm, which uses the items interacted by other users as negatives within the same mini-batch

### Setup

In [5]:
import json
import os
import time

from google.cloud import aiplatform as vertex_ai
from google.cloud.aiplatform import hyperparameter_tuning as hpt

In [6]:
# TODO: Project definitions
PROJECT_ID = 'hybrid-vertex' # Change to your project ID.
REGION = 'us-central1' # Change to your region.

# TODO: Service Account address
VERTEX_SA = '934903580331-compute@developer.gserviceaccount.com' # Change to your service account with Vertex AI Admin permitions.

### For HugeCTR data access

* must be a `/gcs/BUCKET_NAME/...` path for GCSFuse 

In [7]:
# using GCSFuse file lists
TRAIN_DATA = '/gcs/spotify-merlin-v1/nvt-preprocessing-spotify-v24/nvt-processed/train/_gcs_file_list.txt'
VALID_DATA = '/gcs/spotify-merlin-v1/nvt-preprocessing-spotify-v24/nvt-processed/valid/_gcs_file_list.txt'

# Schema used by the training pipepine
SCHEMA_PATH = '/gcs/spotify-merlin-v1/nvt-preprocessing-spotify-v24/nvt-defined/train/schema.pbtxt'

# Merline Datatsets
# train = MerlinDataset(output_train_dir + "/*.parquet", schema=schema, part_size="500MB")
# valid = MerlinDataset(output_valid_dir + "/*.parquet", schema=schema, part_size="500MB")

In [34]:
# Bucket definitions
BUCKET = 'spotify-merlin-v1'

VERSION = 'v02'
MODEL_NAME = 'twotower'
MODEL_DISPLAY_NAME = f'vertex-{MODEL_NAME}-{VERSION}'
WORKSPACE = f'gs://{BUCKET}/{MODEL_DISPLAY_NAME}'

# Docker definitions for training
IMAGE_NAME = f'{MODEL_NAME}-training-{VERSION}'
IMAGE_URI = f'gcr.io/{PROJECT_ID}/{IMAGE_NAME}'
DOCKERNAME = 'hugectr'

### Initialize Vertex AI SDK

In [35]:
vertex_ai.init(
    project=PROJECT_ID,
    location=REGION,
    staging_bucket=os.path.join(WORKSPACE, 'stg')
)

### Submit a Vertex custom training job

In [42]:
os.chdir('/home/jupyter/spotify-merlin')
os.getcwd()

'/home/jupyter/spotify-merlin'

In [43]:
FILE_LOCATION = './src'
! gcloud builds submit --config src/cloudbuild.yaml --substitutions _DOCKERNAME=$DOCKERNAME,_IMAGE_URI=$IMAGE_URI,_FILE_LOCATION=$FILE_LOCATION --timeout=2h --machine-type=e2-highcpu-8

Creating temporary tarball archive of 63 file(s) totalling 952.5 KiB before compression.
Uploading tarball of [.] to [gs://hybrid-vertex_cloudbuild/source/1659095745.612645-31576f0d950b41ffbe805eeab68a4bbc.tgz]
Created [https://cloudbuild.googleapis.com/v1/projects/hybrid-vertex/locations/global/builds/6e2cac91-8728-4da5-b23b-89ee0d12b974].
Logs are available at [https://console.cloud.google.com/cloud-build/builds/6e2cac91-8728-4da5-b23b-89ee0d12b974?project=934903580331].
----------------------------- REMOTE BUILD OUTPUT ------------------------------
starting build "6e2cac91-8728-4da5-b23b-89ee0d12b974"

FETCHSOURCE
Fetching storage object: gs://hybrid-vertex_cloudbuild/source/1659095745.612645-31576f0d950b41ffbe805eeab68a4bbc.tgz#1659095745976092
Copying gs://hybrid-vertex_cloudbuild/source/1659095745.612645-31576f0d950b41ffbe805eeab68a4bbc.tgz#1659095745976092...
/ [1 files][122.6 KiB/122.6 KiB]                                                
Operation completed over 1 objects/122.

In [44]:
# Training parameters
NUM_EPOCHS = 1
MAX_ITERATIONS = 25000
EVAL_INTERVAL = 1000
EVAL_BATCHES = 500
EVAL_BATCHES_FINAL = 2500
DISPLAY_INTERVAL = 200
SNAPSHOT_INTERVAL = 0
PER_GPU_BATCH_SIZE = 2048
LR = 0.001
DROPOUT_RATE = 0.5
NUM_WORKERS = 12

## Vertex Training

In [45]:
gpus

'[[0]]'

In [46]:
MACHINE_TYPE = 'a2-highgpu-1g'
ACCELERATOR_TYPE = 'NVIDIA_TESLA_A100'
ACCELERATOR_NUM = 1

gpus = json.dumps([list(range(ACCELERATOR_NUM))]).replace(' ','')
                 
worker_pool_specs =  [
    {
        "machine_spec": {
            "machine_type": MACHINE_TYPE,
            "accelerator_type": ACCELERATOR_TYPE,
            "accelerator_count": ACCELERATOR_NUM,
        },
        "replica_count": 1,
        "container_spec": {
            "image_uri": IMAGE_URI,
            "command": ["python", "-m", "train_task"],
            "args": [
                f'--per_gpu_batch_size={PER_GPU_BATCH_SIZE}',
                f'--model_name={MODEL_NAME}',
                f'--train_data={TRAIN_DATA}', 
                f'--valid_data={VALID_DATA}',
                f'--schema={SCHEMA_PATH}',
                # f'--slot_size_array={cardinalities}',
                f'--max_iter={MAX_ITERATIONS}',
                # f'--max_eval_batches={EVAL_BATCHES}',
                # f'--eval_batches={EVAL_BATCHES_FINAL}',
                # f'--dropout_rate={DROPOUT_RATE}',
                # f'--lr={LR}',
                # f'--num_workers={NUM_WORKERS}',
                f'--num_epochs={NUM_EPOCHS}',
                # f'--eval_interval={EVAL_INTERVAL}',
                # f'--snapshot={SNAPSHOT_INTERVAL}',
                f'--display_interval={DISPLAY_INTERVAL}',
                f'--gpus={gpus}',
            ],
        },
    }
]

In [47]:
from pprint import pprint

pprint(worker_pool_specs)

[{'container_spec': {'args': ['--per_gpu_batch_size=2048',
                              '--model_name=twotower',
                              '--train_data=/gcs/spotify-merlin-v1/nvt-preprocessing-spotify-v24/nvt-processed/train/_gcs_file_list.txt',
                              '--valid_data=/gcs/spotify-merlin-v1/nvt-preprocessing-spotify-v24/nvt-processed/valid/_gcs_file_list.txt',
                              '--schema=/gcs/spotify-merlin-v1/nvt-preprocessing-spotify-v24/nvt-defined/train/schema.pbtxt',
                              '--max_iter=25000',
                              '--num_epochs=1',
                              '--display_interval=200',
                              '--gpus=[[0]]'],
                     'command': ['python', '-m', 'train_task'],
                     'image_uri': 'gcr.io/hybrid-vertex/twotower-training-v02'},
  'machine_spec': {'accelerator_count': 1,
                   'accelerator_type': 'NVIDIA_TESLA_A100',
                   'machine_type': 

### Submit and monitor train job

In [48]:
job_name = 'merlin_towers_{}'.format(time.strftime("%Y%m%d_%H%M%S"))
base_output_dir =  os.path.join(WORKSPACE, job_name)

job = vertex_ai.CustomJob(
    display_name=job_name,
    worker_pool_specs=worker_pool_specs,
    base_output_dir=base_output_dir
)
job.run(
    sync=True,
    service_account=VERTEX_SA,
    restart_job_on_worker_restart=False
)

Creating CustomJob
CustomJob created. Resource name: projects/934903580331/locations/us-central1/customJobs/2296024645255561216
To use this CustomJob in another session:
custom_job = aiplatform.CustomJob.get('projects/934903580331/locations/us-central1/customJobs/2296024645255561216')
View Custom Job:
https://console.cloud.google.com/ai/platform/locations/us-central1/training/2296024645255561216?project=934903580331
CustomJob projects/934903580331/locations/us-central1/customJobs/2296024645255561216 current state:
JobState.JOB_STATE_PENDING
CustomJob projects/934903580331/locations/us-central1/customJobs/2296024645255561216 current state:
JobState.JOB_STATE_PENDING
CustomJob projects/934903580331/locations/us-central1/customJobs/2296024645255561216 current state:
JobState.JOB_STATE_PENDING
CustomJob projects/934903580331/locations/us-central1/customJobs/2296024645255561216 current state:
JobState.JOB_STATE_PENDING
CustomJob projects/934903580331/locations/us-central1/customJobs/2296024

RuntimeError: Job failed with:
code: 3
message: "The replica workerpool0-0 exited with a non-zero status of 2. To find out more about why your job exited please check the logs: https://console.cloud.google.com/logs/viewer?project=934903580331&resource=ml_job%2Fjob_id%2F2296024645255561216&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%222296024645255561216%22"
