# Train Merlin TwoTower model

### Notebook Steps
* Build custom Vertex training container based on NVIDIA NGC Merlin Training container
* Confiruger and submit Vertec custom training job
* Configure and submit hyperparameter tuning job
* Evaluate results of hyperparameter tuning job

### Negative Sampling

* Merlin provides scalable negative sampling algorithms for the Item Retrieval Task 
* In this example, the in-batch sampling algorithm, which uses the items interacted by other users as negatives within the same mini-batch

### Setup

In [255]:
import json
import os
import time

from google.cloud import aiplatform as vertex_ai
from google.cloud.aiplatform import hyperparameter_tuning as hpt

In [256]:
# TODO: Project definitions
PROJECT_ID = 'hybrid-vertex' # Change to your project ID.
REGION = 'us-central1' # Change to your region.

# TODO: Service Account address
VERTEX_SA = '934903580331-compute@developer.gserviceaccount.com' # Change to your service account with Vertex AI Admin permitions.

### For HugeCTR data access

* must be a `/gcs/BUCKET_NAME/...` path for GCSFuse 

In [257]:
# using GCSFuse file lists
TRAIN_DATA = '/gcs/spotify-merlin-v1/nvt-preprocessing-spotify-v24/nvt-processed/train/_gcs_file_list.txt'
VALID_DATA = '/gcs/spotify-merlin-v1/nvt-preprocessing-spotify-v24/nvt-processed/valid/_gcs_file_list.txt'

# Schema used by the training pipepine
SCHEMA_PATH = '/gcs/spotify-merlin-v1/nvt-preprocessing-spotify-v24/nvt-defined/train/schema.pbtxt'

# Merline Datatsets
# train = MerlinDataset(output_train_dir + "/*.parquet", schema=schema, part_size="500MB")
# valid = MerlinDataset(output_valid_dir + "/*.parquet", schema=schema, part_size="500MB")

In [258]:
# Bucket definitions
BUCKET = 'spotify-merlin-v1'

VERSION = 'v8' # changed merlin image from "..:07" to "...:06"
MODEL_NAME = 'twotower'
FRAMEWORK = 'merlin-tf'
MODEL_DISPLAY_NAME = f'vertex-{FRAMEWORK}-{MODEL_NAME}-{VERSION}'
WORKSPACE = f'gs://{BUCKET}/{MODEL_DISPLAY_NAME}'

# Docker definitions for training
IMAGE_NAME = f'{FRAMEWORK}-{MODEL_NAME}-training-{VERSION}'
IMAGE_URI = f'gcr.io/{PROJECT_ID}/{IMAGE_NAME}'
# DOCKERNAME = 'hugectr'
DOCKERNAME = 'merlintf'
MACHINE_TYPE ='e2-highcpu-8'

### Initialize Vertex AI SDK

In [259]:
vertex_ai.init(
    project=PROJECT_ID,
    location=REGION,
    staging_bucket=os.path.join(WORKSPACE, 'staging')
)

### Create Train Image

In [260]:
!pwd

/home/jupyter/spotify-merlin


In [261]:
REPO_DOCKER_PATH_PREFIX = 'src'

> `RUN pip install merlin-models==0.6.0`

In [262]:
%%writefile {REPO_DOCKER_PATH_PREFIX}/Dockerfile.{DOCKERNAME}

FROM nvcr.io/nvidia/merlin/merlin-tensorflow:22.07

WORKDIR /src

RUN pip install -U pip
RUN pip install google-cloud-bigquery gcsfs cloudml-hypertune
RUN pip install google-cloud-aiplatform kfp
RUN echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] http://packages.cloud.google.com/apt cloud-sdk main" | tee -a /etc/apt/sources.list.d/google-cloud-sdk.list && curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key --keyring /usr/share/keyrings/cloud.google.gpg  add - && apt-get update -y && apt-get install google-cloud-sdk -y

COPY training/* ./

ENV LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/compat/lib.real:/usr/local/hugectr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/lib:/repos/dist/lib

Overwriting src/Dockerfile.merlintf


In [263]:
print(f"DOCKERNAME: {DOCKERNAME}")
print(f"IMAGE_URI: {IMAGE_URI}")
print(f"FILE_LOCATION: {FILE_LOCATION}")
print(f"MACHINE_TYPE: {MACHINE_TYPE}")

DOCKERNAME: merlintf
IMAGE_URI: gcr.io/hybrid-vertex/merlin-tf-twotower-training-v8
FILE_LOCATION: ./src
MACHINE_TYPE: e2-highcpu-8


### Submit a Vertex custom training job

In [264]:
os.chdir('/home/jupyter/spotify-merlin')
os.getcwd()

'/home/jupyter/spotify-merlin'

In [265]:
FILE_LOCATION = './src'
! gcloud builds submit --config src/cloudbuild.yaml --substitutions _DOCKERNAME=$DOCKERNAME,_IMAGE_URI=$IMAGE_URI,_FILE_LOCATION=$FILE_LOCATION --timeout=2h --machine-type=$MACHINE_TYPE

Creating temporary tarball archive of 65 file(s) totalling 952.0 KiB before compression.
Uploading tarball of [.] to [gs://hybrid-vertex_cloudbuild/source/1660959501.628554-a5fca4751d6d4322918dd47dacf52ff4.tgz]
Created [https://cloudbuild.googleapis.com/v1/projects/hybrid-vertex/locations/global/builds/91aa1773-68c9-40df-be03-879781ae8de6].
Logs are available at [https://console.cloud.google.com/cloud-build/builds/91aa1773-68c9-40df-be03-879781ae8de6?project=934903580331].
----------------------------- REMOTE BUILD OUTPUT ------------------------------
starting build "91aa1773-68c9-40df-be03-879781ae8de6"

FETCHSOURCE
Fetching storage object: gs://hybrid-vertex_cloudbuild/source/1660959501.628554-a5fca4751d6d4322918dd47dacf52ff4.tgz#1660959501988104
Copying gs://hybrid-vertex_cloudbuild/source/1660959501.628554-a5fca4751d6d4322918dd47dacf52ff4.tgz#1660959501988104...
/ [1 files][123.4 KiB/123.4 KiB]                                                
Operation completed over 1 objects/123.

In [266]:
# Training parameters
NUM_EPOCHS = 2
MAX_ITERATIONS = 25000
EVAL_INTERVAL = 1000
EVAL_BATCHES = 500
EVAL_BATCHES_FINAL = 2500
DISPLAY_INTERVAL = 200
SNAPSHOT_INTERVAL = 0
PER_GPU_BATCH_SIZE = 2048
LR = 0.001
DROPOUT_RATE = 0.5
NUM_WORKERS = 12
LAYER_SIZES='[1024,512,256]'

In [267]:
layers = json.dumps([list(f"{LAYER_SIZES}")]).replace(' ','')
layers

'[["[","1","0","2","4",",","5","1","2",",","2","5","6","]"]]'

## Vertex Training

* See [here](https://cloud.google.com/vertex-ai/docs/training/configure-compute#specifying_gpus) for GPU config options

In [268]:
TRAIN_DATA = '/gcs/spotify-merlin-v1/nvt-preprocessing-spotify-v24/nvt-processed/train' #/_gcs_file_list.txt'
VALID_DATA = '/gcs/spotify-merlin-v1/nvt-preprocessing-spotify-v24/nvt-processed/valid' #/_gcs_file_list.txt'
WORKFLOW_DIR = f'gs://{BUCKET}/nvt-preprocessing-spotify-v24/nvt-analyzed'

In [269]:
# MACHINE_TYPE = 'a2-highgpu-1g'
# ACCELERATOR_TYPE = 'NVIDIA_TESLA_A100'
# ACCELERATOR_NUM = 1

# Smaller GPU config
MACHINE_TYPE = "n1-standard-16"
ACCELERATOR_TYPE = "NVIDIA_TESLA_T4"
ACCELERATOR_NUM = 1

gpus = json.dumps([list(range(ACCELERATOR_NUM))]).replace(' ','')
                 
worker_pool_specs =  [
    {
        "machine_spec": {
            "machine_type": MACHINE_TYPE,
            "accelerator_type": ACCELERATOR_TYPE,
            "accelerator_count": ACCELERATOR_NUM,
        },
        "replica_count": 1,
        "container_spec": {
            "image_uri": IMAGE_URI,
            "command": ["python", "-m", "train_task"],
            "args": [
                f'--per_gpu_batch_size={PER_GPU_BATCH_SIZE}',
                f'--model_name={MODEL_NAME}',
                f'--train_dir={TRAIN_DATA}',
                f'--valid_dir={VALID_DATA}',
                f'--schema={SCHEMA_PATH}',
                f'--workflow_dir={WORKFLOW_DIR}',
                # f'--layer_sizes={LAYER_SIZES}',
                # f'--slot_size_array={cardinalities}',
                f'--max_iter={MAX_ITERATIONS}',
                # f'--max_eval_batches={EVAL_BATCHES}',
                # f'--eval_batches={EVAL_BATCHES_FINAL}',
                # f'--dropout_rate={DROPOUT_RATE}',
                # f'--lr={LR}',
                # f'--num_workers={NUM_WORKERS}',
                f'--num_epochs={NUM_EPOCHS}',
                # f'--eval_interval={EVAL_INTERVAL}',
                # f'--snapshot={SNAPSHOT_INTERVAL}',
                # f'--display_interval={DISPLAY_INTERVAL}',
                f'--gpus={gpus}',
                # f'--train_dir, --valid_dir, --layer_sizes
            ],
        },
    }
]

In [270]:
from pprint import pprint

pprint(worker_pool_specs)

[{'container_spec': {'args': ['--per_gpu_batch_size=2048',
                              '--model_name=twotower',
                              '--train_dir=/gcs/spotify-merlin-v1/nvt-preprocessing-spotify-v24/nvt-processed/train',
                              '--valid_dir=/gcs/spotify-merlin-v1/nvt-preprocessing-spotify-v24/nvt-processed/valid',
                              '--schema=/gcs/spotify-merlin-v1/nvt-preprocessing-spotify-v24/nvt-defined/train/schema.pbtxt',
                              '--workflow_dir=gs://spotify-merlin-v1/nvt-preprocessing-spotify-v24/nvt-analyzed',
                              '--max_iter=25000',
                              '--num_epochs=2',
                              '--gpus=[[0]]'],
                     'command': ['python', '-m', 'train_task'],
                     'image_uri': 'gcr.io/hybrid-vertex/merlin-tf-twotower-training-v8'},
  'machine_spec': {'accelerator_count': 1,
                   'accelerator_type': 'NVIDIA_TESLA_T4',
          

### Submit and monitor train job

In [271]:
job_name = 'merlin_towers_{}'.format(time.strftime("%Y%m%d_%H%M%S"))
base_output_dir =  os.path.join(WORKSPACE, job_name)

job = vertex_ai.CustomJob(
    display_name=job_name,
    worker_pool_specs=worker_pool_specs,
    base_output_dir=base_output_dir
)
job.run(
    sync=True,
    service_account=VERTEX_SA,
    restart_job_on_worker_restart=False
)

Creating CustomJob
CustomJob created. Resource name: projects/934903580331/locations/us-central1/customJobs/7059943546106675200
To use this CustomJob in another session:
custom_job = aiplatform.CustomJob.get('projects/934903580331/locations/us-central1/customJobs/7059943546106675200')
View Custom Job:
https://console.cloud.google.com/ai/platform/locations/us-central1/training/7059943546106675200?project=934903580331
CustomJob projects/934903580331/locations/us-central1/customJobs/7059943546106675200 current state:
JobState.JOB_STATE_PENDING
CustomJob projects/934903580331/locations/us-central1/customJobs/7059943546106675200 current state:
JobState.JOB_STATE_PENDING
CustomJob projects/934903580331/locations/us-central1/customJobs/7059943546106675200 current state:
JobState.JOB_STATE_PENDING
CustomJob projects/934903580331/locations/us-central1/customJobs/7059943546106675200 current state:
JobState.JOB_STATE_PENDING
CustomJob projects/934903580331/locations/us-central1/customJobs/7059943

RuntimeError: Job failed with:
code: 3
message: "The replica workerpool0-0 exited with a non-zero status of 1. To find out more about why your job exited please check the logs: https://console.cloud.google.com/logs/viewer?project=934903580331&resource=ml_job%2Fjob_id%2F7059943546106675200&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%227059943546106675200%22"
