In [None]:
# Copyright 2021 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Guide to Building End-to-End Reinforcement Learning Application Pipelines using Vertex AI

<table align="left">

  <td>
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/tree/master/community-content/tf_agents_bandits_movie_recommendation_with_kfp_and_vertex_sdk/mlops_pipeline_tf_agents_bandits_movie_recommendation/mlops_pipeline_tf_agents_bandits_movie_recommendation.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
    </a>
  </td>
  <td>
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/tree/master/community-content/tf_agents_bandits_movie_recommendation_with_kfp_and_vertex_sdk/mlops_pipeline_tf_agents_bandits_movie_recommendation/mlops_pipeline_tf_agents_bandits_movie_recommendation.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
</table>

## Overview

This demo showcases the use of [TF-Agents](https://www.tensorflow.org/agents), [Kubeflow Pipelines (KFP)](https://www.kubeflow.org/docs/components/pipelines/overview/pipelines-overview/) and [Vertex AI](https://cloud.google.com/vertex-ai), particularly [Vertex Pipelines](https://cloud.google.com/vertex-ai/docs/pipelines), in building an end-to-end reinforcement learning (RL) pipeline of a movie recommendation system. The demo is intended for developers who want to create RL applications using TensorFlow, TF-Agents and Vertex AI services, and those who want to build end-to-end production pipelines using KFP and Vertex Pipelines. It is recommended for developers to have familiarity with RL and the contextual bandits formulation, and the TF-Agents interface.

### Dataset

This demo uses the [MovieLens 100K](https://www.kaggle.com/prajitdatta/movielens-100k-dataset) dataset to simulate an environment with users and their respective preferences. It is available at `gs://cloud-samples-data/vertex-ai/community-content/tf_agents_bandits_movie_recommendation_with_kfp_and_vertex_sdk/u.data`.

### Objective

In this notebook, you will learn how to build an end-to-end RL pipeline for a TF-Agents (particularly the bandits module) based movie recommendation system, using [KFP](https://www.kubeflow.org/docs/components/pipelines/overview/pipelines-overview/), [Vertex AI](https://cloud.google.com/vertex-ai) and particularly [Vertex Pipelines](https://cloud.google.com/vertex-ai/docs/pipelines) which is fully managed and highly scalable.

This Vertex Pipeline includes the following components:
1. *Generator* to generate MovieLens simulation data
2. *Ingester* to ingest data
3. *Trainer* to train the RL policy
4. *Deployer* to deploy the trained policy to a Vertex AI endpoint

After pipeline construction, you (1) create the *Simulator* (which utilizes Cloud Functions, Cloud Scheduler and Pub/Sub) to send simulated MovieLens prediction requests, (2) create the *Logger* to asynchronously log prediction inputs and results (which utilizes Cloud Functions, Pub/Sub and a hook in the prediction code), and (3) create the *Trigger* to trigger recurrent re-training.

A more general ML pipeline is demonstrated in [MLOps on Vertex AI](https://github.com/ksalama/ucaip-labs).

### Costs

This tutorial uses billable components of Google Cloud:

* Vertex AI
* BigQuery
* Cloud Build
* Cloud Functions
* Cloud Scheduler
* Cloud Storage
* Pub/Sub

Learn about [Vertex AI
pricing](https://cloud.google.com/vertex-ai/pricing), [BigQuery pricing](https://cloud.google.com/bigquery/pricing), [Cloud Build](https://cloud.google.com/build/pricing), [Cloud Functions](https://cloud.google.com/functions/pricing), [Cloud Scheduler](https://cloud.google.com/scheduler/pricing), [Cloud Storage
pricing](https://cloud.google.com/storage/pricing), and [Pub/Sub pricing](https://cloud.google.com/pubsub/pricing), and use the [Pricing
Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.

### Set up your local development environment

**If you are using Colab or Google Cloud Notebooks**, your environment already meets
all the requirements to run this notebook. You can skip this step.

**Otherwise**, make sure your environment meets this notebook's requirements.
You need the following:

* The Google Cloud SDK
* Git
* Python 3
* virtualenv
* Jupyter notebook running in a virtual environment with Python 3

The Google Cloud guide to [Setting up a Python development
environment](https://cloud.google.com/python/setup) and the [Jupyter
installation guide](https://jupyter.org/install) provide detailed instructions
for meeting these requirements. The following steps provide a condensed set of
instructions:

1. [Install and initialize the Cloud SDK.](https://cloud.google.com/sdk/docs/)

1. [Install Python 3.](https://cloud.google.com/python/setup#installing_python)

1. [Install
   virtualenv](https://cloud.google.com/python/setup#installing_and_using_virtualenv)
   and create a virtual environment that uses Python 3. Activate the virtual environment.

1. To install Jupyter, run `pip3 install jupyter` on the
command-line in a terminal shell.

1. To launch Jupyter, run `jupyter notebook` on the command-line in a terminal shell.

1. Open this notebook in the Jupyter Notebook Dashboard.

### Install additional packages

Install additional package dependencies not installed in your notebook environment, such as the Kubeflow Pipelines (KFP) SDK.

In [1]:
import os

# The Google Cloud Notebook product has specific requirements
IS_GOOGLE_CLOUD_NOTEBOOK = os.path.exists("/opt/deeplearning/metadata/env_version")

# Google Cloud Notebook requires dependencies to be installed with '--user'
USER_FLAG = ""
if IS_GOOGLE_CLOUD_NOTEBOOK:
    USER_FLAG = "--user"

In [2]:
# pip3 install --user google-cloud-aiplatform
# pip3 install --user google-cloud-pipeline-components
# pip3 install --user --upgrade kfp
# pip3 install --user numpy
# pip3 install --user --upgrade tensorflow
# pip3 install --user --upgrade pillow
# pip3 install --user --upgrade tf-agents
# pip3 install --user --upgrade fastapi

### Restart the kernel

After you install the additional packages, you need to restart the notebook kernel so it can find the packages.

In [3]:
# Automatically restart kernel after installs
import os

# if not os.getenv("IS_TESTING"):
#     # Automatically restart kernel after installs
#     import IPython

#     app = IPython.Application.instance()
#     app.kernel.do_shutdown(True)

## Before you begin

### Select a GPU runtime

**Make sure you're running this notebook in a GPU runtime if you have that option. In Colab, select "Runtime --> Change runtime type > GPU"**

### Set up your Google Cloud project

**The following steps are required, regardless of your notebook environment.**

1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.

1. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).

1. [Enable the Vertex AI API, BigQuery API, Cloud Build, Cloud Functions, Cloud Scheduler, Cloud Storage, and Pub/Sub API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com,bigquery.googleapis.com,build.googleapis.com,functions.googleapis.com,scheduler.googleapis.com,storage.googleapis.com,pubsub.googleapis.com).

1. If you are running this notebook locally, you will need to install the [Cloud SDK](https://cloud.google.com/sdk).

1. Enter your project ID in the cell below. Then run the cell to make sure the
Cloud SDK uses the right project for all the commands in this notebook.

**Note**: Jupyter runs lines prefixed with `!` as shell commands, and it interpolates Python variables prefixed with `$` into these commands.

#### Set your project ID

**If you don't know your project ID**, you may be able to get your project ID using `gcloud`.

In [4]:
import os

PROJECT_ID = ""

# Get your Google Cloud project ID from gcloud
if not os.getenv("IS_TESTING"):
    shell_output = !gcloud config list --format 'value(core.project)' 2>/dev/null
    PROJECT_ID = shell_output[0]
    print("Project ID: ", PROJECT_ID)

Project ID:  hybrid-vertex


Otherwise, set your project ID here.

In [5]:
if PROJECT_ID == "" or PROJECT_ID is None:
    PROJECT_ID = "[your-project-id]"  # @param {type:"string"}

#### Timestamp

If you are in a live tutorial session, you might be using a shared test account or project. To avoid name collisions between users on resources created, you create a timestamp for each instance session, and append it onto the name of resources you create in this tutorial.

In [6]:
from datetime import datetime

TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")

### Authenticate your Google Cloud account

**If you are using Google Cloud Notebooks**, your environment is already
authenticated. Skip this step.

**If you are using Colab**, run the cell below and follow the instructions
when prompted to authenticate your account via oAuth.

**Otherwise**, follow these steps:

1. In the Cloud Console, go to the [**Create service account key**
   page](https://console.cloud.google.com/apis/credentials/serviceaccountkey).

2. Click **Create service account**.

3. In the **Service account name** field, enter a name, and
   click **Create**.

4. In the **Grant this service account access to project** section, click the **Role** drop-down list. Type "Vertex AI"
into the filter box, and select
   **Vertex AI Administrator**. Type "Storage Object Admin" into the filter box, and select **Storage Object Admin**.

5. Click *Create*. A JSON file that contains your key downloads to your
local environment.

6. Enter the path to your service account key as the
`GOOGLE_APPLICATION_CREDENTIALS` variable in the cell below and run the cell.

In [7]:
import os
import sys

# If you are running this notebook in Colab, run this cell and follow the
# instructions to authenticate your GCP account. This provides access to your
# Cloud Storage bucket and lets you submit training jobs and prediction
# requests.

# The Google Cloud Notebook product has specific requirements
IS_GOOGLE_CLOUD_NOTEBOOK = os.path.exists("/opt/deeplearning/metadata/env_version")

# If on Google Cloud Notebooks, then don't execute this code
if not IS_GOOGLE_CLOUD_NOTEBOOK:
    if "google.colab" in sys.modules:
        from google.colab import auth as google_auth

        google_auth.authenticate_user()

    # If you are running this notebook locally, replace the string below with the
    # path to your service account key and run this cell to authenticate your GCP
    # account.
    elif not os.getenv("IS_TESTING"):
        %env GOOGLE_APPLICATION_CREDENTIALS ''

### Create a Cloud Storage bucket

**The following steps are required, regardless of your notebook environment.**

In this tutorial, a Cloud Storage bucket holds the MovieLens dataset files to be used for model training. Vertex AI also saves the trained model that results from your training job in the same bucket. Using this model artifact, you can then create Vertex AI model and endpoint resources in order to serve online predictions.

Set the name of your Cloud Storage bucket below. It must be unique across all
Cloud Storage buckets.

You may also change the `REGION` variable, which is used for operations
throughout the rest of this notebook. Make sure to [choose a region where Vertex AI services are
available](https://cloud.google.com/vertex-ai/docs/general/locations#available_regions). You may
not use a Multi-Regional Storage bucket for training with Vertex AI. Also note that Vertex
Pipelines is currently only supported in select regions such as "us-central1" ([reference](https://cloud.google.com/vertex-ai/docs/general/locations)).

In [8]:
VERSION='v3'

In [9]:
BUCKET_NAME = f"gs://tf-agents-bandits-{VERSION}"  # @param {type:"string"} The bucket should be in same region as uCAIP. The bucket should not be multi-regional for custom training jobs to work.
REGION = "us-central1"  # @param {type:"string"}

In [10]:
if BUCKET_NAME == "" or BUCKET_NAME is None or BUCKET_NAME == "gs://[your-bucket-name]":
    BUCKET_NAME = "gs://" + PROJECT_ID + "aip-" + TIMESTAMP

**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket.

In [11]:
! gsutil mb -l $REGION $BUCKET_NAME

Creating gs://tf-agents-bandits-v3/...
ServiceException: 409 A Cloud Storage bucket named 'tf-agents-bandits-v3' already exists. Try another name. Bucket names must be globally unique across all Google Cloud projects, including those outside of your organization.


Finally, validate access to your Cloud Storage bucket by examining its contents:

In [12]:
! gsutil ls -al $BUCKET_NAME

### Import libraries and define constants

In [13]:
# pip install --upgrade 'kfp<2'
# pip install --upgrade 'google-cloud-pipeline-components<2'

In [14]:
import os
import sys

# # kfp
# import kfp
# import kfp.v2.dsl
# from kfp.v2.google import client as pipelines_client
# from kfp.v2.dsl import (
#     Artifact, Dataset, 
#     Input, InputPath, 
#     Model, Output,
#     OutputPath, component, 
#     Metrics
# )

from google.cloud import aiplatform
# from google_cloud_pipeline_components import aiplatform as gcc_aip
# from google_cloud_pipeline_components.v1.endpoint import (EndpointCreateOp,
#                                                           ModelDeployOp)
# from google_cloud_pipeline_components.v1.model import ModelUploadOp

import kfp
from kfp.v2 import compiler, dsl
# from kfp.v2.google.client import AIPlatformClient

TypeError: Descriptors cannot not be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:
 1. Downgrade the protobuf package to 3.20.x or lower.
 2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).

More information: https://developers.google.com/protocol-buffers/docs/news/2022-05-06#python-updates

In [14]:
print(f'kfp version: {kfp.__version__}')
! python3 -c "import google_cloud_pipeline_components; print('google_cloud_pipeline_components version: {}'.format(google_cloud_pipeline_components.__version__))"
print(f'vertex_ai SDK version: {aiplatform.__version__}')
# print(f'bigquery SDK version: {bigquery.__version__}')

kfp version: 2.0.1
google_cloud_pipeline_components version: 2.0.0
vertex_ai SDK version: 1.26.0


#### Fill out the following configurations

In [15]:
# BigQuery parameters (used for the Generator, Ingester, Logger)
BIGQUERY_DATASET_ID = f"{PROJECT_ID}.movielens_dataset"  # @param {type:"string"} BigQuery dataset ID as `project_id.dataset_id`.
BIGQUERY_LOCATION = "us"  # @param {type:"string"} BigQuery dataset region.
BIGQUERY_TABLE_ID = f"{BIGQUERY_DATASET_ID}.training_dataset"  # @param {type:"string"} BigQuery table ID as `project_id.dataset_id.table_id`.

#### Set additional configurations

You may use the default values below as is.

In [16]:
# Dataset parameters
RAW_DATA_PATH = f"gs://tf-agents-bandits-{VERSION}/raw_data/u.data"   # @param {type:"string"}

In [17]:
# Download the sample data into your RAW_DATA_PATH
! gsutil cp "gs://cloud-samples-data/vertex-ai/community-content/tf_agents_bandits_movie_recommendation_with_kfp_and_vertex_sdk/u.data" $RAW_DATA_PATH

Copying gs://cloud-samples-data/vertex-ai/community-content/tf_agents_bandits_movie_recommendation_with_kfp_and_vertex_sdk/u.data [Content-Type=application/octet-stream]...
/ [1 files][  1.9 MiB/  1.9 MiB]                                                
Operation completed over 1 objects/1.9 MiB.                                      


In [18]:
# Pipeline parameters
PIPELINE_NAME = "movielens-pipeline"  # Pipeline display name.
ENABLE_CACHING = False  # Whether to enable execution caching for the pipeline.
PIPELINE_ROOT = f"{BUCKET_NAME}/pipeline"  # Root directory for pipeline artifacts.
PIPELINE_SPEC_PATH = "metadata_pipeline.json"  # Path to pipeline specification file.
OUTPUT_COMPONENT_SPEC = "output-component.yaml"  # Output component specification file.

# BigQuery parameters (used for the Generator, Ingester, Logger)
BIGQUERY_TMP_FILE = (
    "tmp.json"  # Temporary file for storing data to be loaded into BigQuery.
)
BIGQUERY_MAX_ROWS = 5  # Maximum number of rows of data in BigQuery to ingest.

# Dataset parameters
TFRECORD_FILE = (
    f"{BUCKET_NAME}/trainer_input_path/*"  # TFRecord file to be used for training.
)

# Logger parameters (also used for the Logger hook in the prediction container)
LOGGER_PUBSUB_TOPIC = "logger-pubsub-topic"  # Pub/Sub topic name for the Logger.
LOGGER_CLOUD_FUNCTION = "logger-cloud-function"  # Cloud Functions name for the Logger.

## Create the RL pipeline components

This section consists of the following steps:
1. Create the *Generator* to generate MovieLens simulation data
2. Create the *Ingester* to ingest data
3. Create the *Trainer* to train the RL policy
4. Create the *Deployer* to deploy the trained policy to a Vertex AI endpoint

After pipeline construction, create the *Simulator* to send simulated MovieLens prediction requests, create the *Logger* to asynchronously log prediction inputs and results, and create the *Trigger* to trigger re-training.

Here's the entire workflow:
1. The startup pipeline has the following components: Generator --> Ingester --> Trainer --> Deployer. This pipeline only runs once.
2. Then, the Simulator generates prediction requests (e.g. every 5 mins),  and the Logger gets invoked immediately at each prediction request and logs each prediction request asynchronously into BigQuery. The Trigger runs the re-training pipeline (e.g. every 30 mins) with the following components: Ingester --> Trainer --> Deploy.

You can find the KFP SDK documentation [here](https://www.kubeflow.org/docs/components/pipelines/sdk/sdk-overview/).

### Create the *Generator* to generate MovieLens simulation data

Create the Generator component to generate the initial set of training data using a MovieLens simulation environment and a random data-collecting policy. Store the generated data in BigQuery.

The Generator source code is [`src/generator/generator_component.py`](src/generator/generator_component.py).

#### Run unit tests on the Generator component

Before running the command, you should update the `RAW_DATA_PATH` in [`src/generator/test_generator_component.py`](src/generator/test_generator_component.py).

In [21]:
! python3 -m unittest src.generator.test_generator_component

2023-07-03 18:29:26.452283: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl6StatusC1EN10tensorflow5error4CodeESt17basic_string_viewIcSt11char_traitsIcEENS_14SourceLocationE']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZTVN10tensorflow13GcsFileSystemE']
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
  declare_namespace(pkg)
Implementing implicit namespace packages (as specified in PEP 420) 

### Create the *Ingester* to ingest data

Create the Ingester component to ingest data from BigQuery, package them as `tf.train.Example` objects, and output TFRecord files.

Read more about `tf.train.Example` and TFRecord [here](https://www.tensorflow.org/tutorials/load_data/tfrecord).

The Ingester component source code is in [`src/ingester/ingester_component.py`](src/ingester/ingester_component.py).

#### Run unit tests on the Ingester component

In [23]:
! python3 -m unittest src.ingester.test_ingester_component

2023-07-03 18:30:13.388724: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl6StatusC1EN10tensorflow5error4CodeESt17basic_string_viewIcSt11char_traitsIcEENS_14SourceLocationE']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZTVN10tensorflow13GcsFileSystemE']
.2023-07-03 18:30:15.205458: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentatio

### Create the *Trainer* to train the RL policy

Create the Trainer component to train a RL policy on the training dataset, and then submit a remote custom training job to Vertex AI. This component trains a policy using the TF-Agents LinUCB agent on the MovieLens simulation dataset, and saves the trained policy as a SavedModel.

The Trainer component source code is in [`src/trainer/trainer_component.py`](src/trainer/trainer_component.py). You use additional Vertex AI platform code in pipeline construction to submit the training code defined in Trainer as a custom training job to Vertex AI. (The additional code is similar to what [`kfp.v2.google.experimental.run_as_aiplatform_custom_job`](https://github.com/kubeflow/pipelines/blob/master/sdk/python/kfp/v2/google/experimental/custom_job.py) does. You can find an example notebook [here](https://github.com/GoogleCloudPlatform/ai-platform-samples/blob/master/ai-platform-unified/notebooks/official/pipelines/google_cloud_pipeline_components_model_train_upload_deploy.ipynb) for how to use that first-party Trainer component.)

The Trainer performs off-policy training, where you train a policy on a static set of pre-collected data records containing information including observation, action and reward. For a data record, the policy in training might not output the same action given the observation in that data record.

If you're interested in pipeline metrics, read about [KFP Pipeline Metrics](https://www.kubeflow.org/docs/components/pipelines/sdk/pipelines-metrics/) here.

In [24]:
# Trainer parameters
TRAINING_ARTIFACTS_DIR = (
    f"{BUCKET_NAME}/artifacts"  # Root directory for training artifacts.
)
TRAINING_REPLICA_COUNT = 1  # Number of replica to run the custom training job.
TRAINING_MACHINE_TYPE = (
    "n1-standard-4"  # Type of machine to run the custom training job.
)
TRAINING_ACCELERATOR_TYPE = "ACCELERATOR_TYPE_UNSPECIFIED"  # Type of accelerators to run the custom training job.
TRAINING_ACCELERATOR_COUNT = 0  # Number of accelerators for the custom training job.

#### Run unit tests on the Trainer component

In [25]:
! python3 -m unittest src.trainer.test_trainer_component

2023-07-03 18:30:28.812488: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl6StatusC1EN10tensorflow5error4CodeESt17basic_string_viewIcSt11char_traitsIcEENS_14SourceLocationE']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZTVN10tensorflow13GcsFileSystemE']
2023-07-03 18:30:30.944132: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation

### Create the *Deployer* to deploy the trained policy to a Vertex AI endpoint

Use [`google_cloud_pipeline_components.aiplatform`](https://cloud.google.com/vertex-ai/docs/pipelines/build-pipeline#google-cloud-components) components during pipeline construction to:
1. Upload the trained policy
2. Create a Vertex AI endpoint
3. Deploy the uploaded trained policy to the endpoint

These 3 components formulate the Deployer. They support flexible configurations; for instance, if you want to set up traffic splitting for the endpoint to run A/B testing, you may pass in your configurations to [google_cloud_pipeline_components.aiplatform.ModelDeployOp](https://google-cloud-pipeline-components.readthedocs.io/en/google-cloud-pipeline-components-0.1.3/google_cloud_pipeline_components.aiplatform.html#google_cloud_pipeline_components.aiplatform.ModelDeployOp).

In [26]:
# Deployer parameters
TRAINED_POLICY_DISPLAY_NAME = (
    "movielens-trained-policy"  # Display name of the uploaded and deployed policy.
)
TRAFFIC_SPLIT = {"0": 100}
ENDPOINT_DISPLAY_NAME = "movielens-endpoint"  # Display name of the prediction endpoint.
ENDPOINT_MACHINE_TYPE = "n1-standard-4"  # Type of machine of the prediction endpoint.
ENDPOINT_REPLICA_COUNT = 1  # Number of replicas of the prediction endpoint.
ENDPOINT_ACCELERATOR_TYPE = "ACCELERATOR_TYPE_UNSPECIFIED"  # Type of accelerators to run the custom training job.
ENDPOINT_ACCELERATOR_COUNT = 0  # Number of accelerators for the custom training job.

### Create a custom prediction container using Cloud Build

Before setting up the Deployer, define and build a custom prediction container that serves predictions using the trained policy. The source code, Cloud Build YAML configuration file and Dockerfile are in `src/prediction_container`.

This prediction container is the serving container for the deployed, trained policy. See a more detailed guide on building prediction custom containers [here](https://github.com/GoogleCloudPlatform/vertex-ai-samples/tree/master/community-content/tf_agents_bandits_movie_recommendation_with_kfp_and_vertex_sdk/step_by_step_sdk_tf_agents_bandits_movie_recommendation/step_by_step_sdk_tf_agents_bandits_movie_recommendation.ipynb).

In [27]:
# Prediction container parameters
PREDICTION_CONTAINER = "prediction-container"  # Name of the container image.
PREDICTION_CONTAINER_DIR = "src/prediction_container"

#### Create a Cloud Build YAML file using Kaniko build

Note: For this application, you are recommended to use E2_HIGHCPU_8 or other high resouce machine configurations instead of the standard machine type listed [here](https://cloud.google.com/build/docs/api/reference/rest/v1/projects.builds#Build.MachineType) to prevent out-of-memory errors.

In [29]:
cloudbuild_yaml = """steps:
- name: "gcr.io/kaniko-project/executor:latest"
  args: ["--destination=gcr.io/{PROJECT_ID}/{PREDICTION_CONTAINER}:latest",
         "--cache=false",
         "--cache-ttl=99h"]
  env: ["AIP_STORAGE_URI={ARTIFACTS_DIR}",
        "PROJECT_ID={PROJECT_ID}",
        "LOGGER_PUBSUB_TOPIC={LOGGER_PUBSUB_TOPIC}"]
options:
  machineType: "E2_HIGHCPU_8"
""".format(
    PROJECT_ID=PROJECT_ID,
    PREDICTION_CONTAINER=PREDICTION_CONTAINER,
    ARTIFACTS_DIR=TRAINING_ARTIFACTS_DIR,
    LOGGER_PUBSUB_TOPIC=LOGGER_PUBSUB_TOPIC,
)

with open(f"{PREDICTION_CONTAINER_DIR}/cloudbuild.yaml", "w") as fp:
    fp.write(cloudbuild_yaml)

#### Run unit tests on the prediction code

In [30]:
! python3 -m unittest src.prediction_container.test_main

2023-07-03 18:30:58.904139: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl6StatusC1EN10tensorflow5error4CodeESt17basic_string_viewIcSt11char_traitsIcEENS_14SourceLocationE']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZTVN10tensorflow13GcsFileSystemE']
..2023-07-03 18:31:01.349576: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentati

#### Build custom prediction container

In [32]:
! gcloud builds submit --config $PREDICTION_CONTAINER_DIR/cloudbuild.yaml $PREDICTION_CONTAINER_DIR

Creating temporary tarball archive of 12 file(s) totalling 25.6 KiB before compression.
Uploading tarball of [src/prediction_container] to [gs://hybrid-vertex_cloudbuild/source/1688409225.69673-1d2291650db64c788a1b204d8307f389.tgz]
Created [https://cloudbuild.googleapis.com/v1/projects/hybrid-vertex/locations/global/builds/28cafa2c-c2fc-4b6c-880a-3d4559621182].
Logs are available at [ https://console.cloud.google.com/cloud-build/builds/28cafa2c-c2fc-4b6c-880a-3d4559621182?project=934903580331 ].
----------------------------- REMOTE BUILD OUTPUT ------------------------------
starting build "28cafa2c-c2fc-4b6c-880a-3d4559621182"

FETCHSOURCE
Fetching storage object: gs://hybrid-vertex_cloudbuild/source/1688409225.69673-1d2291650db64c788a1b204d8307f389.tgz#1688409225893761
Copying gs://hybrid-vertex_cloudbuild/source/1688409225.69673-1d2291650db64c788a1b204d8307f389.tgz#1688409225893761...
/ [1 files][  8.0 KiB/  8.0 KiB]                                                
Operation complete

## Author and run the RL pipeline

You author the pipeline using custom KFP components built from the previous section, and [create a pipeline run](https://cloud.google.com/vertex-ai/docs/pipelines/run-pipeline#kubeflow-pipelines-sdk) using Vertex Pipelines. You can read more about whether to enable execution caching [here](https://cloud.google.com/vertex-ai/docs/pipelines/build-pipeline#caching). You can also specifically configure the worker pool spec for training if for instance you want to train at scale and/or at a higher speed; you can adjust the replica count, machine type, accelerator type and count, and many other specifications.

Here, you build a "startup" pipeline that generates randomly sampled training data (with the Generator) as the first step. This pipeline runs only once.

In [40]:
!pwd

/home/jupyter/vertex-ai-samples/community-content/tf_agents_bandits_movie_recommendation_with_kfp_and_vertex_sdk/mlops_pipeline_tf_agents_bandits_movie_recommendation


In [57]:
from google_cloud_pipeline_components.v1.custom_job import utils
from kfp.components import load_component_from_url, load_component_from_file

# generate_op = load_component_from_url(
#     "https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/62a2a7611499490b4b04d731d48a7ba87c2d636f/community-content/tf_agents_bandits_movie_recommendation_with_kfp_and_vertex_sdk/mlops_pipeline_tf_agents_bandits_movie_recommendation/src/generator/component.yaml"
# )
# ingest_op = load_component_from_url(
#     "https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/62a2a7611499490b4b04d731d48a7ba87c2d636f/community-content/tf_agents_bandits_movie_recommendation_with_kfp_and_vertex_sdk/mlops_pipeline_tf_agents_bandits_movie_recommendation/src/ingester/component.yaml"
# )
# train_op = load_component_from_url(
#     "https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/62a2a7611499490b4b04d731d48a7ba87c2d636f/community-content/tf_agents_bandits_movie_recommendation_with_kfp_and_vertex_sdk/mlops_pipeline_tf_agents_bandits_movie_recommendation/src/trainer/component.yaml"
# )

generate_op = load_component_from_file("./src/generator/component.yaml")
ingest_op = load_component_from_file("./src/ingester/component.yaml")
train_op = load_component_from_file("./src/trainer/component.yaml")

@dsl.pipeline(pipeline_root=PIPELINE_ROOT, name=f"{PIPELINE_NAME}-startup")
def pipeline(
    # Pipeline configs
    project_id: str,
    raw_data_path: str,
    training_artifacts_dir: str,
    # BigQuery configs
    bigquery_dataset_id: str,
    bigquery_location: str,
    bigquery_table_id: str,
    bigquery_max_rows: int = 10000,
    # TF-Agents RL configs
    batch_size: int = 8,
    rank_k: int = 20,
    num_actions: int = 20,
    driver_steps: int = 3,
    num_epochs: int = 5,
    tikhonov_weight: float = 0.01,
    agent_alpha: float = 10,
) -> None:
    """Authors a RL pipeline for MovieLens movie recommendation system.

    Integrates the Generator, Ingester, Trainer and Deployer components. This
    pipeline generates initial training data with a random policy and runs once
    as the initiation of the system.

    Args:
      project_id: GCP project ID. This is required because otherwise the BigQuery
        client will use the ID of the tenant GCP project created as a result of
        KFP, which doesn't have proper access to BigQuery.
      raw_data_path: Path to MovieLens 100K's "u.data" file.
      training_artifacts_dir: Path to store the Trainer artifacts (trained policy).

      bigquery_dataset: A string of the BigQuery dataset ID in the format of
        "project.dataset".
      bigquery_location: A string of the BigQuery dataset location.
      bigquery_table_id: A string of the BigQuery table ID in the format of
        "project.dataset.table".
      bigquery_max_rows: Optional; maximum number of rows to ingest.

      batch_size: Optional; batch size of environment generated quantities eg.
        rewards.
      rank_k: Optional; rank for matrix factorization in the MovieLens environment;
        also the observation dimension.
      num_actions: Optional; number of actions (movie items) to choose from.
      driver_steps: Optional; number of steps to run per batch.
      num_epochs: Optional; number of training epochs.
      tikhonov_weight: Optional; LinUCB Tikhonov regularization weight of the
        Trainer.
      agent_alpha: Optional; LinUCB exploration parameter that multiplies the
        confidence intervals of the Trainer.
    """
    # Run the Generator component.
    generate_task = generate_op(
        project_id=project_id,
        raw_data_path=raw_data_path,
        batch_size=batch_size,
        rank_k=rank_k,
        num_actions=num_actions,
        driver_steps=driver_steps,
        bigquery_tmp_file=BIGQUERY_TMP_FILE,
        bigquery_dataset_id=bigquery_dataset_id,
        bigquery_location=bigquery_location,
        bigquery_table_id=bigquery_table_id,
    )
    
    # Run the Ingester component.
    ingest_task = ingest_op(
        project_id=project_id,
        bigquery_table_id=generate_task.outputs["bigquery_table_id"],
        bigquery_max_rows=bigquery_max_rows,
        tfrecord_file=TFRECORD_FILE,
    )

    # Run the Trainer component and submit custom job to Vertex AI.
    # create_custom_training_job_op_from_component
    # Convert the train_op component into a Vertex AI Custom Job pre-built component
    custom_job_training_op = utils.create_custom_training_job_from_component(
        display_name='custom-training-job',
        component_spec=train_op,
        replica_count=TRAINING_REPLICA_COUNT,
        machine_type=TRAINING_MACHINE_TYPE,
        accelerator_type=TRAINING_ACCELERATOR_TYPE,
        accelerator_count=TRAINING_ACCELERATOR_COUNT,
    )

    train_task = custom_job_training_op(
        training_artifacts_dir=training_artifacts_dir,
        tfrecord_file=ingest_task.outputs["tfrecord_file"],
        num_epochs=num_epochs,
        rank_k=rank_k,
        num_actions=num_actions,
        tikhonov_weight=tikhonov_weight,
        agent_alpha=agent_alpha,
        project=PROJECT_ID,
        location=REGION,
    )

    # Run the Deployer components.
    # Upload the trained policy as a model.
    model_upload_op = gcc_aip.ModelUploadOp(
        project=project_id,
        display_name=TRAINED_POLICY_DISPLAY_NAME,
        artifact_uri=train_task.outputs["training_artifacts_dir"],
        serving_container_image_uri=f"gcr.io/{PROJECT_ID}/{PREDICTION_CONTAINER}:latest",
    )
    # Create a Vertex AI endpoint. (This operation can occur in parallel with
    # the Generator, Ingester, Trainer components.)
    endpoint_create_op = gcc_aip.EndpointCreateOp(
        project=project_id, display_name=ENDPOINT_DISPLAY_NAME
    )
    # Deploy the uploaded, trained policy to the created endpoint. (This operation
    # has to occur after both model uploading and endpoint creation complete.)
    gcc_aip.ModelDeployOp(
        endpoint=endpoint_create_op.outputs["endpoint"],
        model=model_upload_op.outputs["model"],
        deployed_model_display_name=TRAINED_POLICY_DISPLAY_NAME,
        traffic_split=TRAFFIC_SPLIT,
        dedicated_resources_machine_type=ENDPOINT_MACHINE_TYPE,
        dedicated_resources_accelerator_type=ENDPOINT_ACCELERATOR_TYPE,
        dedicated_resources_accelerator_count=ENDPOINT_ACCELERATOR_COUNT,
        dedicated_resources_min_replica_count=ENDPOINT_REPLICA_COUNT,
    )

KeyError: 'custom-training-job'

In [54]:
train_op

<kfp.components.yaml_component.YamlComponent at 0x7fa3050a1fc0>

In [55]:
# Compile the authored pipeline.
compiler.Compiler().compile(pipeline_func=pipeline, package_path=PIPELINE_SPEC_PATH)

# Create a pipeline run job.
job = aiplatform.PipelineJob(
    display_name=f"{PIPELINE_NAME}-startup",
    template_path=PIPELINE_SPEC_PATH,
    pipeline_root=PIPELINE_ROOT,
    parameter_values={
        # Pipeline configs
        "project_id": PROJECT_ID,
        "raw_data_path": RAW_DATA_PATH,
        "training_artifacts_dir": TRAINING_ARTIFACTS_DIR,
        # BigQuery configs
        "bigquery_dataset_id": BIGQUERY_DATASET_ID,
        "bigquery_location": BIGQUERY_LOCATION,
        "bigquery_table_id": BIGQUERY_TABLE_ID,
    },
    enable_caching=ENABLE_CACHING,
)

job.run(sync=False)

Creating PipelineJob
PipelineJob created. Resource name: projects/934903580331/locations/us-central1/pipelineJobs/movielens-pipeline-startup-20230703191348
To use this PipelineJob in another session:
pipeline_job = aiplatform.PipelineJob.get('projects/934903580331/locations/us-central1/pipelineJobs/movielens-pipeline-startup-20230703191348')
View Pipeline Job:
https://console.cloud.google.com/vertex-ai/locations/us-central1/pipelines/runs/movielens-pipeline-startup-20230703191348?project=934903580331
PipelineJob projects/934903580331/locations/us-central1/pipelineJobs/movielens-pipeline-startup-20230703191348 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob projects/934903580331/locations/us-central1/pipelineJobs/movielens-pipeline-startup-20230703191348 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob projects/934903580331/locations/us-central1/pipelineJobs/movielens-pipeline-startup-20230703191348 current state:
PipelineState.PIPELINE_STATE_RUNNING
Pipel

## Create the *Simulator* to send simulated MovieLens prediction requests

Create the Simulator to [obtain observations](https://github.com/tensorflow/agents/blob/v0.8.0/tf_agents/bandits/environments/movielens_py_environment.py#L118-L125) from the MovieLens simulation environment, formats them, and sends prediction requests to the Vertex AI endpoint.

The workflow is: Cloud Scheduler --> Pub/Sub --> Cloud Functions --> Endpoint

In production, this Simulator logic can be modified to that of gathering real-world input features as observations, getting prediction results from the endpoint and communicating those results to real-world users.

The Simulator source code is [`src/simulator/main.py`](src/simulator/main.py).

In [None]:
# Simulator parameters
SIMULATOR_PUBSUB_TOPIC = (
    "simulator-pubsub-topic"  # Pub/Sub topic name for the Simulator.
)
SIMULATOR_CLOUD_FUNCTION = (
    "simulator-cloud-function"  # Cloud Functions name for the Simulator.
)
SIMULATOR_SCHEDULER_JOB = (
    "simulator-scheduler-job"  # Cloud Scheduler cron job name for the Simulator.
)
SIMULATOR_SCHEDULE = "*/5 * * * *"  # Cloud Scheduler cron job schedule for the Simulator. Eg. "*/5 * * * *" means every 5 mins.
SIMULATOR_SCHEDULER_MESSAGE = (
    "simulator-message"  # Cloud Scheduler message for the Simulator.
)
# TF-Agents RL configs
BATCH_SIZE = 8
RANK_K = 20
NUM_ACTIONS = 20

### Run unit tests on the Simulator

In [None]:
! python3 -m unittest src.simulator.test_main

### Create a Pub/Sub topic

- Read more about creating Pub/Sub topics [here](https://cloud.google.com/functions/docs/tutorials/pubsub)

In [None]:
! gcloud pubsub topics create $SIMULATOR_PUBSUB_TOPIC

### Set up a recurrent Cloud Scheduler job for the Pub/Sub topic

- Read more about possible ways to create cron jobs [here](https://cloud.google.com/scheduler/docs/creating#gcloud).
- Read about the cron job schedule format [here](https://man7.org/linux/man-pages/man5/crontab.5.html).

In [None]:
scheduler_job_args = " ".join(
    [
        SIMULATOR_SCHEDULER_JOB,
        f"--schedule='{SIMULATOR_SCHEDULE}'",
        f"--topic={SIMULATOR_PUBSUB_TOPIC}",
        f"--message-body={SIMULATOR_SCHEDULER_MESSAGE}",
    ]
)

! echo $scheduler_job_args

In [None]:
! gcloud scheduler jobs create pubsub $scheduler_job_args

### Define the *Simulator* logic in a Cloud Function to be triggered periodically, and deploy this Function

- Specify dependencies of the Function in [`src/simulator/requirements.txt`](src/simulator/requirements.txt).
- Read more about the available configurable arguments for deploying a Function [here](https://cloud.google.com/sdk/gcloud/reference/functions/deploy). For instance, based on the complexity of your Function, you may want to adjust its memory and timeout.
- Note that the environment variables in `ENV_VARS` should be comma-separated; there should not be additional spaces, or other characters in between. Read more about setting/updating/deleting environment variables [here](https://cloud.google.com/functions/docs/env-var).
- Read more about sending predictions to Vertex endpoints [here](https://cloud.google.com/vertex-ai/docs/predictions/online-predictions-custom-models).

In [None]:
endpoints = ! gcloud ai endpoints list \
    --region=$REGION \
    --filter=display_name=$ENDPOINT_DISPLAY_NAME
print("\n".join(endpoints), "\n")

ENDPOINT_ID = endpoints[2].split(" ")[0]
print(f"ENDPOINT_ID={ENDPOINT_ID}")

In [None]:
ENV_VARS = ",".join(
    [
        f"PROJECT_ID={PROJECT_ID}",
        f"REGION={REGION}",
        f"ENDPOINT_ID={ENDPOINT_ID}",
        f"RAW_DATA_PATH={RAW_DATA_PATH}",
        f"BATCH_SIZE={BATCH_SIZE}",
        f"RANK_K={RANK_K}",
        f"NUM_ACTIONS={NUM_ACTIONS}",
    ]
)

! echo $ENV_VARS

In [None]:
! gcloud functions deploy $SIMULATOR_CLOUD_FUNCTION \
    --region=$REGION \
    --trigger-topic=$SIMULATOR_PUBSUB_TOPIC \
    --runtime=python37 \
    --memory=512MB \
    --timeout=200s \
    --source=src/simulator \
    --entry-point=simulate \
    --stage-bucket=$BUCKET_NAME \
    --update-env-vars=$ENV_VARS

## Create the *Logger* to asynchronously log prediction inputs and results

Create the Logger to get environment feedback as rewards from the MovieLens simulation environment based on prediction observations and predicted actions, formulate trajectory data, and store said data back to BigQuery. The Logger closes the RL feedback loop from prediction to training data, and allows re-training of the policy on new training data.

The Logger is triggered by a hook in the prediction code. At each prediction request, the prediction code messages a Pub/Sub topic, which triggers the Logger code.

The workflow is: prediction container code (at prediction request) --> Pub/Sub --> Cloud Functions (logging predictions back to BigQuery)

In production, this Logger logic can be modified to that of gathering real-world feedback (rewards) based on observations and predicted actions.

The Logger source code is [`src/logger/main.py`](src/logger/main.py).

### Run unit tests on the Logger

In [None]:
! python3 -m unittest src.logger.test_main

### Create a Pub/Sub topic

- Read more about creating Pub/Sub topics [here](https://cloud.google.com/functions/docs/tutorials/pubsub)

In [None]:
! gcloud pubsub topics create $LOGGER_PUBSUB_TOPIC

### Define the *Logger* logic in a Cloud Function to be triggered by a Pub/Sub topic, which is triggered by the prediction code at each prediction request.

- Specify dependencies of the Function in [`src/logger/requirements.txt`](src/logger/requirements.txt).
- Read more about the available configurable arguments for deploying a Function [here](https://cloud.google.com/sdk/gcloud/reference/functions/deploy). For instance, based on the complexity of your Function, you may want to adjust its memory and timeout.
- Note that the environment variables in `ENV_VARS` should be comma-separated; there should not be additional spaces, or other characters in between. Read more about setting/updating/deleting environment variables [here](https://cloud.google.com/functions/docs/env-var).

In [None]:
ENV_VARS = ",".join(
    [
        f"PROJECT_ID={PROJECT_ID}",
        f"RAW_DATA_PATH={RAW_DATA_PATH}",
        f"BATCH_SIZE={BATCH_SIZE}",
        f"RANK_K={RANK_K}",
        f"NUM_ACTIONS={NUM_ACTIONS}",
        f"BIGQUERY_TMP_FILE={BIGQUERY_TMP_FILE}",
        f"BIGQUERY_DATASET_ID={BIGQUERY_DATASET_ID}",
        f"BIGQUERY_LOCATION={BIGQUERY_LOCATION}",
        f"BIGQUERY_TABLE_ID={BIGQUERY_TABLE_ID}",
    ]
)

! echo $ENV_VARS

In [None]:
! gcloud functions deploy $LOGGER_CLOUD_FUNCTION \
    --region=$REGION \
    --trigger-topic=$LOGGER_PUBSUB_TOPIC \
    --runtime=python37 \
    --memory=512MB \
    --timeout=200s \
    --source=src/logger \
    --entry-point=log \
    --stage-bucket=$BUCKET_NAME \
    --update-env-vars=$ENV_VARS

## Create the *Trigger* to trigger re-training

Create the Trigger to recurrently re-run the pipeline to re-train the policy on new training data, using `kfp.v2.google.client.AIPlatformClient.create_schedule_from_job_spec`. You create a pipeline for orchestration on Vertex Pipelines, and a Cloud Scheduler job that recurrently triggers the pipeline. The method also automatically creates a Cloud Function that acts as an intermediary between the Scheduler and Pipelines. You can find the source code [here](https://github.com/kubeflow/pipelines/blob/v1.7.0-alpha.3/sdk/python/kfp/v2/google/client/client.py#L347-L391).

When the Simulator sends prediction requests to the endpoint, the Logger is triggered by the hook in the prediction code to log prediction results to BigQuery, as new training data. As this pipeline has a recurrent schedule, it utlizes the new training data in training a new policy, therefore closing the feedback loop. Theoretically speaking, if you set the pipeline scheduler to be infinitely frequent, then you would be approaching real-time, continuous training.

In [None]:
TRIGGER_SCHEDULE = "*/30 * * * *"  # Schedule to trigger the pipeline. Eg. "*/30 * * * *" means every 30 mins.

In [None]:
ingest_op = load_component_from_url(
    "https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/62a2a7611499490b4b04d731d48a7ba87c2d636f/community-content/tf_agents_bandits_movie_recommendation_with_kfp_and_vertex_sdk/mlops_pipeline_tf_agents_bandits_movie_recommendation/src/ingester/component.yaml"
)
train_op = load_component_from_url(
    "https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/62a2a7611499490b4b04d731d48a7ba87c2d636f/community-content/tf_agents_bandits_movie_recommendation_with_kfp_and_vertex_sdk/mlops_pipeline_tf_agents_bandits_movie_recommendation/src/trainer/component.yaml"
)


@dsl.pipeline(pipeline_root=PIPELINE_ROOT, name=f"{PIPELINE_NAME}-retraining")
def pipeline(
    # Pipeline configs
    project_id: str,
    training_artifacts_dir: str,
    # BigQuery configs
    bigquery_table_id: str,
    bigquery_max_rows: int = 10000,
    # TF-Agents RL configs
    rank_k: int = 20,
    num_actions: int = 20,
    num_epochs: int = 5,
    tikhonov_weight: float = 0.01,
    agent_alpha: float = 10,
) -> None:
    """Authors a re-training pipeline for MovieLens movie recommendation system.

    Integrates the Ingester, Trainer and Deployer components.

    Args:
      project_id: GCP project ID. This is required because otherwise the BigQuery
        client will use the ID of the tenant GCP project created as a result of
        KFP, which doesn't have proper access to BigQuery.
      training_artifacts_dir: Path to store the Trainer artifacts (trained policy).

      bigquery_table_id: A string of the BigQuery table ID in the format of
        "project.dataset.table".
      bigquery_max_rows: Optional; maximum number of rows to ingest.

      rank_k: Optional; rank for matrix factorization in the MovieLens environment;
        also the observation dimension.
      num_actions: Optional; number of actions (movie items) to choose from.
      num_epochs: Optional; number of training epochs.
      tikhonov_weight: Optional; LinUCB Tikhonov regularization weight of the
        Trainer.
      agent_alpha: Optional; LinUCB exploration parameter that multiplies the
        confidence intervals of the Trainer.
    """
    # Run the Ingester component.
    ingest_task = ingest_op(
        project_id=project_id,
        bigquery_table_id=bigquery_table_id,
        bigquery_max_rows=bigquery_max_rows,
        tfrecord_file=TFRECORD_FILE,
    )

    # Run the Trainer component and submit custom job to Vertex AI.
    # Convert the train_op component into a Vertex AI Custom Job pre-built component
    custom_job_training_op = utils.create_custom_training_job_op_from_component(
        component_spec=train_op,
        replica_count=TRAINING_REPLICA_COUNT,
        machine_type=TRAINING_MACHINE_TYPE,
        accelerator_type=TRAINING_ACCELERATOR_TYPE,
        accelerator_count=TRAINING_ACCELERATOR_COUNT,
    )

    train_task = custom_job_training_op(
        training_artifacts_dir=training_artifacts_dir,
        tfrecord_file=ingest_task.outputs["tfrecord_file"],
        num_epochs=num_epochs,
        rank_k=rank_k,
        num_actions=num_actions,
        tikhonov_weight=tikhonov_weight,
        agent_alpha=agent_alpha,
        project=PROJECT_ID,
        location=REGION,
    )

    # Run the Deployer components.
    # Upload the trained policy as a model.
    model_upload_op = gcc_aip.ModelUploadOp(
        project=project_id,
        display_name=TRAINED_POLICY_DISPLAY_NAME,
        artifact_uri=train_task.outputs["training_artifacts_dir"],
        serving_container_image_uri=f"gcr.io/{PROJECT_ID}/{PREDICTION_CONTAINER}:latest",
    )
    # Create a Vertex AI endpoint. (This operation can occur in parallel with
    # the Generator, Ingester, Trainer components.)
    endpoint_create_op = gcc_aip.EndpointCreateOp(
        project=project_id, display_name=ENDPOINT_DISPLAY_NAME
    )
    # Deploy the uploaded, trained policy to the created endpoint. (This operation
    # has to occur after both model uploading and endpoint creation complete.)
    gcc_aip.ModelDeployOp(
        endpoint=endpoint_create_op.outputs["endpoint"],
        model=model_upload_op.outputs["model"],
        deployed_model_display_name=TRAINED_POLICY_DISPLAY_NAME,
        dedicated_resources_machine_type=ENDPOINT_MACHINE_TYPE,
        dedicated_resources_accelerator_type=ENDPOINT_ACCELERATOR_TYPE,
        dedicated_resources_accelerator_count=ENDPOINT_ACCELERATOR_COUNT,
        dedicated_resources_min_replica_count=ENDPOINT_REPLICA_COUNT,
    )

In [None]:
# Compile the authored pipeline.
compiler.Compiler().compile(pipeline_func=pipeline, package_path=PIPELINE_SPEC_PATH)

# Createa Vertex AI client.
api_client = AIPlatformClient(project_id=PROJECT_ID, region=REGION)

# Schedule a recurring pipeline.
response = api_client.create_schedule_from_job_spec(
    job_spec_path=PIPELINE_SPEC_PATH,
    schedule=TRIGGER_SCHEDULE,
    parameter_values={
        # Pipeline configs
        "project_id": PROJECT_ID,
        "training_artifacts_dir": TRAINING_ARTIFACTS_DIR,
        # BigQuery config
        "bigquery_table_id": BIGQUERY_TABLE_ID,
    },
)
response["name"]

## Cleaning up

To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud
project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.

Otherwise, you can delete the individual resources you created in this tutorial (you also need to clean up other resources that are difficult to delete here, such as the all/partial of data in BigQuery, the recurring pipeline and its Scheduler job, the uploaded policy/model, etc.):

In [None]:
# Delete endpoint resource.
! gcloud ai endpoints delete $ENDPOINT_ID --quiet --region $REGION

# Delete Pub/Sub topics.
! gcloud pubsub topics delete $SIMULATOR_PUBSUB_TOPIC --quiet
! gcloud pubsub topics delete $LOGGER_PUBSUB_TOPIC --quiet

# Delete Cloud Functions.
! gcloud functions delete $SIMULATOR_CLOUD_FUNCTION --quiet
! gcloud functions delete $LOGGER_CLOUD_FUNCTION --quiet

# Delete Scheduler job.
! gcloud scheduler jobs delete $SIMULATOR_SCHEDULER_JOB --quiet

# Delete Cloud Storage objects that were created.
! gsutil -m rm -r $PIPELINE_ROOT
! gsutil -m rm -r $TRAINING_ARTIFACTS_DIR