# Forecasting with Vertex AI Tabular Workflows

> Tabular Workflows is a set of integrated, fully managed, and scalable pipelines for end-to-end ML with tabular data. It leverages Google's technology for model development and provides you with customization options to fit your needs.

see [Tabular Workflows: Overview](https://cloud.google.com/vertex-ai/docs/tabular-data/tabular-workflows/overview) for more details 

## Overview

In this tutorial, you will use a few Vertex AI Tabular Workflows pipelines to train AutoML models using different configurations. You will see:
- how `get_l2l_forecasting_pipeline_and_parameters` gives you the ability to customize the default AutoML Tabular pipeline
- how `get_l2l_forecasting_pipeline_and_parameters` allows you to reduce the training time and cost for an AutoML model by using the tuning results from a previous pipeline run.
- how `get_time_series_dense_encoder_forecasting_pipeline_and_parameters` allows you to train FastNN model
- how to enable probabilistic inference for forecasting training pipelines
- how to perform the batch prediction with the forecasting model trained with Tabular workflow.

Learn more about [Tabular Workflow for E2E AutoML](https://cloud.google.com/vertex-ai/docs/tabular-data/tabular-workflows/e2e-automl).

## Data

In this tutorial, we'll use the [Iowa Liquor dataset](https://www.kaggle.com/datasets/residentmario/iowa-liquor-sales), which is available in BigQuery Public datasets:

```
bq://bigquery-public-data.iowa_liquor_sales_forecasting.2020_sales_train
```

# Notebook setup

In [1]:
!pwd

/home/jupyter/vertex-forecas-repo/02-vf-sdk-examples


In [1]:
# pip install google-cloud-pipeline-components --upgrade --user

In [2]:
! python3 -c "import kfp; print('KFP SDK version: {}'.format(kfp.__version__))"
! python3 -c "import google_cloud_pipeline_components; print('google_cloud_pipeline_components version: {}'.format(google_cloud_pipeline_components.__version__))"

KFP SDK version: 2.1.2
google_cloud_pipeline_components version: 2.1.0


In [3]:
# naming convention for all cloud resources
VERSION        = "v1"              # TODO
PREFIX         = f'tab-forecast-{VERSION}'   # TODO

print(f"PREFIX = {PREFIX}")

PREFIX = tab-forecast-v1


In [4]:
import os

GCP_PROJECTS             = !gcloud config get-value project
PROJECT_ID               = GCP_PROJECTS[0]

PROJECT_NUM              = !gcloud projects describe $PROJECT_ID --format="value(projectNumber)"
PROJECT_NUM              = PROJECT_NUM[0]

# locations / regions for cloud resources
LOCATION                 = 'us-central1'                                         # TODO
REGION                   = LOCATION                                              # TODO
BQ_LOCATION              = 'US'                                                  # TODO

# TODO: Service Account address
VERTEX_SA                = f'{PROJECT_NUM}-compute@developer.gserviceaccount.com'  # TODO

print(f"PROJECT_ID       : {PROJECT_ID}")
print(f"PROJECT_NUM      : {PROJECT_NUM}")
print(f"LOCATION         : {LOCATION}")
print(f"REGION           : {REGION}")
print(f"BQ_LOCATION      : {BQ_LOCATION}")
print(f"VERTEX_SA        : {VERTEX_SA}")

PROJECT_ID       : hybrid-vertex
PROJECT_NUM      : 934903580331
LOCATION         : us-central1
REGION           : us-central1
BQ_LOCATION      : US
VERTEX_SA        : 934903580331-compute@developer.gserviceaccount.com


### create GCS bucket

In [5]:
# GCS bucket and paths
BUCKET_NAME              = f'{PREFIX}-{PROJECT_ID}-bucket'
BUCKET_URI               = f'gs://{BUCKET_NAME}'

print(f"BUCKET_URI : {BUCKET_URI}")

BUCKET_URI : gs://tab-forecast-v1-hybrid-vertex-bucket


In [9]:
# create bucket
! gsutil mb -l $REGION $BUCKET_URI

Creating gs://tab-forecast-v1-hybrid-vertex-bucket/...


In [6]:
! gsutil ls $BUCKET_URI

In [7]:
! gsutil iam ch serviceAccount:{VERTEX_SA}:roles/storage.objectCreator $BUCKET_URI

! gsutil iam ch serviceAccount:{VERTEX_SA}:roles/storage.objectViewer $BUCKET_URI

No changes made to gs://tab-forecast-v1-hybrid-vertex-bucket/
No changes made to gs://tab-forecast-v1-hybrid-vertex-bucket/


## Imports

In [15]:
import os
import sys
import uuid
import time

from pprint import pprint

# Import required modules
import json
from typing import Any, Dict, List

from google.cloud import aiplatform, storage
from google_cloud_pipeline_components.preview.automl.forecasting import \
    utils as automl_forecasting_utils

In [16]:
aiplatform.init(project=PROJECT_ID, location=REGION)

## VPC config

In [17]:
# Dataflow's fully qualified subnetwork name, when empty the default subnetwork will be used.
# Fully qualified subnetwork name is in the form of
# https://www.googleapis.com/compute/v1/projects/HOST_PROJECT_ID/regions/REGION_NAME/subnetworks/SUBNETWORK_NAME
# reference: https://cloud.google.com/dataflow/docs/guides/specifying-networks#example_network_and_subnetwork_specifications
dataflow_subnetwork = None  # @param {type:"string"}
# Specifies whether Dataflow workers use public IP addresses.
dataflow_use_public_ips = True  # @param {type:"boolean"}

# Prepare for training

## Define helper functions

In [18]:
def get_bucket_name_and_path(uri):
    no_prefix_uri = uri[len("gs://") :]
    splits = no_prefix_uri.split("/")
    return splits[0], "/".join(splits[1:])


def download_from_gcs(uri):
    bucket_name, path = get_bucket_name_and_path(uri)
    storage_client = storage.Client(project=PROJECT_ID)
    bucket = storage_client.get_bucket(bucket_name)
    blob = bucket.blob(path)
    return blob.download_as_string()


def write_to_gcs(uri: str, content: str):
    bucket_name, path = get_bucket_name_and_path(uri)
    storage_client = storage.Client()
    bucket = storage_client.get_bucket(bucket_name)
    blob = bucket.blob(path)
    blob.upload_from_string(content)


def generate_auto_transformation(column_names: List[str]) -> List[Dict[str, Any]]:
    transformations = []
    for column_name in column_names:
        transformations.append({"auto": {"column_name": column_name}})
    return transformations


def write_auto_transformations(uri: str, column_names: List[str]):
    transformations = generate_auto_transformation(column_names)
    write_to_gcs(uri, json.dumps(transformations))


def get_task_detail(
    task_details: List[Dict[str, Any]], task_name: str
) -> List[Dict[str, Any]]:
    for task_detail in task_details:
        if task_detail.task_name == task_name:
            return task_detail


def get_deployed_model_uri(
    task_details,
):
    ensemble_task = get_task_detail(task_details, "model-upload")
    return ensemble_task.outputs["model"].artifacts[0].uri


def get_no_custom_ops_model_uri(task_details):
    ensemble_task = get_task_detail(task_details, "automl-tabular-ensemble")
    return download_from_gcs(
        ensemble_task.outputs["model_without_custom_ops"].artifacts[0].uri
    )


def get_feature_attributions(
    task_details,
):
    ensemble_task = get_task_detail(task_details, "model-evaluation-2")
    return download_from_gcs(
        ensemble_task.outputs["evaluation_metrics"]
        .artifacts[0]
        .metadata["explanation_gcs_path"]
    )


def get_evaluation_metrics(
    task_details,
):
    ensemble_task = get_task_detail(task_details, "model-evaluation")
    return download_from_gcs(
        ensemble_task.outputs["evaluation_metrics"].artifacts[0].uri
    )


def load_and_print_json(s):
    parsed = json.loads(s)
    print(json.dumps(parsed, indent=2, sort_keys=True))

## Define training specification

### liquor dataset

In [19]:
# root_dir = os.path.join(BUCKET_URI, f"automl_forecasting_pipeline/run-{uuid.uuid4()}")
optimization_objective = "minimize-mae"
time_column = "date"
time_series_identifier_column = "store_name"
target_column = "sale_dollars"
data_source_csv_filenames = None
data_source_bigquery_table_path = (
    "bq://bigquery-public-data.iowa_liquor_sales_forecasting.2020_sales_train"
)

training_fraction = 0.8
validation_fraction = 0.1
test_fraction = 0.1

predefined_split_key = None
if predefined_split_key:
    training_fraction = None
    validation_fraction = None
    test_fraction = None

weight_column = None

features = [
    time_column,
    target_column,
    "city",
    "zip_code",
    "county",
]

available_at_forecast_columns = ",".join([time_column])
unavailable_at_forecast_columns = ",".join([target_column])
time_series_attribute_columns = ",".join(["city", "zip_code", "county"])
forecast_horizon = 30
context_window = 30

# transformations = generate_auto_transformation(features)
# transform_config_path = os.path.join(root_dir, f"transform_config_{uuid.uuid4()}.json")
# write_to_gcs(transform_config_path, json.dumps(transformations))

# Set Vertex AI Experiment

In [20]:
EXPERIMENT_NAME   = f'forecast-wrkfws-{PREFIX}'

aiplatform.init(
    project=PROJECT_ID,
    location=REGION,
    experiment=EXPERIMENT_NAME
)

print(f"EXPERIMENT_NAME   : {EXPERIMENT_NAME}")

EXPERIMENT_NAME   : forecast-wrkfws-tab-forecast-v1


# L2L training & customize search space and change training configuration

We will create a skip evaluation AutoML Forecasting pipeline with the following customizations:
- Limit the hyperparameter search space
- Change machine type and tuning / training parallelism

### experiment run

In [31]:
# new experiment
RUN_PREFIX        = "l2l"
invoke_time       = time.strftime("%Y%m%d-%H%M%S")
RUN_NAME          = f'{RUN_PREFIX}-run-{invoke_time}'

ROOT_DIR          = f"{BUCKET_URI}/{EXPERIMENT_NAME}/{RUN_NAME}"
LOG_DIR           = f"{ROOT_DIR}/logs"
ARTIFACTS_DIR     = f"{ROOT_DIR}/artifacts"  # Where the trained model will be saved and restored.

print(f"RUN_NAME          : {RUN_NAME}")
print(f"ROOT_DIR          : {ROOT_DIR}")
print(f"LOG_DIR           : {LOG_DIR}")
print(f"ARTIFACTS_DIR     : {ARTIFACTS_DIR}")

RUN_NAME          : l2l-run-20230802-045547
ROOT_DIR          : gs://tab-forecast-v1-hybrid-vertex-bucket/forecast-wrkfws-tab-forecast-v1/l2l-run-20230802-045547
LOG_DIR           : gs://tab-forecast-v1-hybrid-vertex-bucket/forecast-wrkfws-tab-forecast-v1/l2l-run-20230802-045547/logs
ARTIFACTS_DIR     : gs://tab-forecast-v1-hybrid-vertex-bucket/forecast-wrkfws-tab-forecast-v1/l2l-run-20230802-045547/artifacts


### define & submit job config

* see `get_learn_to_learn_forecasting_pipeline_and_parameters` [src code](https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/google_cloud_pipeline_components/preview/automl/forecasting/utils.py#L177)

In [50]:
transformations = generate_auto_transformation(features)
transform_config_path = os.path.join(ROOT_DIR, f"transform_config.json")

write_to_gcs(transform_config_path, json.dumps(transformations))

print(f"transform_config_path    : {transform_config_path}\n")
pprint(f"transformations         : {transformations}")

transform_config_path    : gs://tab-forecast-v1-hybrid-vertex-bucket/forecast-wrkfws-tab-forecast-v1/l2l-run-20230802-045547/transform_config.json

("transformations         : [{'auto': {'column_name': 'date'}}, {'auto': "
 "{'column_name': 'sale_dollars'}}, {'auto': {'column_name': 'city'}}, "
 "{'auto': {'column_name': 'zip_code'}}, {'auto': {'column_name': 'county'}}]")


In [67]:
transformations_v = {k: [dic[k] for dic in transformations] for k in transformations[0]}
transformations_v

{'auto': [{'column_name': 'date'}, {'column_name': 'sale_dollars'}, {'column_name': 'city'}, {'column_name': 'zip_code'}, {'column_name': 'county'}]}


In [69]:
# worker_pool_specs_override = [
#     {"machine_spec": {"machine_type": "n1-standard-8"}},  # override for TF chief node
#     {},  # override for TF worker node, since it's not used, leave it empty
#     {},  # override for TF ps node, since it's not used, leave it empty
#     {
#         "machine_spec": {
#             "machine_type": "n1-standard-4"  # override for TF evaluator node
#         }
#     },
# ]

# Number of weak models in the final ensemble model.
num_selected_trials = 5

train_budget_milli_node_hours = 250  # 15 minutes

(
    template_path,
    parameter_values,
) = automl_forecasting_utils.get_learn_to_learn_forecasting_pipeline_and_parameters(
    project=PROJECT_ID,
    location=REGION,
    root_dir=ROOT_DIR,
    target_column=target_column,
    optimization_objective=optimization_objective,
    transformations=transformations_v, #transform_config_path,
    train_budget_milli_node_hours=train_budget_milli_node_hours,
    data_source_csv_filenames=data_source_csv_filenames,
    data_source_bigquery_table_path=data_source_bigquery_table_path,
    weight_column=weight_column,
    predefined_split_key=predefined_split_key,
    training_fraction=training_fraction,
    validation_fraction=validation_fraction,
    test_fraction=test_fraction,
    num_selected_trials=num_selected_trials,
    time_column=time_column,
    time_series_identifier_column=time_series_identifier_column,
    time_series_attribute_columns=time_series_attribute_columns,
    available_at_forecast_columns=available_at_forecast_columns,
    unavailable_at_forecast_columns=unavailable_at_forecast_columns,
    forecast_horizon=forecast_horizon,
    context_window=context_window,
    # stage_1_tuner_worker_pool_specs_override=worker_pool_specs_override,
    dataflow_subnetwork=dataflow_subnetwork,
    dataflow_use_public_ips=dataflow_use_public_ips,
    # quantile forecast, L2L without probabilistic inference requires `minimize-quantile-loss`
    # quantiles=",".join(map(lambda x: str(x), [0.25, 0.5, 0.9])),
)

job_id = f"{RUN_NAME}"

print(f"job_id           : {job_id}")
print(f"template_path    : {template_path}")
pprint(f"parameter_values : {parameter_values}")

job_id           : l2l-run-20230802-045547
template_path    : /home/jupyter/.local/lib/python3.7/site-packages/google_cloud_pipeline_components/preview/automl/forecasting/learn_to_learn_forecasting_pipeline.yaml
("parameter_values : {'project': 'hybrid-vertex', 'location': 'us-central1', "
 "'root_dir': "
 "'gs://tab-forecast-v1-hybrid-vertex-bucket/forecast-wrkfws-tab-forecast-v1/l2l-run-20230802-045547', "
 "'target_column': 'sale_dollars', 'optimization_objective': 'minimize-mae', "
 "'transformations': {'auto': [{'column_name': 'date'}, {'column_name': "
 "'sale_dollars'}, {'column_name': 'city'}, {'column_name': 'zip_code'}, "
 "{'column_name': 'county'}]}, 'train_budget_milli_node_hours': 250, "
 "'time_column': 'date', 'time_series_identifier_column': 'store_name', "
 "'time_series_attribute_columns': 'city,zip_code,county', "
 "'available_at_forecast_columns': 'date', 'unavailable_at_forecast_columns': "
 "'sale_dollars', 'forecast_horizon': 30, 'context_window': 30, "
 "'num_s

### submit pipeline job

In [75]:
job = aiplatform.PipelineJob(
    display_name=job_id,
    location=REGION,  # launches the pipeline job in the specified region
    template_path=template_path,
    job_id=job_id,
    pipeline_root=ROOT_DIR,
    parameter_values=parameter_values,
    enable_caching=False,
)

# job.run(
#     sync=False,
#     service_account=SERVICE_ACCOUNT,
# )
job.submit(
    service_account=VERTEX_SA,
    experiment=EXPERIMENT_NAME,
)

Creating PipelineJob
PipelineJob created. Resource name: projects/934903580331/locations/us-central1/pipelineJobs/l2l-run-20230802-045547
To use this PipelineJob in another session:
pipeline_job = aiplatform.PipelineJob.get('projects/934903580331/locations/us-central1/pipelineJobs/l2l-run-20230802-045547')
View Pipeline Job:
https://console.cloud.google.com/vertex-ai/locations/us-central1/pipelines/runs/l2l-run-20230802-045547?project=934903580331
Associating projects/934903580331/locations/us-central1/pipelineJobs/l2l-run-20230802-045547 to Experiment: forecast-wrkfws-tab-forecast-v1


In [None]:
pipeline_task_details = job.gca_resource.job_detail.task_details

pprint(pipeline_task_details)

### Skip architecture search
Instead of doing architecture search everytime, we can reuse the existing architecture search result. This could help:
1. reducing the variation of the output model
2. reducing training cost

The existing architecture search result is stored in the `tuning_result_output` output of the `automl-forecasting-stage-1-tuner` component. We can manually input it or get it programmatically.

In [None]:
stage_1_tuner_task = get_task_detail(
    pipeline_task_details, "automl-forecasting-stage-1-tuner"
)

stage_1_tuning_result_artifact_uri = (
    stage_1_tuner_task.outputs["tuning_result_output"].artifacts[0].uri
)

upload_model_task = get_task_detail(
    pipeline_task_details, "model-upload-2"
)

forecasting_mp_model_artifact = (
    upload_model_task.outputs["model"].artifacts[0]
)

forecasting_mp_model = aiplatform.Model(forecasting_mp_model_artifact.metadata['resourceName'])

#### Run the skip architecture search pipeline

In [None]:
# new experiment
RUN_PREFIX        = "l2l-skip-arch"
invoke_time       = time.strftime("%Y%m%d-%H%M%S")
RUN_NAME          = f'{RUN_PREFIX}-run-{invoke_time}'

ROOT_DIR          = f"{BUCKET_URI}/{EXPERIMENT_NAME}/{RUN_NAME}"
LOG_DIR           = f"{ROOT_DIR}/logs"
ARTIFACTS_DIR     = f"{ROOT_DIR}/artifacts"  # Where the trained model will be saved and restored.

print(f"RUN_NAME          : {RUN_NAME}")
print(f"ROOT_DIR          : {ROOT_DIR}")
print(f"LOG_DIR           : {LOG_DIR}")
print(f"ARTIFACTS_DIR     : {ARTIFACTS_DIR}")

In [None]:
# Number of weak models in the final ensemble model.
num_selected_trials = 5

train_budget_milli_node_hours = 250  # 15 minutes

(
    template_path,
    parameter_values,
) = automl_forecasting_utils.get_learn_to_learn_forecasting_pipeline_and_parameters(
    project=PROJECT_ID,
    location=REGION,
    root_dir=ROOT_DIR,
    target_column=target_column,
    optimization_objective=optimization_objective,
    transformations=transform_config_path,
    train_budget_milli_node_hours=train_budget_milli_node_hours,
    data_source_csv_filenames=data_source_csv_filenames,
    data_source_bigquery_table_path=data_source_bigquery_table_path,
    weight_column=weight_column,
    predefined_split_key=predefined_split_key,
    training_fraction=training_fraction,
    validation_fraction=validation_fraction,
    test_fraction=test_fraction,
    num_selected_trials=num_selected_trials,
    time_column=time_column,
    time_series_identifier_column=time_series_identifier_column,
    time_series_attribute_columns=time_series_attribute_columns,
    available_at_forecast_columns=available_at_forecast_columns,
    unavailable_at_forecast_columns=unavailable_at_forecast_columns,
    forecast_horizon=forecast_horizon,
    context_window=context_window,
    feature_transform_engine_dataflow_subnetwork=dataflow_subnetwork,
    feature_transform_engine_dataflow_use_public_ips=dataflow_use_public_ips,
    stage_1_tuning_result_artifact_uri=stage_1_tuning_result_artifact_uri,
)

job_id = f"{RUN_NAME}"

print(f"job_id           : {job_id}")
print(f"template_path    : {template_path}")
pprint(f"parameter_values : {parameter_values}")

### submit pipeline job

In [None]:
job = aiplatform.PipelineJob(
    display_name=job_id,
    location=REGION,  # launches the pipeline job in the specified region
    template_path=template_path,
    job_id=job_id,
    pipeline_root=ROOT_DIR,
    parameter_values=parameter_values,
    enable_caching=False,
)

# job.run(
#     sync=False,
#     service_account=SERVICE_ACCOUNT,
# )
job.submit(
    service_account=VERTEX_SA,
    experiment=EXPERIMENT_NAME,
)

In [None]:
# Get model URI
skip_architecture_search_pipeline_task_details = (
    job.gca_resource.job_detail.task_details
)

pprint(skip_architecture_search_pipeline_task_details)

# TiDE(a.k.a. FastNN) training

### experiment run

In [None]:
# new experiment
RUN_PREFIX        = "tide"
invoke_time       = time.strftime("%Y%m%d-%H%M%S")
RUN_NAME          = f'{RUN_PREFIX}-run-{invoke_time}'

ROOT_DIR          = f"{BUCKET_URI}/{EXPERIMENT_NAME}/{RUN_NAME}"
LOG_DIR           = f"{ROOT_DIR}/logs"
ARTIFACTS_DIR     = f"{ROOT_DIR}/artifacts"  # Where the trained model will be saved and restored.

print(f"RUN_NAME          : {RUN_NAME}")
print(f"ROOT_DIR          : {ROOT_DIR}")
print(f"LOG_DIR           : {LOG_DIR}")
print(f"ARTIFACTS_DIR     : {ARTIFACTS_DIR}")

### define & submit job config

In [None]:
transformations = generate_auto_transformation(features)
transform_config_path = os.path.join(ROOT_DIR, f"transform_config.json")

write_to_gcs(transform_config_path, json.dumps(transformations))

print(f"transformations          : {transformations}")
print(f"transform_config_path    : {transform_config_path}")

In [None]:
# Number of weak models in the final ensemble model.
num_selected_trials = 5

train_budget_milli_node_hours = 250  # 15 minutes

(
    template_path,
    parameter_values,
) = automl_forecasting_utils.get_time_series_dense_encoder_forecasting_pipeline_and_parameters(
    project=PROJECT_ID,
    location=REGION,
    root_dir=root_dir,
    target_column=target_column,
    optimization_objective=optimization_objective,
    transformations=transform_config_path,
    train_budget_milli_node_hours=train_budget_milli_node_hours,
    data_source_csv_filenames=data_source_csv_filenames,
    data_source_bigquery_table_path=data_source_bigquery_table_path,
    weight_column=weight_column,
    predefined_split_key=predefined_split_key,
    training_fraction=training_fraction,
    validation_fraction=validation_fraction,
    test_fraction=test_fraction,
    num_selected_trials=num_selected_trials,
    time_column=time_column,
    time_series_identifier_column=time_series_identifier_column,
    time_series_attribute_columns=time_series_attribute_columns,
    available_at_forecast_columns=available_at_forecast_columns,
    unavailable_at_forecast_columns=unavailable_at_forecast_columns,
    forecast_horizon=forecast_horizon,
    context_window=context_window,
    feature_transform_engine_dataflow_subnetwork=dataflow_subnetwork,
    feature_transform_engine_dataflow_use_public_ips=dataflow_use_public_ips,
    # enable_probabilistic_inference=True,
    # quantile forecast, TiDE without probabilistic inference requires `minimize-quantile-loss`
    # quantiles=",".join(map(lambda x: str(x), [0.25, 0.5, 0.9])),
)

job_id = f"{RUN_NAME}"

print(f"job_id           : {job_id}")
print(f"template_path    : {template_path}")
pprint(f"parameter_values : {parameter_values}")

In [None]:
job = aiplatform.PipelineJob(
    display_name=job_id,
    location=REGION,  # launches the pipeline job in the specified region
    template_path=template_path,
    job_id=job_id,
    pipeline_root=ROOT_DIR,
    parameter_values=parameter_values,
    enable_caching=False,
)

# job.run(
#     sync=False,
#     service_account=SERVICE_ACCOUNT,
# )
job.submit(
    service_account=VERTEX_SA,
    experiment=EXPERIMENT_NAME,
)

In [None]:
# Get model URI
tide_pipeline_task_details = (
    job.gca_resource.job_detail.task_details
)

pprint(tide_pipeline_task_details)

# TODO 

* probabilistic forecast