In [1]:
# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# 用于预测的表格化工作流程

<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/automl/automl_tabular_on_vertex_pipelines.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> 在 Colab 中运行
    </a>
  </td>
  <td>
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/automl/automl_forecasting_on_vertex_pipelines.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      在 GitHub 上查看
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/main/notebooks/official/automl/automl_forecasting_on_vertex_pipelines.ipynb">
        <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      在 Vertex AI Workbench 中打开
    </a>
  </td>
</table>
<br/><br/><br/>

注意：此笔记本已在以下环境中进行测试：

- Python 版本 = 3.9

## 概览

本教程演示了您如何使用Vertex AI表格工作流进行预测，训练一个AutoML模型。您可以选择以下模型类型：时间序列密集编码器（TiDE），学习到学习（L2L），序列到序列（Seq2Seq+）和时间融合变换器（TFT）。

了解更多关于[Tabular Workflow for Forecasting](https://cloud.google.com/vertex-ai/docs/tabular-data/tabular-workflows/forecasting)。

与Vertex预测托管服务相比，预测的表格工作流具有以下优势：
1. 支持复合时间序列 id 列。您可以将多个列的组合用作时间序列 id，例如，您可以将`['sku_id']`或`['sku_id', 'store_id']`作为时间序列 id 列。
2. 可跳过模型架构搜索。您可以重用先前的模型架构搜索调整结果直接训练模型。
3. 硬件定制。您可以覆盖调整和训练步骤的机器规格，以调整训练速度。您还可以控制培训过程的并行度和集成步骤中最终选择试验的数量。
4. 一个单一时间序列中支持无限的时间步长。培训数据集中没有3000个时间步长限制。
5. 培训数据集没有上限。数据集大小没有100MM行限制或100GB限制。
6. 使用 Vertex AI 管道中的所有高级功能。

### 目标

在本教程中，您将学习如何使用从[Google Cloud Pipeline Components](https://cloud.google.com/vertex-ai/docs/pipelines/components-introduction)（GCPC）下载的[Vertex AI Pipelines](https://cloud.google.com/vertex-ai/docs/pipelines/introduction)创建AutoML预测模型。 这些管道是由Google维护的Vertex AI Tabular Workflow管道。 这些管道展示了定制Vertex AI Tabular训练过程的不同方法。

本教程使用以下Google Cloud ML服务：

- AutoML 训练
- Vertex AI Pipelines

执行的步骤包括：

- 使用指定的机器类型创建一个使用TiDE（时间序列密集编码器）算法的训练管道。
- 创建一个训练管道，该管道重复使用来自上一个管道的架构搜索结果，以节省TiDE（时间序列密集编码器）的时间。
- 创建一个使用Learn-to-learn（L2L）算法的训练管道。
- 创建一个使用Seq2seq（序列到序列）算法的训练管道。
- 创建一个使用TFT（时间融合变换器）算法的训练管道。
- 使用上述步骤中训练的模型执行批量预测。

### 数据集

本教程使用了[酒类数据集](https://www.kaggle.com/datasets/residentmario/iowa-liquor-sales)，该数据集预测了中西部地区的酒类销售情况。

成本

本教程使用了 Google Cloud 的可计费组件：

* Vertex AI
* Cloud Storage
* BigQuery
* Dataflow

了解 [Vertex AI 价格](https://cloud.google.com/vertex-ai/pricing)、[Cloud Storage 价格](https://cloud.google.com/storage/pricing) 和 [BigQuery 价格](https://cloud.google.com/bigquery)，并使用 [定价计算器](https://cloud.google.com/products/calculator/) 基于您的预期使用量生成成本估算。

安装附加包

安装谷歌云管道组件（GCPC）SDK不早于`2.3.0`。

In [None]:
!pip3 install --upgrade --quiet google-cloud-pipeline-components==2.3.0 \
                                google-cloud-aiplatform

只有协作：取消对以下单元格的注释以重新启动内核。

In [None]:
# Automatically restart kernel after installs so that your environment can access the new packages
# import IPython

# app = IPython.Application.instance()
# app.kernel.do_shutdown(True)

## 在开始之前

### 设置您的Google Cloud项目

**无论您使用什么笔记本环境，以下步骤都是必需的。**

1. [选择或创建一个Google Cloud项目](https://console.cloud.google.com/cloud-resource-manager)。

2. [确保您的项目已启用计费功能](https://cloud.google.com/billing/docs/how-to/modify-project)。

3. [启用Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=ml.googleapis.com,dataflow.googleapis.com,compute_component,storage-component.googleapis.com)。

4. 如果您是在本地运行这个笔记本，您需要安装[Cloud SDK](https://cloud.google.com/sdk)。

##关于服务账户和权限的说明

有关权限设置的详细信息，请参考 https://cloud.google.com/vertex-ai/docs/tabular-data/tabular-workflows/service-accounts

**默认情况下不需要任何配置**，如果遇到任何与权限相关的问题，请确保上述服务账户具有所需的角色：

|服务账户电子邮件|描述|角色|
|---|---|---|
|PROJECT_NUMBER-compute@developer.gserviceaccount.com|Compute Engine默认服务账户|Dataflow开发者、Dataflow工作者、存储管理员、BigQuery数据编辑器、Vertex AI用户、服务账户用户|
|service-PROJECT_NUMBER@gcp-sa-aiplatform.iam.gserviceaccount.com|AI平台服务代理|Vertex AI服务代理|

1. 打开 https://console.cloud.google.com/iam-admin/iam。
2. 选中“包括Google提供的角色授予”复选框。
3. 找到上述电子邮件。
4. 授予相应的角色。

### 使用来自不同项目的数据源
- 对于BQ数据源，为两个服务账户授予“BigQuery数据查看器”角色。
- 对于CSV数据源，为两个服务账户授予“存储对象查看器”角色。

### 设置您的项目ID

**如果您不知道您的项目ID**，请尝试以下操作：
* 运行 `gcloud config list`。
* 运行 `gcloud projects list`。
* 查看支持页面：[查找项目ID](https://support.google.com/googleapi/answer/7014113)。

In [None]:
PROJECT_ID = "[your-project-id]"  # @param {type:"string"}

# Set the project id
! gcloud config set project {PROJECT_ID}

区域

您也可以更改 Vertex AI 使用的 `REGION` 变量。了解有关 [Vertex AI 区域](https://cloud.google.com/vertex-ai/docs/general/locations) 的更多信息。

In [None]:
REGION = "us-central1"  # @param {type: "string"}

### 认证您的谷歌云账户

根据您的Jupyter环境，您可能需要手动进行认证。请按照以下相关说明操作。

1. 顶点 AI 工作台
* 不用做任何事，因为你已经通过认证。

2. 本地JupyterLab实例，请取消注释并运行：

In [None]:
# ! gcloud auth login

3. 协作，取消注释并运行：

In [None]:
# from google.colab import auth
# auth.authenticate_user()

请参阅如何将云存储权限授予您的服务帐户页面：https://cloud.google.com/storage/docs/gsutil/commands/iam#ch-examples。

创建云存储桶

创建一个存储桶来存储中间产物，比如数据集、TF模型检查点、TensorBoard文件等。

In [None]:
BUCKET_URI = f"gs://your-bucket-name-{PROJECT_ID}-unique"  # @param {type:"string"}

如果您的存储桶尚不存在：运行以下单元格来创建您的云存储存储桶。

In [None]:
! gsutil mb -l {REGION} -p {PROJECT_ID} {BUCKET_URI}

#### 服务账号

使用服务账号来创建Vertex AI Pipeline作业。如果您不想使用您项目的Compute Engine服务账号，请将`SERVICE_ACCOUNT`设置为另一个服务账号ID。

In [None]:
SERVICE_ACCOUNT = "[your-service-account]"  # @param {type:"string"}

if (
    SERVICE_ACCOUNT == ""
    or SERVICE_ACCOUNT is None
    or SERVICE_ACCOUNT == "[your-service-account]"
):
    import sys
    IS_COLAB = 'google.colab' in sys.modules

    # Get your service account from gcloud
    if not IS_COLAB:
        shell_output = !gcloud auth list 2>/dev/null
        SERVICE_ACCOUNT = shell_output[2].replace("*", "").strip()

    else:  # IS_COLAB:
        shell_output = ! gcloud projects describe  $PROJECT_ID
        project_number = shell_output[-1].split(":")[1].strip().replace("'", "")
        SERVICE_ACCOUNT = f"{project_number}-compute@developer.gserviceaccount.com"

    print("Service Account:", SERVICE_ACCOUNT)

设置顶点AI流水线的服务账号访问权限
运行以下命令，授予您的服务账号对在上一步中创建的存储桶中的管道工件进行读取和写入的访问权限。您只需要针对每个服务账号运行此步骤一次。

In [None]:
! gsutil iam ch serviceAccount:{SERVICE_ACCOUNT}:roles/storage.objectCreator $BUCKET_URI
! gsutil iam ch serviceAccount:{SERVICE_ACCOUNT}:roles/storage.objectViewer $BUCKET_URI

导入库并定义常数。

In [None]:
# Import required modules
import json
import os
import uuid
from typing import Any, Dict, List, Optional

from google.cloud import aiplatform, storage
from google_cloud_pipeline_components.preview.automl.forecasting import \
    utils as automl_forecasting_utils

初始化用于 Python 的 Vertex AI SDK。

为您的项目初始化 Python 的 Vertex SDK。

In [None]:
aiplatform.init(project=PROJECT_ID, location=REGION)

## VPC相关配置

如果您需要使用自定义Dataflow子网络，可以通过`dataflow_subnetwork`参数进行设置。要求如下：
1. `dataflow_subnetwork`必须是完全限定的子网络名称。
   （[示例网络和子网络规格说明](https://cloud.google.com/dataflow/docs/guides/specifying-networks#example_network_and_subnetwork_specifications)）
1. 以下服务帐户必须在指定的dataflow子网络上分配[计算网络用户角色](https://cloud.google.com/compute/docs/access/iam#compute.networkUser)：                 
    1. 计算引擎默认服务帐户：PROJECT_NUMBER-compute@developer.gserviceaccount.com
    1. Dataflow服务帐户：service-PROJECT_NUMBER@dataflow-service-producer-prod.iam.gserviceaccount.com

如果您的项目已启用VPC-SC，请确保：

1. 在VPC-SC中使用的dataflow子网络已针对Dataflow进行正确配置。
   [[参考](https://cloud.google.com/dataflow/docs/guides/routes-firewall)]
1. `dataflow_use_public_ips`设置为False。

In [None]:
# Dataflow's fully qualified subnetwork name, when empty the default subnetwork will be used.
# Fully qualified subnetwork name is in the form of
# https://www.googleapis.com/compute/v1/projects/HOST_PROJECT_ID/regions/REGION_NAME/subnetworks/SUBNETWORK_NAME
# reference: https://cloud.google.com/dataflow/docs/guides/specifying-networks#example_network_and_subnetwork_specifications
dataflow_subnetwork = None  # @param {type:"string"}
# Specifies whether Dataflow workers use public IP addresses.
dataflow_use_public_ips = True  # @param {type:"boolean"}

准备训练##

### 定义辅助函数

In [None]:
# Below functions will serve as the utility functions.


# Fetch the tuple of GCS bucket and object URI.
def get_bucket_name_and_path(uri: str):
    no_prefix_uri = uri[len("gs://") :]
    splits = no_prefix_uri.split("/")
    return splits[0], "/".join(splits[1:])


# Fetch the content from a GCS object URI.
def download_from_gcs(uri: str):
    bucket_name, path = get_bucket_name_and_path(uri)
    storage_client = storage.Client(project=PROJECT_ID)
    bucket = storage_client.get_bucket(bucket_name)
    blob = bucket.blob(path)
    return blob.download_as_string()


# Upload the string content as a GCS object.
def write_to_gcs(uri: str, content: str):
    bucket_name, path = get_bucket_name_and_path(uri)
    storage_client = storage.Client()
    bucket = storage_client.get_bucket(bucket_name)
    blob = bucket.blob(path)
    blob.upload_from_string(content)


# This is the example to set non-auto transformations.
# For more details about the transformations, please check:
# https://cloud.google.com/vertex-ai/docs/datasets/data-types-tabular#transformations
def generate_transformation(
    auto_column_names: Optional[List[str]] = None,
    numeric_column_names: Optional[List[str]] = None,
    categorical_column_names: Optional[List[str]] = None,
    text_column_names: Optional[List[str]] = None,
    timestamp_column_names: Optional[List[str]] = None,
) -> List[Dict[str, Any]]:
    if auto_column_names is None:
        auto_column_names = []
    if numeric_column_names is None:
        numeric_column_names = []
    if categorical_column_names is None:
        categorical_column_names = []
    if text_column_names is None:
        text_column_names = []
    if timestamp_column_names is None:
        timestamp_column_names = []
    return {
        "auto": auto_column_names,
        "numeric": numeric_column_names,
        "categorical": categorical_column_names,
        "text": text_column_names,
        "timestamp": timestamp_column_names,
    }


# Retrieve the data given a task name.
def get_task_detail(
    task_details: List[Dict[str, Any]], task_name: str
) -> List[Dict[str, Any]]:
    for task_detail in task_details:
        if task_detail.task_name == task_name:
            return task_detail


# Retrieve the URI of the model.
def get_deployed_model_uri(
    task_details,
):
    ensemble_task = get_task_detail(task_details, "model-upload")
    return ensemble_task.outputs["model"].artifacts[0].uri


# Retrieve the feature importance details from GCS.
def get_feature_attributions(
    task_details,
):
    ensemble_task = get_task_detail(task_details, "model-evaluation-2")
    return download_from_gcs(
        ensemble_task.outputs["evaluation_metrics"]
        .artifacts[0]
        .metadata["explanation_gcs_path"]
    )


# Retrieve the evaluation metrics from GCS.
def get_evaluation_metrics(
    task_details,
):
    ensemble_task = get_task_detail(task_details, "model-evaluation")
    return download_from_gcs(
        ensemble_task.outputs["evaluation_metrics"].artifacts[0].uri
    )


# Pretty print the JSON string.
def load_and_print_json(s):
    parsed = json.loads(s)
    print(json.dumps(parsed, indent=2, sort_keys=True))

定义培训规范

In [None]:
root_dir = os.path.join(BUCKET_URI, f"automl_forecasting_pipeline/run-{uuid.uuid4()}")
optimization_objective = "minimize-mae"
time_column = "date"
time_series_identifier_column = "store_name"
target_column = "sale_dollars"
data_source_csv_filenames = None
data_source_bigquery_table_path = (
    "bq://bigquery-public-data.iowa_liquor_sales_forecasting.2020_sales_train"
)

training_fraction = 0.8
validation_fraction = 0.1
test_fraction = 0.1

predefined_split_key = None
if predefined_split_key:
    training_fraction = None
    validation_fraction = None
    test_fraction = None

weight_column = None

features = [
    time_column,
    target_column,
    "city",
    "zip_code",
    "county",
]

available_at_forecast_columns = [time_column]
unavailable_at_forecast_columns = [target_column]
time_series_attribute_columns = ["city", "zip_code", "county"]
forecast_horizon = 150
context_window = 150

transformations = generate_transformation(auto_column_names=features)

# Create a Vertex managed dataset artifact.
vertex_dataset = aiplatform.TimeSeriesDataset.create(
    bq_source=data_source_bigquery_table_path
)
vertex_dataset_artifact_id = vertex_dataset.gca_resource.metadata_artifact.split("/")[
    -1
]

支持的API

目前，在APIs/SDK中支持四种模型类型，并配备了实用函数：
1. `time_series_dense_encoder`（`TiDE`）：`get_time_series_dense_encoder_forecasting_pipeline_and_parameters`
2. `learn_to_learn`（`L2L`）：`get_learn_to_learn_forecasting_pipeline_and_parameters`
3. `sequence_to_sequence`（`seq2seq`）：`get_sequence_to_sequence_forecasting_pipeline_and_parameters`
4. `temporal_fusion_transformer`（`TFT`）：`get_temporal_fusion_transformer_forecasting_pipeline_and_parameters`

### 高级工作流程

以下代码显示了使用API的一般格式：
```python
# 使用实用程序函数来获取创建 Vertex Pipeline 作业所需的参数。
template_path, parameter_values = automl_forecasting_utils.get_${MODEL_TYPE}_forecasting_pipeline_and_parameters(
  ...
)

# 构建 Vertex Pipeline 作业。
job = aiplatform.PipelineJob(
    ...
    location=REGION,  # 在指定区域启动管道作业
    template_path=template_path,
    ...
    pipeline_root=root_dir,
    parameter_values=parameter_values,
    ...
)

# 启动 Vertex Pipeline 作业。
job.run()
```

###效用函数参数

所有模型类型的效用函数都有相同的参数。

以下以`get_time_series_dense_encoder_forecasting_pipeline_and_parameters`为例：

```python
def get_time_series_dense_encoder_forecasting_pipeline_and_parameters(
    *,
    project: str,
    location: str,
    root_dir: str,
    target_column: str,
    optimization_objective: str,
    transformations: Dict[str, List[str]],
    train_budget_milli_node_hours: float,
    time_column: str,
    time_series_identifier_columns: List[str],
    time_series_attribute_columns: Optional[List[str]] = None,
    available_at_forecast_columns: Optional[List[str]] = None,
    unavailable_at_forecast_columns: Optional[List[str]] = None,
    forecast_horizon: Optional[int] = None,
    context_window: Optional[int] = None,
    evaluated_examples_bigquery_path: Optional[str] = None,
    window_predefined_column: Optional[str] = None,
    window_stride_length: Optional[int] = None,
    window_max_count: Optional[int] = None,
    holiday_regions: Optional[List[str]] = None,
    stage_1_num_parallel_trials: Optional[int] = None,
    stage_1_tuning_result_artifact_uri: Optional[str] = None,
    stage_2_num_parallel_trials: Optional[int] = None,
    num_selected_trials: Optional[int] = None,
    data_source_csv_filenames: Optional[str] = None,
    data_source_bigquery_table_path: Optional[str] = None,
    predefined_split_key: Optional[str] = None,
    training_fraction: Optional[float] = None,
    validation_fraction: Optional[float] = None,
    test_fraction: Optional[float] = None,
    weight_column: Optional[str] = None,
    dataflow_service_account: Optional[str] = None,
    dataflow_subnetwork: Optional[str] = None,
    dataflow_use_public_ips: bool = True,
    feature_transform_engine_bigquery_staging_full_dataset_id: str = '',
    feature_transform_engine_dataflow_machine_type: str = 'n1-standard-16',
    feature_transform_engine_dataflow_max_num_workers: int = 10,
    feature_transform_engine_dataflow_disk_size_gb: int = 40,
    evaluation_batch_predict_machine_type: str = 'n1-standard-16',
    evaluation_batch_predict_starting_replica_count: int = 25,
    evaluation_batch_predict_max_replica_count: int = 25,
    evaluation_dataflow_machine_type: str = 'n1-standard-16',
    evaluation_dataflow_max_num_workers: int = 25,
    evaluation_dataflow_disk_size_gb: int = 50,
    study_spec_parameters_override: Optional[List[Dict[str, Any]]] = None,
    stage_1_tuner_worker_pool_specs_override: Optional[Dict[str, Any]] = None,
    stage_2_trainer_worker_pool_specs_override: Optional[Dict[str, Any]] = None,
    enable_probabilistic_inference: bool = False,
    quantiles: Optional[List[float]] = None,
    encryption_spec_key_name: Optional[str] = None,
    model_display_name: Optional[str] = None,
    model_description: Optional[str] = None,
    run_evaluation: bool = True,
) -> Tuple[str, Dict[str, Any]]:
  """返回 l2l_forecasting 管道和格式化参数。

  Args:
    project: 执行管道组件的 GCP 项目。
    location: 运行管道组件的 GCP 区域。
    root_dir: 管道组件的根 GCS 目录。
    target_column: 目标列名。
    optimization_objective: "minimize-rmse", "minimize-mae", "minimize-rmsle",
      "minimize-rmspe", "minimize-wape-mae", "minimize-mape", 或
      "minimize-quantile-loss"。
    transformations: 将自动解析和/或类型解析映射到特征列的字典。支持的类型有：auto、categorical、numeric、text 和 timestamp。
    train_budget_milli_node_hours: 创建该模型的训练预算，以毫秒节点小时表示，即该字段中的值为 1,000 表示 1 节点小时。
    time_column: 表示时间的列。
    time_series_identifier_columns: 区分不同时间序列的列。
    time_series_attribute_columns: 在同一时间序列中不变的列。
    available_at_forecast_columns: 预测时可用列。
    unavailable_at_forecast_columns: 预测时不可用列。
    forecast_horizon: 预测的时间跨度。
    context_window: 上下文窗口的长度。
    evaluated_examples_bigquery_path: 用于写入预测示例以进行评估的现有 BigQuery 数据集，格式为
      `bq://project.dataset`。需要先创建数据集。
    window_predefined_column: 指示每个窗口开始的列。
    window_stride_length: 生成窗口的步长。
    window_max_count: 将生成的窗口的最大数。
    holiday_regions: 应用假日效应的地理区域。
    stage_1_num_parallel_trials: 阶段 1 的并行试验次数。
    stage_1_tuning_result_artifact_uri: 阶段 1 调整结果存储的 GCS URI。
    stage_2_num_parallel_trials: 阶段 2 的并行试验次数。
    num_selected_trials: 选定的试验次数。
    data_source_csv_filenames: 表示逗号分隔的 CSV 文件名列表的字符串。
    data_source_bigquery_table_path: 格式为 bq://bq_project.bq_dataset.bq_table 的 BigQuery 表路径。
    predefined_split_key: 预定义的拆分列名。
    training_fraction: 训练分数。
    validation_fraction: 验证分数。
    test_fraction: 测试分数。
    weight_column: 权重列名。
    dataflow_service_account: 完整的服务帐户名称。
    dataflow_subnetwork: Dataflow 子网络。
    dataflow_use_public_ips: `True` 表示启用 Dataflow 公共 IP。
    feature_transform_engine_bigquery_staging_full_dataset_id: 特征转换引擎暂存数据集的完整 ID。
    feature_transform_engine_dataflow_machine_type: 特征转换引擎的 Dataflow 机器类型。
    feature_transform_engine_dataflow_max_num_workers: 特征转换引擎的最大 Dataflow
      工作者数量。
    feature_transform_engine_dataflow_disk_size_gb: 特征转换引擎的 Dataflow
      工作者的磁盘大小。
    evaluation_batch_predict_machine_type: 评估中批量预测作业的机器类型，如 'n1-standard-16'。
    evaluation_batch_predict_starting_replica_count: 在启动时批量预测集群中使用的副本数。
    evaluation_batch_predict_max_replica_count: 分布式预测作业可以扩展到的最大副本数。
    evaluation_dataflow_machine_type: 评估中 Dataflow 作业的机器类型，如 'n1-standard-16'。
    evaluation_dataflow_max_num_workers: Dataflow 工作者的最大数量。
    evaluation_dataflow_disk_size_gb: Dataflow 的磁盘空间大小（GB）。
    study_spec_parameters_override: 用于覆盖研究规范的列表。
    stage_1_tuner_worker_pool_specs_override: 用于覆盖阶段 1 调谐器工作池规范的字典。
    stage_2_trainer_worker_pool_specs_override: 用于覆盖阶段 2 训练工作池规范的字典。
    enable_probabilistic_inference: 如果启用了概率推断，模型会拟合捕捉预测不确定性的分布。
      如果指定了 quantiles，则还会返回分布的分位数。
    quantiles: 用于概率推断的分位数。允许使用 0 和 1 之间的值的最多 5 个分位数，表示用于该目标的分位数。分位数必须是唯一的。
    encryption_spec_key_name: KMS 密钥名称。
    model_display_name: 模型的可选显示名称。
    model_description: 可选描述。
    run_evaluation: `True` 表示在测试集上评估集成模型。
  """
  ...
```


### 使用假日地区

对于某些用例，区域地区的假日可能会影响预测数据。有关支持的预测假日地区的更多信息，请参阅https://cloud.google.com/vertex-ai/docs/tabular-data/tabular-workflows/forecasting-train#holiday-regions。

将字符串列表`holiday_regions`传递给管道参数生成器，以将假日数据纳入您的训练管道中。

## 自定义训练配置

您可以使用以下自定义内容创建一个预测管道：
- 更改机器类型和调整/训练并行度
- 跳过评估
- 跳过模型架构搜索

您可以重复使用现有的模型架构搜索结果，而不是每次都进行架构搜索。这样可以减少输出模型的变化或训练成本。现有的模型架构搜索结果存储在`automl-forecasting-stage-1-tuner`组件的`tuning_result_output`输出中。您可以使用API以编程方式加载它。

```python
stage_1_tuner_task = get_task_detail(
    pipeline_task_details, "automl-forecasting-stage-1-tuner"
)

stage_1_tuning_result_artifact_uri = (
    stage_1_tuner_task.outputs["tuning_result_output"].artifacts[0].uri
)
```

使用以下代码片段自定义训练配置：

In [None]:
# Customize the work pool for each trial during tuning.
# Only the chief node and the evaluator node are used.
# You can change the machine spec for these two nodes.
worker_pool_specs_override = [
    {"machine_spec": {"machine_type": "n1-standard-8"}},  # override for TF chief node
    {},  # override for TF worker node, since it's not used, leave it empty
    {},  # override for TF ps node, since it's not used, leave it empty
    {
        "machine_spec": {"machine_type": "n1-standard-4"}
    },  # override for TF evaluator node
]

# Number of weak models in the final ensemble model.
num_selected_trials = 5

# Specify the evaluation setup.
run_evaluation = False

您可以通过在培训参数中设置参数 `evaluated_examples_bigquery_path` ，将评估的示例从培训导出到BigQuery。BigQuery路径需要指向现有的BigQuery数据集，格式为 `bq://project.dataset`。

In [None]:
# This is ONLY available when `run_evaluation` is set to `True`.
evaluated_examples_bigquery_path = f"bq://{PROJECT_ID}.eval"

TiDE培训

时间序列密集编码器（TiDE）是一种优化的密集DNN编码器-解码器模型，具有出色的模型质量，训练和推断速度快，尤其适用于长上下文和视野。

更多详细信息请参见https://ai.googleblog.com/2023/04/recent-advances-in-deep-long-horizon.html

在本教程中，运行两次TiDE训练流程：
1. 使用模型架构搜索
2. 没有模型架构搜索

运行具有模型结构搜索功能的TiDE流水线

In [None]:
train_budget_milli_node_hours = 250.0  # 15 minutes

(
    template_path,
    parameter_values,
) = automl_forecasting_utils.get_time_series_dense_encoder_forecasting_pipeline_and_parameters(
    project=PROJECT_ID,
    location=REGION,
    root_dir=root_dir,
    target_column=target_column,
    # `minimize-quantile-loss`
    optimization_objective=optimization_objective,
    transformations=transformations,
    train_budget_milli_node_hours=train_budget_milli_node_hours,
    # Do not set `data_source_csv_filenames` and
    # `data_source_bigquery_table_path` if you want to use Vertex managed
    # dataset by commenting out the following two lines.
    data_source_csv_filenames=data_source_csv_filenames,
    data_source_bigquery_table_path=data_source_bigquery_table_path,
    weight_column=weight_column,
    predefined_split_key=predefined_split_key,
    training_fraction=training_fraction,
    validation_fraction=validation_fraction,
    test_fraction=test_fraction,
    num_selected_trials=num_selected_trials,
    time_column=time_column,
    time_series_identifier_columns=[time_series_identifier_column],
    time_series_attribute_columns=time_series_attribute_columns,
    available_at_forecast_columns=available_at_forecast_columns,
    unavailable_at_forecast_columns=unavailable_at_forecast_columns,
    forecast_horizon=forecast_horizon,
    context_window=context_window,
    stage_1_tuner_worker_pool_specs_override=worker_pool_specs_override,
    dataflow_subnetwork=dataflow_subnetwork,
    dataflow_use_public_ips=dataflow_use_public_ips,
    run_evaluation=run_evaluation,
    # evaluated_examples_bigquery_path=evaluated_examples_bigquery_path,
    dataflow_service_account=SERVICE_ACCOUNT,
    # Quantile forecast requires `minimize-quantile-loss` as optimization objective.
    # quantiles=[0.25, 0.5, 0.9],
    # holiday_regions=["US", "AE"],
)

job_id = "tide-forecasting-{}".format(uuid.uuid4())
job = aiplatform.PipelineJob(
    display_name=job_id,
    location=REGION,  # launches the pipeline job in the specified region
    template_path=template_path,
    job_id=job_id,
    pipeline_root=root_dir,
    parameter_values=parameter_values,
    enable_caching=False,
    # Uncomment the following line if you want to use Vertex managed dataset.
    # input_artifacts={'vertex_dataset': vertex_dataset_artifact_id},
)

job.run(service_account=SERVICE_ACCOUNT)


pipeline_task_details = job.gca_resource.job_detail.task_details

不执行模型架构搜索的 TiDE pipeline

从阶段1调谐器中检索到调谐结果后，您可以使用它来跳过模型架构搜索。

In [None]:
# Retrieve the tuning result output from the previous training pipeline.
stage_1_tuner_task = get_task_detail(
    pipeline_task_details, "automl-forecasting-stage-1-tuner"
)

stage_1_tuning_result_artifact_uri = (
    stage_1_tuner_task.outputs["tuning_result_output"].artifacts[0].uri
)

train_budget_milli_node_hours = 250.0  # 15 minutes

(
    template_path,
    parameter_values,
) = automl_forecasting_utils.get_time_series_dense_encoder_forecasting_pipeline_and_parameters(
    project=PROJECT_ID,
    location=REGION,
    root_dir=root_dir,
    target_column=target_column,
    optimization_objective=optimization_objective,
    transformations=transformations,
    train_budget_milli_node_hours=train_budget_milli_node_hours,
    data_source_csv_filenames=data_source_csv_filenames,
    data_source_bigquery_table_path=data_source_bigquery_table_path,
    weight_column=weight_column,
    predefined_split_key=predefined_split_key,
    training_fraction=training_fraction,
    validation_fraction=validation_fraction,
    test_fraction=test_fraction,
    num_selected_trials=num_selected_trials,
    time_column=time_column,
    time_series_identifier_columns=[time_series_identifier_column],
    time_series_attribute_columns=time_series_attribute_columns,
    available_at_forecast_columns=available_at_forecast_columns,
    unavailable_at_forecast_columns=unavailable_at_forecast_columns,
    forecast_horizon=forecast_horizon,
    context_window=context_window,
    dataflow_subnetwork=dataflow_subnetwork,
    dataflow_use_public_ips=dataflow_use_public_ips,
    stage_1_tuning_result_artifact_uri=stage_1_tuning_result_artifact_uri,
    run_evaluation=run_evaluation,
    # evaluated_examples_bigquery_path=evaluated_examples_bigquery_path,
    dataflow_service_account=SERVICE_ACCOUNT,
)

job_id = "tide-forecasting-skip-architecture-search-{}".format(uuid.uuid4())
job = aiplatform.PipelineJob(
    display_name=job_id,
    location=REGION,  # launches the pipeline job in the specified region
    template_path=template_path,
    job_id=job_id,
    pipeline_root=root_dir,
    parameter_values=parameter_values,
    enable_caching=False,
)

job.run(service_account=SERVICE_ACCOUNT)

# Get model URI
skip_architecture_search_pipeline_task_details = (
    job.gca_resource.job_detail.task_details
)

## L2L 培训

学以致用（L2L）是广泛范围的时间序列预测用例的好选择。

In [None]:
train_budget_milli_node_hours = 250.0  # 15 minutes

(
    template_path,
    parameter_values,
) = automl_forecasting_utils.get_learn_to_learn_forecasting_pipeline_and_parameters(
    project=PROJECT_ID,
    location=REGION,
    root_dir=root_dir,
    target_column=target_column,
    optimization_objective=optimization_objective,
    transformations=transformations,
    train_budget_milli_node_hours=train_budget_milli_node_hours,
    data_source_csv_filenames=data_source_csv_filenames,
    data_source_bigquery_table_path=data_source_bigquery_table_path,
    weight_column=weight_column,
    predefined_split_key=predefined_split_key,
    training_fraction=training_fraction,
    validation_fraction=validation_fraction,
    test_fraction=test_fraction,
    num_selected_trials=num_selected_trials,
    time_column=time_column,
    time_series_identifier_columns=[time_series_identifier_column],
    time_series_attribute_columns=time_series_attribute_columns,
    available_at_forecast_columns=available_at_forecast_columns,
    unavailable_at_forecast_columns=unavailable_at_forecast_columns,
    forecast_horizon=forecast_horizon,
    context_window=context_window,
    dataflow_subnetwork=dataflow_subnetwork,
    dataflow_use_public_ips=dataflow_use_public_ips,
    run_evaluation=run_evaluation,
    # evaluated_examples_bigquery_path=evaluated_examples_bigquery_path,
    dataflow_service_account=SERVICE_ACCOUNT,
    # Quantile forecast requires `minimize-quantile-loss` as optimization objective.
    # quantiles=[0.25, 0.5, 0.9],
)

job_id = "l2l-forecasting-{}".format(uuid.uuid4())
job = aiplatform.PipelineJob(
    display_name=job_id,
    location=REGION,  # launches the pipeline job in the specified region
    template_path=template_path,
    job_id=job_id,
    pipeline_root=root_dir,
    parameter_values=parameter_values,
    enable_caching=False,
)

job.run(service_account=SERVICE_ACCOUNT)


pipeline_task_details = job.gca_resource.job_detail.task_details

## Seq2seq 训练

序列到序列（seq2seq）是进行实验的一个很好的选择。该算法可能会比 AutoML 更快地收敛，因为其架构更简单，并且使用更小的搜索空间。

In [None]:
train_budget_milli_node_hours = 250.0  # 15 minutes

(
    template_path,
    parameter_values,
) = automl_forecasting_utils.get_sequence_to_sequence_forecasting_pipeline_and_parameters(
    project=PROJECT_ID,
    location=REGION,
    root_dir=root_dir,
    target_column=target_column,
    optimization_objective=optimization_objective,
    transformations=transformations,
    train_budget_milli_node_hours=train_budget_milli_node_hours,
    data_source_csv_filenames=data_source_csv_filenames,
    data_source_bigquery_table_path=data_source_bigquery_table_path,
    weight_column=weight_column,
    predefined_split_key=predefined_split_key,
    training_fraction=training_fraction,
    validation_fraction=validation_fraction,
    test_fraction=test_fraction,
    num_selected_trials=num_selected_trials,
    time_column=time_column,
    time_series_identifier_columns=[time_series_identifier_column],
    time_series_attribute_columns=time_series_attribute_columns,
    available_at_forecast_columns=available_at_forecast_columns,
    unavailable_at_forecast_columns=unavailable_at_forecast_columns,
    forecast_horizon=forecast_horizon,
    context_window=context_window,
    dataflow_subnetwork=dataflow_subnetwork,
    dataflow_use_public_ips=dataflow_use_public_ips,
    run_evaluation=run_evaluation,
    # evaluated_examples_bigquery_path=evaluated_examples_bigquery_path,
    dataflow_service_account=SERVICE_ACCOUNT,
    # Quantile prediction is NOT supported by Seq2seq.
)

job_id = "seq2seq-forecasting-{}".format(uuid.uuid4())
job = aiplatform.PipelineJob(
    display_name=job_id,
    location=REGION,  # launches the pipeline job in the specified region
    template_path=template_path,
    job_id=job_id,
    pipeline_root=root_dir,
    parameter_values=parameter_values,
    enable_caching=False,
)

job.run(service_account=SERVICE_ACCOUNT)


pipeline_task_details = job.gca_resource.job_detail.task_details

## TFT培训

TFT代表“时间融合变压器”，这是一种基于注意力的DNN模型，旨在通过将模型与一般的多时间段预测任务对齐，实现高准确性和可解释性。

使用这个模型，您不需要在提供时显式启用可解释性支持，即可获得每个特征列的特征重要性。

In [None]:
train_budget_milli_node_hours = 250.0  # 15 minutes

(
    template_path,
    parameter_values,
) = automl_forecasting_utils.get_temporal_fusion_transformer_forecasting_pipeline_and_parameters(
    project=PROJECT_ID,
    location=REGION,
    root_dir=root_dir,
    target_column=target_column,
    optimization_objective=optimization_objective,
    transformations=transformations,
    train_budget_milli_node_hours=train_budget_milli_node_hours,
    data_source_csv_filenames=data_source_csv_filenames,
    data_source_bigquery_table_path=data_source_bigquery_table_path,
    weight_column=weight_column,
    predefined_split_key=predefined_split_key,
    training_fraction=training_fraction,
    validation_fraction=validation_fraction,
    test_fraction=test_fraction,
    # Please note that TFT model will ONLY ensemble the model from
    # the top one trial, so `num_selected_trials` can not be set for TFT model.
    # num_selected_trials=num_selected_trials,
    time_column=time_column,
    time_series_identifier_columns=[time_series_identifier_column],
    time_series_attribute_columns=time_series_attribute_columns,
    available_at_forecast_columns=available_at_forecast_columns,
    unavailable_at_forecast_columns=unavailable_at_forecast_columns,
    forecast_horizon=forecast_horizon,
    context_window=context_window,
    dataflow_subnetwork=dataflow_subnetwork,
    dataflow_use_public_ips=dataflow_use_public_ips,
    run_evaluation=run_evaluation,
    # evaluated_examples_bigquery_path=evaluated_examples_bigquery_path,
    dataflow_service_account=SERVICE_ACCOUNT,
    # Quantile prediction is NOT supported by TFT.
)

job_id = "tft-forecasting-{}".format(uuid.uuid4())
job = aiplatform.PipelineJob(
    display_name=job_id,
    location=REGION,  # launches the pipeline job in the specified region
    template_path=template_path,
    job_id=job_id,
    pipeline_root=root_dir,
    parameter_values=parameter_values,
    enable_caching=False,
)

job.run(service_account=SERVICE_ACCOUNT)


pipeline_task_details = job.gca_resource.job_detail.task_details

## 批量预测/解释

只需在 `batch_predict` API 中设置 `generate_explanation=True` 即可启用批量解释功能。

使用以下代码从管道中检索经过训练的预测模型：

In [None]:
upload_model_task = get_task_detail(pipeline_task_details, "model-upload-2")

forecasting_mp_model_artifact = upload_model_task.outputs["model"].artifacts[0]

forecasting_mp_model = aiplatform.Model(
    forecasting_mp_model_artifact.metadata["resourceName"]
)

一旦您检索到Vertex AI模型，您就可以开始执行批量预测。

In [None]:
print(f"Running Batch prediction for model: {forecasting_mp_model.display_name}")

batch_predict_bq_output_uri_prefix = f"bq://{PROJECT_ID}"

PREDICTION_DATASET_BQ_PATH = (
    "bq://bigquery-public-data:iowa_liquor_sales_forecasting.2021_sales_predict"
)

batch_prediction_job = forecasting_mp_model.batch_predict(
    job_display_name="forecasting_iowa_liquor_sales_forecasting_predictions",
    bigquery_source=PREDICTION_DATASET_BQ_PATH,
    instances_format="bigquery",
    bigquery_destination_prefix=batch_predict_bq_output_uri_prefix,
    predictions_format="bigquery",
    # Uncomment the following line to run batch explain:
    # generate_explanation=True,
    sync=True,
)

print(batch_prediction_job)

使用Vertex AI管道作业ID检索已上传的Vertex AI模型

In [None]:
# Example format of pipeline_job_id: projects/{your-project-id}/locations/us-central1/pipelineJobs/{pipeline-job-id}
pipeline_job_id = ""  # @param {type:"string"}
if pipeline_job_id:
    job = aiplatform.PipelineJob.get(pipeline_job_id)
    pipeline_task_details = job.gca_resource.job_detail.task_details
    upload_model_task = get_task_detail(pipeline_task_details, "model-upload-2")

    forecasting_mp_model_artifact = upload_model_task.outputs["model"].artifacts[0]
    forecasting_mp_model = aiplatform.Model(
        forecasting_mp_model_artifact.metadata["resourceName"]
    )
    print(forecasting_mp_model)

## 使用父模型上传不同模型版本

要将此模型上传到父 Vertex AI 模型，您需要父 Vertex AI 模型的 `parent_model_resource_name` resource_name。

In [None]:
# The model resource name can be something like: "projects/{your-project-id}/locations/us-central1/models/{model-id}"
parent_model_resource_name = ""  # @param {type:"string"}

if parent_model_resource_name:
    parent_model_artifact = aiplatform.Artifact.get_with_uri(
        "https://us-central1-aiplatform.googleapis.com/v1/" + parent_model_resource_name
    )
    parent_model_artifact_id = str(
        parent_model_artifact.gca_resource.name.split("artifacts/")[1]
    )

    train_budget_milli_node_hours = 250.0  # 15 minutes

    (
        template_path,
        parameter_values,
    ) = automl_forecasting_utils.get_time_series_dense_encoder_forecasting_pipeline_and_parameters(
        project=PROJECT_ID,
        location=REGION,
        root_dir=root_dir,
        target_column=target_column,
        optimization_objective=optimization_objective,
        transformations=transformations,
        train_budget_milli_node_hours=train_budget_milli_node_hours,
        # Do not set `data_source_csv_filenames` and
        # `data_source_bigquery_table_path` if you want to use Vertex managed
        # dataset by commenting out the following two lines.
        data_source_csv_filenames=data_source_csv_filenames,
        data_source_bigquery_table_path=data_source_bigquery_table_path,
        weight_column=weight_column,
        predefined_split_key=predefined_split_key,
        training_fraction=training_fraction,
        validation_fraction=validation_fraction,
        test_fraction=test_fraction,
        num_selected_trials=5,
        time_column=time_column,
        time_series_identifier_columns=[time_series_identifier_column],
        time_series_attribute_columns=time_series_attribute_columns,
        available_at_forecast_columns=available_at_forecast_columns,
        unavailable_at_forecast_columns=unavailable_at_forecast_columns,
        forecast_horizon=forecast_horizon,
        context_window=context_window,
        dataflow_subnetwork=dataflow_subnetwork,
        dataflow_use_public_ips=dataflow_use_public_ips,
        run_evaluation=run_evaluation,
        # evaluated_examples_bigquery_path=evaluated_examples_bigquery_path,
        dataflow_service_account=SERVICE_ACCOUNT,
        # Quantile forecast requires `minimize-quantile-loss` as optimization objective.
        # quantiles=[0.25, 0.5, 0.9],
    )

    job_id = "tide-forecasting-with-parent-model-{}".format(uuid.uuid4())
    job = aiplatform.PipelineJob(
        display_name=job_id,
        location=REGION,  # launches the pipeline job in the specified region
        template_path=template_path,
        job_id=job_id,
        pipeline_root=root_dir,
        parameter_values=parameter_values,
        enable_caching=False,
        input_artifacts={"parent_model": parent_model_artifact_id},
    )

    job.run(service_account=SERVICE_ACCOUNT)

将Tabular Workflow for Forecasting集成到现有的KFP管道中

这是通过KFP的pipeline-as-component功能实现的。 (链接：https://www.kubeflow.org/docs/components/pipelines/v2/load-and-share-components/)

In [None]:
from kfp import compiler, components, dsl

train_budget_milli_node_hours = 250.0  # 15 minutes

(
    template_path,
    parameter_values,
) = automl_forecasting_utils.get_time_series_dense_encoder_forecasting_pipeline_and_parameters(
    project=PROJECT_ID,
    location=REGION,
    root_dir=root_dir,
    target_column=target_column,
    optimization_objective=optimization_objective,
    transformations=transformations,
    train_budget_milli_node_hours=train_budget_milli_node_hours,
    data_source_csv_filenames=data_source_csv_filenames,
    data_source_bigquery_table_path=data_source_bigquery_table_path,
    weight_column=weight_column,
    predefined_split_key=predefined_split_key,
    training_fraction=training_fraction,
    validation_fraction=validation_fraction,
    test_fraction=test_fraction,
    num_selected_trials=num_selected_trials,
    time_column=time_column,
    time_series_identifier_columns=[time_series_identifier_column],
    time_series_attribute_columns=time_series_attribute_columns,
    available_at_forecast_columns=available_at_forecast_columns,
    unavailable_at_forecast_columns=unavailable_at_forecast_columns,
    forecast_horizon=forecast_horizon,
    context_window=context_window,
    dataflow_subnetwork=dataflow_subnetwork,
    dataflow_use_public_ips=dataflow_use_public_ips,
    run_evaluation=False,
    dataflow_service_account=SERVICE_ACCOUNT,
)

# Load the forecasting pipeline as a sub-pipeline/components which can be used
# in a larger KFP pipeline.
forecasting_pipeline = components.load_component_from_file(template_path)


@dsl.component
def print_message(msg: str):
    print("message:", msg)


# Define a pipeline that follows the below steps:
# step_1(print_message) -> step_2(print_message) -> forecasting_pipeline
@dsl.pipeline
def outer_pipeline(msg_1: str, msg_2: str, ds: dsl.Artifact):
    step_1 = print_message(msg=msg_1)
    step_2 = print_message(msg=msg_2).after(step_1)
    # `vertex_dataset` argument needs to be set/forwarded here to avoid the
    # "missing-argument" error in KFP pipeline.
    forecasting_pipeline(**parameter_values, vertex_dataset=ds).after(step_2)


# Compile and save the outer/larger pipeline template.
outer_pipeline_template_path = "./outer_pipeline.yaml"
compiler.Compiler().compile(outer_pipeline, outer_pipeline_template_path)


job_id = "run-forecasting-pipeline-inside-pipeline-{}".format(uuid.uuid4())
job = aiplatform.PipelineJob(
    display_name=job_id,
    location=REGION,  # launches the pipeline job in the specified region
    template_path=outer_pipeline_template_path,
    job_id=job_id,
    pipeline_root=root_dir,
    parameter_values={"msg_1": "step 1", "msg_2": "step 2"},
    enable_caching=False,
)

job.run(service_account=SERVICE_ACCOUNT)

清理

要清理此项目中使用的所有Google Cloud资源，您可以[删除用于本教程的Google Cloud
项目](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects)。

否则，您可以删除在本教程中创建的各个资源：

In [None]:
import os

# Delete Cloud Storage objects that were created
delete_bucket = False
if delete_bucket or os.getenv("IS_TESTING"):
    ! gsutil -m rm -r $BUCKET_URI