In [None]:
# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

<table align="left">

  <td>
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/pipelines/google_cloud_pipeline_components_bqml_pipeline_demand_forecasting.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> 在Colab中运行
    </a>
  </td>
  <td>
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/pipelines/google_cloud_pipeline_components_bqml_pipeline_demand_forecasting.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      在GitHub上查看
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/pipelines/google_cloud_pipeline_components_bqml_pipeline_demand_forecasting.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      在Vertex AI Workbench中打开
    </a>
  </td>                                                                                               
</table>

## 概述

本笔记本展示如何使用`Vertex AI Pipelines`和`BigQuery ML pipeline components`来训练和评估需求预测模型。

### 目标

在本教程中，您将学习如何使用Vertex AI管道和BigQuery ML管道组件训练和评估一个BigQuery ML模型。

本教程使用以下 Google Cloud ML 服务和资源：

- `Vertex AI管道`
- `BigQuery ML管道组件`

执行的步骤包括：

- 定义自定义评估组件
- 定义一个流水线：
  - 获取BigQuery训练数据
  - 训练一个BigQuery Arima Plus模型
  - 评估BigQuery Arima Plus模型
  - 绘制评估结果
  - 检查模型性能
  - 生成ARIMA Plus预测
  - 生成ARIMA PLUS预测解释
- 编译流水线。
- 执行流水线。

数据集

该数据集是[使用Datastream、Dataflow、BigQuery ML和Looker构建和可视化需求预测预测的解决方案架构](https://cloud.google.com/architecture/build-visualize-demand-forecast-prediction-datastream-dataflow-bigqueryml-looker)中的数据集的修改版本。

### 成本

本教程使用Google Cloud的计费组件：

* Vertex AI
* 云存储
* BigQuery

### 设置本地开发环境

如果你正在使用Colab或Vertex AI工作台笔记本，你的环境已经满足运行此笔记本的所有要求。你可以跳过此步骤。

否则，请确保您的环境符合此笔记本的要求。
您需要以下内容：

* Google Cloud SDK
* Git
* Python 3
* virtualenv
* 在Python 3虚拟环境中运行的Jupyter笔记本

Google Cloud指南[设置Python开发环境](https://cloud.google.com/python/setup)和[Jupyter安装指南](https://jupyter.org/install)提供了满足这些要求的详细说明。以下步骤提供了一组简化的说明：

1. [安装并初始化Cloud SDK。](https://cloud.google.com/sdk/docs/)

2. [安装Python 3。](https://cloud.google.com/python/setup#installing_python)

3. [安装virtualenv](https://cloud.google.com/python/setup#installing_and_using_virtualenv)并创建一个使用Python 3的虚拟环境。激活虚拟环境。

4. 要安装Jupyter，请在终端shell中的命令行上运行`pip3 install jupyter`。

5. 要启动Jupyter，请在终端shell中的命令行上运行`jupyter notebook`。

6. 在Jupyter Notebook仪表板中打开此笔记本。

### 安装额外的软件包

安装以下软件包以执行此笔记本。

In [None]:
import os

# The Vertex AI Workbench Notebook product has specific requirements
IS_WORKBENCH_NOTEBOOK = os.getenv("DL_ANACONDA_HOME")
IS_USER_MANAGED_WORKBENCH_NOTEBOOK = os.path.exists(
    "/opt/deeplearning/metadata/env_version"
)

# Vertex AI Notebook requires dependencies to be installed with '--user'
USER_FLAG = ""
if IS_WORKBENCH_NOTEBOOK:
    USER_FLAG = "--user"
    
! pip3 install --upgrade "kfp" \
                         "google-cloud-aiplatform" \
                         "google_cloud_pipeline_components" \
                         "google-cloud-bigquery" {USER_FLAG} -q

! pip3 install --upgrade "tensorflow<2.8.0" {USER_FLAG} -q

### 重新启动内核

安装额外的包之后，您需要重新启动笔记本内核，以便它能够找到这些包。

In [None]:
# Automatically restart kernel after installs
import os

if not os.getenv("IS_TESTING"):
    # Automatically restart kernel after installs
    import IPython

    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

在开始之前

### 设置你的Google Cloud项目

**无论你使用哪种笔记本环境，以下步骤都是必需的。**

1. [选择或创建一个Google Cloud项目](https://console.cloud.google.com/cloud-resource-manager)。当你第一次创建账户时，你会获得$300的免费信用额用于计算/存储成本。

1. [确保你的项目启用了计费](https://cloud.google.com/billing/docs/how-to/modify-project)。

1. [启用API](https://console.cloud.google.com/flows/enableapi?apiid=cloudresourcemanager.googleapis.com,aiplatform.googleapis.com,notebooks.googleapis.com)。

1. 如果你在本地运行这个笔记本，你需要安装[Cloud SDK](https://cloud.google.com/sdk)。

1. 在下面的单元格中输入你的项目ID。然后运行单元格，确保Cloud SDK在这个笔记本中的所有命令中使用正确的项目。

**注意**: Jupyter以`!`为前缀运行带有shell命令的行，并将以`$`为前缀的Python变量插入这些命令中。

#### 设置您的项目ID

**如果您不知道您的项目ID**，您可以使用`gcloud`来获取您的项目ID。

In [None]:
import os

PROJECT_ID = ""

# Get your Google Cloud project ID from gcloud
if not os.getenv("IS_TESTING"):
    shell_output = !gcloud config list --format 'value(core.project)' 2>/dev/null
    PROJECT_ID = shell_output[0]
    print("Project ID: ", PROJECT_ID)

否则，请在这里设置您的项目ID。

In [None]:
if PROJECT_ID == "" or PROJECT_ID is None:
    PROJECT_ID = "[your-project]"  # @param {type:"string"}

In [None]:
! gcloud config set project $PROJECT_ID

#### 区域

您也可以更改`REGION`变量，该变量用于本笔记本其余部分的操作。以下是Vertex AI支持的区域。我们建议您选择距离您最近的区域。

- 美洲：`us-central1`
- 欧洲：`europe-west4`
- 亚太：`asia-east1`

您可能无法使用多区域存储桶进行Vertex AI的训练。并非所有区域都支持所有Vertex AI服务。

了解更多有关[Vertex AI 区域](https://cloud.google.com/vertex-ai/docs/general/locations)。

In [None]:
REGION = "[your-region]"  # @param {type: "string"}

if REGION == "[your-region]":
    REGION = "us-central1"

### UUID
如果您正在进行直播教程会话，您可能会使用共享的测试帐户或项目。为了避免在创建的资源之间发生名称冲突，您可以为每个实例会话创建一个uuid，并将其附加到您在此教程中创建的资源名称上。

In [None]:
import random
import string


# Generate a uuid of a specifed length(default=8)
def generate_uuid(length: int = 8) -> str:
    return "".join(random.choices(string.ascii_lowercase + string.digits, k=length))


UUID = generate_uuid()

验证您的Google Cloud账户

**如果您正在使用Vertex AI Workbench笔记本**，则您的环境已经经过验证。请跳过此步骤。

如果您正在使用Colab，请运行下面的单元格，并按照提示进行oAuth身份验证。

否则，请按照以下步骤操作：

1. 在Cloud Console中，转到[**创建服务帐号密钥**页面](https://console.cloud.google.com/apis/credentials/serviceaccountkey)。

2. 点击**创建服务帐号**。

3. 在**服务帐号名称**字段中输入一个名称，然后点击**创建**。

4. 在**为此服务帐号授予访问权限到项目**部分，点击**角色**下拉列表。在过滤框中输入并选择以下角色：

  - BigQuery数据编辑器
  - BigQuery作业用户
  - 服务帐号用户
  - 存储对象管理员
  - 存储管理员
  - Vertex AI管理员

5. 点击*创建*。一个包含您密钥的JSON文件将下载到您的本地环境。

6. 在下面的单元格中将您的服务帐号密钥路径作为`GOOGLE_APPLICATION_CREDENTIALS`变量输入，并运行该单元格。

In [None]:
# If you are running this notebook in Colab, run this cell and follow the
# instructions to authenticate your GCP account. This provides access to your
# Cloud Storage bucket and lets you submit training jobs and prediction
# requests.

import os
import sys

# If on Vertex AI Workbench, then don't execute this code
IS_COLAB = "google.colab" in sys.modules
if not os.path.exists("/opt/deeplearning/metadata/env_version") and not os.getenv(
    "DL_ANACONDA_HOME"
):
    if "google.colab" in sys.modules:
        from google.colab import auth as google_auth

        google_auth.authenticate_user()

    # If you are running this notebook locally, replace the string below with the
    # path to your service account key and run this cell to authenticate your GCP
    # account.
    elif not os.getenv("IS_TESTING"):
        %env GOOGLE_APPLICATION_CREDENTIALS ''

### 创建云存储存储桶

**无论您使用的是哪种笔记本环境，都需要执行以下步骤。**

在下面设置您的云存储存储桶名称。它必须在所有云存储存储桶中是唯一的。

In [None]:
BUCKET_NAME = "[your-bucket-name]"  # @param {type:"string"}
BUCKET_URI = f"gs://{BUCKET_NAME}"

In [None]:
if BUCKET_NAME == "" or BUCKET_NAME is None or BUCKET_NAME == "[your-bucket-name]":
    BUCKET_NAME = PROJECT_ID + "-aip-" + UUID
    BUCKET_URI = f"gs://{BUCKET_NAME}"

只有当您的存储桶尚不存在时：运行以下单元格以创建您的云存储桶。

In [None]:
! gsutil mb -l $REGION -p $PROJECT_ID $BUCKET_URI

最后，通过检查其内容来验证对您的云存储桶的访问权限。

In [None]:
! gsutil ls -al $BUCKET_URI

获取您的服务账号

如果您不想使用您的项目的计算引擎服务账号，请将`SERVICE_ACCOUNT`设置为另一个服务账号ID。

In [None]:
SERVICE_ACCOUNT = "[your-service-account]"  # @param {type:"string"}

In [None]:
if (
    SERVICE_ACCOUNT == ""
    or SERVICE_ACCOUNT is None
    or SERVICE_ACCOUNT == "[your-service-account]"
):
    # Get your service account from gcloud
    if not IS_COLAB:
        shell_output = !gcloud auth list 2>/dev/null
        SERVICE_ACCOUNT = shell_output[2].replace("*", "").strip()

    else:  # IS_COLAB:
        shell_output = ! gcloud projects describe  $PROJECT_ID
        project_number = shell_output[-1].split(":")[1].strip().replace("'", "")
        SERVICE_ACCOUNT = f"{project_number}-compute@developer.gserviceaccount.com"

    print("Service Account:", SERVICE_ACCOUNT)

### 设置服务账号访问

运行以下命令，将您的服务账号访问权限授予您在之前步骤创建的存储桶。您每个服务账号只需要运行此步骤一次。

In [None]:
! gsutil iam ch serviceAccount:{SERVICE_ACCOUNT}:roles/storage.objectCreator {BUCKET_URI}

! gsutil iam ch serviceAccount:{SERVICE_ACCOUNT}:roles/storage.objectViewer {BUCKET_URI}

### 导入库和设置变量

接下来，导入库并设置一些在整个教程中使用的变量。

In [None]:
from pathlib import Path as path
from typing import NamedTuple
# General
from urllib.parse import urlparse

import google.cloud.aiplatform as vertex_ai
# Check components
import tensorflow as tf
# Simulate operations
from google.cloud import bigquery
# ML pipeline
from google_cloud_pipeline_components.v1.bigquery import (
    BigqueryCreateModelJobOp, BigqueryEvaluateModelJobOp,
    BigqueryExplainForecastModelJobOp, BigqueryForecastModelJobOp,
    BigqueryMLArimaEvaluateJobOp, BigqueryQueryJobOp)
from kfp.v2 import compiler, dsl
from kfp.v2.dsl import HTML, Artifact, Condition, Input, Output, component

### 初始化用于 Python 的 Vertex AI SDK

为您的项目和相应的存储桶初始化 Python 版本的 Vertex AI SDK。

In [None]:
vertex_ai.init(project=PROJECT_ID, location=REGION, staging_bucket=BUCKET_URI)

### 初始化Python的BigQuery SDK

为您的项目初始化Python的BigQuery SDK。

In [None]:
bq_client = bigquery.Client(project=PROJECT_ID, location=REGION)

创建本地目录

接下来，您需要创建一些在本教程中会用到的本地目录。

In [None]:
DATA_PATH = "data"
KFP_COMPONENTS_PATH = "components"
PIPELINES_PATH = "pipelines"

! mkdir -m 777 -p {DATA_PATH}
! mkdir -m 777 -p {KFP_COMPONENTS_PATH}
! mkdir -m 777 -p {PIPELINES_PATH}

### 准备训练数据

接下来，您将把CSV训练数据复制到您的云存储桶中，然后为训练数据创建一个BigQuery数据集表。

In [None]:
PUBLIC_DATA_URI = "gs://cloud-samples-data/vertex-ai/pipeline-deployment/datasets/oracle_retail/orders.csv"
RAW_DATA_URI = f"{BUCKET_URI}/{DATA_PATH}/raw/orders.csv"

! gsutil -m cp -R $PUBLIC_DATA_URI $RAW_DATA_URI

快速查看CSV数据

In [None]:
! gsutil cat {RAW_DATA_URI} | head

In [None]:
LOCATION = REGION.split('-')[0]

! bq mk --location={LOCATION} --dataset {PROJECT_ID}:fast_fresh 

! bq load \
  --location={LOCATION} \
  --source_format=CSV \
  --skip_leading_rows=1\
  fast_fresh.orders_{UUID} \
  {RAW_DATA_URI} \
  time_of_sale:DATETIME,order_id:INTEGER,product_name:STRING,price:NUMERIC,quantity:NUMERIC,payment_method:STRING,store_id:INTEGER,user_id:INTEGER

在接下来的单元格中，您将构建组件和管道，用于训练和评估BQML需求预测模型。

### 设置管道变量

为管道设置一些特定的变量。

In [None]:
# BQML pipeline job configuation
PIPELINE_NAME = "bqml-forecast-pipeline"
PIPELINE_ROOT = urlparse(BUCKET_URI)._replace(path="pipeline_root").geturl()
PIPELINE_PACKAGE = str(path(PIPELINES_PATH) / f"{PIPELINE_NAME}.json")

# BQML pipeline conponent configuration
BQ_DATASET = "fast_fresh"
BQ_ORDERS_TABLE_PREFIX = "orders"
BQ_TRAINING_TABLE_PREFIX = "orders_training"
BQ_MODEL_TABLE_PREFIX = "orders_forecast_arima"
BQ_EVALUATE_TS_TABLE_PREFIX = "orders_arima_time_series_evaluate"
BQ_EVALUATE_MODEL_TABLE_PREFIX = "orders_arima_model_evaluate"
BQ_FORECAST_TABLE_PREFIX = "orders_arima_forecast"
BQ_EXPLAIN_FORECAST_TABLE_PREFIX = "orders_arima_explain_forecast"
BQ_ORDERS_TABLE = f"{BQ_ORDERS_TABLE_PREFIX}_{UUID}"
BQ_TRAINING_TABLE = f"{BQ_TRAINING_TABLE_PREFIX}_{UUID}"
BQ_MODEL_TABLE = f"{BQ_MODEL_TABLE_PREFIX}_{UUID}"
BQ_EVALUATE_TS_TABLE = f"{BQ_EVALUATE_TS_TABLE_PREFIX}_{UUID}"
BQ_EVALUATE_MODEL_TABLE = f"{BQ_EVALUATE_MODEL_TABLE_PREFIX}_{UUID}"
BQ_FORECAST_TABLE = f"{BQ_FORECAST_TABLE_PREFIX}_{UUID}"
BQ_EXPLAIN_FORECAST_TABLE = f"{BQ_EXPLAIN_FORECAST_TABLE_PREFIX}_{UUID}"

BQ_TRAIN_CONFIGURATION = {
    "destinationTable": {
        "projectId": PROJECT_ID,
        "datasetId": BQ_DATASET,
        "tableId": BQ_TRAINING_TABLE,
    },
    "writeDisposition": "WRITE_TRUNCATE",
}

BQ_EVALUATE_TS_CONFIGURATION = {
    "destinationTable": {
        "projectId": PROJECT_ID,
        "datasetId": BQ_DATASET,
        "tableId": BQ_EVALUATE_TS_TABLE,
    },
    "writeDisposition": "WRITE_TRUNCATE",
}
BQ_EVALUATE_MODEL_CONFIGURATION = {
    "destinationTable": {
        "projectId": PROJECT_ID,
        "datasetId": BQ_DATASET,
        "tableId": BQ_EVALUATE_MODEL_TABLE,
    },
    "writeDisposition": "WRITE_TRUNCATE",
}
BQ_FORECAST_CONFIGURATION = {
    "destinationTable": {
        "projectId": PROJECT_ID,
        "datasetId": BQ_DATASET,
        "tableId": BQ_FORECAST_TABLE,
    },
    "writeDisposition": "WRITE_TRUNCATE",
}
BQ_EXPLAIN_FORECAST_CONFIGURATION = {
    "destinationTable": {
        "projectId": PROJECT_ID,
        "datasetId": BQ_DATASET,
        "tableId": BQ_EXPLAIN_FORECAST_TABLE,
    },
    "writeDisposition": "WRITE_TRUNCATE",
}
PERF_THRESHOLD = 3000

### 创建位置以存储组件定义

接下来，您将创建一个存储云位置，用于存储您在本教程中创建的自定义组件的YAML组件定义。

In [None]:
! mkdir -m 777 -p {KFP_COMPONENTS_PATH}/custom_components

### 创建一个自定义组件来读取模型评估指标

使用Kubeflow SDK可视化API构建一个自定义组件，用于在Vertex AI流水线UI中消费模型评估指标并进行可视化展示。实际上，Vertex AI允许您在Google Cloud控制台中轻松访问渲染HTML在输出页面中。

In [None]:
@component(
    base_image="python:3.8-slim",
    packages_to_install=["jinja2", "pandas", "matplotlib"],
    output_component_file=f"{KFP_COMPONENTS_PATH}/custom_components/build_bq_evaluate_metrics.yaml",
)
def get_model_evaluation_metrics(
    metrics_in: Input[Artifact], metrics_out: Output[HTML]
) -> NamedTuple("Outputs", [("avg_mean_absolute_error", float)]):
    """
    Get the average mean absolute error from the metrics
    Args:
        metrics_in: metrics artifact
        metrics_out: metrics artifact
    Returns:
        avg_mean_absolute_error: average mean absolute error
    """

    import pandas as pd

    # Helpers
    def prettyfier(styler):
        """
        Helper function to prettify the metrics table.
        Args:
            styler: Styler object
        Returns:
            Styler object
        """
        caption = {
            "selector": "caption",
            "props": [
                ("caption-side", "top"),
                ("font-size", "150%"),
                ("font-weight", "bold"),
                ("font-family", "arial"),
            ],
        }
        headers = {
            "selector": "th",
            "props": [("color", "black"), ("font-family", "arial")],
        }
        rows = {
            "selector": "td",
            "props": [("text-align", "center"), ("font-family", "arial")],
        }
        styler.set_table_styles([caption, headers, rows])
        styler.set_caption("Forecasting accuracy report <br><br>")
        styler.hide(axis="index")
        styler.format(precision=2)
        styler.background_gradient(cmap="Blues")
        return styler

    def get_column_names(header):
        """
        Helper function to get the column names from the metrics table.
        Args:
            header: header
        Returns:
            column_names: column names
        """
        header_clean = header.replace("_", " ")
        header_abbrev = "".join([h[0].upper() for h in header_clean.split()])
        header_prettied = f"{header_clean} ({header_abbrev})"
        return header_prettied

    # Extract rows and schema from metrics artifact
    rows = metrics_in.metadata["rows"]
    schema = metrics_in.metadata["schema"]

    # Convert into a tabular format
    columns = [metrics["name"] for metrics in schema["fields"] if "name" in metrics]
    records = []
    for row in rows:
        records.append([dl["v"] for dl in row["f"]])
    metrics = (
        pd.DataFrame.from_records(records, columns=columns, index="product_name")
        .astype(float)
        .round(3)
    )
    metrics = metrics.reset_index()

    # Create the HTML artifact for the metrics
    pretty_columns = list(
        map(
            lambda h: get_column_names(h)
            if h != columns[0]
            else h.replace("_", " ").capitalize(),
            columns,
        )
    )
    pretty_metrics = metrics.copy()
    pretty_metrics.columns = pretty_columns
    html_metrics = pretty_metrics.style.pipe(prettyfier).to_html()
    with open(metrics_out.path, "w") as f:
        f.write(html_metrics)

    # Create metrics dictionary for the model
    avg_mean_absolute_error = round(float(metrics.mean_absolute_error.mean()), 0)
    component_outputs = NamedTuple("Outputs", [("avg_mean_absolute_error", float)])

    return component_outputs(avg_mean_absolute_error)

### 构建BigQuery ML训练流水线

使用Kubeflow Pipelines DSL包定义您的工作流程。

以下是流水线工作流程的步骤：

1. 获取BigQuery训练数据
2. 训练一个BigQuery Arima Plus模型
3. 评估BigQuery Arima Plus模型
4. 绘制评估结果
5. 检查模型性能
6. 生成ARIMA Plus预测
7. 生成ARIMA PLUS预测解释

In [None]:
@dsl.pipeline(
    name=PIPELINE_NAME,
    description="A batch pipeline to train ARIMA PLUS using BQML",
)
def pipeline(
    bq_dataset: str = BQ_DATASET,
    bq_orders_table: str = BQ_ORDERS_TABLE,
    bq_training_table: str = BQ_TRAINING_TABLE,
    bq_train_configuration: dict = BQ_TRAIN_CONFIGURATION,
    bq_model_table: str = BQ_MODEL_TABLE,
    bq_evaluate_time_series_configuration: dict = BQ_EVALUATE_TS_CONFIGURATION,
    bq_evaluate_model_configuration: dict = BQ_EVALUATE_MODEL_CONFIGURATION,
    performance_threshold: float = PERF_THRESHOLD,
    bq_forecast_configuration: dict = BQ_FORECAST_CONFIGURATION,
    bq_explain_forecast_configuration: dict = BQ_EXPLAIN_FORECAST_CONFIGURATION,
    project: str = PROJECT_ID,
    location: str = LOCATION,
):

    # Create the training dataset
    create_training_dataset_op = BigqueryQueryJobOp(
        query=f"""
        -- create the training table
        WITH 
        -- get 90% percentile for time series splitting
        get_split AS (
          SELECT APPROX_QUANTILES(DATETIME_TRUNC(time_of_sale, HOUR), 100)[OFFSET(90)] as split
          FROM `{project}.{bq_dataset}.{bq_orders_table}`
        ),
        -- get train table
        get_train AS (
          SELECT
            DATETIME_TRUNC(time_of_sale, HOUR) as hourly_timestamp,
            product_name,
            SUM(quantity) AS total_sold,
            FROM `{project}.{bq_dataset}.{bq_orders_table}`
        GROUP BY hourly_timestamp, product_name
        )
        SELECT
          *,
          CASE WHEN hourly_timestamp < (SELECT split FROM get_split) THEN 'TRAIN' ELSE 'TEST' END AS split
        FROM get_train
        ORDER BY hourly_timestamp
        """,
        job_configuration_query=bq_train_configuration,
        project=project,
        location=location,
    ).set_display_name("get train data")

    # Run an ARIMA PLUS experiment
    bq_arima_model_exp_op = (
        BigqueryCreateModelJobOp(
            query=f"""
        -- create model table
        CREATE OR REPLACE MODEL `{project}.{bq_dataset}.{bq_model_table}`
        OPTIONS(
        MODEL_TYPE = \'ARIMA_PLUS\',
        TIME_SERIES_TIMESTAMP_COL = \'hourly_timestamp\',
        TIME_SERIES_DATA_COL = \'total_sold\',
        TIME_SERIES_ID_COL = [\'product_name\']
        ) AS
        SELECT
          hourly_timestamp,
          product_name,
          total_sold
        FROM `{project}.{bq_dataset}.{bq_training_table}`
        WHERE split='TRAIN';
        """,
            project=project,
            location=location,
        )
        .set_display_name("run arima+ model experiment")
        .after(create_training_dataset_op)
    )

    # Evaluate ARIMA PLUS time series
    _ = (
        BigqueryMLArimaEvaluateJobOp(
            project=project,
            location=location,
            model=bq_arima_model_exp_op.outputs["model"],
            show_all_candidate_models=False,
            job_configuration_query=bq_evaluate_time_series_configuration,
        )
        .set_display_name("evaluate arima plus time series")
        .after(bq_arima_model_exp_op)
    )

    # Evaluate ARIMA Plus model
    bq_arima_evaluate_model_op = (
        BigqueryEvaluateModelJobOp(
            project=project,
            location=location,
            model=bq_arima_model_exp_op.outputs["model"],
            query_statement=f"""SELECT * FROM `{project}.{bq_dataset}.{bq_training_table}` WHERE split='TEST'""",
            job_configuration_query=bq_evaluate_model_configuration,
        )
        .set_display_name("evaluate arima plus model")
        .after(bq_arima_model_exp_op)
    )

    # Plot model metrics
    get_evaluation_model_metrics_op = (
        get_model_evaluation_metrics(
            bq_arima_evaluate_model_op.outputs["evaluation_metrics"]
        )
        .after(bq_arima_evaluate_model_op)
        .set_display_name("plot evaluation metrics")
    )

    # Check the model performance. If ARIMA_PLUS average MAE metric is below to a minimal threshold
    with Condition(
        get_evaluation_model_metrics_op.outputs["avg_mean_absolute_error"]
        < PERF_THRESHOLD,
        name="avg. mae good",
    ):
        # Train the ARIMA PLUS model
        bq_arima_model_op = (
            BigqueryCreateModelJobOp(
                query=f"""
        -- create model table
        CREATE OR REPLACE MODEL `{project}.{bq_dataset}.{bq_model_table}`
        OPTIONS(
        MODEL_TYPE = \'ARIMA_PLUS\',
        TIME_SERIES_TIMESTAMP_COL = \'hourly_timestamp\',
        TIME_SERIES_DATA_COL = \'total_sold\',
        TIME_SERIES_ID_COL = [\'product_name\'],
        MODEL_REGISTRY = \'vertex_ai\',
        VERTEX_AI_MODEL_ID = \'order_demand_forecasting\',
        VERTEX_AI_MODEL_VERSION_ALIASES = [\'staging\']
        ) AS
        SELECT
          DATETIME_TRUNC(time_of_sale, HOUR) as hourly_timestamp,
          product_name,
          SUM(quantity) AS total_sold,
          FROM `{project}.{bq_dataset}.{bq_orders_table}`
        GROUP BY hourly_timestamp, product_name;
        """,
                project=project,
                location=location,
            )
            .set_display_name("train arima+ model")
            .after(get_evaluation_model_metrics_op)
        )

        # Generate the ARIMA PLUS forecasts
        bq_arima_forecast_op = (
            BigqueryForecastModelJobOp(
                project=project,
                location=location,
                model=bq_arima_model_op.outputs["model"],
                horizon=1,  # 1 hour
                confidence_level=0.9,
                job_configuration_query=bq_forecast_configuration,
            )
            .set_display_name("generate hourly forecasts")
            .after(get_evaluation_model_metrics_op)
        )

        # Generate the ARIMA PLUS forecast explainations
        _ = (
            BigqueryExplainForecastModelJobOp(
                project=project,
                location=location,
                model=bq_arima_model_op.outputs["model"],
                horizon=1,  # 1 hour
                confidence_level=0.9,
                job_configuration_query=bq_explain_forecast_configuration,
            )
            .set_display_name("explain hourly forecasts")
            .after(bq_arima_forecast_op)
        )

将管道编译成JSON文件

接下来，您编译管道，这将为您的管道生成一个JSON规范。

In [None]:
compiler.Compiler().compile(pipeline_func=pipeline, package_path=PIPELINE_PACKAGE)

### 执行你的流水线

接下来，我们执行流水线。它需要以下参数，我们设置为默认值：

- `bq_dataset`：要训练的BigQuery数据集。
- `bq_orders_table`：原始数据的BigQuery表。
- `bq_training_table`：预处理后的训练数据的BigQuery表。
- `bq_train_configuration`：训练组件的作业配置。
- `bq_model_table`：训练模型的BigQuery表。
- `bq_evaluate_time_series_configuration`：ARIMA时间序列评估的作业配置。
- `bq_evaluate_model_configuration`：ARIMA模型评估的作业配置。
- `performance_threshold`：平均MAE阈值的值。
- `bq_forecast_configuration`：预测组件的作业配置。
- `bq_explain_forecast_configuration`：预测组件评估的作业配置。
- `project`：项目ID
- `location`：地点

In [None]:
bqml_pipeline = vertex_ai.PipelineJob(
    display_name=f"{PIPELINE_NAME}-job",
    template_path=PIPELINE_PACKAGE,
    pipeline_root=PIPELINE_ROOT,
    enable_caching=False,
)

bqml_pipeline.run()

### 查看BigQuery ML训练管道结果

最后，您可以查看管道中每个任务的工件输出。

In [None]:
PROJECT_NUMBER = bqml_pipeline.gca_resource.name.split("/")[1]
print("PROJECT NUMBER: ", PROJECT_NUMBER)
print("\n\n")


def print_pipeline_output(job, output_task_name):
    JOB_ID = job.name
    print(JOB_ID)
    for _ in range(len(job.gca_resource.job_detail.task_details)):
        TASK_ID = job.gca_resource.job_detail.task_details[_].task_id
        EXECUTE_OUTPUT = (
            PIPELINE_ROOT
            + "/"
            + PROJECT_NUMBER
            + "/"
            + JOB_ID
            + "/"
            + output_task_name
            + "_"
            + str(TASK_ID)
            + "/executor_output.json"
        )
        GCP_RESOURCES = (
            PIPELINE_ROOT
            + "/"
            + PROJECT_NUMBER
            + "/"
            + JOB_ID
            + "/"
            + output_task_name
            + "_"
            + str(TASK_ID)
            + "/gcp_resources"
        )
        EVAL_METRICS = (
            PIPELINE_ROOT
            + "/"
            + PROJECT_NUMBER
            + "/"
            + JOB_ID
            + "/"
            + output_task_name
            + "_"
            + str(TASK_ID)
            + "/evaluation_metrics"
        )
        if tf.io.gfile.exists(EXECUTE_OUTPUT):
            ! gsutil cat $EXECUTE_OUTPUT
            return EXECUTE_OUTPUT
        elif tf.io.gfile.exists(GCP_RESOURCES):
            ! gsutil cat $GCP_RESOURCES
            return GCP_RESOURCES
        elif tf.io.gfile.exists(EVAL_METRICS):
            ! gsutil cat $EVAL_METRICS
            return EVAL_METRICS

    return None


print("bigquery-create-model-job")
artifacts = print_pipeline_output(bqml_pipeline, "bigquery-create-model-job")
print("\n\n")
print("bigquery-ml-arima-evaluate-job")
artifacts = print_pipeline_output(bqml_pipeline, "bigquery-ml-arima-evaluate-job")
print("\n\n")
print("bigquery-evaluate-model-job")
artifacts = print_pipeline_output(bqml_pipeline, "bigquery-evaluate-model-job")
print("\n\n")
print("bigquery-forecast-model-job")
artifacts = print_pipeline_output(bqml_pipeline, "bigquery-forecast-model-job")
print("\n\n")
print("bigquery-explain-forecast-model-job")
artifacts = print_pipeline_output(bqml_pipeline, "bigquery-explain-forecast-model-job")
print("\n\n")

清理工作

要清理此项目中使用的所有Google Cloud资源，您可以删除用于本教程的[Google Cloud项目](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects)。

否则，您可以删除在本教程中创建的各个资源。

In [None]:
# delete pipeline
vertex_ai_pipeline_jobs = vertex_ai.PipelineJob.list(
    filter=f'pipeline_name="{PIPELINE_NAME}"'
)
for pipeline_job in vertex_ai_pipeline_jobs:
    pipeline_job.delete()

# delete model
DELETE_MODEL_SQL = f"DROP MODEL {BQ_DATASET}.{BQ_MODEL_TABLE}"
try:
    delete_model_query_job = bq_client.query(DELETE_MODEL_SQL)
    delete_model_query_result = delete_model_query_job.result()
except Exception as e:
    print(e)

# delete dataset
try:
    delete_detaset_query_result = bq_client.delete_dataset(
        BQ_DATASET, delete_contents=True, not_found_ok=True
    )
    print(delete_detaset_query_result)
except Exception as e:
    print(e)

# delete bucket
delete_bucket = True
if os.getenv("IS_TESTING") or delete_bucket:
    ! gsutil -m rm -r $BUCKET_URI


# Remove local resorces
! rm -rf {KFP_COMPONENTS_PATH}
! rm -rf {PIPELINES_PATH}
! rm -rf {DATA_PATH}