In [None]:
# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# 在Vertex AI中尝试BQML和AutoML

<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/pipelines/rapid_prototyping_bqml_automl.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> 在Colab中运行
    </a>
  </td>  

  <td>
<a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/pipelines/rapid_prototyping_bqml_automl.ipynb" target='_blank'>
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      在GitHub上查看
    </a>
  </td>
  <td>
<a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/main/notebooks/official/pipelines/rapid_prototyping_bqml_automl.ipynb" target='_blank'>
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      在Vertex AI Workbench中打开
    </a>
  </td>  
</table>

## 概述

本教程演示如何使用Vertex AI Pipelines来快速原型化模型，同时使用AutoML和BQML进行评估比较，在进行自定义模型之前对基线进行评估。

学习更多关于[AutoML组件](https://cloud.google.com/vertex-ai/docs/pipelines/vertex-automl-component)和[BigQuery ML组件](https://cloud.google.com/vertex-ai/docs/pipelines/bigqueryml-component)。

### 目标

在本教程中，您将学习如何使用 `Vertex AI Predictions` 快速原型化模型。

本教程使用以下谷歌云机器学习服务：

- `Vertex AI Pipelines`
- `Vertex AI AutoML`
- `Vertex AI BigQuery ML`
- `Google Cloud Pipeline Components`

执行的步骤包括：

- 创建一个 BigQuery 和 Vertex AI 训练数据集。
- 训练 BigQuery ML 和 AutoML 模型。
- 从 BigQuery ML 和 AutoML 模型中提取评估指标。
- 选择最佳训练模型。
- 部署最佳训练模型。
- 测试已部署的模型基础设施。

## 数据集

#### 鲍鱼数据集

<img src="https://storage.googleapis.com/rafacarv-public-bucket-do-not-delete/abalone/dataset.png" />

<p>数据集来源</p>
<p>Dua, D. and Graff, C. (2019). UCI Machine Learning Repository <a href="http://archive.ics.uci.edu/ml">http://archive.ics.uci.edu/ml</a>. Irvine, CA: University of California, School of Information and Computer Science.</p>

<p><a href="https://archive.ics.uci.edu/ml/datasets/abalone">直接链接</a></p>
    
    
#### 属性信息：

<p>给出的是属性名称、属性类型、测量单位和简要描述。环数是要预测的值：可以作为连续值或分类问题。</p>

<body>
	<table>
		<tr>
			<th>名称</th>
			<th>数据类型</th>
			<th>测量单位</th>
			<th>描述</th>
		</tr>
		<tr>
			<td>性别</td>
            <td>名义</td>
            <td>--</td>
            <td>M、F 和 I（幼崽）</td>
		</tr>
		<tr>
			<td>长度</td>
            <td>连续</td>
            <td>毫米</td>
            <td>最长的贝壳测量</td>
		</tr>
		<tr>
			<td>直径</td>
            <td>连续</td>
            <td>毫米</td>
            <td>与长度垂直</td>
		</tr>
		<tr>
			<td>高度</td>
            <td>连续</td>
            <td>毫米</td>
            <td>含肉壳</td>
		</tr>
		<tr>
			<td>全重</td>
            <td>连续</td>
            <td>克</td>
            <td>整只鲍鱼</td>
		</tr>
		<tr>
			<td>去壳重量</td>
            <td>连续</td>
            <td>克</td>
            <td>肉的重量</td>
		</tr>
		<tr>
			<td>内脏重量</td>
            <td>连续</td>
            <td>克</td>
            <td>内脏重量（出血后）</td>
		</tr>
		<tr>
			<td>贝壳重量</td>
            <td>连续</td>
            <td>克</td>
            <td>干燥后的重量</td>
		</tr>
        <tr>
			<td>环数</td>
            <td>整数</td>
			<td>--</td>
            <td>+1.5 表示年龄</td>
		</tr>
	</table>
</body>


## 费用

本教程使用谷歌云的收费组件：

* Vertex AI
* 云存储

了解 [Vertex AI 定价](https://cloud.google.com/vertex-ai/pricing) 和 [云存储定价](https://cloud.google.com/storage/pricing)，并使用 [定价计算器](https://cloud.google.com/products/calculator/) 根据您的预期使用量生成成本估算。

### 设置本地开发环境

如果您使用 Colab 或者 Vertex AI Workbench，那么您的环境已经满足运行本笔记本的所有要求。您可以跳过这一步。

否则，请确保您的环境满足本笔记本的要求。您需要以下内容：

- 云存储 SDK
- Git
- Python 3
- virtualenv
- 在虚拟环境中以 Python 3 运行的 Jupyter 笔记本

云存储指南中有关[设置 Python 开发环境](https://cloud.google.com/python/setup)和 [Jupyter 安装指南](https://jupyter.org/install) 提供了满足这些要求的详细说明。以下步骤提供了简洁的一套指令：

1. [安装并初始化 SDK](https://cloud.google.com/sdk/docs/)。

2. [安装 Python 3](https://cloud.google.com/python/setup#installing_python)。

3. [安装 virtualenv](https://cloud.google.com/python/setup#installing_and_using_virtualenv) 并创建一个使用 Python 3 的虚拟环境。激活虚拟环境。

4. 要安装 Jupyter，请在命令行终端中运行 `pip3 install jupyter`。

5. 若要启动 Jupyter，请在命令行终端中运行 `jupyter notebook`。

6. 在 Jupyter Notebook 仪表板中打开此笔记本。

安装额外的软件包

安装以下软件包以执行此笔记本所需。

In [None]:
# Install Python package dependencies.
! pip3 install --quiet google-cloud-pipeline-components kfp
! pip3 install --quiet --upgrade google-cloud-aiplatform google-cloud-bigquery

### 仅用合作者：取消以下单元格的注释以重新启动内核

In [None]:
# Automatically restart kernel after installs so that your environment can access the new packages
# import IPython

# app = IPython.Application.instance()
# app.kernel.do_shutdown(True)

## 准备开始之前

### 设置您的项目ID

**如果您不知道您的项目ID**，请尝试以下方法：
* 运行 `gcloud config list`。
* 运行 `gcloud projects list`。
* 查看支持页面：[查找项目ID](https://support.google.com/googleapi/answer/7014113)

In [None]:
PROJECT_ID = "[your-project-id]"  # @param {type:"string"}

# Set the project id
! gcloud config set project {PROJECT_ID}

#### 区域

您也可以修改 Vertex AI 使用的 `REGION` 变量。了解更多关于 [Vertex AI 区域](https://cloud.google.com/vertex-ai/docs/general/locations)。

In [None]:
REGION = "us-central1"

UUID

如果您正在进行现场教程会话，您可能会使用共享的测试账户或项目。为了避免用户在创建的资源之间发生名称冲突，您需要为每个实例会话创建一个UUID，并将其附加到您在本教程中创建的资源的名称上。

In [None]:
import random
import string


# Generate a uuid of a specifed length(default=8)
def generate_uuid(length: int = 8) -> str:
    return "".join(random.choices(string.ascii_lowercase + string.digits, k=length))


UUID = generate_uuid()

### 验证您的谷歌云账户

根据您的Jupyter环境，您可能需要手动进行身份验证。请按照以下相关说明进行操作。

**1. Vertex AI Workbench**
* 由于您已经经过身份验证，无需进行任何操作。

**2. 本地JupyterLab实例，请取消注释并运行：**

In [None]:
# ! gcloud auth login

3. 合作，取消注释并运行:

In [None]:
# from google.colab import auth
# auth.authenticate_user()

4. 服务账户或其他
* 查看如何在 https://cloud.google.com/storage/docs/gsutil/commands/iam#ch-examples 上授予您的服务账户云存储权限。

创建一个云存储桶

创建一个存储桶，用来存储中间产物，如数据集。

- *{给笔记本作者的提示：对于任何需要是唯一的用户提供的字符串（例如存储桶名称或模型ID），请在末尾追加“-unique”以便进行适当的测试}*

In [None]:
BUCKET_URI = f"gs://your-bucket-name-{PROJECT_ID}-unique"  # @param {type:"string"}

只有在您的存储桶尚不存在时：运行以下单元格以创建您的云存储存储桶。

In [None]:
! gsutil mb -l {REGION} -p {PROJECT_ID} {BUCKET_URI}

服务帐号

**如果您不知道您的服务帐号**，请尝试使用`gcloud`命令在下面执行第二个单元格以获取您的服务帐号。

In [None]:
SERVICE_ACCOUNT = "[your-service-account]"  # @param {type:"string"}

In [None]:
import sys

IS_COLAB = "google.colab" in sys.modules
if (
    SERVICE_ACCOUNT == ""
    or SERVICE_ACCOUNT is None
    or SERVICE_ACCOUNT == "[your-service-account]"
):
    # Get your service account from gcloud
    if not IS_COLAB:
        shell_output = !gcloud auth list 2>/dev/null
        SERVICE_ACCOUNT = shell_output[2].replace("*", "").strip()

    if IS_COLAB:
        shell_output = ! gcloud projects describe  $PROJECT_ID
        project_number = shell_output[-1].split(":")[1].strip().replace("'", "")
        SERVICE_ACCOUNT = f"{project_number}-compute@developer.gserviceaccount.com"

    print("Service Account:", SERVICE_ACCOUNT)

#### 为Vertex AI Pipeline设置服务账户访问权限

运行以下命令，为您的服务账户授予访问权限，以读取和写入在上一步创建的存储桶中的管道工件 -- 您只需要为每个服务账户运行一次这些命令。

In [None]:
! gsutil iam ch serviceAccount:{SERVICE_ACCOUNT}:roles/storage.objectCreator $BUCKET_URI

! gsutil iam ch serviceAccount:{SERVICE_ACCOUNT}:roles/storage.objectViewer $BUCKET_URI

### 要求的进口

In [None]:
import sys
from typing import NamedTuple

from google.cloud import aiplatform as vertex
from google_cloud_pipeline_components.v1 import bigquery as bq_components
from google_cloud_pipeline_components.v1.automl.training_job import \
    AutoMLTabularTrainingJobRunOp
from google_cloud_pipeline_components.v1.dataset import TabularDatasetCreateOp
from google_cloud_pipeline_components.v1.endpoint import (EndpointCreateOp,
                                                          ModelDeployOp)
from google_cloud_pipeline_components.v1.model import ModelUploadOp
from kfp import compiler, dsl
from kfp.dsl import Artifact, Input, Metrics, Output, component

确定一些项目和管道变量。

指南：
- 确保GCS存储桶和BigQuery数据集不存在。这个脚本可能**删除**任何现有内容。
- 您的存储桶必须与您的Vertex AI资源位于同一地区。
- BQ地区可以是美国或欧盟；
- 确保您首选的Vertex AI地区受支持[[链接]](https://cloud.google.com/vertex-ai/docs/general/locations#americas_1)。

In [None]:
PIPELINE_YAML_PKG_PATH = "rapid_prototyping.yaml"
PIPELINE_ROOT = f"{BUCKET_URI}/pipeline_root"
DATA_FOLDER = f"{BUCKET_URI[5:]}/data"

RAW_INPUT_DATA = f"gs://{DATA_FOLDER}/abalone.csv"
BQ_DATASET = "vertex_ai_dev_dataset_" + UUID  # @param {type:"string"}
BQ_LOCATION = "US"  # @param {type:"string"}
BQ_LOCATION = BQ_LOCATION.upper()
BQML_EXPORT_LOCATION = f"{BUCKET_URI}/artifacts/bqml"

DISPLAY_NAME = "rapid-prototyping"
ENDPOINT_DISPLAY_NAME = f"{DISPLAY_NAME}_endpoint"

image_prefix = REGION.split("-")[0]
BQML_SERVING_CONTAINER_IMAGE_URI = (
    f"{image_prefix}-docker.pkg.dev/vertex-ai/prediction/tf2-cpu.2-8:latest"
)

In [None]:
!gcloud config set project $PROJECT_ID
!gcloud config set ai/region $REGION

### 下载数据

下面的单元格将把数据集下载到CSV文件中，在GCS中保存。

In [None]:
! gsutil cp gs://cloud-samples-data/vertex-ai/community-content/datasets/abalone/abalone.data {RAW_INPUT_DATA}

## 管道组件

### 导入到BQ

该组件会将csv文件导入到BigQuery中的表中。如果数据集不存在，将会被创建。如果已经存在同名的表，它将被删除并重新创建。

In [None]:
@component(base_image="python:3.9", packages_to_install=["google-cloud-bigquery"])
def import_data_to_bigquery(
    project: str,
    bq_location: str,
    bq_dataset: str,
    gcs_data_uri: str,
    raw_dataset: Output[Artifact],
    table_name_prefix: str = "abalone",
):
    from google.cloud import bigquery

    # Construct a BigQuery client object.
    client = bigquery.Client(project=project, location=bq_location)

    def load_dataset(gcs_uri, table_id):
        job_config = bigquery.LoadJobConfig(
            schema=[
                bigquery.SchemaField("Sex", "STRING"),
                bigquery.SchemaField("Length", "NUMERIC"),
                bigquery.SchemaField("Diameter", "NUMERIC"),
                bigquery.SchemaField("Height", "NUMERIC"),
                bigquery.SchemaField("Whole_weight", "NUMERIC"),
                bigquery.SchemaField("Shucked_weight", "NUMERIC"),
                bigquery.SchemaField("Viscera_weight", "NUMERIC"),
                bigquery.SchemaField("Shell_weight", "NUMERIC"),
                bigquery.SchemaField("Rings", "NUMERIC"),
            ],
            skip_leading_rows=1,
            # The source format defaults to CSV, so the line below is optional.
            source_format=bigquery.SourceFormat.CSV,
        )
        print(f"Loading {gcs_uri} into {table_id}")
        load_job = client.load_table_from_uri(
            gcs_uri, table_id, job_config=job_config
        )  # Make an API request.

        load_job.result()  # Waits for the job to complete.
        destination_table = client.get_table(table_id)  # Make an API request.
        print("Loaded {} rows.".format(destination_table.num_rows))

    def create_dataset_if_not_exist(bq_dataset_id, bq_location):
        print(
            "Checking for existence of bq dataset. If it does not exist, it creates one"
        )
        dataset = bigquery.Dataset(bq_dataset_id)
        dataset.location = bq_location
        dataset = client.create_dataset(dataset, exists_ok=True, timeout=300)
        print(f"Created dataset {dataset.full_dataset_id} @ {dataset.location}")

    bq_dataset_id = f"{project}.{bq_dataset}"
    create_dataset_if_not_exist(bq_dataset_id, bq_location)

    raw_table_name = f"{table_name_prefix}_raw"
    table_id = f"{project}.{bq_dataset}.{raw_table_name}"
    print("Deleting any tables that might have the same name on the dataset")
    client.delete_table(table_id, not_found_ok=True)
    print("will load data to table")
    load_dataset(gcs_data_uri, table_id)

    raw_dataset_uri = f"bq://{table_id}"
    raw_dataset.uri = raw_dataset_uri

## 切割数据集

将数据集切分为3个部分：
- 训练集
- 评估集
- 测试集

AutoML和BigQuery ML在数据切分方面使用不同的术语：

#### BQML
BQML如何切分数据：[链接](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-hyperparameter-tuning#data_split)

#### AutoML
AutoML如何切分数据：[链接](https://cloud.google.com/vertex-ai/docs/general/ml-use?hl=da&skip_cache=false)

<ul>
    <li>模型训练
<p>训练集用于训练具有不同的预处理、架构和超参数选项组合的模型。这些模型在验证集上进行质量评估，指导探索其他选项组合。在并行调整阶段确定的最佳参数和架构用于训练下面描述的两个整体模型。</p></li>

<li>模型评估
<p>
Vertex AI使用训练和验证集作为训练数据来训练评估模型。Vertex AI使用测试集在该模型上生成最终模型评估指标。这是该过程中首次使用测试集。这种方法确保最终评估指标是对最终训练模型在生产环境中性能如何的无偏反映。</p></li>

<li>服务模型
<p>使用训练、验证和测试集来训练一个模型，以最大化训练数据量。这个模型是您用来请求预测的模型。</p></li>

In [None]:
@component(
    base_image="python:3.9",
    packages_to_install=["google-cloud-bigquery"],
)  # pandas, pyarrow and fsspec required to export bq data to csv
def split_datasets(
    raw_dataset: Input[Artifact],
    bq_location: str,
) -> NamedTuple(
    "bqml_split",
    [
        ("dataset_uri", str),
        ("dataset_bq_uri", str),
        ("test_dataset_uri", str),
    ],
):

    from collections import namedtuple

    from google.cloud import bigquery

    raw_dataset_uri = raw_dataset.uri
    table_name = raw_dataset_uri.split("bq://")[-1]
    print(table_name)
    raw_dataset_uri = table_name.split(".")
    print(raw_dataset_uri)
    project = raw_dataset_uri[0]
    bq_dataset = raw_dataset_uri[1]
    bq_raw_table = raw_dataset_uri[2]

    client = bigquery.Client(project=project, location=bq_location)

    def split_dataset(table_name_dataset):
        training_dataset_table_name = f"{project}.{bq_dataset}.{table_name_dataset}"
        split_query = f"""
        CREATE OR REPLACE TABLE
            `{training_dataset_table_name}`
           AS
        SELECT
          Sex,
          Length,
          Diameter,
          Height,
          Whole_weight,
          Shucked_weight,
          Viscera_weight,
          Shell_weight,
          Rings,
            CASE(ABS(MOD(FARM_FINGERPRINT(TO_JSON_STRING(f)), 10)))
              WHEN 9 THEN 'TEST'
              WHEN 8 THEN 'VALIDATE'
              ELSE 'TRAIN' END AS split_col
        FROM
          `{project}.{bq_dataset}.abalone_raw` f
        """
        dataset_uri = f"{project}.{bq_dataset}.{bq_raw_table}"
        print("Splitting the dataset")
        query_job = client.query(split_query)  # Make an API request.
        query_job.result()
        print(dataset_uri)
        print(split_query.replace("\n", " "))
        return training_dataset_table_name

    def create_test_view(training_dataset_table_name, test_view_name="dataset_test"):
        view_uri = f"{project}.{bq_dataset}.{test_view_name}"
        query = f"""
             CREATE OR REPLACE VIEW `{view_uri}` AS SELECT
          Sex,
          Length,
          Diameter,
          Height,
          Whole_weight,
          Shucked_weight,
          Viscera_weight,
          Shell_weight,
          Rings 
          FROM `{training_dataset_table_name}`  f
          WHERE 
          f.split_col = 'TEST'
          """
        print(f"Creating view for --> {test_view_name}")
        print(query.replace("\n", " "))
        query_job = client.query(query)  # Make an API request.
        query_job.result()
        return view_uri

    table_name_dataset = "dataset"

    dataset_uri = split_dataset(table_name_dataset)
    test_dataset_uri = create_test_view(dataset_uri)
    dataset_bq_uri = "bq://" + dataset_uri

    print(f"dataset: {dataset_uri}")

    result_tuple = namedtuple(
        "bqml_split",
        ["dataset_uri", "dataset_bq_uri", "test_dataset_uri"],
    )
    return result_tuple(
        dataset_uri=str(dataset_uri),
        dataset_bq_uri=str(dataset_bq_uri),
        test_dataset_uri=str(test_dataset_uri),
    )

### 训练BQML模型

对于这个演示，我们在BQML上使用简单的线性回归模型。然而，您也可以尝试其他的模型架构，如深度神经网络、XGboost、逻辑回归等。

要查看BQML支持的所有模型的完整列表，请看这里：[每个模型的端到端用户旅程](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-e2e-journey)。

正如之前指出的，BQML和AutoML使用不同的拆分术语，因此我们在CREATE model查询的SELECT部分直接进行<i>split_col</i>列的调整：

> 当DATA_SPLIT_METHOD的值为'CUSTOM'时，相应的列应该是BOOL类型。带有TRUE或NULL值的行将用作评估数据。带有FALSE值的行将用作训练数据。

In [None]:
def _query_create_model(
    project_id: str,
    bq_dataset: str,
    training_data_uri: str,
    model_name: str = "linear_regression_model_prototyping",
):
    model_uri = f"{project_id}.{bq_dataset}.{model_name}"

    model_options = """OPTIONS
      ( MODEL_TYPE='LINEAR_REG',
        input_label_cols=['Rings'],
         DATA_SPLIT_METHOD='CUSTOM',
        DATA_SPLIT_COL='split_col'
        )
        """
    query = f"""
    CREATE OR REPLACE MODEL
      `{model_uri}`
      {model_options}
     AS
    SELECT
      Sex,
      Length,
      Diameter,
      Height,
      Whole_weight,
      Shucked_weight,
      Viscera_weight,
      Shell_weight,
      Rings,
      CASE(split_col)
        WHEN 'TEST' THEN TRUE
      ELSE
      FALSE
    END
      AS split_col
    FROM
      `{training_data_uri}`;
    """

    print(query.replace("\n", " "))

    return query

### 解释BQML模型评估

当您在模型创建查询上进行超参数调优时，预先构建的组件[BigqueryEvaluateModelJobOp](https://google-cloud-pipeline-components.readthedocs.io/en/google-cloud-pipeline-components-1.0.0/google_cloud_pipeline_components.experimental.bigquery.html#google_cloud_pipeline_components.experimental.bigquery.BigqueryEvaluateModelJobOp)的输出将是一个表格，其中包含BQML在训练模型时获得的指标。在您的BigQuery控制台中，它们看起来像下面的图片。我们需要以编程方式访问它们，以便我们可以将它们与AutoML模型进行比较。

下面的单元格显示了如何完成这一操作的示例。BQML不会在指标列表中提供均方根误差，所以我们手动将其添加到指标字典中。有关输出的更多信息，请查看[BQML文档](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-evaluate#mlevaluate_output)。

In [None]:
@component(base_image="python:3.9")
def interpret_bqml_evaluation_metrics(
    bqml_evaluation_metrics: Input[Artifact], metrics: Output[Metrics]
) -> dict:
    import math

    metadata = bqml_evaluation_metrics.metadata
    for r in metadata["rows"]:

        rows = r["f"]
        schema = metadata["schema"]["fields"]

        output = {}
        for metric, value in zip(schema, rows):
            metric_name = metric["name"]
            val = float(value["v"])
            output[metric_name] = val
            metrics.log_metric(metric_name, val)
            if metric_name == "mean_squared_error":
                rmse = math.sqrt(val)
                metrics.log_metric("root_mean_squared_error", rmse)

    metrics.log_metric("framework", "BQML")

    print(output)

### 解释 AutoML 模型评估

类似于 BQML，AutoML 在模型创建过程中也会生成指标。可以在以下 UI 中查看这些指标：

<img src="https://storage.googleapis.com/rafacarv-public-bucket-do-not-delete/abalone/automl-evaluate.png" />

由于我们没有预构建组件来以编程方式访问这些指标，我们可以使用 Vertex AI GAPIC（Google API Compiler），这将自动生成服务的低级 gRPC 接口。

In [None]:
# Inspired by Andrew Ferlitsch's work on https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/ml_ops/stage3/get_started_with_automl_pipeline_components.ipynb


@component(
    base_image="python:3.9",
    packages_to_install=[
        "google-cloud-aiplatform",
    ],
)
def interpret_automl_evaluation_metrics(
    region: str, model: Input[Artifact], metrics: Output[Metrics]
):
    """'
    For a list of available regression metrics, go here: gs://google-cloud-aiplatform/schema/modelevaluation/regression_metrics_1.0.0.yaml.

    More information on available metrics for different types of models: https://cloud.google.com/vertex-ai/docs/predictions/online-predictions-automl
    """

    import google.cloud.aiplatform.gapic as gapic

    # Get a reference to the Model Service client
    client_options = {"api_endpoint": f"{region}-aiplatform.googleapis.com"}

    model_service_client = gapic.ModelServiceClient(client_options=client_options)

    model_resource_name = model.metadata["resourceName"]

    model_evaluations = model_service_client.list_model_evaluations(
        parent=model_resource_name
    )
    model_evaluation = list(model_evaluations)[0]

    available_metrics = [
        "meanAbsoluteError",
        "meanAbsolutePercentageError",
        "rSquared",
        "rootMeanSquaredError",
        "rootMeanSquaredLogError",
    ]
    output = dict()
    for x in available_metrics:
        val = model_evaluation.metrics.get(x)
        output[x] = val
        metrics.log_metric(str(x), float(val))

    metrics.log_metric("framework", "AutoML")
    print(output)

模型选择

现在我们已经独立评估了模型，我们将只选择其中一个继续前进。此次选举将基于在前几步中收集的模型评估指标来进行。

请注意，BigQuery ML和AutoML使用不同的评估指标名称，因此我们需要对这些不同的术语进行映射。

In [None]:
@component(base_image="python:3.9")
def select_best_model(
    metrics_bqml: Input[Metrics],
    metrics_automl: Input[Metrics],
    thresholds_dict_str: str,
    best_metrics: Output[Metrics],
    reference_metric_name: str = "rmse",
) -> NamedTuple(
    "Outputs",
    [
        ("deploy_decision", str),
        ("best_model", str),
        ("metric", float),
        ("metric_name", str),
    ],
):
    import json
    from collections import namedtuple

    best_metric = float("inf")
    best_model = None

    # BQML and AutoML use different metric names.
    metric_possible_names = []

    if reference_metric_name == "mae":
        metric_possible_names = ["meanAbsoluteError", "mean_absolute_error"]
    elif reference_metric_name == "rmse":
        metric_possible_names = ["rootMeanSquaredError", "root_mean_squared_error"]

    metric_bqml = float("inf")
    metric_automl = float("inf")
    print(metrics_bqml.metadata)
    print(metrics_automl.metadata)
    for x in metric_possible_names:

        try:
            metric_bqml = metrics_bqml.metadata[x]
            print(f"Metric bqml: {metric_bqml}")
        except:
            print(f"{x} does not exist int the BQML dictionary")

        try:
            metric_automl = metrics_automl.metadata[x]
            print(f"Metric automl: {metric_automl}")
        except:
            print(f"{x} does not exist on the AutoML dictionary")

    # Change condition if higher is better.
    print(f"Comparing BQML ({metric_bqml}) vs AutoML ({metric_automl})")
    if metric_bqml <= metric_automl:
        best_model = "bqml"
        best_metric = metric_bqml
        best_metrics.metadata = metrics_bqml.metadata
    else:
        best_model = "automl"
        best_metric = metric_automl
        best_metrics.metadata = metrics_automl.metadata

    thresholds_dict = json.loads(thresholds_dict_str)
    deploy = False

    # Change condition if higher is better.
    if best_metric < thresholds_dict[reference_metric_name]:
        deploy = True

    if deploy:
        deploy_decision = "true"
    else:
        deploy_decision = "false"

    print(f"Which model is best? {best_model}")
    print(f"What metric is being used? {reference_metric_name}")
    print(f"What is the best metric? {best_metric}")
    print(f"What is the threshold to deploy? {thresholds_dict_str}")
    print(f"Deploy decision: {deploy_decision}")

    Outputs = namedtuple(
        "Outputs", ["deploy_decision", "best_model", "metric", "metric_name"]
    )

    return Outputs(
        deploy_decision=deploy_decision,
        best_model=best_model,
        metric=best_metric,
        metric_name=reference_metric_name,
    )

### 验证基础设施

一旦部署了最佳模型，您可以通过向其进行简单预测来验证端点。

In [None]:
@component(base_image="python:3.9", packages_to_install=["google-cloud-aiplatform"])
def validate_infrastructure(
    endpoint: Input[Artifact],
) -> NamedTuple(
    "validate_infrastructure_output", [("instance", str), ("prediction", float)]
):
    import json
    from collections import namedtuple

    from google.cloud import aiplatform
    from google.protobuf import json_format
    from google.protobuf.struct_pb2 import Value

    def treat_uri(uri):
        return uri[uri.find("projects/") :]

    def request_prediction(endp, instance):
        instance = json_format.ParseDict(instance, Value())
        instances = [instance]
        parameters_dict = {}
        parameters = json_format.ParseDict(parameters_dict, Value())
        response = endp.predict(instances=instances, parameters=parameters)
        print("deployed_model_id:", response.deployed_model_id)
        print("predictions: ", response.predictions)
        # The predictions are a google.protobuf.Value representation of the model's predictions.
        predictions = response.predictions

        for pred in predictions:
            if type(pred) is dict and "value" in pred.keys():
                # AutoML predictions
                prediction = pred["value"]
            elif type(pred) is list:
                # BQML Predictions return different format
                prediction = pred[0]
            return prediction

    endpoint_uri = endpoint.uri
    treated_uri = treat_uri(endpoint_uri)

    instance = {
        "Sex": "M",
        "Length": 0.33,
        "Diameter": 0.255,
        "Height": 0.08,
        "Whole_weight": 0.205,
        "Shucked_weight": 0.0895,
        "Viscera_weight": 0.0395,
        "Shell_weight": 0.055,
    }
    instance_json = json.dumps(instance)
    print("Will use the following instance: " + instance_json)

    endpoint = aiplatform.Endpoint(treated_uri)
    prediction = request_prediction(endpoint, instance)
    result_tuple = namedtuple(
        "validate_infrastructure_output", ["instance", "prediction"]
    )

    return result_tuple(instance=str(instance_json), prediction=float(prediction))

## 管道

In [None]:
pipeline_params = {
    "project": PROJECT_ID,
    "region": REGION,
    "gcs_input_file_uri": RAW_INPUT_DATA,
    "bq_dataset": BQ_DATASET,
    "bq_location": BQ_LOCATION,
    "bqml_model_export_location": BQML_EXPORT_LOCATION,
    "bqml_serving_container_image_uri": BQML_SERVING_CONTAINER_IMAGE_URI,
    "endpoint_display_name": ENDPOINT_DISPLAY_NAME,
    "thresholds_dict_str": '{"rmse": 2.5}',
}

In [None]:
@dsl.pipeline(name=DISPLAY_NAME, description="Rapid Prototyping")
def train_pipeline(
    project: str,
    gcs_input_file_uri: str,
    region: str,
    bq_dataset: str,
    bq_location: str,
    bqml_model_export_location: str,
    bqml_serving_container_image_uri: str,
    endpoint_display_name: str,
    thresholds_dict_str: str,
):
    from google_cloud_pipeline_components.types import artifact_types
    from kfp.dsl import importer_node

    # Imports data to BigQuery using a custom component.
    import_data_to_bigquery_op = import_data_to_bigquery(
        project=project,
        bq_location=bq_location,
        bq_dataset=bq_dataset,
        gcs_data_uri=gcs_input_file_uri,
    )
    raw_dataset = import_data_to_bigquery_op.outputs["raw_dataset"]

    # Splits the BQ dataset using a custom component.
    split_datasets_op = split_datasets(raw_dataset=raw_dataset, bq_location=bq_location)

    # Generates the query to create a BQML using a static function.
    create_model_query = _query_create_model(
        project_id=project,
        bq_dataset=bq_dataset,
        training_data_uri=split_datasets_op.outputs["dataset_uri"],
    )

    # Builds BQML model using pre-built-component.
    bqml_create_op = bq_components.BigqueryCreateModelJobOp(
        project=project, location=bq_location, query=create_model_query
    )
    bqml_model = bqml_create_op.outputs["model"]

    # Gathers BQML evaluation metrics using a pre-built-component.
    bqml_evaluate_op = bq_components.BigqueryEvaluateModelJobOp(
        project=project, location=bq_location, model=bqml_model
    )
    bqml_eval_metrics_raw = bqml_evaluate_op.outputs["evaluation_metrics"]

    # Analyzes evaluation BQML metrics using a custom component.
    interpret_bqml_evaluation_metrics_op = interpret_bqml_evaluation_metrics(
        bqml_evaluation_metrics=bqml_eval_metrics_raw
    )
    bqml_eval_metrics = interpret_bqml_evaluation_metrics_op.outputs["metrics"]

    # Exports the BQML model to a GCS bucket using a pre-built-component.
    bqml_export_op = bq_components.BigqueryExportModelJobOp(
        project=project,
        location=bq_location,
        model=bqml_model,
        model_destination_path=bqml_model_export_location,
    ).after(bqml_evaluate_op)
    bqml_exported_gcs_path = bqml_export_op.outputs["exported_model_path"]

    unmanaged_model_importer = importer_node.importer(
        artifact_uri=bqml_exported_gcs_path,
        artifact_class=artifact_types.UnmanagedContainerModel,
        metadata={
            "containerSpec": {
                "imageUri": "us-docker.pkg.dev/cloud-aiplatform/prediction/tf2-cpu.2-3:latest"
            }
        },
    )

    # Uploads the recently exported the BQML model from GCS into Vertex AI using a pre-built-component.
    bqml_model_upload_op = ModelUploadOp(
        project=project,
        location=region,
        display_name=DISPLAY_NAME + "_bqml",
        unmanaged_container_model=unmanaged_model_importer.outputs["artifact"],
    )
    bqml_vertex_model = bqml_model_upload_op.outputs["model"]

    # Creates a Vertex AI Tabular dataset using a pre-built-component.
    dataset_create_op = TabularDatasetCreateOp(
        project=project,
        location=region,
        display_name=DISPLAY_NAME,
        bq_source=split_datasets_op.outputs["dataset_bq_uri"],
    )

    # Trains an AutoML Tables model using a pre-built-component.
    automl_training_op = AutoMLTabularTrainingJobRunOp(
        project=project,
        location=region,
        display_name=f"{DISPLAY_NAME}_automl",
        optimization_prediction_type="regression",
        optimization_objective="minimize-rmse",
        predefined_split_column_name="split_col",
        dataset=dataset_create_op.outputs["dataset"],
        target_column="Rings",
        column_transformations=[
            {"categorical": {"column_name": "Sex"}},
            {"numeric": {"column_name": "Length"}},
            {"numeric": {"column_name": "Diameter"}},
            {"numeric": {"column_name": "Height"}},
            {"numeric": {"column_name": "Whole_weight"}},
            {"numeric": {"column_name": "Shucked_weight"}},
            {"numeric": {"column_name": "Viscera_weight"}},
            {"numeric": {"column_name": "Shell_weight"}},
            {"numeric": {"column_name": "Rings"}},
        ],
    )
    automl_model = automl_training_op.outputs["model"]

    # Analyzes evaluation AutoML metrics using a custom component.
    automl_eval_op = interpret_automl_evaluation_metrics(
        region=region, model=automl_model
    )
    automl_eval_metrics = automl_eval_op.outputs["metrics"]

    # 1) Decides which model is best (AutoML vs BQML);
    # 2) Determines if the best model meets the deployment condition.
    best_model_task = select_best_model(
        metrics_bqml=bqml_eval_metrics,
        metrics_automl=automl_eval_metrics,
        thresholds_dict_str=thresholds_dict_str,
    )

    # If the deploy condition is True, then deploy the best model.
    with dsl.Condition(
        best_model_task.outputs["deploy_decision"] == "true",
        name="deploy_decision",
    ):
        # Creates a Vertex AI endpoint using a pre-built-component.
        endpoint_create_op = EndpointCreateOp(
            project=project,
            location=region,
            display_name=endpoint_display_name,
        )
        endpoint_create_op.after(best_model_task)

        # In case the BQML model is the best...
        with dsl.Condition(
            best_model_task.outputs["best_model"] == "bqml",
            name="deploy_bqml",
        ):
            # Deploys the BQML model (now on Vertex AI) to the recently created endpoint using a pre-built component.
            model_deploy_bqml_op = ModelDeployOp(  # noqa: F841
                endpoint=endpoint_create_op.outputs["endpoint"],
                model=bqml_vertex_model,
                deployed_model_display_name=DISPLAY_NAME + "_best_bqml",
                dedicated_resources_machine_type="n1-standard-2",
                dedicated_resources_min_replica_count=2,
                dedicated_resources_max_replica_count=2,
                traffic_split={
                    "0": 100
                },  # newly deployed model gets 100% of the traffic
            ).set_caching_options(False)

            # Sends an online prediction request to the recently deployed model using a custom component.
            validate_infrastructure(
                endpoint=endpoint_create_op.outputs["endpoint"]
            ).set_caching_options(False).after(model_deploy_bqml_op)

        # In case the AutoML model is the best...
        with dsl.Condition(
            best_model_task.outputs["best_model"] == "automl",
            name="deploy_automl",
        ):
            # Deploys the AutoML model to the recently created endpoint using a pre-built component.
            model_deploy_automl_op = ModelDeployOp(  # noqa: F841
                endpoint=endpoint_create_op.outputs["endpoint"],
                model=automl_model,
                deployed_model_display_name=DISPLAY_NAME + "_best_automl",
                dedicated_resources_machine_type="n1-standard-2",
                dedicated_resources_min_replica_count=2,
                dedicated_resources_max_replica_count=2,
                traffic_split={
                    "0": 100
                },  # newly deployed model gets 100% of the traffic
            ).set_caching_options(False)

            # Sends an online prediction request to the recently deployed model using a custom component.
            validate_infrastructure(
                endpoint=endpoint_create_op.outputs["endpoint"]
            ).set_caching_options(False).after(model_deploy_automl_op)

### 运行管道

In [None]:
compiler.Compiler().compile(
    pipeline_func=train_pipeline,
    package_path=PIPELINE_YAML_PKG_PATH,
)


vertex.init(project=PROJECT_ID, location=REGION)

pipeline_job = vertex.PipelineJob(
    display_name=DISPLAY_NAME,
    template_path=PIPELINE_YAML_PKG_PATH,
    pipeline_root=PIPELINE_ROOT,
    parameter_values=pipeline_params,
    enable_caching=False,
)

response = pipeline_job.submit()

等待管道完成

目前，您的管道通过使用`submit()`方法异步运行。要以同步方式运行它，您需要调用`run()`方法。

在这最后一步中，您会通过使用`wait()`方法阻塞在异步执行的任务等待完成。

In [None]:
pipeline_job.wait()

清理

要清理此项目中使用的所有 Google Cloud 资源，您可以[删除用于本教程的 Google Cloud 项目](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects)。

否则，您可以删除在本教程中创建的各个资源。

In [None]:
import os

vertex.init(project=PROJECT_ID, location=REGION)

delete_bucket = False

print("Will delete endpoint")
endpoints = vertex.Endpoint.list(
    filter=f"display_name={DISPLAY_NAME}_endpoint", order_by="create_time"
)
endpoint = endpoints[0]
endpoint.undeploy_all()
vertex.Endpoint.delete(endpoint)
print("Deleted endpoint:", endpoint)

print("Will delete models")
suffix_list = ["bqml", "automl"]
for suffix in suffix_list:
    try:
        model_display_name = f"{DISPLAY_NAME}_{suffix}"
        print("Will delete model with name " + model_display_name)
        models = vertex.Model.list(
            filter=f"display_name={model_display_name}", order_by="create_time"
        )

        model = models[0]
        vertex.Model.delete(model)
        print("Deleted model:", model)
    except Exception as e:
        print(e)


print("Will delete Vertex dataset")
datasets = vertex.TabularDataset.list(
    filter=f"display_name={DISPLAY_NAME}", order_by="create_time"
)

dataset = datasets[0]
vertex.TabularDataset.delete(dataset)
print("Deleted Vertex dataset:", dataset)


pipelines = vertex.PipelineJob.list(
    filter=f"pipeline_name={DISPLAY_NAME}", order_by="create_time"
)
pipeline = pipelines[0]
vertex.PipelineJob.delete(pipeline)
print("Deleted pipeline:", pipeline)

delete_dataset = True

# delete dataset
if delete_dataset or os.getenv("IS_TESTING"):
    ! bq rm -r -f -d $PROJECT_ID:$BQ_DATASET

dataset_id = f"{PROJECT_ID}.{BQ_DATASET}"
print(f"Deleted BQ dataset '{dataset_id}' from location {BQ_LOCATION}.")

if delete_bucket or os.getenv("IS_TESTING"):
    ! gsutil rm -r $BUCKET_URI