In [None]:
# Copyright 2021 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Vertex AI管道：使用google-cloud-pipeline-components的AutoML表格管道

<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/pipelines/automl_tabular_classification_beans.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Google Colaboratory logo"><br> 在Colab中打开
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fvertex-ai-samples%2Fmain%2Fnotebooks%2Fofficial%2Fpipelines%2Fautoml_tabular_classification_beans.ipynb">
      <img width="32px" src="https://cloud.google.com/ml-engine/images/colab-enterprise-logo-32px.png" alt="Google Cloud Colab Enterprise logo"><br> 在Colab Enterprise中打开
    </a>
  </td>    
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/main/notebooks/official/pipelines/automl_tabular_classification_beans.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo"><br> 在Workbench中打开
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/pipelines/automl_tabular_classification_beans.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo"><br> 在GitHub上查看
    </a>
  </td>
</table>

## 概述

本笔记本展示如何使用 [*google_cloud_pipeline_components*](https://github.com/kubeflow/pipelines/tree/master/components/google-cloud) SDK 中的组件在 [Vertex AI Pipelines](https://cloud.google.com/vertex-ai/docs/pipelines) 中构建一个 AutoML 表格分类工作流。

您可以在本笔记本中构建如下的流水线：

<a href="https://storage.googleapis.com/amy-jo/images/mp/beans.png" target="_blank"><img src="https://storage.googleapis.com/amy-jo/images/mp/beans.png" width="95%"/></a>

了解有关 [Vertex AI Pipelines](https://cloud.google.com/vertex-ai/docs/pipelines/introduction) 和 [AutoML 组件](https://cloud.google.com/vertex-ai/docs/pipelines/vertex-automl-component) 的更多信息。了解有关 [表格数据分类](https://cloud.google.com/vertex-ai/docs/tabular-data/classification-regression/overview) 的更多信息。

### 目标

在本教程中，您将学习如何使用**Vertex AI Pipelines**和**Google Cloud Pipeline Components**构建一个AutoML表格分类模型。


本教程使用以下Vertex AI服务：

- Vertex AI Pipelines
- Google Cloud Pipeline Components
- Vertex AutoML
- Vertex AI模型
- Vertex AI端点

执行的步骤包括：

- 创建一个KFP流水线，其中包括：
    - 创建一个Vertex AI数据集。
    - 训练一个AutoML表格分类模型资源。
    - 创建一个Vertex AI端点资源。
    - 将模型资源部署到端点资源。
- 编译KFP流水线。
- 使用Vertex AI Pipelines执行KFP流水线。

了解更多关于[Google Cloud Pipeline Components SDK中的Vertex AI组件](https://google-cloud-pipeline-components.readthedocs.io/en/latest/google_cloud_pipeline_components.aiplatform.html#module-google_cloud_pipeline_components.aiplatform)。

数据集

本教程使用UCI机器学习的 "干豆数据集"，来自于KOKLU, M. and OZKAN, I.A., (2020)的研究: "利用计算机视觉和机器学习技术对干豆进行多类别分类"。该研究发表在《计算与电子农业》杂志上，第174期，105507页。DOI：https://doi.org/10.1016/j.compag.2020.105507。

### 费用

本教程使用 Google Cloud 的可计费组件：

* Vertex AI
* Cloud Storage

了解 [Vertex AI 价格](https://cloud.google.com/vertex-ai/pricing) 和 [Cloud Storage 价格](https://cloud.google.com/storage/pricing)，并使用 [定价计算器](https://cloud.google.com/products/calculator/) 根据您的预期使用情况生成费用估算。

开始吧

### 为Python安装Vertex AI SDK和其他必需的包

In [None]:
! pip3 install --upgrade --quiet google-cloud-aiplatform \
                                 google-cloud-storage \
                                 kfp \
                                 google-cloud-pipeline-components

### 重新启动运行时（仅适用于Colab）

为了使用新安装的软件包，您必须在Google Colab上重新启动运行时。

In [None]:
import sys

if "google.colab" in sys.modules:

    import IPython

    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

<div class="alert alert-block alert-warning">
<b>⚠️ 内核即将重新启动。在进行下一步之前，请等待它完成。 ⚠️</b>
</div>

检查包版本

检查您安装的包的版本。KFP SDK 版本应该是 >=1.8。

In [None]:
! python3 -c "import kfp; print('KFP SDK version: {}'.format(kfp.__version__))"
! python3 -c "import google_cloud_pipeline_components; print('google_cloud_pipeline_components version: {}'.format(google_cloud_pipeline_components.__version__))"

### 在笔记本环境上进行身份验证（仅限Colab）

在Google Colab上对您的环境进行身份验证。

In [None]:
import sys

if "google.colab" in sys.modules:

    from google.colab import auth

    auth.authenticate_user()

### 设置Google Cloud项目信息

要开始使用Vertex AI，您必须拥有现有的Google Cloud项目。了解更多关于[设置项目和开发环境](https://cloud.google.com/vertex-ai/docs/start/cloud-environment)的信息。

In [None]:
PROJECT_ID = "[your-project-id]"  # @param {type:"string"}
LOCATION = "us-central1"  # @param {type:"string"}

创建一个云存储桶

创建一个存储桶，用于存储中间产物，如数据集。

In [None]:
BUCKET_URI = f"gs://your-bucket-name-{PROJECT_ID}-unique"  # @param {type:"string"}

如果您的存储桶尚不存在：运行以下单元格以创建您的云存储桶。

In [None]:
! gsutil mb -l {LOCATION} -p {PROJECT_ID} {BUCKET_URI}

### 服务账号

**如果您不知道您的服务账号**，请尝试使用`gcloud`命令在下面执行第二个单元格获取您的服务账号。

In [None]:
SERVICE_ACCOUNT = ""  # @param {type:"string"}

In [None]:
import sys

IS_COLAB = "google.colab" in sys.modules
if (
    SERVICE_ACCOUNT == ""
    or SERVICE_ACCOUNT is None
    or SERVICE_ACCOUNT == "[your-service-account]"
):
    # Get your service account from gcloud
    if not IS_COLAB:
        shell_output = !gcloud auth list 2>/dev/null
        SERVICE_ACCOUNT = shell_output[2].replace("*", "").strip()

    if IS_COLAB:
        shell_output = ! gcloud projects describe  $PROJECT_ID
        project_number = shell_output[-1].split(":")[1].strip().replace("'", "")
        SERVICE_ACCOUNT = f"{project_number}-compute@developer.gserviceaccount.com"

    print("Service Account:", SERVICE_ACCOUNT)

设置用于 Vertex AI 管道的服务帐户访问权限

运行以下命令，将您的服务帐户授予对您在上一步中创建的存储桶中的管道工件进行读取和写入的访问权限。每个服务帐户只需要运行一次。

In [None]:
! gsutil iam ch serviceAccount:{SERVICE_ACCOUNT}:roles/storage.objectCreator $BUCKET_URI

! gsutil iam ch serviceAccount:{SERVICE_ACCOUNT}:roles/storage.objectViewer $BUCKET_URI

### 导入库并定义常量

In [None]:
from typing import NamedTuple

import google.cloud.aiplatform as aiplatform
import kfp
from google.cloud import bigquery
from kfp import compiler, dsl
from kfp.dsl import (Artifact, ClassificationMetrics, Input, Metrics, Output,
                     component)

#### Vertex AI 常量

为 Vertex AI 流水线设置以下常量：
- `PIPELINE_NAME`：为流水线设置名称。
- `PIPELINE_ROOT`：Cloud Storage 存储流水线工件的桶路径。

In [None]:
# set path for storing the pipeline artifacts
PIPELINE_NAME = "automl-tabular-beans-training"
PIPELINE_ROOT = "{}/pipeline_root/beans".format(BUCKET_URI)

### 初始化用于 Python 的 Vertex AI SDK

要开始使用 Vertex AI，您必须[启用 Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com)。

为您的项目和相应的存储桶初始化 Python 的 Vertex AI SDK。

In [None]:
aiplatform.init(project=PROJECT_ID, location=LOCATION, staging_bucket=BUCKET_URI)

## 为度量评估定义自定义组件

在本教程中，您定义一个自定义管道组件。其余组件是为Vertex AI服务预构建的组件。

您定义的自定义管道组件是基于Python函数的组件。基于Python函数的组件使您可以通过构建Python函数作为组件代码并为您生成组件规范来更轻松地进行迭代。

注意`@component`装饰器。当您评估`classification_model_eval`函数时，该组件被编译为本质上是一个任务工厂函数，可在管道定义中使用。

此外，生成了一个**tabular_eval_component.yaml**组件定义文件。该组件**yaml**文件可以被共享并放在版本控制下，以便稍后用于定义管道步骤。

组件定义指定组件要使用的基础镜像，并指定应安装`google-cloud-aiplatform`包。如果未指定，基础镜像默认为Python 3.7。

您在此步骤中创建的自定义管道组件检索由AutoML表格训练过程生成的分类模型评估度量。然后，它解析评估数据，并为模型呈现ROC曲线和混淆矩阵。此外，检查设置的阈值是否符合生成的指标，以确定模型是否足够准确可供部署。

**注意**：此自定义组件特定于AutoML表格分类任务。

In [None]:
@component(
    base_image="gcr.io/deeplearning-platform-release/tf2-cpu.2-6:latest",
    packages_to_install=["google-cloud-aiplatform"],
)
def classification_model_eval_metrics(
    project: str,
    location: str,
    thresholds_dict_str: str,
    model: Input[Artifact],
    metrics: Output[Metrics],
    metricsc: Output[ClassificationMetrics],
) -> NamedTuple("Outputs", [("dep_decision", str)]):  # Return parameter.

    import json
    import logging

    from google.cloud import aiplatform

    aiplatform.init(project=project)

    # Fetch model eval info
    def get_eval_info(model):
        response = model.list_model_evaluations()
        metrics_list = []
        metrics_string_list = []
        for evaluation in response:
            evaluation = evaluation.to_dict()
            print("model_evaluation")
            print(" name:", evaluation["name"])
            print(" metrics_schema_uri:", evaluation["metricsSchemaUri"])
            metrics = evaluation["metrics"]
            for metric in metrics.keys():
                logging.info("metric: %s, value: %s", metric, metrics[metric])
            metrics_str = json.dumps(metrics)
            metrics_list.append(metrics)
            metrics_string_list.append(metrics_str)

        return (
            evaluation["name"],
            metrics_list,
            metrics_string_list,
        )

    # Use the given metrics threshold(s) to determine whether the model is
    # accurate enough to deploy.
    def classification_thresholds_check(metrics_dict, thresholds_dict):
        for k, v in thresholds_dict.items():
            logging.info("k {}, v {}".format(k, v))
            if k in ["auRoc", "auPrc"]:  # higher is better
                if metrics_dict[k] < v:  # if under threshold, don't deploy
                    logging.info("{} < {}; returning False".format(metrics_dict[k], v))
                    return False
        logging.info("threshold checks passed.")
        return True

    def log_metrics(metrics_list, metricsc):
        test_confusion_matrix = metrics_list[0]["confusionMatrix"]
        logging.info("rows: %s", test_confusion_matrix["rows"])

        # log the ROC curve
        fpr = []
        tpr = []
        thresholds = []
        for item in metrics_list[0]["confidenceMetrics"]:
            fpr.append(item.get("falsePositiveRate", 0.0))
            tpr.append(item.get("recall", 0.0))
            thresholds.append(item.get("confidenceThreshold", 0.0))
        print(f"fpr: {fpr}")
        print(f"tpr: {tpr}")
        print(f"thresholds: {thresholds}")
        metricsc.log_roc_curve(fpr, tpr, thresholds)

        # log the confusion matrix
        annotations = []
        for item in test_confusion_matrix["annotationSpecs"]:
            annotations.append(item["displayName"])
        logging.info("confusion matrix annotations: %s", annotations)
        metricsc.log_confusion_matrix(
            annotations,
            test_confusion_matrix["rows"],
        )

        # log textual metrics info as well
        for metric in metrics_list[0].keys():
            if metric != "confidenceMetrics":
                val_string = json.dumps(metrics_list[0][metric])
                metrics.log_metric(metric, val_string)

    logging.getLogger().setLevel(logging.INFO)

    # extract the model resource name from the input Model Artifact
    model_resource_path = model.metadata["resourceName"]
    logging.info("model path: %s", model_resource_path)

    # Get the trained model resource
    model = aiplatform.Model(model_resource_path)

    # Get model evaluation metrics from the the trained model
    eval_name, metrics_list, metrics_str_list = get_eval_info(model)
    logging.info("got evaluation name: %s", eval_name)
    logging.info("got metrics list: %s", metrics_list)
    log_metrics(metrics_list, metricsc)

    thresholds_dict = json.loads(thresholds_dict_str)
    deploy = classification_thresholds_check(metrics_list[0], thresholds_dict)
    if deploy:
        dep_decision = "true"
    else:
        dep_decision = "false"
    logging.info("deployment decision is %s", dep_decision)

    return (dep_decision,)


compiler.Compiler().compile(
    classification_model_eval_metrics, "tabular_eval_component.yaml"
)

定义管道

使用`google_cloud_pipeline_components`包中的组件来定义AutoML表格分类的管道。

In [None]:
@kfp.dsl.pipeline(name=PIPELINE_NAME, pipeline_root=PIPELINE_ROOT)
def pipeline(
    bq_source: str,
    DATASET_DISPLAY_NAME: str,
    TRAINING_DISPLAY_NAME: str,
    MODEL_DISPLAY_NAME: str,
    ENDPOINT_DISPLAY_NAME: str,
    MACHINE_TYPE: str,
    project: str,
    gcp_region: str,
    thresholds_dict_str: str,
):
    from google_cloud_pipeline_components.v1.automl.training_job import \
        AutoMLTabularTrainingJobRunOp
    from google_cloud_pipeline_components.v1.dataset.create_tabular_dataset.component import \
        tabular_dataset_create as TabularDatasetCreateOp
    from google_cloud_pipeline_components.v1.endpoint.create_endpoint.component import \
        endpoint_create as EndpointCreateOp
    from google_cloud_pipeline_components.v1.endpoint.deploy_model.component import \
        model_deploy as ModelDeployOp

    dataset_create_op = TabularDatasetCreateOp(
        project=project,
        location=gcp_region,
        display_name=DATASET_DISPLAY_NAME,
        bq_source=bq_source,
    )

    training_op = AutoMLTabularTrainingJobRunOp(
        project=project,
        location=gcp_region,
        display_name=TRAINING_DISPLAY_NAME,
        optimization_prediction_type="classification",
        optimization_objective="minimize-log-loss",
        budget_milli_node_hours=1000,
        model_display_name=MODEL_DISPLAY_NAME,
        column_specs={
            "Area": "numeric",
            "Perimeter": "numeric",
            "MajorAxisLength": "numeric",
            "MinorAxisLength": "numeric",
            "AspectRation": "numeric",
            "Eccentricity": "numeric",
            "ConvexArea": "numeric",
            "EquivDiameter": "numeric",
            "Extent": "numeric",
            "Solidity": "numeric",
            "roundness": "numeric",
            "Compactness": "numeric",
            "ShapeFactor1": "numeric",
            "ShapeFactor2": "numeric",
            "ShapeFactor3": "numeric",
            "ShapeFactor4": "numeric",
            "Class": "categorical",
        },
        dataset=dataset_create_op.outputs["dataset"],
        target_column="Class",
    )

    model_eval_task = classification_model_eval_metrics(
        project=project,
        location=gcp_region,
        thresholds_dict_str=thresholds_dict_str,
        model=training_op.outputs["model"],
    )

    with dsl.If(
        model_eval_task.outputs["dep_decision"] == "true",
        name="deploy_decision",
    ):

        endpoint_op = EndpointCreateOp(
            project=project,
            location=gcp_region,
            display_name=ENDPOINT_DISPLAY_NAME,
        )

        ModelDeployOp(
            model=training_op.outputs["model"],
            endpoint=endpoint_op.outputs["endpoint"],
            dedicated_resources_min_replica_count=1,
            dedicated_resources_max_replica_count=1,
            dedicated_resources_machine_type=MACHINE_TYPE,
        )

## 编译管道

接下来，将管道编译为一个yaml文件。

In [None]:
compiler.Compiler().compile(
    pipeline_func=pipeline,
    package_path="tabular_classification_pipeline.yaml",
)

运行流水线

传递必要的流水线输入参数并运行它。定义的流水线接受以下参数：

- `bq_source`: 表格数据集的BigQuery源。
- `DATASET_DISPLAY_NAME`: Vertex AI管理的数据集的显示名称。
- `TRAINING_DISPLAY_NAME`: AutoML训练作业的显示名称。
- `MODEL_DISPLAY_NAME`: 训练作业生成的Vertex AI模型的显示名称。
- `ENDPOINT_DISPLAY_NAME`: 部署模型的Vertex AI端点的显示名称。
- `MACHINE_TYPE`: 用于服务容器的机器类型。
- `project`: 运行流水线的项目ID。
- `gcp_region`: 设置流水线位置的区域。
- `thresholds_dict_str`: 基于其条件部署模型的阈值字典。
- `pipeline_root`: 要覆盖流水线作业定义中指定的流水线根路径，请指定一个您的流水线作业可以访问的路径，比如一个Cloud Storage存储桶URI。

In [None]:
# Set the display-names for Vertex AI resources
PIPELINE_DISPLAY_NAME = "[your-pipeline-display-name]"  # @param {type:"string"}
DATASET_DISPLAY_NAME = "[your-dataset-display-name]"  # @param {type:"string"}
MODEL_DISPLAY_NAME = "[your-model-display-name]"  # @param {type:"string"}
TRAINING_DISPLAY_NAME = "[your-training-job-display-name]"  # @param {type:"string"}
ENDPOINT_DISPLAY_NAME = "[your-endpoint-display-name]"  # @param {type:"string"}

# Otherwise, use the default display-names
if PIPELINE_DISPLAY_NAME == "[your-pipeline-display-name]":
    PIPELINE_DISPLAY_NAME = "pipeline_beans-unique"

if DATASET_DISPLAY_NAME == "[your-dataset-display-name]":
    DATASET_DISPLAY_NAME = "dataset_beans-unique"

if MODEL_DISPLAY_NAME == "[your-model-display-name]":
    MODEL_DISPLAY_NAME = "model_beans-unique"

if TRAINING_DISPLAY_NAME == "[your-training-job-display-name]":
    TRAINING_DISPLAY_NAME = "automl_training_beans-unique"

if ENDPOINT_DISPLAY_NAME == "[your-endpoint-display-name]":
    ENDPOINT_DISPLAY_NAME = "endpoint_beans-unique"

# Set machine type
MACHINE_TYPE = "n1-standard-4"

In [None]:
# Validate region of the given source (BigQuery) against region of the pipeline
bq_source = "aju-dev-demos.beans.beans1"

client = bigquery.Client()
bq_region = client.get_table(bq_source).location.lower()
try:
    assert bq_region in LOCATION
    print(f"Region validated: {LOCATION}")
except AssertionError:
    print(
        "Please make sure the region of BigQuery (source) and that of the pipeline are the same."
    )

# Configure the pipeline
job = aiplatform.PipelineJob(
    display_name=PIPELINE_DISPLAY_NAME,
    template_path="tabular_classification_pipeline.yaml",
    pipeline_root=PIPELINE_ROOT,
    parameter_values={
        "project": PROJECT_ID,
        "gcp_region": LOCATION,
        "bq_source": f"bq://{bq_source}",
        "thresholds_dict_str": '{"auRoc": 0.95}',
        "DATASET_DISPLAY_NAME": DATASET_DISPLAY_NAME,
        "TRAINING_DISPLAY_NAME": TRAINING_DISPLAY_NAME,
        "MODEL_DISPLAY_NAME": MODEL_DISPLAY_NAME,
        "ENDPOINT_DISPLAY_NAME": ENDPOINT_DISPLAY_NAME,
        "MACHINE_TYPE": MACHINE_TYPE,
    },
    enable_caching=True,
)

运行管道作业。单击生成的链接在云控制台中查看您的运行。

In [None]:
# Run the job
job.run()

检查流水线运行的参数和指标，以及其受到跟踪的元数据。

接下来，您可以使用Python的Vertex AI SDK检查流水线运行的参数和指标。等待流水线运行完成后再运行下一个单元格。

In [None]:
pipeline_df = aiplatform.get_pipeline_df(pipeline=PIPELINE_NAME)
print(pipeline_df.head(2))

清理

要清理此项目中使用的所有Google Cloud资源，您可以删除用于本教程的[Google Cloud项目](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects)。

否则，您可以删除在本教程中创建的各个资源。

In [None]:
# Delete the Vertex AI Pipeline Job
job.delete()

# List and filter the Vertex AI Endpoint
endpoints = aiplatform.Endpoint.list(
    filter=f"display_name={ENDPOINT_DISPLAY_NAME}", order_by="create_time"
)
# Delete the Vertex AI Endpoint
if len(endpoints) > 0:
    endpoint = endpoints[0]
    endpoint.delete(force=True)

# List and filter the Vertex AI model
models = aiplatform.Model.list(
    filter=f"display_name={MODEL_DISPLAY_NAME}", order_by="create_time"
)
# Delete the Vertex AI model
if len(models) > 0:
    model = models[0]
    model.delete()

# List and filter the Vertex AI Dataset
datasets = aiplatform.TabularDataset.list(
    filter=f"display_name={DATASET_DISPLAY_NAME}", order_by="create_time"
)
# Delete the Vertex AI Dataset
if len(datasets) > 0:
    dataset = datasets[0]
    dataset.delete()

# Delete the Cloud Storage bucket
delete_bucket = False  # Set True for deletion
if delete_bucket:
    ! gsutil rm -r $BUCKET_URI