In [None]:
# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

使用Swivel、BigQuery ML和Vertex AI Pipelines训练一个收购预测模型

<div align="left">
  <td>
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/pipelines/google_cloud_pipeline_components_bqml_text.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> 在Colab中运行
    </a>
  </td>
  <td>
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/pipelines/google_cloud_pipeline_components_bqml_text.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      在GitHub上查看
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/main/notebooks/official/pipelines/google_cloud_pipeline_components_bqml_text.ipynb" target='_blank'>
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      在Vertex AI Workbench中打开
    </a>
  </td>
</div>

## 概述

本笔记本演示了通过构建一个文本分类模型并在Vertex AI管道上运行它，使用`DataflowPythonJobOp`和BigQuery ML组件的用法。

执行的步骤包括：

1. 读取存储在Google Cloud Storage中的原始文本（HTML）文档。
2. 使用Dataflow提取（HTML）文档的标题、内容和主题，并将其纳入BigQuery中。
3. 应用Swivel模型生成文档内容的嵌入。
4. 训练一个逻辑回归模型，用于分类文章是否涉及企业收购（`acq`类别）。
5. 评估模型。
6. 将模型应用于数据集以生成预测。

了解更多关于[Vertex AI管道](https://cloud.google.com/vertex-ai/docs/pipelines/introduction) 和[BigQuery ML组件](https://cloud.google.com/vertex-ai/docs/pipelines/bigqueryml-component)。

### 目标

在这份笔记本中，您将学习如何使用Vertex AI pipelines构建一个简单的BigQuery ML管道，以计算文章内容的文本嵌入并将其分类为“公司收购”类别。

本教程使用以下Google Cloud ML服务和资源：

- Vertex AI Pipelines
- BigQuery ML

执行的步骤包括：

- 创建一个用于将数据导入BigQuery的Dataflow作业组件。
- 创建一个用于在BigQuery上运行数据预处理步骤的组件。
- 创建一个用于使用BigQuery ML训练逻辑回归模型的组件。
- 使用所有创建的组件构建和配置Kubeflow DSL管道。
- 在Vertex AI Pipelines中编译和运行管道。

数据集

本笔记本中使用的数据集是[路透社21578文本分类集数据集](https://archive.ics.uci.edu/ml/datasets/reuters-21578+text+categorization+collection)。这个数据集是一组公开可用的新闻文章，出现在1987年的路透新闻社新闻线上。它们是由1987年来自路透社有限公司和卡内基集团的工作人员组装和建立并加以分类索引的。

### 费用

此教程使用 Google Cloud 的计费组件:

* Vertex AI
* Cloud Storage
* BigQuery
* Dataflow

了解[Vertex AI
价格](https://cloud.google.com/vertex-ai/pricing), [Cloud Storage
价格](https://cloud.google.com/storage/pricing), [BigQuery
价格](https://cloud.google.com/bigquery/pricing), [Dataflow
价格](https://cloud.google.com/dataflow/pricing) 并使用 [定价
计算器](https://cloud.google.com/products/calculator/)
根据您的预期使用量生成费用估算。

安装

安装执行此笔记所需的软件包。

In [None]:
# Install dependencies
! pip3 install --upgrade --quiet    google-cloud-aiplatform \
                                    google_cloud_pipeline_components \
                                    google-api-core \
                                    google-auth

! pip3 install --upgrade --quiet    tensorflow==2.8.0 \
                                    tensorflow-hub==0.12.0 \
                                    kfp

### 仅限Colab使用：请取消注释以下单元格以重新启动内核。

In [None]:
# Automatically restart kernel after installs so that your environment can access the new packages
# import IPython

# app = IPython.Application.instance()
# app.kernel.do_shutdown(True)

##开始之前

###设置您的Google Cloud项目

**无论您使用的是哪种笔记本环境，下面这些步骤都是必需的。**

1. [选择或创建一个Google Cloud项目](https://console.cloud.google.com/cloud-resource-manager)。 当您第一次创建账户时，您可以获得300美元的免费信用额度，用于支付计算/存储成本。

2. [确保为您的项目启用了计费](https://cloud.google.com/billing/docs/how-to/modify-project)。

3. [启用Vertex AI API]。

4. 如果您是在本地运行这个笔记本，您需要安装[Cloud SDK](https://cloud.google.com/sdk)。

设置您的项目 ID

**如果您不知道您的项目 ID**，请尝试以下操作：
* 运行 `gcloud config list`。
* 运行 `gcloud projects list`。
* 查看支持页面：[查找项目 ID](https://support.google.com/googleapi/answer/7014113)

In [None]:
PROJECT_ID = "[your-project-id]"  # @param {type:"string"}

# Set the project id
! gcloud config set project {PROJECT_ID}

#### 地区

您还可以更改 Vertex AI 使用的“REGION”变量。了解更多关于[Vertex AI地区](https://cloud.google.com/vertex-ai/docs/general/locations)的信息。

In [None]:
REGION = "us-central1"  # @param {type: "string"}

UUID

如果您正在进行实时教程会话，您可能正在使用共享测试帐户或项目。为了避免用户之间在创建的资源上发生名称冲突，您为每个实例会话创建一个UUID，并将其附加到您在此教程中创建的资源名称上。

In [None]:
import random
import string


# Generate a uuid of length 8
def generate_uuid():
    return "".join(random.choices(string.ascii_lowercase + string.digits, k=8))


UUID = generate_uuid()

### 验证您的 Google Cloud 账户

根据您的 Jupyter 环境，您可能需要手动验证。请按以下相关说明进行操作。

**1. Vertex AI Workbench**
* 无需操作，因为您已经验证过。

**2. 在本地 JupyterLab 实例中取消注释并运行：**

In [None]:
# ! gcloud auth login

3. Colab，取消注释并运行：

In [None]:
# from google.colab import auth
# auth.authenticate_user()

4. 服务帐户或其他
* 查看如何在https://cloud.google.com/storage/docs/gsutil/commands/iam#ch-examples 上为您的服务帐户授予云存储权限。

创建一个云存储桶

创建一个存储桶，用来存储诸如数据集之类的中间产物。

In [None]:
BUCKET_URI = f"gs://your-bucket-name-{PROJECT_ID}-unique"  # @param {type:"string"}

只要您的存储桶尚不存在：运行以下单元格以创建您的云存储存储桶。

In [None]:
! gsutil mb -l {REGION} -p {PROJECT_ID} {BUCKET_URI}

# 服务帐号

您使用服务帐号来创建Vertex AI管道作业。如果您不想使用您项目的计算引擎服务帐号，请将`SERVICE_ACCOUNT`设置为另一个服务帐号ID。

In [None]:
SERVICE_ACCOUNT = "[your-service-account]"  # @param {type:"string"}

In [None]:
import os
import sys

IS_COLAB = "google.colab" in sys.modules

if (
    SERVICE_ACCOUNT == ""
    or SERVICE_ACCOUNT is None
    or SERVICE_ACCOUNT == "[your-service-account]"
):
    # Get your service account from gcloud
    if not IS_COLAB:
        shell_output = !gcloud auth list 2>/dev/null
        SERVICE_ACCOUNT = shell_output[2].replace("*", "").strip()

    else:  # IS_COLAB:
        shell_output = ! gcloud projects describe  $PROJECT_ID
        project_number = shell_output[-1].split(":")[1].strip().replace("'", "")
        SERVICE_ACCOUNT = f"{project_number}-compute@developer.gserviceaccount.com"

    print("Service Account:", SERVICE_ACCOUNT)

#### 为 Vertex AI Pipelines 设置服务账号访问权限

运行以下命令，为您的服务账号授予对在上一步中创建的存储桶中读取和写入管道工件的权限。您只需要针对每个服务账号运行一次此步骤。

In [None]:
! gsutil iam ch serviceAccount:{SERVICE_ACCOUNT}:roles/storage.objectCreator $BUCKET_URI

! gsutil iam ch serviceAccount:{SERVICE_ACCOUNT}:roles/storage.objectViewer $BUCKET_URI

### 设置项目模板

In [None]:
DATA_PATH = "data"
KFP_COMPONENTS_PATH = "components"
SRC = "src"
BUILD = "build"

In [None]:
!mkdir -m 777 -p {DATA_PATH} {KFP_COMPONENTS_PATH} {SRC} {BUILD}

### 准备输入数据

在下面的单元格中，您将：

1）从UCI存档中获取数据集。
2）解压缩数据集。
3）将数据集复制到云存储位置。

In [None]:
!wget --no-parent https://archive.ics.uci.edu/ml/machine-learning-databases/reuters21578-mld/reuters21578.tar.gz --directory-prefix={DATA_PATH}/raw
!mkdir -m 777 -p {DATA_PATH}/raw/temp {DATA_PATH}/raw
!tar -zxvf {DATA_PATH}/raw/reuters21578.tar.gz -C {DATA_PATH}/raw/temp/
!mv {DATA_PATH}/raw/temp/*.sgm {DATA_PATH}/raw && rm -rf {DATA_PATH}/raw/temp && rm -f {DATA_PATH}/raw/reuters21578.tar.gz

In [None]:
!gsutil -m cp -R {DATA_PATH}/raw $BUCKET_URI/{DATA_PATH}/raw

### 导入库

In [None]:
from pathlib import Path as path
from urllib.parse import urlparse

import tensorflow_hub as hub
from google.cloud import aiplatform as vertex_ai
from kfp import compiler, dsl
from kfp.dsl import component

os.environ["TFHUB_MODEL_LOAD_FORMAT"] = "UNCOMPRESSED"

### 定义常数

关于您要在预处理中使用的模型，您使用了在英语Google新闻130GB语料库上训练的具有20个维度的 [Swivel](https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1) 嵌入。

In [None]:
JOB_NAME = f"reuters-ingest-{UUID}"
SETUP_FILE_URI = urlparse(BUCKET_URI)._replace(path="setup.py").geturl()
RUNNER = "DataflowRunner"
STAGING_LOCATION_URI = urlparse(BUCKET_URI)._replace(path="staging").geturl()
TMP_LOCATION_URI = urlparse(BUCKET_URI)._replace(path="temp").geturl()
INPUTS_URI = urlparse(BUCKET_URI)._replace(path=f"{DATA_PATH}/raw/*.sgm").geturl()
BQ_DATASET = "mlops_bqml_text_analyisis"
BQ_TABLE = "reuters_ingested"
MODEL_NAME = "swivel_text_embedding_model"
EMBEDDINGS_TABLE = f"reuters_text_embeddings_{UUID}"
MODEL_PATH = (
    f'{hub.resolve("https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1")}/*'
)
PREPROCESSED_TABLE = f"reuters_text_preprocessed_{UUID}"
CLASSIFICATION_MODEL_NAME = "logistic_reg"
PREDICT_TABLE = f"reuters_text_predict_{UUID}"

初始化 Vertex AI SDK 客户端

In [None]:
vertex_ai.init(project=PROJECT_ID, location=REGION, staging_bucket=BUCKET_URI)

## 管道形式化

在这一步中，您可以为管道创建各种组件并构建最终的管道。

### 数据摄入组件

#### 创建 Dataflow Python 模块

以下模块包含一个 Dataflow pipeline，其中

1) 从 Cloud Storage 读取文件。
2) 从文件中提取文章并生成标题、主题和内容。
3) 将结构化数据加载到 BigQuery。

In [None]:
!touch {SRC}/__init__.py 

In [None]:
%%writefile src/ingest_pipeline.py
# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# General imports
from __future__ import absolute_import
import argparse
import logging
import os
import string

# Preprocessing imports
import tensorflow as tf
import bs4
import nltk

import apache_beam as beam
from apache_beam.io.gcp.internal.clients import bigquery
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions


# Helpers ---------------------------------------------------------------------

def get_args():
    """
    Get command line arguments.
    Returns:
      args: The parsed arguments.
    """
    parser = argparse.ArgumentParser()
    parser.add_argument('--inputs', dest='inputs', default='data/raw/reuters/*.sgm',
                        help='A directory location of input data')
    parser.add_argument('--bq-dataset', dest='bq_dataset', required=False,
                        default='reuters_dataset', help='Dataset name used in BigQuery.')
    parser.add_argument('--bq-table', dest='bq_table', required=False,
                        default='reuters_ingested_table', help='Table name used in BigQuery.')
    args, pipeline_args = parser.parse_known_args()
    return args, pipeline_args

def get_paths(data_pattern):
    """
  A function to get all the paths of the files in the data directory.
  Args:
    data_pattern: A directory location of input data.
  Returns:
    A list of file paths.
  """
    data_paths = tf.io.gfile.glob(data_pattern)
    return data_paths


def get_title(article):
    """
    A function to get the title of an article.
    Args:
        article: A BeautifulSoup object of an article.
    Returns:
        A string of the title of the article.
    """
    title = article.find('text').title
    if title is not None:
        title = ''.join(filter(lambda x: x in set(string.printable), title.text))
        title = title.encode('ascii', 'ignore')
    return title


def get_content(article):
    """
    A function to get the content of an article.
    Args:
        article: A BeautifulSoup object of an article.
    Returns:
        A string of the content of the article.
    """
    content = article.find('text').body
    if content is not None:
        content = ''.join(filter(lambda x: x in set(string.printable), content.text))
        content = ' '.join(content.split())
        try:
            content = '\n'.join(nltk.sent_tokenize(content))
        except LookupError:
            nltk.download('punkt')
            content = '\n'.join(nltk.sent_tokenize(content))
        content = content.encode('ascii', 'ignore')
    return content


def get_topics(article):
    """
    A function to get the topics of an article.
    Args:
        article: A BeautifulSoup object of an article.
    Returns:
        A list of strings of the topics of the article.
    """
    topics = []
    for topic in article.topics.children:
        topic = ''.join(filter(lambda x: x in set(string.printable), topic.text))
        topics.append(topic.encode('ascii', 'ignore'))
    return topics


def get_articles(data_paths):
    """
    Args:
        data_paths: A list of file paths.
    Returns:
        A list of articles.
    """
    data = tf.io.gfile.GFile(data_paths, 'rb').read()
    soup = bs4.BeautifulSoup(data, "html.parser")
    articles = []
    for raw_article in soup.find_all('reuters'):
        article = {
            'title': get_title(raw_article),
            'content': get_content(raw_article),
            'topics': get_topics(raw_article)
        }
        if None not in article.values():
            if [] not in article.values():
                articles.append(article)
    return articles


def get_bigquery_schema():
    """
    A function to get the BigQuery schema.
    Returns:
        A list of BigQuery schema.
    """

    table_schema = bigquery.TableSchema()
    columns = (('topics', 'string', 'repeated'),
               ('title', 'string', 'nullable'),
               ('content', 'string', 'nullable'))

    for column in columns:
        column_schema = bigquery.TableFieldSchema()
        column_schema.name = column[0]
        column_schema.type = column[1]
        column_schema.mode = column[2]
        table_schema.fields.append(column_schema)

    return table_schema


# Pipeline runner
def run(args, pipeline_args=None):
    """
    A function to run the pipeline.
    Args:
        args: The parsed arguments.
    Returns:
        None
    """

    options = PipelineOptions(pipeline_args)
    options.view_as(SetupOptions).save_main_session = True

    pipeline = beam.Pipeline(options=options)
    articles = (
            pipeline
            | 'Get Paths' >> beam.Create(get_paths(args.inputs))
            | 'Get Articles' >> beam.Map(get_articles)
            | 'Get Article' >> beam.FlatMap(lambda x: x)
    )
    if options.get_all_options()['runner'] == 'DirectRunner':
        articles | 'Dry run' >> beam.io.WriteToText('data/processed/reuters', file_name_suffix=".jsonl")
    else:
        (articles
         | 'Write to BigQuery' >> beam.io.WriteToBigQuery(
                    project=options.get_all_options()['project'],
                    dataset=args.bq_dataset,
                    table=args.bq_table,
                    schema=get_bigquery_schema(),
                    create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
                    write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE)
         )
    job = pipeline.run()

    if options.get_all_options()['runner'] == 'DirectRunner':
        job.wait_until_finish()


if __name__ == '__main__':
    args, pipeline_args = get_args()
    logging.getLogger().setLevel(logging.INFO)
    run(args, pipeline_args)

创建需求

接下来，用于Apache Beam管道所需的Python模块创建requirements.txt文件。

In [None]:
%%writefile requirements.txt
apache-beam[gcp]==2.36.0
bs4==0.0.1
nltk==3.7
tensorflow==2.8.0

创建安装文件，并添加所需的Python模块，以便执行Dataflow工作人员。

In [None]:
%%writefile setup.py
# !/usr/bin/python

# Copyright 2022 Google LLC

# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at

#      http://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import setuptools

REQUIRED_PACKAGES = [
    'bs4==0.0.1',
    'nltk==3.7',
    'tensorflow==2.8.0']

setuptools.setup(
    name='ingest',
    version='0.0.1',
    author='author',
    author_email='author@google.com',
    install_requires=REQUIRED_PACKAGES,
    packages=setuptools.find_packages())

#### 将设置文件、模块和要求文件复制到云存储

最后，将Python模块、要求文件和设置文件复制到您的云存储存储桶中。

In [None]:
# !gsutil cp -R {SRC}/preprocess_pipeline.py {BUCKET_URI}/preprocess_pipeline.py
!gsutil cp -R {SRC} {BUCKET_URI}/{SRC}
!gsutil cp requirements.txt {BUCKET_URI}/requirements.txt
!gsutil cp setup.py {BUCKET_URI}/setup.py

### BigQuery ML 组件

在构建管道的下一步中，您可以定义一组查询来：

1）创建BigQuery数据集架构。
2）预处理您的文本数据并使用Swivel模型生成嵌入。
2）训练BigQuery ML逻辑回归模型。
3）评估模型。
4）运行批处理预测。

In [None]:
!mkdir -m 777 -p {KFP_COMPONENTS_PATH}/bq_dataset_component
!mkdir -m 777 -p {KFP_COMPONENTS_PATH}/bq_preprocess_component
!mkdir -m 777 -p {KFP_COMPONENTS_PATH}/bq_model_component
!mkdir -m 777 -p {KFP_COMPONENTS_PATH}/bq_prediction_component

#### 创建BigQuery数据集查询

使用此查询，您将创建要用来训练模型的BigQuery数据集模式。

In [None]:
create_bq_dataset_query = f"""
CREATE SCHEMA IF NOT EXISTS {BQ_DATASET}
"""

with open(
    f"{KFP_COMPONENTS_PATH}/bq_dataset_component/create_bq_dataset.sql", "w"
) as q:
    q.write(create_bq_dataset_query)
q.close()

创建BigQuery预处理查询

以下查询使用TFHub Swivel模型为您的文本数据生成嵌入，并将数据集拆分为训练和服务目的。

In [None]:
create_bq_preprocess_query = f"""
-- create the embedding model
CREATE OR REPLACE MODEL
  `{PROJECT_ID}.{BQ_DATASET}.{MODEL_NAME}` OPTIONS(model_type='tensorflow',
    model_path='{MODEL_PATH}');

-- create the preprocessed table
CREATE OR REPLACE TABLE `{PROJECT_ID}.{BQ_DATASET}.{PREPROCESSED_TABLE}`
AS (
  WITH
    -- Apply the model for embedding generation
    get_embeddings AS (
      SELECT
        title,
        sentences,
        output_0 as content_embeddings,
        topics
      FROM ML.PREDICT(MODEL `{PROJECT_ID}.{BQ_DATASET}.{MODEL_NAME}`,(
        SELECT topics, title, content AS sentences
        FROM `{PROJECT_ID}.{BQ_DATASET}.{BQ_TABLE}`
      ))),
    -- Get label
    get_label AS (
        SELECT
            *,
            STRUCT( CASE WHEN 'acq' in UNNEST(topics) THEN 1 ELSE 0 END as acq ) AS label,
        FROM get_embeddings
    ),
    -- Train-serve splitting
    get_split AS (
        SELECT
            *,
            CASE WHEN ABS(MOD(FARM_FINGERPRINT(title), 10)) < 8 THEN 'TRAIN' ELSE 'PREDICT' END AS split
        FROM get_label
    )
    -- create training table
    SELECT
        title,
        sentences,
        STRUCT( content_embeddings[OFFSET(0)] AS content_embed_0,
                content_embeddings[OFFSET(1)] AS content_embed_1,
                content_embeddings[OFFSET(2)] AS content_embed_2,
                content_embeddings[OFFSET(3)] AS content_embed_3,
                content_embeddings[OFFSET(4)] AS content_embed_4,
                content_embeddings[OFFSET(5)] AS content_embed_5,
                content_embeddings[OFFSET(6)] AS content_embed_6,
                content_embeddings[OFFSET(7)] AS content_embed_7,
                content_embeddings[OFFSET(8)] AS content_embed_8,
                content_embeddings[OFFSET(9)] AS content_embed_9,
                content_embeddings[OFFSET(10)] AS content_embed_10,
                content_embeddings[OFFSET(11)] AS content_embed_11,
                content_embeddings[OFFSET(12)] AS content_embed_12,
                content_embeddings[OFFSET(13)] AS content_embed_13,
                content_embeddings[OFFSET(14)] AS content_embed_14,
                content_embeddings[OFFSET(15)] AS content_embed_15,
                content_embeddings[OFFSET(16)] AS content_embed_16,
                content_embeddings[OFFSET(17)] AS content_embed_17,
                content_embeddings[OFFSET(18)] AS content_embed_18,
                content_embeddings[OFFSET(19)] AS content_embed_19) AS feature,
        label.acq as label,
        split
    FROM
      get_split)
"""

with open(
    f"{KFP_COMPONENTS_PATH}/bq_preprocess_component/bq_preprocess_query.sql", "w"
) as q:
    q.write(create_bq_preprocess_query)
q.close()

#### 创建 BigQuery 模型查询

下面是一个简单的查询，用于构建用于主题文章分类的 BigQuery ML 逻辑回归分类器模型。

In [None]:
create_bq_model_query = f"""
CREATE OR REPLACE MODEL `{PROJECT_ID}.{BQ_DATASET}.{CLASSIFICATION_MODEL_NAME}`
  OPTIONS (
      model_type='logistic_reg',
      input_label_cols=['label']) AS
  SELECT
      label,
      feature.*
  FROM
     `{PROJECT_ID}.{BQ_DATASET}.{PREPROCESSED_TABLE}`
  WHERE split = 'TRAIN';
"""

with open(f"{KFP_COMPONENTS_PATH}/bq_model_component/create_bq_model.sql", "w") as q:
    q.write(create_bq_model_query)
q.close()

#### 创建 BigQuery 预测查询

通过以下查询，您可以使用包含预处理查询的表来运行预测作业。

In [None]:
create_bq_prediction_query = f"""SELECT title, sentences, feature.* FROM `{PROJECT_ID}.{BQ_DATASET}.{PREPROCESSED_TABLE}` WHERE split = 'PREDICT' """

with open(
    f"{KFP_COMPONENTS_PATH}/bq_prediction_component/create_bq_prediction_query.sql", "w"
) as q:
    q.write(create_bq_prediction_query)
q.close()

### 构建管道

在这一步中，您将使用各个组件来构建管道。

请在下方定义`JOB_NAME`和`JOB_CONFIG`。 `JOB_CONFIG`包括目标表的以下参数：
- `PROJECT_ID`：项目的ID。
- `BQ_DATASET`：BigQuery数据集的ID。
- `PREDICT_TABLE`：存储预测结果的BigQuery表的ID。

In [None]:
ID = random.randint(1, 10000)
JOB_NAME = f"reuters-preprocess-{UUID}-{ID}"
JOB_CONFIG = {
    "destinationTable": {
        "projectId": PROJECT_ID,
        "datasetId": BQ_DATASET,
        "tableId": PREDICT_TABLE,
    }
}

为参数创建一个自定义组件

接下来，您需要创建一个组件来传递参数给`DataflowPythonJobOp`组件。

In [None]:
@component(base_image="python:3.8-slim")
def build_dataflow_args(
    # destination_table: Input[Artifact],
    bq_dataset: str,
    bq_table: str,
    job_name: str,
    setup_file_uri: str,
    runner: str,
    inputs_uri: str,
) -> list:
    return [
        "--job_name",
        job_name,
        "--setup_file",
        setup_file_uri,
        "--runner",
        runner,
        "--inputs",
        inputs_uri,
        "--bq-dataset",
        bq_dataset,
        "--bq-table",
        bq_table,
    ]

创建流水线

定义流水线的工作流程并构建流水线。传递给流水线的参数包括：

- `create_bq_dataset_query`：用于在BigQuery中创建数据集的SQL查询。
- `job_name`：在`PipelineOptions`中配置的Cloud Dataflow作业名称。
- `inputs_uri`：输入数据的目录位置。
- `bq_dataset`：在BigQuery中使用的数据集名称。
- `bq_table`：在BigQuery中用于摄取的表名称。
- `requirements_file_path`：pip requirements文件的GCS路径。
- `python_file_path`：要运行的python文件的GCS路径。
- `setup_file_uri`：包含包依赖关系的Python设置文件路径。
- `temp_location`：Dataflow用于在执行流水线期间创建临时作业文件的GCS路径。
- `runner`：用于执行工作流程的流水线运行器。
- `create_bq_preprocess_query`：在BigQuery中预处理数据的SQL查询。
- `create_bq_model_query`：用于创建BigQuery ML模型的SQL查询。
- `create_bq_prediction_query`：用于预测的SQL查询。
- `job_config`：描述作业配置的json格式化字符串。有关更多信息，请访问此[页面](https://cloud.google.com/bigquery/docs/reference/rest/v2/Job#JobConfigurationQuery)。
- `project`：项目ID。
- `region`：选择用于运行Dataflow作业的地区。

In [None]:
@dsl.pipeline(
    name="mlops-bqml-text-generate-embeddings",
    description="A batch pipeline to generate embeddings",
)
def pipeline(
    create_bq_dataset_query: str,
    job_name: str,
    inputs_uri: str,
    bq_dataset: str,
    bq_table: str,
    requirements_file_path: str,
    python_file_path: str,
    setup_file_uri: str,
    temp_location: str,
    runner: str,
    create_bq_preprocess_query: str,
    create_bq_model_query: str,
    create_bq_prediction_query: str,
    job_config: dict,
    project: str = PROJECT_ID,
    region: str = REGION,
):

    from google_cloud_pipeline_components.v1.bigquery import (
        BigqueryCreateModelJobOp, BigqueryEvaluateModelJobOp,
        BigqueryPredictModelJobOp, BigqueryQueryJobOp)
    from google_cloud_pipeline_components.v1.dataflow import \
        DataflowPythonJobOp
    from google_cloud_pipeline_components.v1.wait_gcp_resources import \
        WaitGcpResourcesOp

    # create the dataset
    bq_dataset_op = BigqueryQueryJobOp(
        query=create_bq_dataset_query,
        project=project,
        location="US",
    )
    # instantiate dataflow args
    build_dataflow_args_op = build_dataflow_args(
        job_name=job_name,
        inputs_uri=inputs_uri,
        # destination_table = bq_dataset_op.outputs['destination_table'],
        bq_dataset=bq_dataset,
        bq_table=bq_table,
        setup_file_uri=setup_file_uri,
        runner=runner,
    ).after(bq_dataset_op)

    # run dataflow job
    dataflow_python_op = DataflowPythonJobOp(
        requirements_file_path=requirements_file_path,
        python_module_path=python_file_path,
        args=build_dataflow_args_op.output,
        project=project,
        location=region,
        temp_location=temp_location,
    ).after(build_dataflow_args_op)

    dataflow_wait_op = WaitGcpResourcesOp(
        gcp_resources=dataflow_python_op.outputs["gcp_resources"]
    ).after(dataflow_python_op)

    # run preprocessing job
    bq_preprocess_op = BigqueryQueryJobOp(
        query=create_bq_preprocess_query,
        project=project,
        location="US",
    ).after(dataflow_wait_op)

    # create the logistic regression
    bq_model_op = BigqueryCreateModelJobOp(
        query=create_bq_model_query,
        project=project,
        location="US",
    ).after(bq_preprocess_op)

    # evaluate the logistic regression
    bq_evaluate_op = BigqueryEvaluateModelJobOp(
        project=project, location="US", model=bq_model_op.outputs["model"]
    ).after(bq_model_op)

    # similuate prediction
    BigqueryPredictModelJobOp(
        model=bq_model_op.outputs["model"],
        query_statement=create_bq_prediction_query,
        job_configuration_query=job_config,
        project=project,
        location="US",
    ).after(bq_evaluate_op)

编译和运行流水线

将必要的常量和参数传递给流水线，并将其编译为yaml文件。

In [None]:
PIPELINE_ROOT = urlparse(BUCKET_URI)._replace(path="pipeline_root").geturl()
PIPELINE_PACKAGE = str(path(BUILD) / "mlops_bqml_text_analyisis_pipeline.yaml")
REQUIREMENTS_URI = urlparse(BUCKET_URI)._replace(path="requirements.txt").geturl()
PYTHON_FILE_URI = urlparse(BUCKET_URI)._replace(path="src/ingest_pipeline.py").geturl()
MODEL_URI = urlparse(BUCKET_URI)._replace(path="swivel_text_embedding_model").geturl()

compiler.Compiler().compile(pipeline_func=pipeline, package_path=PIPELINE_PACKAGE)

使用编译后的yaml文件，创建Vertex AI管道作业，并通过传递之前配置的 `SERVICE_ACCOUNT` 详细信息来运行它。

In [None]:
pipeline = vertex_ai.PipelineJob(
    display_name=f"data_preprocess_{UUID}",
    template_path=PIPELINE_PACKAGE,
    pipeline_root=PIPELINE_ROOT,
    parameter_values={
        "create_bq_dataset_query": create_bq_dataset_query,
        "bq_dataset": BQ_DATASET,
        "job_name": JOB_NAME,
        "inputs_uri": INPUTS_URI,
        "bq_table": BQ_TABLE,
        "requirements_file_path": REQUIREMENTS_URI,
        "python_file_path": PYTHON_FILE_URI,
        "setup_file_uri": SETUP_FILE_URI,
        "temp_location": PIPELINE_ROOT,
        "runner": RUNNER,
        "create_bq_preprocess_query": create_bq_preprocess_query,
        "create_bq_model_query": create_bq_model_query,
        "create_bq_prediction_query": create_bq_prediction_query,
        "job_config": JOB_CONFIG,
    },
    enable_caching=False,
)

pipeline.run(service_account=SERVICE_ACCOUNT)

一旦管道工作成功完成，训练好的模型可以在BigQuery数据集中找到。运行下面的单元格查看模型在输出中的列表。

In [None]:
! bq ls $PROJECT_ID:$BQ_DATASET

清理工作

要清理此项目中使用的所有Google Cloud资源，您可以[删除您用于本教程的Google Cloud项目](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects)。

否则，您可以在下面的单元格中删除本教程中创建的各个资源。将`delete_bucket`和`delete_dataset`设置为**True**，分别删除在本笔记本中使用的 Cloud Storage 存储桶和 BigQuery 数据集。

In [None]:
# delete the pipeline job
pipeline.delete()

delete_bucket = False
delete_dataset = False

# delete bucket
if delete_bucket or os.getenv("IS_TESTING"):
    ! gsutil -m rm -r $BUCKET_URI

# delete dataset
if delete_dataset or os.getenv("IS_TESTING"):
    ! bq rm -r -f -d $PROJECT_ID:$BQ_DATASET