In [None]:
# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

使用Vertex AI模型注册表进行模型版本控制


<table align="left">

  <td>
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/notebook_template.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> 在Colab中运行
    </a>
  </td>
  <td>
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/notebook_template.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      在GitHub上查看
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/notebook_template.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      在Vertex AI工作台中打开
    </a>
  </td>                                                                                               
</table>

## 概述

在这本笔记本中，我们将展示Vertex AI Model Registry与AutoML模型的模型版本控制能力。

### 目标

在本教程中，您将学习如何使用Vertex AI SDK和Vertex AI模型注册表来管理您的模型。

本教程使用以下Google Cloud ML服务和资源：

- Vertex AI AutoML
- Vertex AI模型注册表

执行的步骤包括：

- 使用SparkNLP预处理数据并将其加载到BQML中
- 使用Vertex AI AutoML训练和注册一个AutoML分类器
- 提名一位冠军并通过更新别名为 `production` 的模型批准将其投入生产
- 部署模型资源的默认/生产版本。

数据集

[BBC](http://mlg.ucd.ie/datasets/bbc.html) 数据集包括来自BBC新闻网站的2225篇文章，涵盖了2004年至2005年的五个主题领域（商业、娱乐、政治、体育、科技）。每篇文章都保存在一个.txt文件中。

### 成本

本教程使用 Google Cloud 的可计费组件：

* Vertex AI
* BigQuery
* Dataproc
* Cloud Storage

了解 [Vertex AI 价格](https://cloud.google.com/vertex-ai/pricing)、[BigQuery 价格](https://cloud.google.com/bigquery/pricing)、[Dataproc 价格](https://cloud.google.com/dataproc/pricing) 和 [Cloud Storage 价格](https://cloud.google.com/storage/pricing)，并使用 [定价计算器](https://cloud.google.com/products/calculator/) 来根据您的预期使用情况生成成本估算。

### 配置你的本地开发环境

**如果你正在使用Colab或者Vertex AI Workbench笔记本**，你的环境已经满足运行这个笔记本的所有要求。你可以跳过这一步。

否则，请确保您的环境满足该笔记本的要求。
您需要以下内容：

* Google Cloud SDK
* Git
* Python 3
* virtualenv
* 在使用Python 3的虚拟环境中运行的Jupyter笔记本

Google Cloud指南[设置Python开发环境](https://cloud.google.com/python/setup)和[Jupyter安装指南](https://jupyter.org/install)提供了满足这些要求的详细说明。以下步骤提供了一套简明的说明：

1. [安装并初始化Cloud SDK。](https://cloud.google.com/sdk/docs/)

2. [安装Python 3。](https://cloud.google.com/python/setup#installing_python)

3. [安装virtualenv](https://cloud.google.com/python/setup#installing_and_using_virtualenv)并创建一个使用Python 3的虚拟环境。激活虚拟环境。

4. 要安装Jupyter，请在终端窗口中的命令行上运行`pip3 install jupyter`。

5. 要启动Jupyter，请在终端窗口的命令行上运行`jupyter notebook`。

6. 在Jupyter Notebook仪表板中打开此笔记本。

## 安装

安装下面所需的包以执行这个笔记本。

In [None]:
import os

# The Vertex AI Workbench Notebook product has specific requirements
IS_WORKBENCH_NOTEBOOK = os.getenv("DL_ANACONDA_HOME")
IS_USER_MANAGED_WORKBENCH_NOTEBOOK = os.path.exists(
    "/opt/deeplearning/metadata/env_version"
)

# Vertex AI Notebook requires dependencies to be installed with '--user'
USER_FLAG = ""
if IS_WORKBENCH_NOTEBOOK:
    USER_FLAG = "--user"

! pip3 install --upgrade tensorflow google-cloud-bigquery google-cloud-aiplatform {USER_FLAG} -q

### 重新启动内核

在安装完额外的软件包后，您需要重新启动笔记本内核，以便它可以找到这些软件包。

In [None]:
# Automatically restart kernel after installs
import os

if not os.getenv("IS_TESTING"):
    # Automatically restart kernel after installs
    import IPython

    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

## 在开始之前

### 设置您的谷歌云项目

**无论您使用什么笔记本环境，以下步骤都是必需的。**

1. [选择或创建谷歌云项目](https://console.cloud.google.com/cloud-resource-manager)。当您首次创建帐户时，您将获得300美元的免费信用额度，用于支付计算/存储成本。

1. [确保为您的项目启用了结算功能](https://cloud.google.com/billing/docs/how-to/modify-project)。

1. [启用 API](https://console.cloud.google.com/flows/enableapi?apiid=iam.googleapis.com,aiplatform.googleapis.com,artifactregistry.googleapis.com,dataproc.googleapis.com,cloudbuild.googleapis.com)

1. 如果您在本地运行此笔记本，您需要安装 [Cloud SDK](https://cloud.google.com/sdk)。

1. 在下面的单元格中输入您的项目ID。然后运行该单元格，以确保
Cloud SDK在此笔记本中的所有命令中使用正确的项目。

**注意**：Jupyter将以 `!` 为前缀的行视为shell命令，并将以 `$` 为前缀的Python变量插入这些命令中。

设置您的项目ID

**如果您不知道您的项目ID**，您可以使用`gcloud`来获取您的项目ID。

In [None]:
PROJECT_ID = "[your-project-id]"  # @param {type:"string"}

In [None]:
if PROJECT_ID == "" or PROJECT_ID is None or PROJECT_ID == "[your-project-id]":
    # Get your GCP project id from gcloud
    shell_output = ! gcloud config list --format 'value(core.project)' 2>/dev/null
    PROJECT_ID = shell_output[0]
    print("Project ID:", PROJECT_ID)

In [None]:
! gcloud config set project $PROJECT_ID

### 区域

您还可以更改“REGION”变量，该变量用于此笔记本的其余部分操作。以下是支持Vertex AI的区域。我们建议您选择最靠近您的区域。

- 美洲：`us-central1`
- 欧洲：`europe-west4`
- 亚太：`asia-east1`

您可能不会使用多区域存储桶进行Vertex AI的培训。并非所有区域都支持所有Vertex AI服务。

了解有关[Vertex AI 区域](https://cloud.google.com/vertex-ai/docs/general/locations)的更多信息。

In [None]:
REGION = "[your-region]"  # @param {type: "string"}

if REGION == "[your-region]":
    REGION = "us-central1"

UUID

如果您正在参加实时教程会话，您可能在使用共享测试账户或项目。为了避免用户在创建的资源之间发生名称冲突，您需要为每个实例会话创建一个UUID，并将其附加到您在本教程中创建的资源名称之后。

In [None]:
import random
import string


# Generate a uuid of a specifed length(default=8)
def generate_uuid(length: int = 8) -> str:
    return "".join(random.choices(string.ascii_lowercase + string.digits, k=length))


UUID = generate_uuid()

### 验证您的谷歌云账户

**如果您正在使用Vertex AI工作台笔记本**，您的环境已经通过身份验证。跳过此步骤。

**如果您正在使用Colab**，运行下面的单元格并按照提示进行身份验证。

**否则**，请按照以下步骤操作：

1. 在云控制台中，转到[**创建服务账户密钥**页面](https://console.cloud.google.com/apis/credentials/serviceaccountkey)。

2. 点击**创建服务账户**。

3. 在**服务账户名称**字段中输入名称，并点击**创建**。

4. 在**授予此服务账户访问项目的权限**部分，点击**角色**下拉列表。在筛选框中输入并选中以下角色：

    *   BigQuery管理员
    *   Dataproc管理员
    *   Dataproc工作台
    *   存储管理员
    *   存储对象管理员
    *   Vertex AI管理员

5. 点击*创建*。一个包含您密钥的JSON文件将下载到您的本地环境。

6. 在下面的单元格中输入您的服务账户密钥的路径作为`GOOGLE_APPLICATION_CREDENTIALS`变量，并运行该单元格。

In [None]:
# If you are running this notebook in Colab, run this cell and follow the
# instructions to authenticate your GCP account. This provides access to your
# Cloud Storage bucket and lets you submit training jobs and prediction
# requests.

import os
import sys

# If on Vertex AI Workbench, then don't execute this code
IS_COLAB = "google.colab" in sys.modules
if not os.path.exists("/opt/deeplearning/metadata/env_version") and not os.getenv(
    "DL_ANACONDA_HOME"
):
    if "google.colab" in sys.modules:
        from google.colab import auth as google_auth

        google_auth.authenticate_user()

    # If you are running this notebook locally, replace the string below with the
    # path to your service account key and run this cell to authenticate your GCP
    # account.
    elif not os.getenv("IS_TESTING"):
        %env GOOGLE_APPLICATION_CREDENTIALS ''

获取项目编号

现在项目ID已设置，您可以获得相应的项目编号。

In [None]:
shell_output = ! gcloud projects list --filter="PROJECT_ID:'{PROJECT_ID}'" --format='value(PROJECT_NUMBER)'
PROJECT_NUMBER = shell_output[0]
print("Project Number:", PROJECT_NUMBER)

### 创建一个云存储桶

**无论您使用的是哪种笔记本环境，以下步骤都是必需的。**

在下方设置您的云存储桶的名称。它必须在所有云存储桶中是唯一的。

In [None]:
BUCKET_NAME = "[your-bucket-name]"  # @param {type:"string"}
BUCKET_URI = f"gs://{BUCKET_NAME}"

In [None]:
if BUCKET_NAME == "" or BUCKET_NAME is None or BUCKET_NAME == "[your-bucket-name]":
    BUCKET_NAME = PROJECT_ID + "-aip-" + UUID
    BUCKET_URI = f"gs://{BUCKET_NAME}"

只有当您的存储桶不存在时才执行以下单元格以创建您的云存储存储桶。

In [None]:
! gsutil mb -l $REGION -p $PROJECT_ID $BUCKET_URI

最后，通过检查云存储桶中的内容来验证对其的访问权限。

In [None]:
! gsutil ls -al $BUCKET_URI

服务账号

如果您不想使用您的项目的Compute Engine服务账号，则将`SERVICE_ACCOUNT`设置为另一个服务账号ID。

In [None]:
SERVICE_ACCOUNT = "[your-service-account]"  # @param {type:"string"}

In [None]:
if (
    SERVICE_ACCOUNT == ""
    or SERVICE_ACCOUNT is None
    or SERVICE_ACCOUNT == "[your-service-account]"
):
    # Get your service account from gcloud
    if not IS_COLAB:
        shell_output = !gcloud auth list 2>/dev/null
        SERVICE_ACCOUNT = shell_output[2].replace("*", "").strip()

    else:  # IS_COLAB:
        shell_output = ! gcloud projects describe  $PROJECT_ID
        project_number = shell_output[-1].split(":")[1].strip().replace("'", "")
        SERVICE_ACCOUNT = f"{project_number}-compute@developer.gserviceaccount.com"

    print("Service Account:", SERVICE_ACCOUNT)

设置服务账户访问权限

运行以下命令，将您的服务账户访问权限授予您在上一步中创建的存储桶。您只需要对每个服务账户运行这一步一次。

In [None]:
! gsutil iam ch serviceAccount:{SERVICE_ACCOUNT}:roles/storage.objectCreator $BUCKET_URI
! gsutil iam ch serviceAccount:{SERVICE_ACCOUNT}:roles/storage.objectViewer $BUCKET_URI

启用Dataproc无服务器的专用Google访问权限

要执行无服务器Spark工作负载，VPC子网络必须满足Dataproc无服务器的Spark网络配置中列出的[要求](https://cloud.google.com/dataproc-serverless/docs/concepts/network)。在本教程中，我们将使用默认设置并启用专用IP访问。

In [None]:
SUBNETWORK = "default"  # @param {type:"string"}

In [None]:
!gcloud compute networks subnets list --regions=$REGION --filter=$SUBNETWORK

In [None]:
!gcloud compute networks subnets update $SUBNETWORK \
--region=$REGION \
--enable-private-ip-google-access

In [None]:
!gcloud compute networks subnets describe $SUBNETWORK \
--region=$REGION \
--format="get(privateIpGoogleAccess)"

### 创建和配置 Docker 存储库

您可以在 Artefact Registry 中为将为 NLP 数据预处理创建的自定义 dataproc 映像创建一个 Docker 存储库。

In [None]:
REPO_NAME = "vertex-ai-model-registry-demo"

In [None]:
!gcloud artifacts repositories create $REPO_NAME \
    --repository-format=docker \
    --location=$REGION \
    --description="vertex ai model registry spark docker repository"

### 设置项目模板

您可以创建一组存储库来在本地组织您的项目。

In [None]:
DATA_PATH = "data"
SRC_PATH = "src"
BUILD_PATH = "build"
CONFIG_PATH = "config"

In [None]:
!mkdir -m 777 -p $DATA_PATH $SRC_PATH $BUILD_PATH $CONFIG_PATH

### 获取输入数据

在以下代码中，您将下载并提取教程数据集。

In [None]:
RAW_DATA_URI = "http://mlg.ucd.ie/files/datasets/bbc-fulltext.zip"

In [None]:
!rm -Rf {DATA_PATH}/raw 
!wget --no-parent {RAW_DATA_URI} --directory-prefix={DATA_PATH}/raw 
!unzip -qo {DATA_PATH}/raw/bbc-fulltext.zip -d {DATA_PATH}/raw && mv {DATA_PATH}/raw/bbc/* {DATA_PATH}/raw/
!rm -Rf {DATA_PATH}/raw/bbc-fulltext.zip {DATA_PATH}/raw/bbc

设置 BigQuery 数据集

您为本教程创建了 BigQuery 数据集。

In [None]:
LOCATION = REGION.split("-")[0]
BQ_DATASET = "bcc_sport"

! bq mk --location={LOCATION} --dataset {PROJECT_ID}:{BQ_DATASET}

### 引入库

In [None]:
import csv
import glob
# General
import os
import sys

import pandas as pd

pd.set_option("display.max_colwidth", 3000)

# Model Training
import tensorflow as tf
from google.cloud import aiplatform as vertex_ai
from google.cloud import bigquery

In [None]:
print("BigQuery library version:", bigquery.__version__)
print("Vertex AI library version:", vertex_ai.__version__)

设定变量

In [None]:
# General
STAGING_BUCKET = f"{BUCKET_URI}/jobs"
RAW_PATH = os.path.join(DATA_PATH, "raw")
DATAPROC_IMAGE_BUILD_PATH = os.path.join(BUILD_PATH, "dataproc_image")
PREPROCESS_DOCKERFILE_PATH = os.path.join(DATAPROC_IMAGE_BUILD_PATH, "Dockerfile")
DATAPROC_RUNTIME_IMAGE = "dataproc_serverless_custom_runtime"
IMAGE_TAG = "1.0.0"
DATAPROC_RUNTIME_CONTAINER_IMAGE = (
    f"gcr.io/{PROJECT_ID}/{DATAPROC_RUNTIME_IMAGE}:{IMAGE_TAG}"
)
INIT_PATH = os.path.join(SRC_PATH, "__init__.py")
MODULE_URI = f"{BUCKET_URI}/{SRC_PATH}"
VERTEX_AI_MODEL_ID = "text-classifier-model"

# Ingest
PREPARED_PATH = os.path.join(DATA_PATH, "prepared")
PREPARED_FILE = "prepared_data.csv"
PREPARED_FILE_PATH = os.path.join(PREPARED_PATH, PREPARED_FILE)
PREPARED_FILE_URI = f"{BUCKET_URI}/{PREPARED_FILE_PATH}"

# Preprocess
PREPROCESS_MODULE_PATH = os.path.join(SRC_PATH, "preprocess.py")
LEMMA_DICTIONARY_PATH = os.path.join(CONFIG_PATH, "lemmas.txt")
LEMMA_DICTIONARY_URI = f"{BUCKET_URI}/{CONFIG_PATH}/lemmas.txt"
PROCESS_PYTHON_FILE_URI = f"{MODULE_URI}/preprocess.py"
PROCESS_DATA_PATH = os.path.join(DATA_PATH, "processed")
BQ_OUTPUT_TABLE_URI = f"{BQ_DATASET}.news_processed_{UUID}"
PROCESS_DATA_URI = f"{BUCKET_URI}/{PROCESS_DATA_PATH}"
PROCESS_FILE_URI = f"{PROCESS_DATA_URI}/*.parquet"
PREPROCESS_BATCH_ID = f"nlp-preprocess-{UUID}"

# Training
AUTOML_BQ_TABLE_URI = f"{BQ_DATASET}.news_automl_dataset_table_{UUID}"
AUTOML_BQ_SOURCE = f"bq://{PROJECT_ID}.{AUTOML_BQ_TABLE_URI}"
AUTOML_TEXT_DATASET = f"sport_news_dataset_{UUID}"
AUTOML_BQ_EVALUATION_TABLE = (
    f"bq://{PROJECT_ID}.{BQ_DATASET}.news_automl_eval_table_{UUID}"
)

# Deployment
ENDPOINT_NAME = "text-classifier-endpoint"
DEPLOYED_MODEL_NAME = "naive-bayes-text-classifier"

### 初始化用于Python的Vertex AI SDK

为您的项目和相应的存储桶初始化Vertex AI SDK。

In [None]:
vertex_ai.init(project=PROJECT_ID, location=REGION, staging_bucket=STAGING_BUCKET)

###助手

一组助手，可以简化一些任务。

In [None]:
def prepare_data(input_path: str, output_path: str, file_name: str):
    """
    This function prepares the data for the model registry demo.
    Args:
        input_path: The directory where the raw data is stored.
        output_path: The directory where the prepared data will be stored.
        file_name: The name of the file to be prepared.
    Returns:
        None
    """
    # Read folder names
    categories = [f.name for f in os.scandir(input_path) if f.is_dir()]

    # Create output directory if it doesn't exist
    if not os.path.exists(output_path):
        os.makedirs(output_path)

    # Create output file
    with open(output_path + "/" + file_name, "w") as output_file:
        csv_writer = csv.writer(output_file)
        csv_writer.writerow(["category", "text"])

        # For each category, read all files and write to output file
        for category in categories:
            # Read all files in category
            for filename in glob.glob(os.path.join(input_path, category, "*.txt")):
                # Read file
                with open(filename, "r") as input_file:
                    output_text = "".join([line.rstrip() for line in input_file])
                    # Write to output file
                    csv_writer.writerow([category, output_text])
                    input_file.close()

        # Close output file
        output_file.close()


def run_query(query):

    """
    This function runs a query on the prepared data.
    Args:
        query: The query to be run.
    Returns:
        None
    """

    # Construct a BigQuery client object.
    client = bigquery.Client(project=PROJECT_ID, location=LOCATION)

    # Run the query_job
    query_job = client.query(query)

    # Wait for the query to finish
    result = query_job.result()

    # Return table
    table = query_job.ddl_target_table

    return table, result


def read_metrics_file(metrics_file_uri):
    """
    This function reads metrics file on bucket
    Args:
      metrics_file_uri: The uri of the metrics file
    Returns:
      metrics_str: metrics string
    """

    with tf.io.gfile.GFile(metrics_file_uri, "r") as metrics_file:
        metrics = metrics_file.read().replace("'", '"')
    metrics_file.close()
    return metrics

使用Dataproc Serverless进行数据工程

在构建NLP机器学习模型之前，有一些常见的预处理步骤：

1. 初步处理，如句子切分和单词标记化
2. 常见步骤，如去除停用词，词干提取和词形还原，去除数字/标点符号，转换为小写等。

其他步骤包括归一化、语言检测以及词性标注之外的语法分析。

在接下来的部分中，您将摄取数据集，并使用Dataproc Serverless上的SparkNLP构建和执行一个简单的NLP预处理流水线。为此，您需要：

1. 在Google Cloud Bucket上上传数据
2. 创建一个自定义Dataproc Serverless映像
3. 创建并上传`preprocess`模块及其依赖项到Google Cloud Bucket

然后，您将运行Dataproc Serverless作业，并将结果数据加载到Bigquery中。

摄入数据

接下来，你需要做以下事情：

1. 从目录中提取新闻，创建相应的csv文件来准备数据。
2. 将数据上传至Google Cloud Bucket。

准备数据

In [None]:
prepare_data(RAW_PATH, PREPARED_PATH, PREPARED_FILE)

快速浏览CSV数据

In [None]:
! head $PREPARED_FILE_PATH

将数据上传到存储桶####

In [None]:
! gsutil cp $PREPARED_FILE_PATH $PREPARED_FILE_URI

### 基本数据和特征工程

在这种情况下，您将使用Spark管道来使用Spark NLP覆盖以下步骤

1. 句子分割
2. 词语分割
3. 标准化
4. 停用词移除
5. 词干提取
6. 词形还原

最后，您将使用`CountVectorizer`对象创建一个词袋（BOW）。

构建自定义Dataproc无服务器镜像

`DataprocPySparkBatchOp`允许您传递自定义镜像，当提供的Dataproc无服务器运行时版本不符合您的要求时可以使用。 在这种情况下，需要一个带有Spark NLP库的镜像。

下载Spark作业所需的依赖项

您可以下载运行NLP预处理流程所需的Spark依赖项。

In [None]:
! rm -rf $DATAPROC_IMAGE_BUILD_PATH
! mkdir $DATAPROC_IMAGE_BUILD_PATH

In [None]:
!gsutil cp gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.22.2.jar $DATAPROC_IMAGE_BUILD_PATH
!wget -P $DATAPROC_IMAGE_BUILD_PATH https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-4.0.2.jar
!wget -P $DATAPROC_IMAGE_BUILD_PATH https://repo.anaconda.com/miniconda/Miniconda3-py38_4.9.2-Linux-x86_64.sh

将Dataproc无服务器自定义运行时图像定义为（Define）Dataproc无服务器自定义运行时图像（Dataproc serverless custom runtime image）。

您需要定义 Dockerfile 来创建自定义图像。

In [None]:
dataproc_serverless_custom_runtime_image = """
# Debian 11 is recommended.
FROM debian:11-slim

# Suppress interactive prompts
ENV DEBIAN_FRONTEND=noninteractive

# (Required) Install utilities required by Spark scripts.
RUN apt update && apt install -y procps tini

# (Optional) Add extra jars.
ENV SPARK_EXTRA_JARS_DIR=/opt/spark/jars/
ENV SPARK_EXTRA_CLASSPATH='/opt/spark/jars/*'
RUN mkdir -p "${SPARK_EXTRA_JARS_DIR}"
COPY spark-bigquery-with-dependencies_2.12-0.22.2.jar "${SPARK_EXTRA_JARS_DIR}"
COPY spark-nlp-assembly-4.0.2.jar "${SPARK_EXTRA_JARS_DIR}"

# (Optional) Install and configure Miniconda3.
ENV CONDA_HOME=/opt/miniconda3
ENV PYSPARK_PYTHON=${CONDA_HOME}/bin/python
ENV PATH=${CONDA_HOME}/bin:${PATH}
COPY Miniconda3-py38_4.9.2-Linux-x86_64.sh .
RUN bash Miniconda3-py38_4.9.2-Linux-x86_64.sh -b -p /opt/miniconda3 \
  && ${CONDA_HOME}/bin/conda config --system --set always_yes True \
  && ${CONDA_HOME}/bin/conda config --system --set auto_update_conda False \
  && ${CONDA_HOME}/bin/conda config --system --prepend channels conda-forge \
  && ${CONDA_HOME}/bin/conda config --system --set channel_priority strict

# (Optional) Install Conda packages.
#
# The following packages are installed in the default image, it is strongly
# recommended to include all of them.
#
# Use mamba to install packages quickly.
RUN ${CONDA_HOME}/bin/conda install mamba -n base -c conda-forge \
    && ${CONDA_HOME}/bin/mamba install \
      conda \
      cython \
      gcsfs \
      google-cloud-bigquery-storage \
      google-cloud-bigquery[pandas] \
      google-cloud-dataproc \
      numpy \
      pandas \
      python \
      pyspark \
      findspark

# Use conda to install spark-nlp
RUN ${CONDA_HOME}/bin/conda install -n base -c johnsnowlabs spark-nlp

# Add lemma dictionary
# ENV CONFIG_DIR='/home/app/build'
# RUN mkdir -p "${CONFIG_DIR}"
# COPY lemmas.txt "${CONFIG_DIR}"

# (Required) Create the 'spark' group/user.
# The GID and UID must be 1099. Home directory is required.
RUN groupadd -g 1099 spark
RUN useradd -u 1099 -g 1099 -d /home/spark -m spark
USER spark
"""

with open(PREPROCESS_DOCKERFILE_PATH, "w") as f:
    f.write(dataproc_serverless_custom_runtime_image)
f.close()

##### 使用Google Cloud Build构建Dataproc无服务器自定义运行时

您可以使用云构建来创建并注册容器映像到Artefact注册表。

请注意，`<PROJECT_ID>@cloudbuild.gserviceaccount.com`需要具有对Google Cloud Storage对象的storage.objects.get访问权限。

**注意**：此步骤将需要约5分钟。

In [None]:
CLOUD_BUILD_SERVICE_ACCOUNT = f"{PROJECT_NUMBER}@cloudbuild.gserviceaccount.com"

! gsutil iam ch serviceAccount:{CLOUD_BUILD_SERVICE_ACCOUNT}:roles/storage.objectCreator $BUCKET_URI
! gsutil iam ch serviceAccount:{CLOUD_BUILD_SERVICE_ACCOUNT}:roles/storage.objectViewer $BUCKET_URI

In [None]:
!gcloud builds submit --tag $DATAPROC_RUNTIME_CONTAINER_IMAGE $DATAPROC_IMAGE_BUILD_PATH --machine-type=N1_HIGHCPU_32 --timeout=900s --verbosity=info

#### 准备 `preprocess` 模块

创建预处理模块

该模块将预处理数据，包括以下步骤：

1. 句子分割
2. 单词分词
3. 规范化
4. 停用词移除
5. 词干提取
6. 词形还原

In [None]:
with open(INIT_PATH, "w") as init_file:
    pass

In [None]:
process_module = """
#!/usr/bin/env python3

'''
This is a simple module to preprocess the data for the model registry demo.
Steps:
1. Sentence segmentation
2. Word tokenization
3. Normalization
4. Stopword removal
5. Stemming
6. Lemmatization
'''

# Libraries
import logging
import argparse

from pyspark.sql.types import StructType, StringType
from pyspark.sql.functions import col, concat_ws, rand
from pyspark.ml.functions import vector_to_array
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml.feature import CountVectorizer
from pyspark.ml import Pipeline

# Variables ------------------------------------------------------------------------------------------------------------
DATA_SCHEMA = (StructType()
               .add("category", StringType(), True)
               .add("text", StringType(), True))
SEED=8

# Helper functions -----------------------------------------------------------------------------------------------------
def get_logger():
    '''
    This function returns a logger object.
    Returns:
        logger: The logger object.
    '''
    logger = logging.getLogger(__name__)
    logger.setLevel(logging.INFO)
    handler = logging.StreamHandler()
    formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
    handler.setFormatter(formatter)
    logger.addHandler(handler)
    return logger


def get_args():
    '''
    This function returns the arguments from the command line.
    Returns:
        args: The arguments from the command line.
    '''
    parser = argparse.ArgumentParser()
    parser.add_argument('--input_path', type=str, help='The input path uri without bucket prefix')
    parser.add_argument('--lemmas_path', type=str, help='The lemma dictionary path without bucket prefix')
    parser.add_argument('--gcs_output_path', type=str, help='The gcs path for preprocessed data without bucket prefix')
    parser.add_argument('--bq_output_table_uri', type=str, help='The Bigquery output table URI')
    parser.add_argument('--bucket', type=str, help='The staging bucket')
    parser.add_argument('--project', type=str, help='The project id')
    args = parser.parse_args()
    return args


def build_preliminary_steps():
    '''
    This function builds the preliminary steps for the preprocessing.
    Returns:
        preliminary_steps: The preliminary steps for the preprocessing.
    '''

    document_assembler = DocumentAssembler().setInputCol("text").setOutputCol("document").setCleanupMode('shrink_full')
    sentence_detector = SentenceDetector().setInputCols("document").setOutputCol("sentence")
    tokenizer = Tokenizer().setInputCols("sentence").setOutputCol("token")
    preliminary_steps = [document_assembler, sentence_detector, tokenizer]
    return preliminary_steps


def build_common_preprocess_steps(lemma_uri):
    '''
    This function builds the common preprocessing steps.
    Args:
        lemma_uri: The uri of lemma dictionary
    Returns:
        common_preprocess_steps: The common preprocessing steps.
    '''

    normalizer = Normalizer().setInputCols("token").setOutputCol("normalized_token").setLowercase(True)
    stopwords_cleaner = StopWordsCleaner().setInputCols("normalized_token").setOutputCol(
        "cleaned_tokens").setCaseSensitive(False)
    stemmer = Stemmer().setInputCols("cleaned_tokens").setOutputCol("stem")
    lemmatizer = Lemmatizer().setInputCols("stem").setOutputCol("lemma").setDictionary(lemma_uri, "->", "\t")
    finisher = Finisher().setInputCols("lemma").setOutputCols(["lemma_features"]).setIncludeMetadata(
        False).setOutputAsArray(True)
    common_preprocess_steps = [normalizer, stopwords_cleaner, stemmer, lemmatizer, finisher]
    return common_preprocess_steps


def build_feature_extraction_steps():
    '''
    This function builds the feature extraction steps.
    Returns:
        feature_extraction_steps: The feature extraction steps.
    '''

    count_vectorizer = CountVectorizer().setInputCol("lemma_features").setOutputCol("features").setVocabSize(30)
    feature_extraction_steps = [count_vectorizer]
    return feature_extraction_steps


def read_data(spark_session, data_schema, input_dir):
    '''
    This function reads the data from the input directory.
    Args:
        spark_session: The SparkSession object.
        data_schema: The data schema.
        input_dir: The input directory.
    Returns:
        raw_df: The raw dataframe.
    '''

    raw_df = (spark_session.read.option("header", True)
              .option("delimiter", ',')
              .schema(data_schema)
              .csv(input_dir))
    return raw_df


def prepare_train_df(df):
    '''
    This function prepares the training dataframe.
    Args:
        df: The dataframe.
    Returns:
        None
    '''
    train_df = (df.withColumn("bow_col", vector_to_array("features"))
                .withColumn("lemmas", concat_ws(" ", col("lemma_features")))
                .select(["text"] + ["lemmas"] + [col("bow_col")[i] for i in range(30)] + ["category"]))

    return train_df


def save_data(data, bucket, gcs_path, bigquery_uri):
    '''
    This function saves the data to Bigquery.
    Args:
        data: The data to save.
        bucket: The bucket.
        gcs_path: The path to store processed data.
        bigquery_uri: The URI of the Bigquery table.
    Returns:
        None
    '''
    # df_sample = data.sample(withReplacement=False, fraction=0.7, seed=SEED)
    df_sample = data.orderBy(rand(SEED)).limit(1000)
    df_sample.write.format('bigquery') \
        .mode("overwrite") \
        .option("persistentGcsBucket", bucket) \
        .option("persistentGcsPath", gcs_path) \
        .save(bigquery_uri)


# Main function --------------------------------------------------------------------------------------------------------
def preprocess(args):
    '''
    preprocess function.
    Args:
        args: The arguments from the command line.
    Returns:
        None
    '''
    # Get logger
    logger = get_logger()

    # Initialize variables
    input_path = args.input_path
    lemma_path = args.lemmas_path
    gcs_output_path = args.gcs_output_path
    bq_output_table_uri = args.bq_output_table_uri
    bucket = args.bucket
    project = args.project
    lemma_uri = f'gs://{bucket}/{lemma_path}'
    input_uri = f'gs://{bucket}/{input_path}'

    # Initialize SparkSession
    logger.info('Starting preprocessing')
    spark = sparknlp.start()
    print(f"Spark NLP version: {sparknlp.version()}")
    print(f"Spark version: {spark.version}")

    # Build pipeline steps
    logger.info('Building pipeline steps')
    preliminary_steps = build_preliminary_steps()
    common_preprocess_steps = build_common_preprocess_steps(lemma_uri)
    feature_extraction_steps = build_feature_extraction_steps()
    pipeline = Pipeline(stages=preliminary_steps + common_preprocess_steps + feature_extraction_steps)

    # Read data
    logger.info('Reading data')
    raw_df = read_data(spark, DATA_SCHEMA, input_uri)

    # Preprocess data
    logger.info('Preprocessing data')
    processed_pipeline = pipeline.fit(raw_df)
    preprocessed_df = processed_pipeline.transform(raw_df)
    preprocessed_df.show(10, truncate=False)

    # Save data to Bigquery
    logger.info('Saving data to Bigquery')
    train_df = prepare_train_df(preprocessed_df)
    save_data(train_df, bucket, gcs_output_path, bq_output_table_uri)
    logging.info('done.')
    spark.stop()


if __name__ == '__main__':
    # Get args
    args = get_args()
    preprocess(args)
"""

with open(PREPROCESS_MODULE_PATH, "w") as process_file:
    process_file.write(process_module)
process_file.close()

将模块上传到存储桶中。

In [None]:
!gsutil cp $SRC_PATH/__init__.py $MODULE_URI/__init__.py
!gsutil cp $SRC_PATH/preprocess.py $MODULE_URI/preprocess.py

上传配置文件

您根据Spark NLP文档使用lemma字典，并将其上传至Google Cloud存储桶。

In [None]:
!wget https://raw.githubusercontent.com/mahavivo/vocabulary/master/lemmas/AntBNC_lemmas_ver_001.txt -O $LEMMA_DICTIONARY_PATH
!gsutil cp $LEMMA_DICTIONARY_PATH $LEMMA_DICTIONARY_URI

使用Dataproc无服务器运行一个预处理的Spark作业

现在您已经准备好执行，可以提交预处理的Dataproc无服务器作业。关于这个cli命令的解释超出了范围，但可以查看官方文档中的所有选项。

In [None]:
! gcloud beta dataproc batches submit pyspark $PROCESS_PYTHON_FILE_URI \
  --batch=$PREPROCESS_BATCH_ID \
  --container-image=$DATAPROC_RUNTIME_CONTAINER_IMAGE \
  --region=$REGION \
  --subnet='default' \
  --properties spark.executor.instances=2,spark.driver.cores=4,spark.executor.cores=4,spark.app.name=spark_preprocessing_job \
  -- --input_path=$PREPARED_FILE_PATH --lemmas_path=$LEMMA_DICTIONARY_PATH --gcs_output_path=$PROCESS_DATA_PATH --bq_output_table_uri=$BQ_OUTPUT_TABLE_URI --bucket=$BUCKET_NAME --project=$PROJECT_ID

## 文本分类的模型训练

根据[Practical Natural Language Processing: A Comprehensive Guide to Building Real-World NLP Systems](https://www.oreilly.com/library/view/practical-natural-language/9781492054047/)，有不同的方法可以训练文本分类器。

例如，您可以使用

- 传统方法，如逻辑回归或朴素贝叶斯分类器
- 神经嵌入方法
- 深度学习方法
- 大型、预训练的语言模型

在接下来的部分中，您将使用Vertex AI AutoML，并展示Vertex AI模型注册表将如何管理它。

使用 Vertex AI AutoML 训练一个深度文本分类器模型的方法非常简单。

准备Biguery中的文本数据

In [None]:
automl_table_query = f"""
CREATE OR REPLACE TABLE {AUTOML_BQ_TABLE_URI} AS
  SELECT text, category
  FROM `{PROJECT_ID}.{BQ_OUTPUT_TABLE_URI}`
"""

In [None]:
table, result = run_query(query=automl_table_query)

创建一个表格数据集

In [None]:
automl_bq_dataset = vertex_ai.TabularDataset.create(
    display_name=AUTOML_TEXT_DATASET, bq_source=AUTOML_BQ_SOURCE
)

##### 训练AutoML文本分类器

您将训练一个AutoML分类器，该分类器将最小化对数损失。最终，它将注册为文本分类器模型的新版本。

请注意，这将需要**~4小时**。

In [None]:
automl_pipeline_job = vertex_ai.AutoMLTabularTrainingJob(
    display_name=f"deep_text_classifier_{UUID}",
    optimization_prediction_type="classification",
    optimization_objective="minimize-log-loss",
    column_specs={"text": "auto"},
)

In [None]:
automl_model = automl_pipeline_job.run(
    dataset=automl_bq_dataset,
    target_column="category",
    training_fraction_split=0.8,
    validation_fraction_split=0.1,
    test_fraction_split=0.1,
    parent_model=VERTEX_AI_MODEL_ID,
    model_version_aliases=["automl", "deep_classifier"],
    model_version_description="An Vertex AI AutoML text classifier",
    model_labels={"created_by": "inardini", "team": "advocacy"},
    is_default_version=False,
    disable_early_stopping=False,
    export_evaluated_data_items=True,
    export_evaluated_data_items_bigquery_destination_uri=AUTOML_BQ_EVALUATION_TABLE,
    sync=True,
)

使用Vertex AI 模型注册表对模型管理进行规范化

初始化Vertex AI模型注册表

要访问Vertex AI模型资源的不同模型版本，您可以初始化一个模型注册表实例。

In [None]:
registry = vertex_ai.models.ModelRegistry(VERTEX_AI_MODEL_ID)

####比较模型版本

评估新的候选人

您将使用`list_model_evaluations`方法来评估新模型。

In [None]:
automl_evaluations = automl_model.list_model_evaluations()

for model_evaluation in automl_evaluations:
    print(model_evaluation.to_dict())

注册`champion`模型版本

验证新候选人

In [None]:
versions = registry.list_versions()

In [None]:
CANDIDATE_VERSION_ID = versions[-1].version_id

In [None]:
candidate_model_version_info = registry.get_version_info(CANDIDATE_VERSION_ID)
candidate_model_version_info_df = pd.DataFrame(
    candidate_model_version_info,
    columns=["model_version"],
    index=[
        "version_id",
        "created_at",
        "updated_at",
        "model_display_name",
        "model_resource_name",
        "version_aliases",
        "version_description",
    ],
)
candidate_model_version_info_df

推广到生产

In [None]:
registry.add_version_aliases(["default", "production"], version=CANDIDATE_VERSION_ID)

部署新候选人

In [None]:
candidate_model = registry.get_model(version="candidate")

创建终端点

In [None]:
endpoint = vertex_ai.Endpoint.create(
    display_name=ENDPOINT_NAME,
    project=PROJECT_ID,
    location=REGION,
)

部署冠军模型

In [None]:
endpoint.deploy(
    model=candidate_model,
    deployed_model_display_name=DEPLOYED_MODEL_NAME,
    machine_type="n1-standard-8",
)

生成预测

In [None]:
text = """The singer to headline the event halftime show: 'It's on"""  # @param {type:"string"}

In [None]:
instances = [{"text": text}]
predictions = endpoint.predict(instances)
print(predictions)

最后的想法

正如您所想象的，您也可以上传外部模型。请查看文档示例和[示例笔记本](https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/ml_ops/stage3/get_started_with_model_registry.ipynb)。

清理

要清理此项目中使用的所有谷歌云资源，您可以[删除用于本教程的谷歌云项目](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects)。

否则，您可以删除创建的各个资源。

In [None]:
endpoint.undeploy_all()

endpoint.delete()

versions = registry.list_versions()
for version in versions:
    registry.delete_version(version=version.version_id)

automl_pipeline_job.delete()

automl_bq_dataset.delete()

!gcloud dataproc batches delete $PREPROCESS_BATCH_ID --region=$REGION --quiet

! bq rm -r -f -d $PROJECT_ID:$BQ_DATASET

! gcloud artifacts repositories delete $REPO_NAME --location=$REGION --quiet

!rm -rf $DATA_PATH $SRC_PATH $BUILD_PATH $CONFIG_PATH