In [None]:
# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# 使用Vertex AI模型注册中心进行模型管理

<table align="left">

  <td>
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/notebook_template.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> 在Colab中运行
    </a>
  </td>
  <td>
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/notebook_template.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      在GitHub上查看
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/notebook_template.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      在Vertex AI工作台中打开
    </a>
  </td>                                                                                               
</table>

## 概述

本笔记本将展示 Vertex AI Model Registry 在 BQML 和自定义模型中的模型版本控制功能。

### 目标

在本教程中，您将学习如何使用Vertex AI SDK和Vertex AI模型注册表来管理您的模型。

本教程使用以下Google Cloud ML服务和资源：

- BigQuery
- Vertex AI 训练
- Vertex AI 模型注册表

执行的步骤包括：

- 使用SparkNLP预处理数据并将其载入BQML
- 使用BQML训练和注册Logistic回归
- 使用scikit-learn训练和注册朴素贝叶斯分类器
- 回顾和验证BQML和scikit-learn模型
- 提名一名冠军，并通过更新别名为`production`别名来批准该模型进入生产
- 部署Model资源的默认/生产版本。

数据集

[BBC](http://mlg.ucd.ie/datasets/bbc.html) 包含来自BBC新闻网站的2225篇文章，涵盖了2004年至2005年的五个主题领域（商业、娱乐、政治、体育、科技）的报道。每篇文章都保存在一个 .txt 文件中。

费用

本教程使用谷歌云的可计费组件：

* Vertex AI
* BigQuery
* Dataproc
* 云存储

了解[Vertex AI价格](https://cloud.google.com/vertex-ai/pricing)和[云存储价格](https://cloud.google.com/storage/pricing)，并使用[定价计算器](https://cloud.google.com/products/calculator/)根据您的预计使用量生成费用估算。

### 设置本地开发环境

**如果您正在使用 Colab 或 Vertex AI Workbench 笔记本**，您的环境已满足运行此笔记本的所有要求。您可以跳过此步骤。

**否则**，请确保您的环境满足此笔记本的要求。您需要以下内容：

- Google Cloud SDK
- Git
- Python 3
- virtualenv
- 在使用 Python 3 的虚拟环境中运行的 Jupyter 笔记本

Google Cloud 的[设置 Python 开发环境](https://cloud.google.com/python/setup)指南和[Jupyter 安装指南](https://jupyter.org/install)提供了满足这些要求的详细说明。以下步骤提供了一套精简的说明：

1. [安装和初始化 Cloud SDK。](https://cloud.google.com/sdk/docs/)

2. [安装 Python 3。](https://cloud.google.com/python/setup#installing_python)

3. [安装 virtualenv](https://cloud.google.com/python/setup#installing_and_using_virtualenv) 并创建一个使用 Python 3 的虚拟环境。激活虚拟环境。

4. 要安装 Jupyter，在终端 shell 中运行 `pip3 install jupyter`。

5. 要启动 Jupyter，在终端 shell 中运行 `jupyter notebook`。

6. 在 Jupyter Notebook 仪表板中打开此笔记本。

## 安装

安装以下所需的软件包以执行此笔记本。

In [None]:
import os

# The Vertex AI Workbench Notebook product has specific requirements
IS_WORKBENCH_NOTEBOOK = os.getenv("DL_ANACONDA_HOME")
IS_USER_MANAGED_WORKBENCH_NOTEBOOK = os.path.exists(
    "/opt/deeplearning/metadata/env_version"
)

# Vertex AI Notebook requires dependencies to be installed with '--user'
USER_FLAG = ""
if IS_WORKBENCH_NOTEBOOK:
    USER_FLAG = "--user"

! pip3 install --upgrade tensorflow google-cloud-bigquery google-cloud-aiplatform "shapely<2" {USER_FLAG} -q --no-warn-conflicts

### 重新启动内核

安装完附加包之后，您需要重新启动笔记本内核，以便它可以找到这些包。

In [None]:
# Automatically restart kernel after installs
import os

if not os.getenv("IS_TESTING"):
    # Automatically restart kernel after installs
    import IPython

    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

## 开始之前

### 设置您的谷歌云项目

**无论您使用什么笔记本环境，以下步骤都是必要的。**

1. [选择或创建谷歌云项目](https://console.cloud.google.com/cloud-resource-manager)。当您首次创建账户时，您会获得$300的免费信用，用于支付计算/存储成本。

1. [确保为您的项目启用了计费](https://cloud.google.com/billing/docs/how-to/modify-project)。

1. [启用API](https://console.cloud.google.com/flows/enableapi?apiid=iam.googleapis.com,aiplatform.googleapis.com,cloudresourcemanager.googleapis.com,artifactregistry.googleapis.com,dataproc.googleapis.com,cloudbuild.googleapis.com)。

1. 如果您是在本地运行此笔记本，您需要安装[Cloud SDK](https://cloud.google.com/sdk)。

1. 在下面的单元格中输入您的项目ID。然后运行该单元格，以确保Cloud SDK对本笔记本中的所有命令使用正确的项目。

**注意**：Jupyter会将以`!`开头的行作为shell命令运行，并将以`$`开头的Python变量插入这些命令中。

设置您的项目ID

**如果您不知道您的项目ID**，您可以使用`gcloud`获取您的项目ID。

In [None]:
PROJECT_ID = "[your-project-id]"  # @param {type:"string"}

In [None]:
if PROJECT_ID == "" or PROJECT_ID is None or PROJECT_ID == "[your-project-id]":
    # Get your GCP project id from gcloud
    shell_output = ! gcloud config list --format 'value(core.project)' 2>/dev/null
    PROJECT_ID = shell_output[0]
    print("Project ID:", PROJECT_ID)

In [None]:
! gcloud config set project $PROJECT_ID

#### 地区

您也可以更改 `REGION` 变量，在本笔记本的其余部分中将使用该变量进行操作。以下是Vertex AI支持的地区。我们建议您选择距离您最近的地区。

- 美洲：`us-central1`
- 欧洲：`europe-west4`
- 亚太：`asia-east1`

您可能无法使用多区域存储桶进行使用Vertex AI进行训练。并非所有地区都支持所有Vertex AI服务。

了解更多关于[Vertex AI地区](https://cloud.google.com/vertex-ai/docs/general/locations)。

In [None]:
REGION = "[your-region]"  # @param {type: "string"}

if REGION == "[your-region]":
    REGION = "us-central1"

UUID

如果你正在参加现场教程会话，你可能正在使用共享的测试账户或项目。为了避免用户在创建的资源上发生名称冲突，你为每个实例会话创建一个UUID，并将其附加到你在本教程中创建的资源的名称上。

In [None]:
import random
import string


# Generate a uuid of a specifed length(default=8)
def generate_uuid(length: int = 8) -> str:
    return "".join(random.choices(string.ascii_lowercase + string.digits, k=length))


UUID = generate_uuid()

###验证您的Google Cloud帐户

**如果您正在使用Vertex AI Workbench笔记本**，您的环境已经得到验证。

**如果您正在使用Colab**，请运行下面的单元格，并按提示进行身份验证，通过oAuth验证您的帐户。

**否则**，请按照以下步骤操作：

1. 在Cloud控制台中，转到[**创建服务帐户密钥**页面](https://console.cloud.google.com/apis/credentials/serviceaccountkey)。

2. 点击**创建服务帐户**。

3. 在**服务帐户名称**字段中输入名称，然后点击**创建**。

4. 在**授予此服务帐户对项目的访问权限**部分，点击**角色**下拉列表。在过滤框中输入并选择以下角色：

    * Artifact Registry管理员
    * Artifact Registry存储库管理员
    * BigQuery管理员
    * 计算网络管理员
    * Cloud Build编辑器
    * Dataproc管理员
    * Dataproc Worker
    * 服务帐户用户
    * 服务使用管理员
    * 存储管理员
    * 存储对象管理员
    * Vertex AI管理员

5. 点击*创建*。一个包含您密钥的JSON文件会下载到您的本地环境中。

6. 在下面的单元格中，将您的服务帐户密钥路径输入为`GOOGLE_APPLICATION_CREDENTIALS`变量并运行单元格。

In [None]:
# If you are running this notebook in Colab, run this cell and follow the
# instructions to authenticate your GCP account. This provides access to your
# Cloud Storage bucket and lets you submit training jobs and prediction
# requests.

import os
import sys

# If on Vertex AI Workbench, then don't execute this code
IS_COLAB = "google.colab" in sys.modules
if not os.path.exists("/opt/deeplearning/metadata/env_version") and not os.getenv(
    "DL_ANACONDA_HOME"
):
    if "google.colab" in sys.modules:
        from google.colab import auth as google_auth

        google_auth.authenticate_user()

    # If you are running this notebook locally, replace the string below with the
    # path to your service account key and run this cell to authenticate your GCP
    # account.
    elif not os.getenv("IS_TESTING"):
        %env GOOGLE_APPLICATION_CREDENTIALS ''

请获取您的项目编号

现在项目ID已设置好，您可以获取对应的项目编号。

In [None]:
shell_output = ! gcloud projects list --filter="PROJECT_ID:'{PROJECT_ID}'" --format='value(PROJECT_NUMBER)'
PROJECT_NUMBER = shell_output[0]
print("Project Number:", PROJECT_NUMBER)

### 创建一个云存储存储桶

**无论您的笔记本环境如何，都需要以下步骤。**

在下方设置您的云存储存储桶的名称。它必须在所有云存储存储桶中是唯一的。

In [None]:
BUCKET_NAME = "[your-bucket-name]"  # @param {type:"string"}
BUCKET_URI = f"gs://{BUCKET_NAME}"

In [None]:
if BUCKET_NAME == "" or BUCKET_NAME is None or BUCKET_NAME == "[your-bucket-name]":
    BUCKET_NAME = PROJECT_ID + "-aip-" + UUID
    BUCKET_URI = f"gs://{BUCKET_NAME}"

只有当您的存储桶不存在时才运行以下单元格以创建您的云存储存储桶。

In [None]:
! gsutil mb -l $REGION -p $PROJECT_ID $BUCKET_URI

最后，通过检查存储桶的内容来验证对您的云存储桶的访问权限。

In [None]:
! gsutil ls -al $BUCKET_URI

服务账户

如果您不想使用您项目的计算引擎服务账户，请将`SERVICE_ACCOUNT`设置为另一个服务账户ID。

In [None]:
SERVICE_ACCOUNT = "[your-service-account]"  # @param {type:"string"}

In [None]:
if (
    SERVICE_ACCOUNT == ""
    or SERVICE_ACCOUNT is None
    or SERVICE_ACCOUNT == "[your-service-account]"
):
    # Get your service account from gcloud
    if not IS_COLAB:
        shell_output = !gcloud auth list 2>/dev/null
        SERVICE_ACCOUNT = shell_output[2].replace("*", "").strip()

    else:  # IS_COLAB:
        shell_output = ! gcloud projects describe  $PROJECT_ID
        project_number = shell_output[-1].split(":")[1].strip().replace("'", "")
        SERVICE_ACCOUNT = f"{project_number}-compute@developer.gserviceaccount.com"

    print("Service Account:", SERVICE_ACCOUNT)

#### 设置服务账号访问权限

运行以下命令，为您的服务账号授予访问您在上一步创建的存储桶的权限。您只需要为每个服务账号运行此步骤一次。

In [None]:
! gsutil iam ch serviceAccount:{SERVICE_ACCOUNT}:roles/storage.objectCreator $BUCKET_URI
! gsutil iam ch serviceAccount:{SERVICE_ACCOUNT}:roles/storage.objectViewer $BUCKET_URI

为Dataproc Serverless启用私有Google访问

要执行Serverless Spark工作负载，VPC子网络必须符合Dataproc Serverless for Spark网络配置中列出的要求。在本教程中，我们将使用默认设置并将其启用为私有IP访问。

In [None]:
SUBNETWORK = "default"  # @param {type:"string"}

In [None]:
!gcloud compute networks subnets list --regions=$REGION --filter=$SUBNETWORK

In [None]:
!gcloud compute networks subnets update $SUBNETWORK \
--region=$REGION \
--enable-private-ip-google-access

In [None]:
!gcloud compute networks subnets describe $SUBNETWORK \
--region=$REGION \
--format="get(privateIpGoogleAccess)"

### 创建并配置Docker代码库

您可以在Artefact Registry中为准备创建的用于NLP数据预处理的自定义dataproc镜像创建一个Docker代码库。

In [None]:
REPO_NAME = "vertex-ai-model-registry-demo"

In [None]:
!gcloud artifacts repositories create $REPO_NAME \
    --repository-format=docker \
    --location=$REGION \
    --description="vertex ai model registry spark docker repository"

### 设定项目模板

您可以创建一组仓库来在本地组织您的项目。

In [None]:
DATA_PATH = "data"
SRC_PATH = "src"
BUILD_PATH = "build"
CONFIG_PATH = "config"

In [None]:
!mkdir -m 777 -p $DATA_PATH $SRC_PATH $BUILD_PATH $CONFIG_PATH

获取输入数据

在以下代码中，您下载并提取教程数据集。

In [None]:
RAW_DATA_URI = "http://mlg.ucd.ie/files/datasets/bbc-fulltext.zip"

In [None]:
!rm -Rf {DATA_PATH}/raw 
!wget --no-parent {RAW_DATA_URI} --directory-prefix={DATA_PATH}/raw 
!unzip -qo {DATA_PATH}/raw/bbc-fulltext.zip -d {DATA_PATH}/raw && mv {DATA_PATH}/raw/bbc/* {DATA_PATH}/raw/
!rm -Rf {DATA_PATH}/raw/bbc-fulltext.zip {DATA_PATH}/raw/bbc

设置 Bigquery 数据集

您为本教程创建 BigQuery 数据集。

In [None]:
LOCATION = REGION.split("-")[0]
BQ_DATASET = "bcc_sport"

! bq mk --location={LOCATION} --dataset {PROJECT_ID}:{BQ_DATASET}

### 导入库

In [None]:
# General
import csv
import datetime as dt
import glob
import json
import os
import sys

import pandas as pd

pd.set_option("display.max_colwidth", 3000)

# Model Training
import tensorflow as tf
from google.cloud import aiplatform as vertex_ai
from google.cloud import bigquery

In [None]:
print("BigQuery library version:", bigquery.__version__)
print("Vertex AI library version:", vertex_ai.__version__)

### 设定变量

In [None]:
# General
STAGING_BUCKET = f"{BUCKET_URI}/jobs"
RAW_PATH = os.path.join(DATA_PATH, "raw")
DATAPROC_IMAGE_BUILD_PATH = os.path.join(BUILD_PATH, "dataproc_image")
PREPROCESS_DOCKERFILE_PATH = os.path.join(DATAPROC_IMAGE_BUILD_PATH, "Dockerfile")
DATAPROC_RUNTIME_IMAGE = "dataproc_serverless_custom_runtime"
IMAGE_TAG = "1.0.0"
DATAPROC_RUNTIME_CONTAINER_IMAGE = (
    f"gcr.io/{PROJECT_ID}/{DATAPROC_RUNTIME_IMAGE}:{IMAGE_TAG}"
)
INIT_PATH = os.path.join(SRC_PATH, "__init__.py")
MODULE_URI = f"{BUCKET_URI}/{SRC_PATH}"
VERTEX_AI_MODEL_ID = "text-classifier-model"

# Ingest
PREPARED_PATH = os.path.join(DATA_PATH, "prepared")
PREPARED_FILE = "prepared_data.csv"
PREPARED_FILE_PATH = os.path.join(PREPARED_PATH, PREPARED_FILE)
PREPARED_FILE_URI = f"{BUCKET_URI}/{PREPARED_FILE_PATH}"

# Preprocess
PREPROCESS_MODULE_PATH = os.path.join(SRC_PATH, "preprocess.py")
LEMMA_DICTIONARY_PATH = os.path.join(CONFIG_PATH, "lemmas.txt")
LEMMA_DICTIONARY_URI = f"{BUCKET_URI}/{CONFIG_PATH}/lemmas.txt"
PROCESS_PYTHON_FILE_URI = f"{MODULE_URI}/preprocess.py"
PROCESS_DATA_PATH = os.path.join(DATA_PATH, "processed")
BQ_OUTPUT_TABLE_URI = f"{BQ_DATASET}.news_processed_{UUID}"
PROCESS_DATA_URI = f"{BUCKET_URI}/{PROCESS_DATA_PATH}"
PROCESS_FILE_URI = f"{PROCESS_DATA_URI}/*.parquet"
PREPROCESS_BATCH_ID = f"nlp-preprocess-{UUID}"

# Training
TRAIN_NAIVE_MODULE_PATH = os.path.join(SRC_PATH, "train_naive.py")
NAIVE_TRAIN_JOB_NAME = f"naive_training_job_{UUID}"
TRAIN_VERSION = "scikit-learn-cpu.0-23"
NAIVE_TRAIN_CONTAINER_URI = (
    f"{REGION.split('-')[0]}-docker.pkg.dev/vertex-ai/training/{TRAIN_VERSION}:latest"
)
NAIVE_TRAIN_REQUIREMENTS = ["pyarrow", "fastparquet", "gcsfs"]
DEPLOY_VERSION = "sklearn-cpu.0-23"
NAIVE_DEPLOY_CONTAINER_URI = f"{REGION.split('-')[0]}-docker.pkg.dev/vertex-ai/prediction/{DEPLOY_VERSION}:latest"
NAIVE_MODEL_BASE_URI = f"{BUCKET_URI}/deliverables/naive"
NAIVE_MODEL_URI = f"{BUCKET_URI}/deliverables/naive/model"
NAIVE_METRICS_FILE_URI = f"{NAIVE_MODEL_URI}/metrics.json"

# Deployment
SERVING_BUILD_PATH = os.path.join(BUILD_PATH, "serving")
SERVING_APP_BUILD_PATH = os.path.join(SERVING_BUILD_PATH, "app")
SERVE_NAIVE_MODULE_PATH = os.path.join(SERVING_APP_BUILD_PATH, "main.py")
SERVE_REQUIREMENTS_PATH = os.path.join(SERVING_BUILD_PATH, "requirements.txt")
SERVE_DOCKERFILE_PATH = os.path.join(SERVING_BUILD_PATH, "Dockerfile")
SERVE_AUTH_PATH = os.path.join(SERVING_BUILD_PATH, "key.json")
SERVE_SCRIPT_PATH = os.path.join(SERVING_BUILD_PATH, "copy_model.sh")
SERVING_RUNTIME_IMAGE = "serving_custom_naive"
IMAGE_TAG = "1.0.0"
SERVING_NAIVE_RUNTIME_CONTAINER_IMAGE = (
    f"gcr.io/{PROJECT_ID}/{SERVING_RUNTIME_IMAGE}:{IMAGE_TAG}"
)
ENDPOINT_NAME = "text-classifier-endpoint"
DEPLOYED_MODEL_NAME = "naive-bayes-text-classifier"

### 初始化Python版的Vertex AI SDK

为您的项目和对应的存储桶初始化Python版SDK。

In [None]:
vertex_ai.init(project=PROJECT_ID, location=REGION, staging_bucket=STAGING_BUCKET)

### 助手

一组助手，用于简化一些任务。

In [None]:
def prepare_data(input_path: str, output_path: str, file_name: str):
    """
    This function prepares the data for the model registry demo.
    Args:
        input_path: The directory where the raw data is stored.
        output_path: The directory where the prepared data will be stored.
        file_name: The name of the file to be prepared.
    Returns:
        None
    """
    # Read folder names
    categories = [f.name for f in os.scandir(input_path) if f.is_dir()]

    # Create output directory if it doesn't exist
    if not os.path.exists(output_path):
        os.makedirs(output_path)

    # Create output file
    with open(output_path + "/" + file_name, "w") as output_file:
        csv_writer = csv.writer(output_file)
        csv_writer.writerow(["category", "text"])

        # For each category, read all files and write to output file
        for category in categories:
            # Read all files in category
            for filename in glob.glob(os.path.join(input_path, category, "*.txt")):
                # Read file
                with open(filename, "r") as input_file:
                    output_text = "".join([line.rstrip() for line in input_file])
                    # Write to output file
                    csv_writer.writerow([category, output_text])
                    input_file.close()

    # Close output file
    output_file.close()


def run_query(query):

    """
    This function runs a query on the prepared data.
    Args:
        query: The query to be run.
    Returns:
        None
    """

    # Construct a BigQuery client object.
    client = bigquery.Client(project=PROJECT_ID, location=LOCATION)

    # Run the query_job
    query_job = client.query(query)

    # Wait for the query to finish
    result = query_job.result()

    # Return table
    table = query_job.ddl_target_table

    return table, result


def read_metrics_file(metrics_file_uri):
    """
    This function reads metrics file on bucket
    Args:
      metrics_file_uri: The uri of the metrics file
    Returns:
      metrics_str: metrics string
    """

    with tf.io.gfile.GFile(metrics_file_uri, "r") as metrics_file:
        metrics = metrics_file.read().replace("'", '"')
    metrics_file.close()
    return metrics

使用Dataproc Serverless进行数据工程

在构建NLP机器学习模型之前，有一些常见的预处理步骤可以使用：

1. 初始步骤，如句子分割和单词标记化
2. 频繁步骤，如停用词去除，词干提取和词形还原，去除数字和标点，转小写等。

其他步骤包括标准化，语言检测以及除词性标注之外的解析。

在接下来的部分中，您将导入您的数据集，并使用Dataproc服务器上的SparkNLP构建和执行一个简单的NLP预处理管道。为此，您需要：

1. 在Google Cloud存储桶上上传数据
2. 创建一个自定义的Dataproc Serverless映像
3. 创建并上传`preprocess`模块及其依赖项到Google Cloud存储桶

然后您将运行Dataproc Serverless作业，并将生成的数据加载到BigQuery中。

### 导入数据

接下来您将会：

1.  通过从目录中提取新闻来准备数据，并生成相应的csv文件。
2.  将数据上传至Google Cloud Bucket。

准备数据

In [None]:
prepare_data(RAW_PATH, PREPARED_PATH, PREPARED_FILE)

快速查看CSV数据

In [None]:
! head $PREPARED_FILE_PATH

将数据上传到存储桶中。

In [None]:
! gsutil cp $PREPARED_FILE_PATH $PREPARED_FILE_URI

### 基础数据和特征工程

在这种情况下，您将使用一个Spark管道来使用Spark NLP覆盖以下步骤：

1. 句子分割
2. 单词分词
3. 标准化
4. 停用词移除
5. 词干提取
6. 词形归并

最后，您将使用`CountVectorizer`对象创建一个词袋（BOW）。

#### 构建自定义 Dataproc 无服务器镜像

`DataprocPySparkBatchOp` 允许您传递自定义镜像，以便在[提供的 Dataproc 无服务器运行时版本](https://cloud.google.com/dataproc-serverless/docs/concepts/versions/spark-runtime-versions)不符合您的要求时使用。

在这种情况下，需要一个带有 Spark NLP 库的镜像。

下载Spark作业所需的依赖项

您下载了运行NLP预处理流水线所需的Spark依赖项。

In [None]:
! rm -rf $DATAPROC_IMAGE_BUILD_PATH
! mkdir $DATAPROC_IMAGE_BUILD_PATH

In [None]:
!gsutil cp gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.22.2.jar $DATAPROC_IMAGE_BUILD_PATH
!wget -P $DATAPROC_IMAGE_BUILD_PATH https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-4.0.2.jar
!wget -P $DATAPROC_IMAGE_BUILD_PATH https://repo.anaconda.com/miniconda/Miniconda3-py38_4.9.2-Linux-x86_64.sh

定义Dataproc无服务器自定义运行时映像

您可以定义Dockerfile来创建自定义映像。

In [None]:
dataproc_serverless_custom_runtime_image = """
# Debian 11 is recommended.
FROM debian:11-slim

# Suppress interactive prompts
ENV DEBIAN_FRONTEND=noninteractive

# (Required) Install utilities required by Spark scripts.
RUN apt update && apt install -y procps tini

# (Optional) Add extra jars.
ENV SPARK_EXTRA_JARS_DIR=/opt/spark/jars/
ENV SPARK_EXTRA_CLASSPATH='/opt/spark/jars/*'
RUN mkdir -p "${SPARK_EXTRA_JARS_DIR}"
COPY spark-bigquery-with-dependencies_2.12-0.22.2.jar "${SPARK_EXTRA_JARS_DIR}"
COPY spark-nlp-assembly-4.0.2.jar "${SPARK_EXTRA_JARS_DIR}"

# (Optional) Install and configure Miniconda3.
ENV CONDA_HOME=/opt/miniconda3
ENV PYSPARK_PYTHON=${CONDA_HOME}/bin/python
ENV PATH=${CONDA_HOME}/bin:${PATH}
COPY Miniconda3-py38_4.9.2-Linux-x86_64.sh .
RUN bash Miniconda3-py38_4.9.2-Linux-x86_64.sh -b -p /opt/miniconda3 \
  && ${CONDA_HOME}/bin/conda config --system --set always_yes True \
  && ${CONDA_HOME}/bin/conda config --system --set auto_update_conda False \
  && ${CONDA_HOME}/bin/conda config --system --prepend channels conda-forge \
  && ${CONDA_HOME}/bin/conda config --system --set channel_priority strict

# (Optional) Install Conda packages.
#
# The following packages are installed in the default image, it is strongly
# recommended to include all of them.
#
# Use mamba to install packages quickly.
RUN ${CONDA_HOME}/bin/conda install mamba -n base -c conda-forge \
    && ${CONDA_HOME}/bin/mamba install \
      conda \
      cython \
      gcsfs \
      google-cloud-bigquery-storage \
      google-cloud-bigquery[pandas] \
      google-cloud-dataproc \
      numpy \
      pandas \
      python \
      pyspark \
      findspark 

# Use conda to install spark-nlp
RUN ${CONDA_HOME}/bin/conda install -n base -c johnsnowlabs 'spark-nlp=4.0.2'

# Add lemma dictionary
# ENV CONFIG_DIR='/home/app/build'
# RUN mkdir -p "${CONFIG_DIR}"
# COPY lemmas.txt "${CONFIG_DIR}"

# (Required) Create the 'spark' group/user.
# The GID and UID must be 1099. Home directory is required.
RUN groupadd -g 1099 spark
RUN useradd -u 1099 -g 1099 -d /home/spark -m spark
USER spark
"""

with open(PREPROCESS_DOCKERFILE_PATH, "w") as f:
    f.write(dataproc_serverless_custom_runtime_image)
f.close()

使用Google Cloud Build构建Dataproc无服务器自定义运行时

您可以使用云构建来创建并注册容器镜像到Artifact注册表。

请注意，`<PROJECT_ID>@cloudbuild.gserviceaccount.com`需要具有对Google Cloud Storage对象的storage.objects.get访问权限。

**注意**: 这一步可能需要大约5分钟。

In [None]:
CLOUD_BUILD_SERVICE_ACCOUNT = f"{PROJECT_NUMBER}@cloudbuild.gserviceaccount.com"

! gsutil iam ch serviceAccount:{CLOUD_BUILD_SERVICE_ACCOUNT}:roles/storage.objectCreator $BUCKET_URI
! gsutil iam ch serviceAccount:{CLOUD_BUILD_SERVICE_ACCOUNT}:roles/storage.objectViewer $BUCKET_URI

In [None]:
!gcloud builds submit --tag $DATAPROC_RUNTIME_CONTAINER_IMAGE $DATAPROC_IMAGE_BUILD_PATH --machine-type=N1_HIGHCPU_32 --timeout=900s --verbosity=info

####准备`preprocess`模块

创建预处理模块

此模块将对数据进行预处理，包括以下步骤：

1. 句子分割
2. 词语标记
3. 正规化
4. 去除停用词
5. 词干提取
6. 词形还原

In [None]:
with open(INIT_PATH, "w") as init_file:
    pass

In [None]:
process_module = """
#!/usr/bin/env python3

'''
This is a simple module to preprocess the data for the model registry demo.
Steps:
1. Sentence segmentation
2. Word tokenization
3. Normalization
4. Stopword removal
5. Stemming
6. Lemmatization
'''

# Libraries
import logging
import argparse

from pyspark.sql.types import StructType, StringType
from pyspark.sql.functions import col, concat_ws, rand
from pyspark.ml.functions import vector_to_array
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml.feature import CountVectorizer, SQLTransformer
from pyspark.ml import Pipeline

# Variables ------------------------------------------------------------------------------------------------------------
DATA_SCHEMA = (StructType()
               .add("category", StringType(), True)
               .add("text", StringType(), True))
SEED=8

# Helper functions -----------------------------------------------------------------------------------------------------
def get_logger():
    '''
    This function returns a logger object.
    Returns:
        logger: The logger object.
    '''
    logger = logging.getLogger(__name__)
    logger.setLevel(logging.INFO)
    handler = logging.StreamHandler()
    formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
    handler.setFormatter(formatter)
    logger.addHandler(handler)
    return logger


def get_args():
    '''
    This function returns the arguments from the command line.
    Returns:
        args: The arguments from the command line.
    '''
    parser = argparse.ArgumentParser()
    parser.add_argument('--input_path', type=str, help='The input path uri without bucket prefix')
    parser.add_argument('--lemmas_path', type=str, help='The lemma dictionary path without bucket prefix')
    parser.add_argument('--gcs_output_path', type=str, help='The gcs path for preprocessed data without bucket prefix')
    parser.add_argument('--bq_output_table_uri', type=str, help='The Bigquery output table URI')
    parser.add_argument('--bucket', type=str, help='The staging bucket')
    parser.add_argument('--project', type=str, help='The project id')
    args = parser.parse_args()
    return args


def build_preliminary_steps():
    '''
    This function builds the preliminary steps for the preprocessing.
    Returns:
        preliminary_steps: The preliminary steps for the preprocessing.
    '''
    
    document_assembler = DocumentAssembler().setInputCol("text").setOutputCol("document").setCleanupMode('shrink_full')
    sentence_detector = SentenceDetector().setInputCols("document").setOutputCol("sentence")
    tokenizer = Tokenizer().setInputCols("sentence").setOutputCol("token")
    preliminary_steps = [document_assembler, sentence_detector, tokenizer]
    return preliminary_steps


def build_common_preprocess_steps(lemma_uri):
    '''
    This function builds the common preprocessing steps.
    Args:
        lemma_uri: The uri of lemma dictionary
    Returns:
        common_preprocess_steps: The common preprocessing steps.
    '''

    normalizer = Normalizer().setInputCols("token").setOutputCol("normalized_token").setLowercase(True)
    stopwords_cleaner = StopWordsCleaner().setInputCols("normalized_token").setOutputCol(
        "cleaned_tokens").setCaseSensitive(False)
    stemmer = Stemmer().setInputCols("cleaned_tokens").setOutputCol("stem")
    lemmatizer = Lemmatizer().setInputCols("stem").setOutputCol("lemma").setDictionary(lemma_uri, "->", "\t")
    finisher = Finisher().setInputCols("lemma").setOutputCols(["lemma_features"]).setIncludeMetadata(
        False).setOutputAsArray(True)
    common_preprocess_steps = [normalizer, stopwords_cleaner, stemmer, lemmatizer, finisher]
    return common_preprocess_steps


def build_feature_extraction_steps():
    '''
    This function builds the feature extraction steps.
    Returns:
        feature_extraction_steps: The feature extraction steps.
    '''

    count_vectorizer = CountVectorizer().setInputCol("lemma_features").setOutputCol("features").setVocabSize(30)
    feature_extraction_steps = [count_vectorizer]
    return feature_extraction_steps

def build_postprocessing_steps():
    '''
    This function builds the postprocessing steps.
    Returns:
        target_conversion_step: The target conversion step.
    '''

    sql_transformer = SQLTransformer(statement="SELECT CASE WHEN (category != 'business') THEN 'other' ELSE category END AS category, text, lemma_features, features  FROM __THIS__")
    build_postprocessing_steps = [sql_transformer]
    return build_postprocessing_steps

def read_data(spark_session, data_schema, input_dir):
    '''
    This function reads the data from the input directory.
    Args:
        spark_session: The SparkSession object.
        data_schema: The data schema.
        input_dir: The input directory.
    Returns:
        raw_df: The raw dataframe.
    '''

    raw_df = (spark_session.read.option("header", True)
              .option("delimiter", ',')
              .schema(data_schema)
              .csv(input_dir))
    return raw_df


def prepare_train_df(df):
    '''
    This function prepares the training dataframe.
    Args:
        df: The dataframe.
    Returns:
        None
    '''
    train_df = (df.withColumn("bow_col", vector_to_array("features"))
                .withColumn("lemmas", concat_ws(" ", col("lemma_features")))
                .select(["text"] + ["lemmas"] + [col("bow_col")[i] for i in range(30)] + ["category"]))

    return train_df


def save_data(data, bucket, gcs_path, bigquery_uri):
    '''
    This function saves the data to Bigquery.
    Args:
        data: The data to save.
        bucket: The bucket.
        gcs_path: The path to store processed data.
        bigquery_uri: The URI of the Bigquery table.
    Returns:
        None
    '''
    # df_sample = data.sample(withReplacement=False, fraction=0.7, seed=SEED)
    df_sample = data.orderBy(rand(SEED)).limit(1000)
    df_sample.write.format('bigquery') \
        .mode("overwrite") \
        .option("persistentGcsBucket", bucket) \
        .option("persistentGcsPath", gcs_path) \
        .save(bigquery_uri)


# Main function --------------------------------------------------------------------------------------------------------
def preprocess(args):
    '''
    preprocess function.
    Args:
        args: The arguments from the command line.
    Returns:
        None
    '''
    # Get logger
    logger = get_logger()

    # Initialize variables
    input_path = args.input_path
    lemma_path = args.lemmas_path
    gcs_output_path = args.gcs_output_path
    bq_output_table_uri = args.bq_output_table_uri
    bucket = args.bucket
    project = args.project
    lemma_uri = f'gs://{bucket}/{lemma_path}'
    input_uri = f'gs://{bucket}/{input_path}'

    # Initialize SparkSession
    logger.info('Starting preprocessing')
    spark = sparknlp.start()
    print(f"Spark NLP version: {sparknlp.version()}")
    print(f"Spark version: {spark.version}")

    # Build pipeline steps
    logger.info('Building pipeline steps')
    preliminary_steps = build_preliminary_steps()
    common_preprocess_steps = build_common_preprocess_steps(lemma_uri)
    feature_extraction_steps = build_feature_extraction_steps()
    postprocessing_steps = build_postprocessing_steps()
    pipeline = Pipeline(stages=preliminary_steps + common_preprocess_steps + feature_extraction_steps + postprocessing_steps)

    # Read data
    logger.info('Reading data')
    raw_df = read_data(spark, DATA_SCHEMA, input_uri)

    # Preprocess data
    logger.info('Preprocessing data')
    processed_pipeline = pipeline.fit(raw_df)
    preprocessed_df = processed_pipeline.transform(raw_df)
    preprocessed_df.show(10, truncate=False)

    # Save data to Bigquery
    logger.info('Saving data to Bigquery')
    train_df = prepare_train_df(preprocessed_df)
    save_data(train_df, bucket, gcs_output_path, bq_output_table_uri)
    logging.info('done.')
    spark.stop()


if __name__ == '__main__':
    # Get args
    args = get_args()
    preprocess(args)
"""

with open(PREPROCESS_MODULE_PATH, "w") as process_file:
    process_file.write(process_module)
process_file.close()

##### 将模块上传到存储桶

In [None]:
!gsutil cp $SRC_PATH/__init__.py $MODULE_URI/__init__.py
!gsutil cp $SRC_PATH/preprocess.py $MODULE_URI/preprocess.py

上传配置文件

根据Spark NLP文档，您将使用引理词典，并将其上传到Google Cloud存储桶。

In [None]:
!wget https://raw.githubusercontent.com/mahavivo/vocabulary/master/lemmas/AntBNC_lemmas_ver_001.txt -O $LEMMA_DICTIONARY_PATH
!gsutil cp $LEMMA_DICTIONARY_PATH $LEMMA_DICTIONARY_URI

利用 Dataproc 无服务器运行一个预处理 Spark 作业

现在你已经准备好执行，可以提交预处理 Dataproc 无服务器作业。这个 CLI 命令的解释超出范围，但你可以查看所有选项[官方文档](https://cloud.google.com/dataproc-serverless/docs/quickstarts/spark-batch)。

In [None]:
! gcloud beta dataproc batches submit pyspark $PROCESS_PYTHON_FILE_URI \
  --batch=$PREPROCESS_BATCH_ID \
  --container-image=$DATAPROC_RUNTIME_CONTAINER_IMAGE \
  --region=$REGION \
  --version='1.0.21' \
  --subnet='default' \
  --properties spark.executor.instances=2,spark.driver.cores=4,spark.executor.cores=4,spark.app.name=spark_preprocessing_job \
  -- --input_path=$PREPARED_FILE_PATH --lemmas_path=$LEMMA_DICTIONARY_PATH --gcs_output_path=$PROCESS_DATA_PATH --bq_output_table_uri=$BQ_OUTPUT_TABLE_URI --bucket=$BUCKET_NAME --project=$PROJECT_ID

## 文本分类模型训练

根据《实用自然语言处理：构建现实世界NLP系统的全面指南》(Practical Natural Language Processing: A Comprehensive Guide to Building Real-World NLP Systems)，有不同的方法可以训练文本分类器。例如，

- 传统方法，如逻辑回归或朴素贝叶斯分类器
- 神经嵌入方法
- 深度学习方法
- 大型、预训练的语言模型

在接下来的章节中，你将使用传统方法，并展示Vertex AI模型注册表将如何管理所有这些方法。

使用BQML进行逻辑回归

训练并注册模型

要将BigQuery ML模型注册到Vertex AI模型注册表，您必须使用`model_registry="vertex_ai"`。

In [None]:
train_lr_query = f"""
CREATE OR REPLACE MODEL
  `{PROJECT_ID}.{BQ_DATASET}.text_logit_classifier`
OPTIONS
  ( MODEL_TYPE='LOGISTIC_REG',
    AUTO_CLASS_WEIGHTS=TRUE,
    DATA_SPLIT_METHOD='RANDOM',
    DATA_SPLIT_EVAL_FRACTION = .10,
    INPUT_LABEL_COLS=['category'],
    ENABLE_GLOBAL_EXPLAIN=TRUE,
    MODEL_REGISTRY='vertex_ai',
    VERTEX_AI_MODEL_ID='{VERTEX_AI_MODEL_ID}',
    VERTEX_AI_MODEL_VERSION_ALIASES=['experimental', 'baseline', 'BQML', 'logistic_regression']
  ) AS
    SELECT * EXCEPT(text, lemmas)
    FROM `{PROJECT_ID}.{BQ_OUTPUT_TABLE_URI}`
"""

In [None]:
model_table, result = run_query(query=train_lr_query)
print(f"The {model_table.dataset_id}.{model_table.table_id} successfully created!")

朴素贝叶斯分类器与scikit-learn

创建朴素训练模块

使用这个模块，您将为文本分类训练一个简单的scikit-learn朴素贝叶斯估计器。

In [None]:
train_naive_module = """
#!/usr/bin/env python3
'''
This is a simple module to train a naive bayes model.
'''

import logging
import argparse
import os
import glob

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.utils.class_weight import compute_sample_weight
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score, log_loss, roc_auc_score
import pickle



# Variables
RANDOM_STATE = 8
TEST_SIZE = 0.2
EVAL_SIZE = 0.25



# Helpers --------------------------------------------------------------------------------------------------------------

def get_logger():
    '''
    This function returns a logger object.
    Returns:
        logger: The logger object.
    '''
    logger = logging.getLogger(__name__)
    logger.setLevel(logging.INFO)
    handler = logging.StreamHandler()
    formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
    handler.setFormatter(formatter)
    logger.addHandler(handler)
    return logger


def get_args():
    '''
    This function parses and return arguments passed in command line.
    Returns:
        args: Arguments list.
    '''

    parser = argparse.ArgumentParser()
    parser.add_argument("--data_path",
                        type=str, help="The path of the training data.")
    parser.add_argument('--model_dir',
                        type=str, help='The path of the model directory.')
    args = parser.parse_args()
    return args


def read_data(data_path: str):
    '''
    This function reads the data from the provided data path.
    Args:
        data_path: The path of the data.
    Returns:
        x_train: The training data.
        y_train: The training labels.
        x_test: The test data.
        y_test: The test labels.
    '''
    # Read data
    gs_prefix = 'gs://'
    gcsfuse_prefix = '/gcs/'
    if data_path.startswith(gs_prefix):
        data_path = data_path.replace(gs_prefix, gcsfuse_prefix)
    parquet_files = glob.glob(data_path)
    dataframes = []
    for parquet_file_path in parquet_files:
        parquet_file_path = parquet_file_path.replace(gcsfuse_prefix, gs_prefix)
        dataframes.append(pd.read_parquet(parquet_file_path, engine='fastparquet'))
    df = pd.concat(dataframes, axis=0)
    x = df.text
    # y = np.where(df.category == 'sport', 1, 0)
    y = df.category
    x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=RANDOM_STATE, test_size=TEST_SIZE)
    x_train, x_eval, y_train, y_eval = train_test_split(x_train, y_train, random_state=RANDOM_STATE, test_size=EVAL_SIZE)
    return x_train, y_train, x_test, y_test


def get_weights(y_train):
    '''
    This function returns the class weights for the model.
    Returns:
        weights: The class weights.
    '''
    weights = compute_sample_weight('balanced', y_train)
    return weights


def build_model():
    '''
    This function builds the model.
    Returns:
        model: The model.
    '''
    model = Pipeline([
        ('count_vectorizer', CountVectorizer()),
        ('classifier', MultinomialNB())
    ])
    return model


def train_model(x_train, y_train, model):
    '''
    This function trains the model.
    Args:
        x_train: The training data.
        y_train: The training labels.
        model: The model to train.
    Returns:
        model: The trained model.
    '''
    model = model.fit(x_train, y_train, classifier__sample_weight=get_weights(y_train))
    return model


def evaluate_model(model, x_test, y_test):
    '''
    This function evaluates the model on the test data.
    Parameters:
        model: The model to evaluate.
        x_test: The test data.
        y_test: The test labels.
    '''

    y_pred = model.predict(x_test)
    y_pred_proba = model.predict_proba(x_test)
    metrics = {
        "precision": round(precision_score(y_test, y_pred, sample_weight=get_weights(y_test), average="weighted"), 5),
        "recall": round(recall_score(y_test, y_pred, sample_weight=get_weights(y_test), average="weighted"), 5),
        "accuracy": round(accuracy_score(y_test, y_pred, sample_weight=get_weights(y_test)), 5),
        "f1_score": round(f1_score(y_test, y_pred, sample_weight=get_weights(y_test), average="weighted"), 5),
        "log_loss": round(log_loss(y_test, y_pred_proba, sample_weight=get_weights(y_test)), 5),
        "roc_auc": round(roc_auc_score(y_test, y_pred_proba[:,1], sample_weight=get_weights(y_test), average="weighted"), 5)
    }
    return metrics


def save_model(model, model_dir):
    '''
    This function saves the model to the provided model directory.
    Parameters:
        model: The model to save.
        model_dir: The directory to save the model to.
    '''

    # Create output directory if it doesn't exist
    gs_prefix = 'gs://'
    gcsfuse_prefix = '/gcs/'
    if model_dir.startswith(gs_prefix):
        model_dir = model_dir.replace(gs_prefix, gcsfuse_prefix)
    model_dir = os.path.join(model_dir, 'model')
    if not os.path.exists(model_dir):
        os.makedirs(model_dir)
    model_path = os.path.join(model_dir, 'model.pkl')
    with open(model_path, 'wb') as model_file:
        pickle.dump(model, model_file)

def save_metrics(metrics, model_dir):
    '''
    This function saves the metrics to the provided model directory.
    Parameters:
        metrics: The metrics to save.
        model_dir: The directory to save the metrics to.
    '''

    # Create output directory if it doesn't exist
    gs_prefix = 'gs://'
    gcsfuse_prefix = '/gcs/'
    if model_dir.startswith(gs_prefix):
        model_dir = model_dir.replace(gs_prefix, gcsfuse_prefix)
    metrics_path = os.path.join(model_dir, 'model', 'metrics.json')
    with open(metrics_path, 'w') as f:
        f.write(str(metrics))


def train_naive(args):
    '''
    This function trains the model and saves it to the provided model directory.
    Parameters:
        args: The arguments from the command line.
    '''
    # Get logger
    logger = get_logger()
    logger.info('Starting model training...')

    # Initialize variables
    data_path = args.data_path
    model_dir = args.model_dir

    # Build model
    model = build_model()

    # Read data
    logger.info('Reading data')
    x_train, y_train, x_test, y_test = read_data(data_path)

    # Train model
    logger.info('Training model')
    model = train_model(x_train, y_train, model)

    # Evaluate model
    logger.info('Evaluating model')
    metrics = evaluate_model(model, x_test, y_test)
    for key, value in metrics.items():
        print(f'{key}: {value}')

    # Save model
    logger.info('Saving model')
    save_model(model, model_dir)

    # Save metrics
    logger.info('Saving metrics')
    save_metrics(metrics, model_dir)

    logger.info('Training complete.')


if __name__ == '__main__':
    # Get args
    args = get_args()
    train_naive(args)
"""

with open(TRAIN_NAIVE_MODULE_PATH, "w") as train_naive_file:
    train_naive_file.write(train_naive_module)
train_naive_file.close()

使用Vertex AI Training对模型进行训练和注册

要注册一个新的自定义模型版本，使用Vertex AI Training对现有模型进行训练，您需要提供以下额外的参数：

* `parent_model`：要注册新版本的现有模型的父资源名称。
* `model_version_aliases`：要创建的模型版本的别名。
* `model_version_description`：模型版本的描述。
* `is_default_version`：该模型版本是否为默认版本。

一旦您运行训练任务，它将需要**~5分钟**才能完成。

In [None]:
naive_bayes_train_job = vertex_ai.CustomTrainingJob(
    display_name=NAIVE_TRAIN_JOB_NAME,
    script_path=TRAIN_NAIVE_MODULE_PATH,
    container_uri=NAIVE_TRAIN_CONTAINER_URI,
    requirements=NAIVE_TRAIN_REQUIREMENTS,
)

In [None]:
naive_model = naive_bayes_train_job.run(
    args=["--data_path", PROCESS_FILE_URI, "--model_dir", NAIVE_MODEL_BASE_URI],
    replica_count=1,
    machine_type="n1-standard-4",
    base_output_dir=NAIVE_MODEL_BASE_URI,
)

用 Vertex AI Model Registry 进行模型治理

#### 初始化Vertex AI模型注册表

要访问Vertex AI模型资源的不同模型版本，您可以初始化模型注册表模型的实例。

In [None]:
registry = vertex_ai.models.ModelRegistry(VERTEX_AI_MODEL_ID)

比较模型版本

然后，您使用`ML.EVALUATE`生成BQML模型评估指标，并将其与您使用自定义模型创建的相同指标进行比较。

In [None]:
evaluation_query = f"""
SELECT *
FROM
  ML.EVALUATE(MODEL `{BQ_DATASET}.text_logit_classifier`)
ORDER BY  roc_auc desc
LIMIT 1
"""
_, result = run_query(query=evaluation_query)
evaluation_df = result.to_dataframe().rename(index={0: "bqml_text_logit_classifier"})
evaluation_df

In [None]:
naive_metrics = read_metrics_file(NAIVE_METRICS_FILE_URI)
metrics_dict = [json.loads(naive_metrics)]
naive_metrics_df = pd.DataFrame.from_dict(metrics_dict).rename(index={0: "naive_bayes"})
evaluation_df = evaluation_df.append(naive_metrics_df, ignore_index=False)
evaluation_df

### 注册`champion`模型版本

根据模型评估，scikit-learn朴素贝叶斯分类器胜过了BQML逻辑回归，因此成为生产候选模型。

构建并推送定制服务容器到Artifact Registry

构建自定义服务图像

In [None]:
! rm -rf $SERVING_BUILD_PATH
! mkdir $SERVING_BUILD_PATH
! mkdir $SERVING_APP_BUILD_PATH

构建提供服务的应用程序

In [None]:
serve_naive_module = """
'''
This is a simple web application to serve the naive bayes model.
'''

# Libraries ------------------------------------------------------------------------------------------------------------

import logging
import os
from flask import Flask, Response, request, jsonify
import pickle
import pandas as pd


# Helpers --------------------------------------------------------------------------------------------------------------

def get_probabilities(model_classes, probabilities):
    proba_classes = []
    for probabilities_list in probabilities:
      proba_classes.append({"classes": model_classes, "scores": probabilities_list})
    return proba_classes


# App ------------------------------------------------------------------------------------------------------------------

# Initialize the app
app = Flask(__name__)
app.logger.setLevel(logging.INFO)

# Load the model
app.logger.info("Loading model...")
model = pickle.load(open('./model/model.pkl', 'rb'))
app.logger.info("Model loaded.")

# classes = model.classes_
classes = model.classes_.tolist()


@app.route(os.environ['AIP_HEALTH_ROUTE'], methods=['GET'])
def health():
    '''
    A health check endpoint.
    '''
    app.logger.info("Health check")
    return Response(response='OK', status=200)


@app.route(os.environ['AIP_PREDICT_ROUTE'], methods=['POST'])
def predict():
    '''
    A predict endpoint.
    '''
    app.logger.info("Predict")

    # Get instances
    instances_dict = request.get_json()["instances"]

    # Generate predictions
    instances_df = pd.DataFrame.from_records(instances_dict)
    probabilities = model.predict_proba(instances_df.iloc[:, 0])

    # Format predictions
    fmt_probabilities = get_probabilities(classes, probabilities.tolist())

    return jsonify({"predictions": fmt_probabilities})


if __name__ == "main":
    app.run(debug=True, host="0.0.0.0", port=9999)
"""

with open(SERVE_NAIVE_MODULE_PATH, "w") as serve_naive_file:
    serve_naive_file.write(serve_naive_module)
serve_naive_file.close()

####复制模型

In [None]:
!gsutil cp -r $NAIVE_MODEL_URI $SERVING_APP_BUILD_PATH

创建 `requirements` 文件

In [None]:
serve_requirements = """
flask==2.2.2
gunicorn==20.1.0
numpy==1.22.4
pandas==1.4.3
scikit-learn==0.23.1
"""

with open(SERVE_REQUIREMENTS_PATH, "w") as serve_requirements_file:
    serve_requirements_file.write(serve_requirements)
serve_requirements_file.close()

创建`Dockerfile`文件

In [None]:
serve_dockerfile = """
FROM python:3.8-slim

# Update pip
RUN pip3 install --upgrade pip

# Install requirements
COPY requirements.txt requirements.txt
RUN pip install -r requirements.txt

# Create app folder and copy app files
RUN mkdir /app
COPY app /app
WORKDIR /app

# Run app
EXPOSE 9999
CMD ["gunicorn", "main:app", "--timeout=0", "--preload", \
     "--workers=1", "--threads=4", "--bind=0.0.0.0:9999"]
"""

with open(SERVE_DOCKERFILE_PATH, "w") as serve_dockerfile_file:
    serve_dockerfile_file.write(serve_dockerfile)
serve_dockerfile_file.close()

构建并推送定制图片

In [None]:
!gcloud builds submit --tag $SERVING_NAIVE_RUNTIME_CONTAINER_IMAGE $SERVING_BUILD_PATH --machine-type=N1_HIGHCPU_32 --timeout=900s --verbosity=info

注册模型

In [None]:
naive_model = vertex_ai.Model.upload(
    parent_model=VERTEX_AI_MODEL_ID,
    is_default_version=False,
    version_aliases=["experimental", "challenger", "custom-training", "naive-bayes"],
    version_description="A Naive Bayes text classifier",
    serving_container_image_uri=SERVING_NAIVE_RUNTIME_CONTAINER_IMAGE,
    serving_container_health_route="/health",
    serving_container_predict_route="/predict",
    serving_container_ports=[9999],
    labels={"created_by": "inardini", "team": "advocacy"},
)

列出模型版本

您可以使用 `list_versions` 方法列出所有模型版本。

In [None]:
versions = registry.list_versions()
for version in versions:
    version_id = version.version_id
    version_created_time = dt.datetime.fromtimestamp(
        version.version_create_time.timestamp()
    ).strftime("%m/%d/%Y %H:%M:%S")
    version_aliases = version.version_aliases
    print(
        "\n",
        f"Model version {version_id} was created at {version_created_time} with aliases {version_aliases}",
    )

获取有关“champion”模型版本的所有信息

要获取有关您的“champion”模型的所有信息，您可以使用“get_version_info”方法。

In [None]:
CHAMPION_VERSION_ID = versions[-1].version_id

In [None]:
champion_model_version_info = registry.get_version_info(CHAMPION_VERSION_ID)
champion_model_version_info_df = pd.DataFrame(
    champion_model_version_info,
    columns=["model_version"],
    index=[
        "version_id",
        "created_at",
        "updated_at",
        "model_display_name",
        "model_resource_name",
        "version_aliases",
        "version_description",
    ],
)
champion_model_version_info_df

### 将冠军模型设置为`production`，状态为`default`

为了更新别名并将模型状态从`experimental`更改为`production`，Vertex AI SDK提供了`add_version_aliases`和`remove_version_aliases`方法。

请注意，我们根据在线实验阶段中讨论的内容设置这些别名，该内容在[“MLOps从业者指南：持续交付和自动化机器学习的框架”](https://services.google.com/fh/files/misc/practitioners_guide_to_mlops_whitepaper.pdf)中有所介绍。

In [None]:
registry.remove_version_aliases(
    ["experimental", "challenger"], version=CHAMPION_VERSION_ID
)
registry.add_version_aliases(["default", "production"], version=CHAMPION_VERSION_ID)

### 部署 `champion` 模型

最后，您启动了准备投入生产的冠军模型，并将其部署到 Vertex AI 端点。

In [None]:
champion_model = registry.get_model(version="production")

创建端点

In [None]:
endpoint = vertex_ai.Endpoint.create(
    display_name=ENDPOINT_NAME,
    project=PROJECT_ID,
    location=REGION,
)

部署冠军模型

In [None]:
endpoint.deploy(
    model=champion_model,
    deployed_model_display_name=DEPLOYED_MODEL_NAME,
    machine_type="n1-standard-8",
)

### 生成预测

In [None]:
text = """The singer to headline the event halftime show: 'It's on"""  # @param {type:"string"}

In [None]:
instances = [{"text": text}]
predictions = endpoint.predict(instances)
print(predictions)

### 最后的想法 

正如你所能想象的，你也可以上传外部模型。查看文档示例和[示例笔记本](https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/ml_ops/stage3/get_started_with_model_registry.ipynb)。

清理工作

要清理此项目中使用的所有Google Cloud资源，您可以删除用于本教程的[Google Cloud项目](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects)。

否则，您可以删除您创建的各个资源。

In [None]:
endpoint.undeploy_all()

endpoint.delete()

drop_model_query = f"DROP MODEL `{PROJECT_ID}.{BQ_DATASET}.text_logit_classifier`"
run_query(drop_model_query)

versions = registry.list_versions()
for version in versions:
    if "default" not in version.version_aliases:
        registry.delete_version(version=version.version_id)
    else:
        model = registry.get_model(version="default")
        model.delete()

naive_bayes_train_job.delete()

! gcloud dataproc batches delete $PREPROCESS_BATCH_ID --region=$REGION --quiet

! bq rm -r -f -d $PROJECT_ID:$BQ_DATASET

! gcloud artifacts repositories delete $REPO_NAME --location=$REGION --quiet

delete_bucket = False
if delete_bucket or os.getenv("IS_TESTING"):
    ! gsutil -m rm -r $BUCKET_URI

! rm -rf $DATA_PATH $SRC_PATH $BUILD_PATH $CONFIG_PATH