In [None]:
# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# 在 Vertex AI 上培训、调整和部署 PyTorch 文本情感分类模型

<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/model_evaluation/model_based_llm_evaluation/autosxs_check_alignment_against_human_preference_data.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Google Colaboratory logo"><br> 在 Colab 中打开
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fvertex-ai-samples%2Fmain%2Fnotebooks%2Fofficial%2Ftraining%2Fpytorch-text-sentiment-classification-custom-train-deploy.ipynb">
      <img width="32px" src="https://cloud.google.com/ml-engine/images/colab-enterprise-logo-32px.png" alt="Google Cloud Colab Enterprise logo"><br> 在 Colab Enterprise 中打开
    </a>
  </td>    
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/main/notebooks/official/training/pytorch-text-sentiment-classification-custom-train-deploy.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo"><br> 在 Workbench 中打开
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/training/pytorch-text-sentiment-classification-custom-train-deploy.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo"><br> 在 GitHub 上查看
    </a>
  </td>
</table>

## 概览

本笔记本演示了如何使用 Vertex AI 和 Pytorch SDK 对一个预训练的 [BERT](https://huggingface.co/bert-base-cased) 模型进行微调，构建并部署一个文本情感分类模型。这个示例受到了 Hugging Face 的 [Token_Classification](https://github.com/huggingface/notebooks/blob/master/examples/token_classification.ipynb) 和 [Run_Glue](https://github.com/huggingface/transformers/blob/v2.5.0/examples/run_glue.py) 笔记本的启发。

您可以在[Hugging Face Hub](https://huggingface.co/bert-base-cased)找到有关该模型的更多详细信息。想要了解更多使用最先进的 PyTorch/Tensorflow/JAX 的笔记本，请探索 [Hugging FaceNotebooks](https://huggingface.co/transformers/notebooks.html)。

了解更多关于[自定义训练](https://cloud.google.com/vertex-ai/docs/training/custom-training)。

### 目标

在本教程中，您将学习如何在[Vertex AI](https://cloud.google.com/vertex-ai)上构建、训练、调整和部署一个PyTorch模型。您主要关注Vertex AI上对自定义模型训练和部署的支持。


此教程使用以下谷歌云机器学习服务：

- Vertex AI `工作台`
- Vertex AI `训练`（自定义Python包训练）
- Vertex AI `模型注册`
- Vertex AI `端点`

执行的步骤包括：

- 为文本分类模型创建训练包。
- 在Vertex AI上进行自定义训练模型。
- 检查创建的模型工件。
- 为预测创建一个自定义容器。
- 使用自定义容器将训练好的模型部署到Vertex AI端点。
- 发送在线预测请求到部署的模型并验证。
- 清理本笔记本中创建的资源。

### 数据集

本教程使用的数据集是来自[Kaggle数据集](https://www.kaggle.com/ritresearch/happydb)的[Happy Moments数据集](https://www.kaggle.com/ritresearch/happydb)。在本教程中使用的数据集版本存储在公共云存储桶中。

有关该数据集的更多信息可以在[HappyDB网站](https://rit-public.github.io/HappyDB/)找到。

### 成本

本教程使用谷歌云的计费组件：

- Vertex AI
- 云存储
- 云构建
- Artifact Registry

了解[Vertex AI 定价](https://cloud.google.com/vertex-ai/pricing)、[云存储定价](https://cloud.google.com/storage/pricing)、[云构建定价](https://cloud.google.com/build/pricing)、[Artifact Registry 定价](https://cloud.google.com/artifact-registry/pricing) ，并使用[定价计算器](https://cloud.google.com/products/calculator/) 生成一个基于您预期使用量的成本估算。

开始吧

### 为 Python 安装 Vertex AI SDK 和其他必需的软件包。

## 安装

安装执行此笔记本所需的软件包。

In [None]:
! pip3 install --upgrade --quiet google-cloud-aiplatform

### 重新启动运行时（仅适用于Colab）

为了使用新安装的软件包，您必须重新启动Google Colab上的运行时。

In [None]:
import sys

if "google.colab" in sys.modules:

    import IPython

    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

<div class="alert alert-block alert-warning">
<b>⚠️内核即将重新启动。请等待它完成后再继续下一步。⚠️</b>
</div>

### 在谷歌 Colab 上验证您的笔记本环境

在谷歌 Colab 上验证您的环境。

In [None]:
import sys

if "google.colab" in sys.modules:

    from google.colab import auth

    auth.authenticate_user()

### 设置Google Cloud项目信息并初始化Python版Vertex AI SDK

要开始使用Vertex AI，您必须拥有一个现有的Google Cloud项目并[启用Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com)。了解有关[设置项目和开发环境](https://cloud.google.com/vertex-ai/docs/start/cloud-environment)的更多信息。

In [None]:
PROJECT_ID = "[your-project-id]"  # @param {type:"string"}
LOCATION = "us-central1"  # @param {type:"string"}

UUID

如果您正在参加现场教程会话，您可能会使用共享测试账户或项目。为了避免资源创建时用户之间的名称冲突，为每个实例会话创建一个uuid，并将uuid附加到您在本教程中创建的资源名称上。

In [None]:
import random
import string


# Generate a uuid of a specifed length(default=8)
def generate_uuid(length: int = 8) -> str:
    return "".join(random.choices(string.ascii_lowercase + string.digits, k=length))


UUID = generate_uuid()

创建一个云存储桶

创建一个存储桶来存储中间产物，例如数据集。

In [None]:
BUCKET_URI = f"gs://your-bucket-name-{PROJECT_ID}-unique"  # @param {type:"string"}

只有当您的存储桶尚不存在时：运行以下单元格以创建您的云存储存储桶。

In [None]:
! gsutil mb -l $LOCATION -p $PROJECT_ID $BUCKET_URI

###引入库

导入Vertex AI Python SDK和其他所需的Python库。

In [None]:
import base64
import json
import os
import sys

from google.cloud import aiplatform
from google.protobuf.json_format import MessageToDict

定义常数

定义本教程所需的常数。

In [None]:
# Name for the package application / model / repository
APP_NAME = "finetuned-bert-classifier"

# URI for the pre-built container for custom training
PRE_BUILT_TRAINING_CONTAINER_IMAGE_URI = (
    "us-docker.pkg.dev/vertex-ai/training/pytorch-gpu.1-11:latest"
)

# Name of the folder where the python package needs to be stored
PYTHON_PACKAGE_APPLICATION_DIR = "python_package"

# Path to the source distribution tar of the python package
source_package_file_name = f"{PYTHON_PACKAGE_APPLICATION_DIR}/dist/trainer-0.1.tar.gz"

# GCS path where the python package is stored
python_package_gcs_uri = (
    f"{BUCKET_URI}/pytorch-on-gcp/{APP_NAME}/train/python_package/trainer-0.1.tar.gz"
)

# Module name for training application
python_module_name = "trainer.task"

# Training job's display name
JOB_NAME = f"{APP_NAME}-pytorch-pkg-train-{UUID}"

# Set training job's machine-type
TRAIN_MACHINE_TYPE = "n1-standard-8"
# Set training job's accelerator type
TRAIN_ACCELERATOR_TYPE = "NVIDIA_TESLA_T4"
# Set no. of h/w accelerators needed for the training job
TRAIN_ACCELERATOR_COUNT = 1

# Set the name of the container image for prediction
CUSTOM_PREDICTOR_IMAGE_URI = f"{LOCATION}-docker.pkg.dev/{PROJECT_ID}/{APP_NAME}/pytorch_predict_{APP_NAME}:latest"

# Set the version for model-deployment
VERSION = 1
# Set the model display name
model_display_name = f"{APP_NAME}-v{VERSION}"
# Set the model description
model_description = "PyTorch based text classifier with custom container"

# Set the health route for prediction container
health_route = "/ping"
# Set the predict route for prediction container
predict_route = f"/predictions/{APP_NAME}"
# Set the serving container ports for prediction
serving_container_ports = [7080]

# Set the display name for endpoint
endpoint_display_name = f"{APP_NAME}-endpoint"
# Set the machine-type for deployment
DEPLOY_MACHINE_TYPE = "n1-standard-4"

### 为Python初始化Vertex AI SDK

In [None]:
aiplatform.init(project=PROJECT_ID, staging_bucket=BUCKET_URI)

### 定制Vertex AI培训

__推荐的培训应用程序结构__

您可以按照自己喜欢的方式构建培训应用程序。然而，[以下结构](https://cloud.google.com/vertex-ai/docs/training/create-python-pre-built-container#structure)通常在Vertex AI示例中使用，并且将项目组织得类似可以让您更容易地跟随示例。

以下`python_package`目录结构展示了一个示例打包方法。

```
├── python_package
│   ├── setup.py
│   └── trainer
│       ├── __init__.py
│       ├── experiment.py
│       ├── metadata.py
│       ├── model.py
│       ├── task.py
│       └── utils.py
└── pytorch-text-sentiment-classification-custom-train-deploy.ipynb    --> 这个笔记本
```

* 主项目目录包含您的`setup.py`文件以及依赖项。
* 在`trainer`目录中：
   - `task.py` - 主应用程序模块初始化并解析任务参数（超参数）。它也作为训练器的入口点。
   - `model.py` - 包括一个从预训练模型创建具有序列分类头的模型的函数。
   - `experiment.py` - 运行模型训练和评估实验，并导出最终模型。
   - `metadata.py` - 定义用于分类任务的元数据，如预定义模型、数据集名称和目标标签。
   - `utils.py` - 包括实用函数，例如用于读取数据、将模型保存到云存储桶等函数。

### 为 Python 软件包创建所需的文件

为 Python 软件包创建目录。

In [None]:
!mkdir -p python_package/trainer
!mkdir -p python_package/scripts
!touch ./python_package/trainer/__init__.py

创建`model.py`文件，该文件返回指定的预训练模型。

In [None]:
%%writefile ./python_package/trainer/model.py

from transformers import AutoModelForSequenceClassification
from trainer import metadata

def create(num_labels):
    """create the model by loading a pretrained model or define your 
    own

    Args:
      num_labels: number of target labels
    """
    # Create the model, loss function, and optimizer
    model = AutoModelForSequenceClassification.from_pretrained(
        metadata.PRETRAINED_MODEL_NAME,
        num_labels=num_labels
    )
    
    return model

创建`utils.py`文件，定义用于数据加载、预处理和模型保存的实用函数。

In [None]:
%%writefile ./python_package/trainer/utils.py

import os
import datetime
import pandas as pd

from google.cloud import storage

from transformers import AutoTokenizer
from datasets import load_dataset, load_metric, ReadInstruction, DatasetDict, Dataset
from trainer import metadata


def preprocess_function(examples):
    tokenizer = AutoTokenizer.from_pretrained(
        metadata.PRETRAINED_MODEL_NAME,
        use_fast=True,
    )
    
    # Tokenize the texts
    tokenizer_args = (
        (examples['text'],) 
    )
    result = tokenizer(*tokenizer_args, 
                       padding='max_length', 
                       max_length=metadata.MAX_SEQ_LENGTH, 
                       truncation=True)
    
    # We can extract this automatically but the unique() method of the dataset
    # is not reporting the label -1 which shows up in the pre-processing
    # hence the additional -1 term in the dictionary
    
    label_to_id = metadata.TARGET_LABELS
    
    # Map labels to IDs (not necessary for GLUE tasks)
    if label_to_id is not None and "label" in examples:
        result["label"] = [label_to_id[l] for l in examples["label"]]

    return result


def load_data(args):
    """Loads the data into two different data loaders. (Train, Test)

        Args:
            args: arguments passed to the python script
    """
    # dataset loading repeated here to make this cell idempotent
    # since we are over-writing datasets variable
    
    df_train = pd.read_csv(metadata.TRAIN_DATA)
    df_test = pd.read_csv(metadata.TEST_DATA)
    
    dataset = DatasetDict({"train": Dataset.from_pandas(df_train),"test": Dataset.from_pandas(df_test)})

    dataset = dataset.map(preprocess_function, 
                          batched=True, 
                          load_from_cache_file=True)

    train_dataset, test_dataset = dataset["train"], dataset["test"]

    return train_dataset, test_dataset


def save_model(args):
    """Saves the model to Google Cloud Storage or local file system

    Args:
      args: contains name for saved model.
    """
    scheme = 'gs://'
    if args.job_dir.startswith(scheme):
        job_dir = args.job_dir.split("/")
        bucket_name = job_dir[2]
        object_prefix = "/".join(job_dir[3:]).rstrip("/")

        if object_prefix:
            model_path = '{}/{}'.format(object_prefix, args.model_name)
        else:
            model_path = '{}'.format(args.model_name)

        bucket = storage.Client().bucket(bucket_name)    
        local_path = os.path.join("/tmp", args.model_name)
        files = [f for f in os.listdir(local_path) if os.path.isfile(os.path.join(local_path, f))]
        for file in files:
            local_file = os.path.join(local_path, file)
            blob = bucket.blob("/".join([model_path, file]))
            blob.upload_from_filename(local_file)
        print(f"Saved model files in gs://{bucket_name}/{model_path}")
    else:
        print(f"Saved model files at {os.path.join('/tmp', args.model_name)}")
        print(f"To save model files in GCS bucket, please specify job_dir starting with gs://")


为训练应用程序中使用的常量定义创建`metadata.py`文件。

In [None]:
%%writefile ./python_package/trainer/metadata.py

# Task type can be either 'classification', 'regression', or 'custom'.
# This is based on the target feature in the dataset.
TASK_TYPE = 'classification'

# Dataset paths
    
TRAIN_DATA = "gs://cloud-samples-data/ai-platform-unified/datasets/text/happydb/happydb_train.csv"
TEST_DATA = "gs://cloud-samples-data/ai-platform-unified/datasets/text/happydb/happydb_test.csv"

# pre-trained model name
PRETRAINED_MODEL_NAME = 'bert-base-cased'

# List of the class values (labels) in a classification dataset.
TARGET_LABELS = {"leisure": 0, "exercise":1, "enjoy_the_moment":2, "affection":3,"achievement":4, "nature":5, "bonding":6}


# maximum sequence length
MAX_SEQ_LENGTH = 128

创建一个名为`experiment.py`的文件，定义用于超参数调整和训练的函数。

In [None]:
%%writefile ./python_package/trainer/experiment.py

import os
import numpy as np
import hypertune

from transformers import (
    AutoTokenizer,
    EvalPrediction,
    Trainer,
    TrainingArguments,
    default_data_collator,
    TrainerCallback
)

from trainer import model, metadata, utils


class HPTuneCallback(TrainerCallback):
    """
    A custom callback class that reports a metric to hypertuner
    at the end of each epoch.
    """
    
    def __init__(self, metric_tag, metric_value):
        super(HPTuneCallback, self).__init__()
        self.metric_tag = metric_tag
        self.metric_value = metric_value
        self.hpt = hypertune.HyperTune()
        
    def on_evaluate(self, args, state, control, **kwargs):
        print(f"HP metric {self.metric_tag}={kwargs['metrics'][self.metric_value]}")
        self.hpt.report_hyperparameter_tuning_metric(
            hyperparameter_metric_tag=self.metric_tag,
            metric_value=kwargs['metrics'][self.metric_value],
            global_step=state.epoch)


def compute_metrics(p: EvalPrediction):
    preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
    preds = np.argmax(preds, axis=1)
    return {"accuracy": (preds == p.label_ids).astype(np.float32).mean().item()}


def train(args, model, train_dataset, test_dataset):
    """Create the training loop to load pretrained model and tokenizer and 
    start the training process

    Args:
      args: read arguments from the runner to set training hyperparameters
      model: The neural network that you are training
      train_dataset: The training dataset
      test_dataset: The test dataset for evaluation
    """
    
    # initialize the tokenizer
    tokenizer = AutoTokenizer.from_pretrained(
        metadata.PRETRAINED_MODEL_NAME,
        use_fast=True,
    )
    
    # set training arguments
    training_args = TrainingArguments(
        evaluation_strategy="epoch",
        learning_rate=args.learning_rate,
        per_device_train_batch_size=args.batch_size,
        per_device_eval_batch_size=args.batch_size,
        num_train_epochs=args.num_epochs,
        weight_decay=args.weight_decay,
        output_dir=os.path.join("/tmp", args.model_name)
    )
    
    # initialize our Trainer
    trainer = Trainer(
        model,
        training_args,
        train_dataset=train_dataset,
        eval_dataset=test_dataset,
        data_collator=default_data_collator,
        tokenizer=tokenizer,
        compute_metrics=compute_metrics
    )
    
    # add hyperparameter tuning callback to report metrics when enabled
    if args.hp_tune == "y":
        trainer.add_callback(HPTuneCallback("accuracy", "eval_accuracy"))
    
    # training
    trainer.train()
    
    return trainer


def run(args):
    """Load the data, train, evaluate, and export the model for serving and
     evaluating.

    Args:
      args: experiment parameters.
    """
    # Open our dataset
    train_dataset, test_dataset = utils.load_data(args)

    label_list = train_dataset.unique("label")
    num_labels = len(label_list)
    
    # Create the model, loss function, and optimizer
    text_classifier = model.create(num_labels=num_labels)
    
    # Train / Test the model
    trainer = train(args, text_classifier, train_dataset, test_dataset)

    metrics = trainer.evaluate(eval_dataset=test_dataset)
    trainer.save_metrics("all", metrics)

    # Export the trained model
    trainer.save_model(os.path.join("/tmp", args.model_name))

    # Save the model to GCS
    if args.job_dir:
        utils.save_model(args)
    else:
        print(f"Saved model files at {os.path.join('/tmp', args.model_name)}")
        print(f"To save model files in GCS bucket, please specify job_dir starting with gs://")


创建`task.py`文件，这是运行训练应用的主文件。

In [None]:
%%writefile ./python_package/trainer/task.py

import argparse
import os

from trainer import experiment


def get_args():
    """Define the task arguments with the default values.

    Returns:
        experiment parameters
    """
    args_parser = argparse.ArgumentParser()


    # Experiment arguments
    args_parser.add_argument(
        '--batch-size',
        help='Batch size for each training and evaluation step.',
        type=int,
        default=16)
    args_parser.add_argument(
        '--num-epochs',
        help="""\
        Maximum number of training data epochs on which to train.
        If both --train-size and --num-epochs are specified,
        --train-steps are: (train-size/train-batch-size) * num-epochs.\
        """,
        default=1,
        type=int,
    )
    args_parser.add_argument(
        '--seed',
        help='Random seed (default: 42)',
        type=int,
        default=42,
    )

    # Estimator arguments
    args_parser.add_argument(
        '--learning-rate',
        help='Learning rate value for the optimizers.',
        default=2e-5,
        type=float)
    args_parser.add_argument(
        '--weight-decay',
        help="""
      The factor by which the learning rate should decay by the end of the
      training.

      decayed_learning_rate =
        learning_rate * decay_rate ^ (global_step / decay_steps)

      If set to 0 (default), then no decay occurs.
      If set to 0.5, then the learning rate should reach 0.5 of its original
          value at the end of the training.
      Note that decay_steps is set to train_steps.
      """,
        default=0.01,
        type=float)

    # Enable hyperparameter
    args_parser.add_argument(
        '--hp-tune',
        default="n",
        help='Enable hyperparameter tuning. Valida values are: "y" - enable, "n" - disable')
    
    # Saved model arguments
    args_parser.add_argument(
        '--job-dir',
        default=os.getenv('AIP_MODEL_DIR'),
        help='GCS location to export models')
    args_parser.add_argument(
        '--model-name',
        default="finetuned-bert-classifier",
        help='The name of your saved model')

    return args_parser.parse_args()


def main():
    """Setup / Start the experiment
    """
    args = get_args()
    print(args)
    experiment.run(args)


if __name__ == '__main__':
    main()


以下是用于训练应用程序的`setup.py`文件。`setup.py`文件内的`find_packages()`函数包括`trainer`目录在包内，因为它包含`__init__.py`，告诉 [Python Setuptools](https://setuptools.readthedocs.io/en/latest/) 将父目录的所有子目录包含为依赖项。

在`setup.py`中，您还需要指定用于训练应用程序的Python包，例如`transformers`、`datasets`、`cloudml-hypertune`和`tqdm`。

In [None]:
%%writefile ./{PYTHON_PACKAGE_APPLICATION_DIR}/setup.py

from setuptools import find_packages
from setuptools import setup
import setuptools

from distutils.command.build import build as _build
import subprocess


REQUIRED_PACKAGES = [
    'transformers',
    'datasets',
    'tqdm',
    'cloudml-hypertune'
]

setup(
    name='trainer',
    version='0.1',
    install_requires=REQUIRED_PACKAGES,
    packages=find_packages(),
    include_package_data=True,
    description='Vertex AI | Training | PyTorch | Text Classification | Python Package'
)

运行以下命令以创建源代码分发。

In [None]:
!cd {PYTHON_PACKAGE_APPLICATION_DIR} && python3 setup.py sdist --formats=gztar

现在将带有训练应用程序的源分发上传至云存储桶。

In [None]:
!gsutil cp {source_package_file_name} {python_package_gcs_uri}

验证源分发在云存储桶中存在。

In [None]:
!gsutil ls -l {python_package_gcs_uri}

### 在Vertex AI中使用预构建容器运行自定义作业

在这个笔记本中，您将使用Hugging Face Datasets并使用PyTorch fine-tuning Hugging Face Transformers库中的transformer模型进行情感分析任务。您无需从头开始构建PyTorch环境来运行训练应用程序，因为Vertex AI提供了[预构建容器](https://cloud.google.com/vertex-ai/docs/training/pre-built-containers#available_container_images)。

Vertex AI预构建容器是Docker容器镜像，可用于自定义训练。它们包括基于机器学习框架和框架版本的训练代码中使用的一些常见依赖项。

您将使用[PyTorch的预构建容器](https://cloud.google.com/vertex-ai/docs/training/pre-built-containers#pytorch)和打包的训练应用程序在Vertex AI上运行训练作业。

使用[PyTorch预构建容器](https://cloud.google.com/vertex-ai/docs/training/pre-built-containers#pytorch)和打包为Python源分发的训练代码配置一个[Custom Job](https://cloud.google.com/vertex-ai/docs/training/create-custom-job)。

In [None]:
job = aiplatform.CustomPythonPackageTrainingJob(
    display_name=JOB_NAME,
    python_package_gcs_uri=python_package_gcs_uri,
    python_module_name=python_module_name,
    container_uri=PRE_BUILT_TRAINING_CONTAINER_IMAGE_URI,
)

使用以下参数运行自定义训练作业：
- `machine_type`：作业需要在其中运行的机器类型。
- `accelerator_type`：用于运行作业的硬件加速器类型。其中之一为_ACCELERATOR_TYPE_UNSPECIFIED_、_NVIDIA_TESLA_K80_、_NVIDIA_TESLA_P100_、_NVIDIA_TESLA_V100_、_NVIDIA_TESLA_P4_、_NVIDIA_TESLA_T4_、_NVIDIA_TELSA_A100_
- `accelerator_count`：要连接到工作程序副本的加速器数量。
- `replica_count`：工作程序副本数量。
- `args`：要传递给Python脚本的命令行参数。

了解有关 Vertex AI 的[自定义 Python-Package 训练](https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform.CustomPythonPackageTrainingJob)的更多信息。

*注意*：此训练作业可能需要超过24小时。

In [None]:
if os.getenv("IS_TESTING"):
    sys.exit(0)

In [None]:
training_args = ["--num-epochs", "2", "--model-name", APP_NAME]

model = job.run(
    replica_count=1,
    machine_type=TRAIN_MACHINE_TYPE,
    accelerator_type=TRAIN_ACCELERATOR_TYPE,
    accelerator_count=TRAIN_ACCELERATOR_COUNT,
    args=training_args,
)

验证模型工件在作业成功完成后是否由训练代码写入到云存储中。

In [None]:
job_response = MessageToDict(job._gca_resource._pb)
GCS_MODEL_ARTIFACTS_URI = job_response["trainingTaskInputs"]["baseOutputDirectory"][
    "outputUriPrefix"
]
print(f"Model artifacts are available at {GCS_MODEL_ARTIFACTS_URI}")

In [None]:
!gsutil ls -lr $GCS_MODEL_ARTIFACTS_URI/

部署

在Vertex AI上部署PyTorch模型需要您使用一个自定义容器，在Vertex AI端点上提供在线预测。您部署一个运行PyTorch的TorchServe工具的容器，以便为情感分析任务的微调变换器模型提供预测服务。然后，您可以使用Vertex AI的在线预测服务对输入文本的情感进行分类。

使用自定义容器在Vertex AI上部署模型需要提供一个Docker容器镜像，其中运行一个HTTP服务器应用程序，比如在这种情况下是TorchServe。了解更多关于在Vertex AI上的预测容器要求。

基本上，部署PyTorch模型到Vertex AI需要以下步骤：

1. 打包训练的模型文件，包括默认或自定义处理程序，通过使用Torch模型打包工具创建一个压缩文件。
2. 构建一个与Vertex AI兼容的自定义容器，以使用TorchServe提供模型服务。
3. 将带有自定义容器镜像的模型上传以作为Vertex AI模型资源提供预测服务。
4. 创建一个Vertex AI端点和部署模型资源。

创建一个自定义模型处理程序来处理预测请求

将输入文本传递给微调的变换器模型时，输入文本需要进行预处理。一旦模型生成预测，还需要对生成的输出进行一些后处理以将其标记为底层类别并提供它们的概率（或置信度分数）。

为包含类似预处理和后处理步骤，您可以创建一个打包在模型文件中的自定义处理程序脚本。稍后，在部署时，TorchServe执行该脚本。

自定义处理程序脚本执行以下操作：

- 在将输入文本发送给模型进行推理之前进行预处理
- 自定义模型在推理时如何被调用
- 在发送回响应之前从模型输出进行后处理

从TorchServe文档中了解更多关于定义自定义处理程序的内容。

创建一个目录来定义一个函数用来处理预测。

In [None]:
!mkdir -p predictor

创建一个名为 `custom_handler.py` 的文件，在部署时处理预测请求。

In [None]:
%%writefile predictor/custom_handler.py

import os
import json
import logging

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from ts.torch_handler.base_handler import BaseHandler

logger = logging.getLogger(__name__)


class TransformersClassifierHandler(BaseHandler):
    """
    The handler takes an input string and returns the classification text 
    based on the serialized transformers checkpoint.
    """
    def __init__(self):
        super(TransformersClassifierHandler, self).__init__()
        self.initialized = False

    def initialize(self, ctx):
        """ Loads the model.pt file and initialized the model object.
        Instantiates Tokenizer for preprocessor to use
        Loads labels to name mapping file for post-processing inference response
        """
        self.manifest = ctx.manifest

        properties = ctx.system_properties
        model_dir = properties.get("model_dir")
        self.device = torch.device("cuda:" + str(properties.get("gpu_id")) if torch.cuda.is_available() else "cpu")

        # Read model serialize/pt file
        serialized_file = self.manifest["model"]["serializedFile"]
        model_pt_path = os.path.join(model_dir, serialized_file)
        if not os.path.isfile(model_pt_path):
            raise RuntimeError("Missing the model.pt or pytorch_model.bin file")
        
        # Load model
        self.model = AutoModelForSequenceClassification.from_pretrained(model_dir)
        self.model.to(self.device)
        self.model.eval()
        logger.debug('Transformer model from path {0} loaded successfully'.format(model_dir))
        
        # Ensure to use the same tokenizer used during training
        self.tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

        # Read the mapping file, index to object name
        mapping_file_path = os.path.join(model_dir, "index_to_name.json")

        if os.path.isfile(mapping_file_path):
            with open(mapping_file_path) as f:
                self.mapping = json.load(f)
        else:
            logger.warning('Missing the index_to_name.json file. Inference output defaults.')
            self.mapping = {"0": "Negative",  "1": "Positive"}

        self.initialized = True

    def preprocess(self, data):
        """ Preprocessing input request by tokenizing
            Extend with your own preprocessing steps as needed
        """
        text = data[0].get("data")
        if text is None:
            text = data[0].get("body")
        sentences = text.decode('utf-8')
        logger.info("Received text: '%s'", sentences)

        # Tokenize the texts
        tokenizer_args = ((sentences,))
        inputs = self.tokenizer(*tokenizer_args,
                                padding='max_length',
                                max_length=128,
                                truncation=True,
                                return_tensors = "pt")
        return inputs

    def inference(self, inputs):
        """ Predict the class of a text using a trained transformer model.
        """
        prediction = self.model(inputs['input_ids'].to(self.device))[0].argmax().item()

        if self.mapping:
            prediction = self.mapping[str(prediction)]

        logger.info("Model predicted: '%s'", prediction)
        return [prediction]

    def postprocess(self, inference_output):
        return inference_output


### 生成一个用于类别名称的文件

对于自定义处理程序，创建以下映射文件（`index_to_name.json`），用于将目标标签与它们的有意义的名称关联起来，同时格式化预测响应。

In [None]:
%%writefile ./predictor/index_to_name.json

{
    "0": "leisure",
    "1": "exercise",
    "2": "enjoy_the_moment",
    "3": "affection",
    "4": "achievement",
    "5": "nature",
    "6": "bonding"
}

### 创建一个自定义容器映像来提供预测

接下来，您可以使用[Artifact Registry](https://cloud.google.com/artifact-registry)和[Cloud Build](https://cloud.google.com/build)按照以下步骤创建自定义容器映像：

#### 下载模型工件

从云存储下载作为训练（或超参数调整）工作的一部分保存的模型工件到本地目录。

在云存储桶中验证模型工件文件。

In [None]:
!gsutil ls -r $GCS_MODEL_ARTIFACTS_URI/model/

将文件从云存储复制到本地目录。

In [None]:
!gsutil -m cp -r $GCS_MODEL_ARTIFACTS_URI/model/ ./predictor/

In [None]:
!ls -ltrR ./predictor/model

为镜像创建一个Dockerfile

通过以下步骤创建一个以TorchServe为基础镜像的Dockerfile：

- 安装依赖项，如`transformers`。
- 将模型 artifacts 添加到容器镜像中的`/home/model-server/`目录。
- 将自定义处理程序脚本添加到容器镜像中的`/home/model-server/`目录。
- 创建`/home/model-server/config.properties`来定义服务配置（健康和预测监听端口）。
- 运行[Torch模型打包工具](https://github.com/pytorch/serve/tree/master/model-archiver#creating-a-model-archive)来创建一个模型存档文件，文件是从复制到容器镜像中的`/home/model-server/`目录中的文件创建的。模型存档保存在`/home/model-server/model-store/`目录中，名称为`<模型名称>.mar`。
- 启动TorchServe HTTP服务器，该服务器引用配置属性并启用模型的服务。

In [None]:
%%bash -s $APP_NAME

APP_NAME=$1

cat << EOF > ./predictor/Dockerfile

FROM pytorch/torchserve:latest-cpu

# install dependencies
RUN python3 -m pip install --upgrade pip
RUN pip3 install transformers

USER model-server

# copy model artifacts, custom handler and other dependencies
COPY ./custom_handler.py /home/model-server/
COPY ./index_to_name.json /home/model-server/
COPY ./model/$APP_NAME/ /home/model-server/

# create torchserve configuration file
USER root
RUN printf "\nservice_envelope=json" >> /home/model-server/config.properties
RUN printf "\ninference_address=http://0.0.0.0:7080" >> /home/model-server/config.properties
RUN printf "\nmanagement_address=http://0.0.0.0:7081" >> /home/model-server/config.properties
USER model-server

# expose health and prediction listener ports from the image
EXPOSE 7080
EXPOSE 7081

# create model archive file packaging model artifacts and dependencies
RUN torch-model-archiver -f \
  --model-name=$APP_NAME \
  --version=1.0 \
  --serialized-file=/home/model-server/pytorch_model.bin \
  --handler=/home/model-server/custom_handler.py \
  --extra-files "/home/model-server/config.json,/home/model-server/tokenizer.json,/home/model-server/training_args.bin,/home/model-server/tokenizer_config.json,/home/model-server/special_tokens_map.json,/home/model-server/vocab.txt,/home/model-server/index_to_name.json" \
  --export-path=/home/model-server/model-store

# run Torchserve HTTP serve to respond to prediction requests
CMD ["torchserve", \
     "--start", \
     "--ts-config=/home/model-server/config.properties", \
     "--models", \
     "$APP_NAME=$APP_NAME.mar", \
     "--model-store", \
     "/home/model-server/model-store"]
EOF

echo "Writing ./predictor/Dockerfile"

创建一个 Docker 仓库

在 Artifact Registry 中创建您自己的 Docker 仓库，将 Docker 镜像推送到该仓库以提供预测服务。

1. 运行 `gcloud artifacts repositories create` 命令，创建一个新的 Docker 仓库，并指定区域和描述。

2. 运行 `gcloud artifacts repositories list` 命令，验证您的仓库是否已创建。

将 `APP_NAME` 设置为您的仓库名称。

In [None]:
# Create the repository in Artifact registry
! gcloud artifacts repositories create {APP_NAME} --repository-format=docker --location={LOCATION} --description="Docker repository"

# List all repositories and check your repository
! gcloud artifacts repositories list

#### 使用带有镜像路径标签的docker映像来构建

接下来，您可以使用Cloud Build在创建的存储库中构建docker映像。Cloud Build尝试定位标签中提供的存储库路径。

了解更多关于[使用Cloud Build构建和推送docker映像](https://cloud.google.com/build/docs/build-push-docker-image)。

In [None]:
!gcloud builds submit --region={LOCATION} --tag=$CUSTOM_PREDICTOR_IMAGE_URI ./predictor

### 部署服务容器到 Vertex AI

接下来，您在 Vertex AI 上创建一个模型资源，并将模型部署到 Vertex AI 端点。您必须将模型部署到端点以进行在线预测服务。部署的模型会运行自定义容器映像以提供预测。

#### 创建一个 Vertex AI 模型资源

使用创建的模型工件和容器映像创建一个 Vertex AI 模型资源。

In [None]:
model = aiplatform.Model.upload(
    display_name=model_display_name,
    description=model_description,
    serving_container_image_uri=CUSTOM_PREDICTOR_IMAGE_URI,
    serving_container_predict_route=predict_route,
    serving_container_health_route=health_route,
    serving_container_ports=serving_container_ports,
)

model.wait()

print(model.display_name)
print(model.resource_name)

创建一个Vertex AI终端点，部署已注册的Vertex AI模型。

In [None]:
endpoint = aiplatform.Endpoint.create(display_name=endpoint_display_name)

部署模型到端点

部署模型会将物理资源与模型关联起来，使其能够以低延迟提供在线预测。

**注意：** 部署资源需要几分钟时间。

In [None]:
model.deploy(
    endpoint=endpoint,
    deployed_model_display_name=model_display_name,
    machine_type=DEPLOY_MACHINE_TYPE,
    sync=True,
)

### 发送在线预测请求

现在，使用Vertex AI SDK for Python调用部署模型的端点，为一些测试实例进行预测。

#### 格式化在线预测的输入

本笔记本使用[TorchServe的基于KServe的推理API](https://pytorch.org/serve/inference_api.html#kserve-inference-api)，它也是[Vertex AI预测兼容格式](https://cloud.google.com/vertex-ai/docs/predictions/custom-container-requirements#prediction)。对于在线预测请求，请将预测输入实例格式化为带有base64编码的JSON，如下所示：

```
[
    {
        "data": {
            "b64": "<base64编码的字符串>"
        }
    }
]
```

In [None]:
test_instances = [
    b"I went to a meeting that went really well.",
    b"I ran four miles this morning with a good time.",
    b"Watching the storms we had yesterday.  The lightning was incredible!",
    b"The last night I said with her 'I love you '. And she said ' Yes'.",
    b"I had followed a complex recipe making roasted duck, which took me hours and I had successfully made it.",
    b"I woke up this morning to birds chirping.",
]

### 发送在线预测请求

格式化输入文本字符串，使用格式化的输入请求调用预测端点并获取响应。

In [None]:
# print the test instances and their responses
for instance in test_instances:
    print(f"Input text: \n\t{instance.decode('utf-8')}\n")
    b64_encoded = base64.b64encode(instance)
    test_instance = [{"data": {"b64": f"{str(b64_encoded.decode('utf-8'))}"}}]
    print(f"Formatted input: \n{json.dumps(test_instance, indent=4)}\n")
    prediction = endpoint.predict(instances=test_instance)
    print(f"Prediction response: \n\t{prediction}")
    print("=" * 100)

清理工作

要清理此项目中使用的所有Google Cloud资源，您可以[删除用于教程的Google Cloud项目](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects)。

否则，您可以删除本教程中创建的各个资源：

- 生成的文件和文件夹
- 训练任务
- Vertex AI 模型
- Vertex AI 端点
- Cloud Storage 存储桶（将`delete_bucket`设置为**True**以删除存储桶）
- 图像仓库（Artifact Registry）

In [None]:
delete_bucket = False

# Delete files in trainer folder
os.remove("./python_package/trainer/init.py")

# Delete directories generated
os.rmdir("./python_package/trainer")
os.rmdir("./python_package/scripts")
os.rmdir("./python_package")
os.rmdir("predictor")


# Delete the Custom training job
job.delete()

# Undeploy the model from the endpoint
endpoint.undeploy_all()
# Delete the endpoint
endpoint.delete()

# Delete the Vertex AI Model resource
model.delete()

# Delete the Cloud Storage bucket
if delete_bucket or os.getenv("IS_TESTING"):
    ! gsutil -m rm -r $BUCKET_URI

# Delete artifact repository
! gcloud artifacts repositories delete $APP_NAME --location=$LOCATION --quiet