In [None]:
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Vertex AI - 使用LoRA在TPUv5e上进行Gemini分布式调优，服务于L4 GPU

<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/training/tpuv5e_gemma_peft_finetuning_and_serving.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> 在Colab中运行（需要更高内存的Colab pro）
    </a>
  </td>
  <td>
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/training/tpuv5e_gemma_peft_finetuning_and_serving.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      在GitHub上查看
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/notebooks/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/main/notebooks/official/training/tpuv5e_gemma_peft_finetuning_and_serving.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
在Vertex AI Workbench中打开
    </a>（建议使用带有250GB磁盘的e2-standard-8 CPU）
  </td>
</table>

## 概述

这本笔记本是基于[ai.google.dev上的LoRA调优示例](https://ai.google.dev/gemma/docs/distributed_tuning)。它遵循了一个已经存在的[为GPU上的微调编写的Model Garden示例](https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/model_garden/model_garden_gemma_finetuning_on_vertex.ipynb)，并已经修改为使用最新的TPUv5e芯片进行训练。它演示了使用[Vertex AI Custom Training Job](https://cloud.google.com/vertex-ai/docs/training/create-custom-job)进行微调和部署Gemma模型。Vertex AI Custom Training Job允许对微调作业进行更高级别的自定义和控制。本笔记本中的所有示例都使用参数有效微调方法[PEFT](https://github.com/huggingface/peft)来降低训练和存储成本。

这本笔记本使用[vLLM](https://github.com/vllm-project/vllm) docker部署模型。

### 目标

- 使用Vertex AI Custom Training Job对Gemma模型进行微调和部署。
- 向您微调的Gemma模型发送预测请求。

### 成本

本教程使用谷歌云的收费组件：

* Vertex AI
* 云存储

了解[Vertex AI定价](https://cloud.google.com/vertex-ai/pricing)，[云存储定价](https://cloud.google.com/storage/pricing)，并使用[定价计算器](https://cloud.google.com/products/calculator/)根据您的预期使用量生成成本估算。

### 数据集

在这个例子中，您将使用来自 TensorFlow 数据集的 IMDB 评论数据集来微调模型。数据集的详细信息可以在这里找到：https://www.tensorflow.org/datasets/catalog/imdb_reviews

成本

本教程使用谷歌云的可计费组件：

Vertex AI（训练，TPUv5e，L4 GPU），云存储

了解[Vertex AI 价格](https://cloud.google.com/vertex-ai/pricing)，[云存储价格](https://cloud.google.com/storage/pricing)，[云NL API 价格](https://cloud.google.com/natural-language/pricing)，并使用[定价计算器](https://cloud.google.com/products/calculator/)根据您的预期使用量生成一份成本估算。

## 安装

安装以下包以执行这个笔记本。

运行以下命令来安装支持 TPUv5e 的最新谷歌云平台库。

In [None]:
import os

# (optional) update gcloud if needed
if os.getenv("IS_TESTING"):
    ! gcloud components update --quiet

! pip3 install --upgrade --quiet google-cloud-aiplatform

只有计算协作：取消下面的单元格注释以重新启动内核。

In [None]:
# import IPython

# app = IPython.Application.instance()
# app.kernel.do_shutdown(True)

## 在开始之前

### 设置您的 Google 云项目

**无论您使用什么笔记本环境，下面的步骤都是必需的。**

1. [选择或创建一个 Google 云项目](https://console.cloud.google.com/cloud-resource-manager)。当您第一次创建帐户时，您将获得 $300 的免费信用额度，可以用于计算/存储成本。

2. [确保您的项目已启用计费功能](https://cloud.google.com/billing/docs/how-to/modify-project)。

3. [启用 Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com)。

4. 如果您在本地运行此笔记本，您需要安装 [Cloud SDK](https://cloud.google.com/sdk)。

5. [选择或创建一个云存储存储桶](https://cloud.google.com/storage/docs/creating-buckets) 用于存储实验结果。

6. [创建一个服务帐号](https://cloud.google.com/iam/docs/service-accounts-create#iam-service-accounts-create-console)，并为其指定 `Vertex AI 用户` 和 `存储对象管理员` 角色，以便将优化后的模型部署到 Vertex AI 终端。

### Kaggle认证
Gemma模型由Kaggle托管。要使用Gemma，请在Kaggle上请求访问权限：

* 在[kaggle.com](https://www.kaggle.com)上登录或注册
* 打开[Gemma模型卡](https://www.kaggle.com/models/google/gemma)，并选择“请求访问权限”
* 完成同意表格并接受条款和条件

然后，要使用Kaggle API，请创建一个API令牌：

* 打开[Kaggle设置](https://www.kaggle.com/settings)
* 选择“创建新的令牌”
* 会下载一个kaggle.json文件。它包含您的Kaggle凭据。请注意用户名和密钥，以便稍后填写。

设置您的项目ID

**如果您不知道您的项目ID**，请尝试以下操作：
* 运行 `gcloud config list`。
* 运行 `gcloud projects list`。
* 查看支持页面：[查找项目ID](https://support.google.com/googleapi/answer/7014113)

In [None]:
PROJECT_ID = "[your-project-id]"  # @param {type:"string"}

# Set the project id
! gcloud config set project {PROJECT_ID}

#### 区域

您还可以更改 Vertex AI 使用的“REGION”变量。了解有关[Vertex AI 地区](https://cloud.google.com/vertex-ai/docs/general/locations)的更多信息。

TPUv5e 可在[此处列出的以下地区](https://cloud.google.com/tpu/pricing)提供。

In [None]:
REGION = "us-west1"  # @param {type: "string"}

### 验证您的Google Cloud账户

根据您的Jupyter环境，您可能需要手动进行身份验证。请按照以下相关说明操作。

1. 顶点 AI 工作台
* 无需操作，您已经通过身份验证。

2. 本地JupyterLab实例，请取消注释并运行：

In [None]:
# ! gcloud auth login

3. 协作，取消注释并运行：

In [None]:
# from google.colab import auth
# auth.authenticate_user()

请参考https://cloud.google.com/storage/docs/gsutil/commands/iam#ch-examples 网址了解如何向服务账户授予云存储权限。

###导入库

In [None]:
import os
from datetime import datetime, timedelta

from google.cloud import aiplatform

### 创建一个云存储桶

创建一个存储桶来存储诸如数据集之类的中间文件。

In [None]:
BUCKET_URI = f"gs://your-bucket-name-{PROJECT_ID}-unique"  # @param {type:"string"}

设置文件夹路径以用于暂存、环境和模型工件

In [None]:
STAGING_BUCKET = os.path.join(BUCKET_URI, "temporal")
MODEL_BUCKET = os.path.join(BUCKET_URI, "model")

# The service account looks like:
# '@.iam.gserviceaccount.com'
# Please go to https://cloud.google.com/iam/docs/service-accounts-create#iam-service-accounts-create-console
# and create service account with `Vertex AI User` and `Storage Object Admin` roles.
# The service account for deploying fine tuned model.
SERVICE_ACCOUNT = "[your-service-account]"  # @param {type:"string"}

只有当您的存储桶尚不存在时：运行以下单元格以创建您的云存储存储桶。

In [None]:
! gsutil mb -l {REGION} -p {PROJECT_ID} {BUCKET_URI}

### 初始化用于Python的Vertex AI SDK

为您的项目初始化用于Python的Vertex AI SDK。

In [None]:
aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=STAGING_BUCKET)

选择 Gemma 基础模型

In [None]:
# The Gemma base model.
base_model = "google/gemma-2b"  # @param ["google/gemma-2b", "google/gemma-2b-it", "google/gemma-7b", "google/gemma-7b-it"]

### 创建artifact registry repository，并设置自定义的docker镜像URI

In [None]:
REPOSITORY = "tpuv5e-training-repository-unique"

In [None]:
image_name_train = "gemma-lora-tuning-tpuv5e"
hostname = f"{REGION}-docker.pkg.dev"
tag = "latest"

In [None]:
# Register gcloud as a Docker credential helper
!gcloud auth configure-docker $REGION-docker.pkg.dev --quiet

In [None]:
# One time or use an existing repository
!gcloud artifacts repositories create $REPOSITORY --repository-format=docker \
--location=$REGION --description="Vertex TPUv5e training repository"

In [None]:
# Define container image name
KERAS_TRAIN_DOCKER_URI = (
    f"{hostname}/{PROJECT_ID}/{REPOSITORY}/{image_name_train}:{tag}"
)

# Set the docker image uri for the vLLM serving container
VLLM_DOCKER_URI = "us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20240220_0936_RC01"

# Set the docker image uri for the model conversion container that converts the fine-tuned model to HF format
KERAS_MODEL_CONVERSION_DOCKER_URI = "us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/jax-keras-model-conversion:20240220_0936_RC01"

### 定义通用函数

In [None]:
def get_job_name_with_datetime(prefix: str) -> str:
    """Gets the job name with date time when triggering training or deployment
    jobs in Vertex AI.
    """
    return prefix + datetime.now().strftime("_%Y%m%d_%H%M%S")


def deploy_model_vllm(
    model_name: str,
    model_uri: str,
    service_account: str,
    machine_type: str = "g2-standard-8",
    accelerator_type: str = "NVIDIA_L4",
    accelerator_count: int = 1,
    max_model_len: int = 8192,
    dtype: str = "bfloat16",
) -> tuple[aiplatform.Model, aiplatform.Endpoint]:
    # Upload the model to "Model Registry"
    job_name = get_job_name_with_datetime(model_name)
    vllm_args = [
        "--host=0.0.0.0",
        "--port=7080",
        f"--tensor-parallel-size={accelerator_count}",
        "--swap-space=16",
        "--gpu-memory-utilization=0.95",
        f"--max-model-len={max_model_len}",
        f"--dtype={dtype}",
        "--disable-log-stats",
    ]
    model = aiplatform.Model.upload(
        display_name=job_name,
        artifact_uri=model_uri,
        serving_container_image_uri=VLLM_DOCKER_URI,
        serving_container_command=["python", "-m", "vllm.entrypoints.api_server"],
        serving_container_args=vllm_args,
        serving_container_ports=[7080],
        serving_container_predict_route="/generate",
        serving_container_health_route="/ping",
    )

    # Deploy the model to an endpoint to serve "Online predictions"
    endpoint = aiplatform.Endpoint.create(display_name=f"{model_name}-endpoint")
    model.deploy(
        endpoint=endpoint,
        machine_type=machine_type,
        accelerator_type=accelerator_type,
        accelerator_count=accelerator_count,
        deploy_request_timeout=1800,
        service_account=service_account,
        sync=False,
    )

    return model, endpoint

构建Docker容器文件

#### 创建培训师目录

In [None]:
import os

if not os.path.exists("trainer"):
    os.makedirs("trainer")

#### 为了进行KerasNLP训练和Hex-LLM使用TPUs进行部署，需要Kaggle凭据。
将KAGGLE_USERNAME和KAGGLE_KEY设置为环境变量，以便在Vertex Training中使用。
根据[这些说明](https://github.com/Kaggle/kaggle-api?tab=readme-ov-file#api-credentials)生成Kaggle用户名和密钥。
您需要按照早前提到的说明来查看并接受模型许可证。

In [None]:
KAGGLE_USERNAME = "your-kaggle-username"  # @param {type:"string", isTemplate:true}
KAGGLE_KEY = "your-kaggle-key"  # @param {type:"string", isTemplate:true}

为自定义容器创建Dockerfile。这将安装JAX[TPU]，Keras和TensorFlow数据集。

In [None]:
%%writefile trainer/Dockerfile
# This Dockerfile fine tunes the Gemma model using LoRA with JAX

FROM python:3.10

ENV DEBIAN_FRONTEND=noninteractive

# Install basic libs
RUN apt-get update && apt-get -y upgrade && apt-get install -y --no-install-recommends \
        cmake \
        curl \
        wget \
        sudo \
        gnupg \
        libsm6 \
        libxext6 \
        libxrender-dev \
        lsb-release \
        ca-certificates \
        build-essential \
        git \
        libgl1

# Copy Apache license.
RUN wget https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/main/LICENSE

# Install required libs
RUN pip install --upgrade pip
RUN pip install jax[tpu]==0.4.25 -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
RUN pip install tensorflow==2.15.0.post1
RUN pip install tensorflow-datasets==4.9.4
RUN pip install -q -U keras-nlp==0.8.2
RUN pip install keras==3.0.5

# Copy other licenses.
RUN wget -O MIT_LICENSE https://github.com/pytest-dev/pytest/blob/main/LICENSE
RUN wget -O BSD_LICENSE https://github.com/pytorch/xla/blob/master/LICENSE
RUN wget -O BSD-3_LICENSE https://github.com/pytorch/pytorch/blob/main/LICENSE

ENV KERAS_BACKEND=jax
ENV XLA_PYTHON_CLIENT_MEM_FRACTION=0.9
ENV TPU_LIBRARY_PATH=/lib/libtpu.so

# Copy install libtpu to PATH above
RUN find ./usr/local/lib -name 'libtpu.so' -exec cp {} /lib \;

WORKDIR /
COPY train.py train.py
ENV PYTHONPATH ./

ENTRYPOINT ["python", "train.py"]

请添加 __init__.py 文件

In [None]:
!touch trainer/__init__.py

#### 添加train.py文件
这段代码来自于LoRA分布式微调代码的示例：https://ai.google.dev/gemma/docs/distributed_tuning

IMDB TensorFlow数据集用于微调Gemma模型。还添加了额外的逻辑，以处理TPUv5e所需的TPU拓扑设置。

In [None]:
%%writefile trainer/train.py
import os
import argparse
import shutil
import locale

# Model saving variables
_ENCODING_FOR_MODEL_SAVING = "UTF-8"
_VOCABULARY_FILENAME = "vocabulary.spm"
_TOKENIZER_FILENAME = "tokenizer.model"

import keras
import keras_nlp
import tensorflow
import tensorflow_datasets as tfds
print (keras.__version__)
print (tensorflow.__version__)

parser = argparse.ArgumentParser()
parser.add_argument(
    "--tpu_topology",
    help="Topology to use for the TPUv5e (1x1, 1x4, 2x2, etc.)",
    type=str
)
parser.add_argument(
    "--model_name",
    help="Kaggle model name (gemma_2b_en, gemma_7b_en)",
    type=str
)
parser.add_argument(
    "--output_folder",
    type=str,
    required=True,
    help="Path to the output folder.",
)
parser.add_argument(
    "--checkpoint_filename",
    type=str,
    default="fine_tuned.weights.h5",
    help="Checkpoint filename.",
)
args = parser.parse_args()

def main():
    x = args.tpu_topology.split("x")
    tpu_topology_x = int(x[0])
    tpu_topology_y = int(x[1])
    print (f'TPU topology is ({tpu_topology_x}, {tpu_topology_y})')
    print (f'Model name is {args.model_name}')

    device_mesh = keras.distribution.DeviceMesh(
        (tpu_topology_x, tpu_topology_y),
        ["batch", "model"],
        devices=keras.distribution.list_devices())

    model_dim = "model"

    layout_map = keras.distribution.LayoutMap(device_mesh)
    # Weights that match 'token_embedding/embeddings' will be sharded on 8 TPUs
    layout_map["token_embedding/embeddings"] = (None, model_dim)
    # Regex to match against the query, key and value matrices in the decoder
    # attention layers
    layout_map["decoder_block.*attention.*(query|key|value).*kernel"] = (
        None, model_dim, None)
    layout_map["decoder_block.*attention_output.*kernel"] = (
        None, None, model_dim)
    layout_map["decoder_block.*ffw_gating.*kernel"] = (model_dim, None)
    layout_map["decoder_block.*ffw_linear.*kernel"] = (None, model_dim)
    model_parallel = keras.distribution.ModelParallel(device_mesh, layout_map,
                                                    batch_dim_name="batch")
    keras.distribution.set_distribution(model_parallel)
    model_name = args.model_name
    gemma_lm = keras_nlp.models.GemmaCausalLM.from_preset(args.model_name)
    print (f'Running inference on the base {args.model_name} model')
    lm_output = gemma_lm.generate("Prompt: Return 3 things I ask for in this format. \
        Response: 1) item 1 2) item 2 3) item 3. \
        Prompt: List the 3 best comedy movies in the 90s Response: ", max_length=100)
    print (lm_output)

    # Start training
    imdb_train = tfds.load(
        "imdb_reviews",
        split="train",
        as_supervised=True,
        batch_size=2,
    )
    # Drop labels.
    imdb_train = imdb_train.map(lambda x, y: x)

    imdb_train.unbatch().take(1).get_single_element().numpy()

    gemma_lm.backbone.enable_lora(rank=4)

    # Fine-tune on the IMDb movie reviews dataset.

    # Limit the input sequence length to 128 to control memory usage.
    gemma_lm.preprocessor.sequence_length = 128
    # Use AdamW (a common optimizer for transformer models).
    optimizer = keras.optimizers.AdamW(learning_rate=5e-5,weight_decay=0.01,)

    # Exclude layernorm and bias terms from decay.
    optimizer.exclude_from_weight_decay(var_names=["bias", "scale"])

    gemma_lm.compile(
        loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
        optimizer=optimizer,
        weighted_metrics=[keras.metrics.SparseCategoricalAccuracy()],
    )
    gemma_lm.summary()
    gemma_lm.fit(imdb_train, epochs=1)

    print (f'Running inference on the fine-tuned {args.model_name} model')
    lm_output = gemma_lm.generate("Prompt: Return 3 things I ask for in this format. \
        Response: 1) item 1 2) item 2 3) item 3. \
        Prompt: List the 3 best comedy movies in the 90s Response: ", max_length=100)
    print (lm_output) 

    # Save checkpoint and tokenizer.
    print("Saving checkpoint and tokenizer.")
    if not os.path.exists(args.output_folder):
        os.makedirs(args.output_folder)
    locale.getpreferredencoding = lambda: _ENCODING_FOR_MODEL_SAVING
    gemma_lm.save_weights(
        os.path.join(args.output_folder, args.checkpoint_filename)
    )
    gemma_lm.preprocessor.tokenizer.save_assets(args.output_folder)

    # Copy and rename the tokenizer file.
    print("Copying tokenizer file.")
    shutil.copy(
        os.path.join(args.output_folder, _VOCABULARY_FILENAME),
        os.path.join(args.output_folder, _TOKENIZER_FILENAME),
    )
    print ('Exiting job')

if __name__ == "__main__":
    main()

## 使用Vertex AI自定义训练任务进行微调

本节演示如何使用PEFT LoRA在Vertex AI自定义训练任务上微调并部署Gemma模型。LoRA（低秩调整）是PEFT（参数高效微调）的一种方法，其中预训练模型权重被冻结，并在微调期间训练表示模型权重变化的秩分解矩阵。请阅读关于LoRA的更多信息，请参阅以下出版物：[Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L. and Chen, W., 2021. Lora: Low-rank adaptation of large language models. *arXiv preprint arXiv:2106.09685*](https://arxiv.org/abs/2106.09685)。

启用Docker 以普通用户身份运行

In [None]:
!sudo usermod -a -G docker ${USER}

转到训练师目录以构建Docker容器####

In [None]:
%cd trainer

构建定制的 Docker 容器并推送到工件存储库

In [None]:
!docker build -t $KERAS_TRAIN_DOCKER_URI -f Dockerfile .

In [None]:
!docker push $KERAS_TRAIN_DOCKER_URI

改变回你的主目录

In [None]:
%cd ..

#### 设置GCS文件夹位置和作业配置设置

In [None]:
# Create a GCS folder to store the merged model with the base model and the
# fine-tuned LORA adapter.
merged_model_dir = get_job_name_with_datetime("gemma-lora-model-tpuv5")
merged_model_output_dir = os.path.join(MODEL_BUCKET, merged_model_dir)
merged_model_output_dir_gcsfuse = merged_model_output_dir.replace("gs://", "/gcs/")

# Set the checkpoint output filename
checkpoint_filename = "fine_tuned.weights.h5"

DISPLAY_NAME_PREFIX = "gemma-lora-train"  # @param {type:"string"}
tpuv5e_gemma_peft_job = {
    "display_name": get_job_name_with_datetime(DISPLAY_NAME_PREFIX),
    "job_spec": {
        "worker_pool_specs": [
            {
                "machine_spec": {
                    "machine_type": "ct5lp-hightpu-1t",
                    "tpu_topology": "1x1",
                },
                "replica_count": 1,
                "container_spec": {
                    "image_uri": KERAS_TRAIN_DOCKER_URI,
                    "args": [
                        "--tpu_topology=1x1",
                        "--model_name=gemma_2b_en",
                        f"--output_folder={merged_model_output_dir_gcsfuse}",
                        f"--checkpoint_filename={checkpoint_filename}",
                    ],
                    "env": [
                        {"name": "KAGGLE_USERNAME", "value": KAGGLE_USERNAME},
                        {"name": "KAGGLE_KEY", "value": KAGGLE_KEY},
                    ],
                },
            },
        ],
    },
}

tpuv5e_gemma_peft_job

创建作业客户端并运行作业

In [None]:
job_client = aiplatform.gapic.JobServiceClient(
    client_options=dict(api_endpoint=f"{REGION}-aiplatform.googleapis.com")
)

In [None]:
create_tpuv5e_gemma_peft_job_response = job_client.create_custom_job(
    parent="projects/{project}/locations/{location}".format(
        project=PROJECT_ID, location=REGION
    ),
    custom_job=tpuv5e_gemma_peft_job,
)
print(create_tpuv5e_gemma_peft_job_response)

检查工作进展
根据模型大小不同，可能需要20-60分钟甚至更长时间。多次运行此单元格以检查进度。

In [None]:
get_tpuv5e_gemma_peft_job_response = job_client.get_custom_job(
    name=create_tpuv5e_gemma_peft_job_response.name
)
get_tpuv5e_gemma_peft_job_response

点击此单元格输出的控制台日志URL，以查看您的日志。

In [None]:
job_id = create_tpuv5e_gemma_peft_job_response.name[
    create_tpuv5e_gemma_peft_job_response.name.rfind("/") + 1 :
]
startdate = datetime.today() - timedelta(days=1)
startdate = startdate.strftime("%Y-%m-%d")
print(
    f"https://console.cloud.google.com/logs/query;query=resource.labels.job_id=%22{job_id}%22%20timestamp%3E={startdate}"
)

### 将微调后的Keras检查点转换为HF格式

#### 从KerasNLP工具下载转换脚本
GitHub仓库网址为https://github.com/keras-team/keras-nlp

In [None]:
!wget -nv -nc https://raw.githubusercontent.com/keras-team/keras-nlp/master/tools/gemma/export_gemma_to_hf.py

将微调后的检查点文件下载到本地

In [None]:
!gcloud storage cp -r $merged_model_output_dir .

#### 为模型转换安装库。

In [None]:
!pip install torch==2.1
!pip install --upgrade keras-nlp
!pip install --upgrade keras>=3
!pip install --upgrade accelerate sentencepiece transformers

运行模型转换脚本

In [None]:
os.environ["KERAS_BACKEND"] = "torch"
os.environ["KAGGLE_USERNAME"] = KAGGLE_USERNAME
os.environ["KAGGLE_KEY"] = KAGGLE_KEY
MODEL_SIZE="2b"
!KERAS_BACKEND=torch python export_gemma_to_hf.py \
  --weights_file ./$merged_model_dir/fine_tuned.weights.h5 \
  --size $MODEL_SIZE \
  --vocab_path ./$merged_model_dir/vocabulary.spm \
  --output_dir ./$merged_model_dir/fine_tuned_gg_hf

#### 将转换后的HF文件复制到GCS

In [None]:
HUGGINGFACE_MODEL_DIR = os.path.join("./", merged_model_dir, "fine_tuned_gg_hf")
HUGGINGFACE_MODEL_DIR_GCS = os.path.join(merged_model_output_dir, "fine_tuned_gg_hf")
HUGGINGFACE_MODEL_DIR

In [None]:
!gcloud storage cp $HUGGINGFACE_MODEL_DIR/* $HUGGINGFACE_MODEL_DIR_GCS

### 部署经过精细调整的模型
该部分将模型上传到模型注册表，并使用[vLLM](https://github.com/vllm-project/vllm)在端点上部署它。

模型部署步骤将需要15分钟到1小时才能完成，取决于模型的大小。

In [None]:
MODEL_NAME_VLLM = get_job_name_with_datetime(prefix="gemma-vllm-serve")

# Start with a G2 Series cost-effective configuration
if MODEL_SIZE == "2b":
    machine_type = "g2-standard-8"
    accelerator_type = "NVIDIA_L4"
    accelerator_count = 1
elif MODEL_SIZE == "7b":
    machine_type = "g2-standard-12"
    accelerator_type = "NVIDIA_L4"
    accelerator_count = 1
else:
    assert MODEL_SIZE in ("2b", "7b")

# See supported machine/GPU configurations in chosen region:
# https://cloud.google.com/vertex-ai/docs/predictions/configure-compute

# For even more performance, consider V100 and A100 GPUs
# > Nvidia Tesla V100
# machine_type = "n1-standard-8"
# accelerator_type = "NVIDIA_TESLA_V100"
# > Nvidia Tesla A100
# machine_type = "a2-highgpu-1g"
# accelerator_type = "NVIDIA_TESLA_A100"

# Larger `max_model_len` values will require more GPU memory
max_model_len = 2048

model, endpoint = deploy_model_vllm(
    MODEL_NAME_VLLM,
    HUGGINGFACE_MODEL_DIR_GCS,
    SERVICE_ACCOUNT,
    machine_type=machine_type,
    accelerator_type=accelerator_type,
    accelerator_count=accelerator_count,
    max_model_len=max_model_len,
)

点击此单元格中输出的控制台日志URL，以查看您的日志。

In [None]:
startdate = datetime.today() - timedelta(days=1)
startdate = startdate.strftime("%Y-%m-%d")
log_link = "https://console.cloud.google.com/logs/query;query=resource.type=%22aiplatform.googleapis.com%2FEndpoint%22"
log_link += f"%20resource.labels.endpoint_id=%22{endpoint.name}%22"
log_link += f"%20resource.labels.location={REGION}"
log_link += f"%20timestamp%3E={startdate}"
print(log_link)

注意：总体部署可能需要30-40分钟或更长时间。部署成功后（大约15-20分钟），微调后的模型将从上面训练时使用的GCS存储桶中下载。因此，在模型部署步骤成功后，以及您运行下面的下一步之前，需要额外约15-20分钟（取决于模型大小）的等待时间。否则，当您向端点发送请求时，您可能会看到`ServiceUnavailable: 503 502:Bad Gateway`错误。

### 发送预测请求

一旦部署成功，您可以使用文本提示向端点发送请求。使用笔记本中先前使用的示例

示例：

```
提示：按照这种格式回答我问的3件事，不要重复我的提示。 响应：1）物品1 2）物品2  3）物品3。列出90年代最好的3部喜剧电影 
响应：1）The Cable Guy  2）Scooby-Doo  3）贝多芬要求
```

In [None]:
PROMPT = "Prompt: Return 3 things I ask for in this format and do not repeat my prompt. \
Response: 1) item 1 2) item 2 3) item 3. \
Prompt: List the 3 best comedy movies in the 90s Response: "

instances = [
    {"prompt": PROMPT},
    {"max_tokens": 500},
    {"temperature": 1.0},
    {"top_p": 1.0},
    {"top_k": 1.0},
]

response = endpoint.predict(instances=instances)

for prediction in response.predictions:
    print(prediction)

清理工作

要清理此项目中使用的所有Google Cloud资源，您可以[删除用于本教程的Google Cloud项目](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects)。

否则，您可以删除在本教程中创建的单个资源：

In [None]:
# Delete the train job.
job_client.delete_custom_job(name=create_tpuv5e_gemma_peft_job_response.name)

# Undeploy model and delete endpoint.
endpoint.delete(force=True)

# Delete models.
model.delete()

import os

# Delete Cloud Storage objects that were created
delete_bucket = False
if delete_bucket or os.getenv("IS_TESTING"):
    ! gsutil -m rm -r $BUCKET_URI