In [None]:
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

使用KerasNLP微调Gemma并部署到Vertex AI

<table align="left">
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fvertex-ai-samples%2Fmain%2Fnotebooks%2Fcommunity%2Fmodel_garden%2Fmodel_garden_gemma_kerasnlp_to_vertexai.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo"><br> 在Colab Enterprise中运行
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/model_garden/model_garden_gemma_kerasnlp_to_vertexai.ipynb">
      <img width="32px" src="https://upload.wikimedia.org/wikipedia/commons/9/91/Octicons-mark-github.svg" alt="GitHub logo"><br> 在GitHub上查看
    </a>
  </td>
</table>

这本笔记本是在以下环境中测试的：
- Python 3.10
- 使用`g2-standard-8`运行时的Colab Enterprise:
  - 32GB系统内存
  - 24GB GPU内存（NVIDIA L4）

##概览

Gemma是一系列轻量级、先进的开放模型家族，使用了与Gemini模型相同的研究和技术来创建。

这本笔记本演示了加载、微调、转换和部署Gemma到Vertex AI。

### 目标

- 使用KerasNLP加载Gemma
- 使用KerasNLP微调Gemma
- 将Gemma转换为Hugging Face Transformers
- 将Gemma部署到Vertex AI

成本

本教程使用Google Cloud的计费组件：

- Vertex AI
- 云存储

了解关于[Vertex AI](https://cloud.google.com/vertex-ai/pricing)和[Cloud Storage](https://cloud.google.com/storage/pricing)的定价，
并使用[Pricing Calculator](https://cloud.google.com/products/calculator/)来根据您的预期使用量生成成本估算。

安装

安装以下所需的软件包来执行这个笔记本：

In [None]:
# Keras & KerasNLP
# Install Keras 3 last, see https://keras.io/getting_started
%pip install --upgrade --quiet keras-nlp
%pip install --upgrade --quiet keras

# Hugging Face Transformers
%pip install --upgrade --quiet accelerate sentencepiece transformers

# Vertex AI SDK
%pip install --upgrade --quiet google-cloud-aiplatform

在你开始之前

### Kaggle 凭证

Gemma 模型由 Kaggle 托管。要使用 Gemma，请在 Kaggle 上请求访问：

- 在 [kaggle.com](https://www.kaggle.com) 登录或注册
- 打开 [Gemma 模型卡片](https://www.kaggle.com/models/google/gemma) 并选择 _"请求访问"_
- 填写同意书并接受条款和条件

然后，要使用 Kaggle API，创建一个 API 令牌：

- 打开 [Kaggle 设置](https://www.kaggle.com/settings)
- 选择 _"创建新令牌"_
- 将下载一个 `kaggle.json` 文件。它包含您的 Kaggle 凭据

运行以下单元格并输入您的 Kaggle 凭据。

In [None]:
import kagglehub

kagglehub.login()

注意：如果`kagglehub.login()` 对你不起作用，另一种方法是设置`KAGGLE_USERNAME`和`KAGGLE_KEY`环境变量。

### 谷歌云设置

1. [选择或创建Google Cloud项目](https://console.cloud.google.com/cloud-resource-manager)。当您第一次创建帐户时，您将获得$300免费信用额度用于支付计算/存储费用。

2. [确保为您的项目启用了计费](https://cloud.google.com/billing/docs/how-to/modify-project)。

3. [启用Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com)。

4. 如果您在本地运行此笔记本，则需要安装[Cloud SDK](https://cloud.google.com/sdk)。

Google Cloud身份验证

如果您从Colab Enterprise运行此笔记本，Cloud SDK、代码和其他库已经在使用您的Google Cloud账户运行。

请检查您的活动账户：

In [None]:
!gcloud config get core/account

如果您的帳戶未定義，您需要進行身份驗證：

In [None]:
# Authenticate the Cloud SDK with your credentials
# !gcloud auth login

# Authenticate code and libraries with your credentials
# !gcloud auth application-default login

### 谷歌云项目

如果您在 Colab Enterprise 中运行此笔记本，则默认项目会自动定义:

In [None]:
res = !gcloud config get core/project
PROJECT_ID = res[0]

print(f"{PROJECT_ID=}")

否则，列出您的项目并手动定义默认项目：

In [None]:
# List your projects
# !gcloud projects list

# Define the default project
# PROJECT_ID = ""  # @param {type:"string"}
# !gcloud config set core/project $PROJECT_ID

### Vertex AI 区域

定义您的默认 Vertex AI 区域。请参考可用的 [Vertex AI 区域](https://cloud.google.com/vertex-ai/docs/general/locations)。

In [None]:
REGION = "us-central1"  # @param {type: "string"}

!gcloud config set ai/region $REGION

注意：此笔记本将 Gemma 模型部署到单个区域。在生产环境中，您可以部署到多个区域，以提供最佳延迟服务于您的全球用户。

云存储桶

创建一个存储桶（或使用现有的存储桶）来存储模型权重或数据集等工件。

In [None]:
# Define a bucket related to your project
BUCKET_URI = f"gs://gemma-{PROJECT_ID}-unique"
# Or use an existing one
# BUCKET_URI = "gs://"  # @param {type:"string"}

res = !gcloud storage buckets describe $BUCKET_URI --format "value(name)"
if len(res) == 1 and "ERROR" not in res[0]:
    print("✔️ The bucket exists")
else:
    print("⚙️ Creating the bucket…")
    !gcloud storage buckets create $BUCKET_URI --project $PROJECT_ID --location $REGION

服务账户

将Gemma部署到Vertex AI端点时，模型服务将需要具有“存储对象管理员”和“Vertex AI用户”角色的服务账户。

创建服务账户（或使用现有账户）：

In [None]:
# Create the service account for the Vertex AI endpoint
SERVICE_ACCOUNT_NAME = "gemma-vertexai"
SERVICE_ACCOUNT_DISPLAY_NAME = "Gemma Vertex AI endpoint"
SERVICE_ACCOUNT = f"{SERVICE_ACCOUNT_NAME}@{PROJECT_ID}.iam.gserviceaccount.com"
# Or use an existing one
# SERVICE_ACCOUNT = ""  # @param {type:"string"}
assert SERVICE_ACCOUNT.endswith(f"@{PROJECT_ID}.iam.gserviceaccount.com")

res = !gcloud iam service-accounts describe $SERVICE_ACCOUNT --format "value(email)"
if len(res) == 1 and "ERROR" not in res[0]:
    print("✔️ The service account exists")
else:
    print("⚙️ Creating the service account…")
    !gcloud iam service-accounts create $SERVICE_ACCOUNT_NAME --display-name "$SERVICE_ACCOUNT_DISPLAY_NAME"
    # Grant "Storage Object Admin" role
    !gcloud projects add-iam-policy-binding $PROJECT_ID --member "serviceAccount:$SERVICE_ACCOUNT" --role "roles/storage.objectAdmin"
    # Grant "Vertex AI User" role
    !gcloud projects add-iam-policy-binding $PROJECT_ID --member "serviceAccount:$SERVICE_ACCOUNT" --role "roles/aiplatform.user"

### 依赖关系

In [None]:
import datetime
import json
import locale

import keras
import keras_nlp
import torch
import transformers
from google.cloud import aiplatform
from numba import cuda

### 模型常量

Gemma 模型有多种大小和变体可供选择。本笔记本使用 `gemma_2b_en` 版本，该版本的资源需求较低。要了解有关 Gemma 的更多信息，请参阅[Gemma 模型花园卡片](https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/gemma)。

定义模型和相关常量：

In [None]:
MODEL_NAME = "gemma_2b_en"
# MODEL_NAME = "gemma_instruct_2b_en"
# MODEL_NAME = "gemma_7b_en"
# MODEL_NAME = "gemma_instruct_7b_en"

# Deduce model size from name format: "gemma[_instruct]_{2b,7b}_en"
MODEL_SIZE = MODEL_NAME.split("_")[-2]
assert MODEL_SIZE in ("2b", "7b")

# Dataset
DATASET_NAME = "databricks-dolly-15k"
DATASET_PATH = f"{DATASET_NAME}.jsonl"
DATASET_URL = f"https://huggingface.co/datasets/databricks/{DATASET_NAME}/resolve/main/{DATASET_PATH}"

# Finetuned model
FINETUNED_MODEL_DIR = f"./{MODEL_NAME}_{DATASET_NAME}"
FINETUNED_WEIGHTS_PATH = f"{FINETUNED_MODEL_DIR}/model.weights.h5"
FINETUNED_VOCAB_PATH = f"{FINETUNED_MODEL_DIR}/vocabulary.spm"

# Converted model
HUGGINGFACE_MODEL_DIR = f"./{MODEL_NAME}_huggingface"

# Deployed model
DEPLOYED_MODEL_URI = f"{BUCKET_URI}/{MODEL_NAME}"

数据集

为了对Gemma进行微调，此笔记本使用[databricks-dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k) 测试数据集。

下载数据集：

In [None]:
!wget -nv -nc -O $DATASET_PATH $DATASET_URL

加载Gemma

在这一步中，您将配置Keras的精度设置并加载Gemma与KerasNLP。

### Keras 精度设置

当在NVIDIA GPU上训练时，可以使用混合精度 (`keras.mixed_precision.set_global_policy("mixed_bfloat16")`) 来加快训练速度而对训练质量影响最小。在大多数情况下，建议打开混合精度，因为它既节省内存又节省时间。然而，请注意，在小批量大小时，它可能会使内存使用量增加1.5倍（权重将以半精确度和全精度加载两次）。

对于推理，半精度 (`keras.config.set_floatx("bfloat16")`) 将有效并节省内存（而混合精度则不适用）。

配置您的精度设置：

In [None]:
# Run inferences at half precision
keras.config.set_floatx("bfloat16")

# Train at mixed precision (enable for large batch sizes)
# keras.mixed_precision.set_global_policy("mixed_bfloat16")

### 模型概要

使用`GemmaCausalLM.from_preset()`方法加载Gemma模型。

In [None]:
gemma_lm = keras_nlp.models.GemmaCausalLM.from_preset(MODEL_NAME)

Downloading from https://www.kaggle.com/api/v1/models/keras/gemma/keras/gemma_2b_en/1/download/config.json...
100%|██████████| 555/555 [00:00<00:00, 634kB/s]
Downloading from https://www.kaggle.com/api/v1/models/keras/gemma/keras/gemma_2b_en/1/download/model.weights.h5...
100%|██████████| 4.67G/4.67G [02:28<00:00, 33.7MB/s]
Downloading from https://www.kaggle.com/api/v1/models/keras/gemma/keras/gemma_2b_en/1/download/tokenizer.json...
100%|██████████| 401/401 [00:00<00:00, 554kB/s]
Downloading from https://www.kaggle.com/api/v1/models/keras/gemma/keras/gemma_2b_en/1/download/assets/tokenizer/vocabulary.spm...
100%|██████████| 4.04M/4.04M [00:00<00:00, 5.27MB/s]


显示模型总结:

In [None]:
gemma_lm.summary()

### 测试案例

在微调模型之前和之后定义测试案例和函数，用于测试模型。

In [None]:
TEST_EXAMPLES = [
    "What are good activities for a toddler?",
    "What can we hope to see after rain and sun?",
    "What's the most famous painting by Monet?",
    "Who engineered the Statue of Liberty?",
    'Who were "The Lumières"?',
]

# Prompt template for the training data and the finetuning tests
PROMPT_TEMPLATE = "Instruction:\n{instruction}\n\nResponse:\n{response}"

TEST_PROMPTS = [
    PROMPT_TEMPLATE.format(instruction=example, response="")
    for example in TEST_EXAMPLES
]

### 采样器

您可以通过调用`compile()`方法并使用`sampler`参数来控制`GemmaCausalLM`生成令牌的方式。

例如：

- `greedy`: 选择具有最大概率的下一个令牌
- `top_k`: 从前K个概率最高的令牌中随机选择下一个令牌

为了在此笔记本中获得确定性的输出，请确保您使用`greedy`采样器。

In [None]:
gemma_lm.compile(sampler="greedy")

要了解更多有关可用取样器的信息，请查看[取样器](https://keras.io/api/keras_nlp/samplers)。

### 微调之前的推断

检查模型对测试示例的响应情况：

In [None]:
for test_example in TEST_EXAMPLES:
    response = gemma_lm.generate(test_example, max_length=48)
    output = response[len(test_example) :]
    print(f"{test_example}\n{output!r}\n")

What are good activities for a toddler?
'\n\nWhat are the best activities for a toddler?\n\nWhat are the best activities for a toddler?\n\nWhat are the best activities for a toddler?\n\nWhat are the best activities for a toddler'

What can we hope to see after rain and sun?
'\n\nThe answer is: a lot.\n\nThe rain and sun are the two most important elements in the world of photography.\n\nThe rain is the most important element because it creates'

What's the most famous painting by Monet?
"\n\nWhat's the most famous painting by Van Gogh?\n\nWhat's the most famous painting by Picasso?\n\nWhat's the most famous painting by Dali?\n\nWhat'"

Who engineered the Statue of Liberty?
'\n\nA. George Washington\nB. Napoleon Bonaparte\nC. Robert Fulton\nD. Gustave Eiffel\n\nIn the following sentence, underline the correct modifier from the pair given in parentheses. Example 1'

Who were "The Lumières"?
' What did they invent?\n\nIn the following sentence, underline the correct modifier from the pair

一个预训练模型可能会生成与您期望的输出偏离的文本。以下是一些例子：

- 输出不符合您的输出要求。
- 输出过于普遍或不够一致。
- 输出事实上不正确或已过时。
- 输出必须符合您特定的安全政策。

更具体的输入（提示工程）可以解决其中一些问题，但会在复杂性和提示长度上付出代价。如果期望的输出不在模型训练数据中，LLMs 仍会生成可信的文本，并产生所谓的幻觉。

您可以进行模型微调以提高模型性能并保持更简单的提示。

细调您的Gemma模型，以提高其在回答问题方面的性能，使其更加一贯和准确。

### 训练数据

使用数据集生成训练示例。

In [None]:
def generate_training_data(training_ratio: int = 100) -> list[str]:
    assert 0 < training_ratio <= 100
    data = []
    with open(DATASET_PATH) as file:
        for line in file.readlines():
            features = json.loads(line)
            # Skip examples with context, for simplicity
            if features["context"]:
                continue
            data.append(PROMPT_TEMPLATE.format(**features))
    total_data_count = len(data)
    training_data_count = total_data_count * training_ratio // 100
    print(f"Training examples: {training_data_count}/{total_data_count}")

    return data[:training_data_count]


# Limit to 10% for test purposes
training_data = generate_training_data(training_ratio=10)

Training examples: 1054/10544


### 低秩适应（LoRA）

[低秩适应](https://arxiv.org/abs/2106.09685)（LoRA）是一种微调技术，通过冻结模型的所有权重并在模型中插入少量新的可训练权重，极大地减少了下游任务的可训练参数数量。这种技术使训练速度更快，更节省内存。

启用LoRA，将LoRA秩设置为4：

In [None]:
gemma_lm.backbone.enable_lora(rank=4)

检查可训练参数的数量是否显著减少：

In [None]:
gemma_lm.summary()

可训练参数的数量从25亿减少到140万（少了1800倍），使得可以通过合理的GPU内存要求来微调模型。

### 微调

使用训练数据对模型进行微调。这一步可能需要几分钟时间：

In [None]:
def finetune_gemma(model: keras_nlp.models.GemmaCausalLM, data: list[str]):
    # Reduce the input sequence length to limit memory usage
    model.preprocessor.sequence_length = 128

    # Use AdamW (a common optimizer for transformer models)
    optimizer = keras.optimizers.AdamW(
        learning_rate=5e-5,
        weight_decay=0.01,
    )

    # Exclude layernorm and bias terms from decay
    optimizer.exclude_from_weight_decay(var_names=["bias", "scale"])

    model.compile(
        loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
        optimizer=optimizer,
        weighted_metrics=[keras.metrics.SparseCategoricalAccuracy()],
        sampler="greedy",
    )
    model.fit(data, epochs=1, batch_size=1)


finetune_gemma(gemma_lm, training_data)

[1m1054/1054[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m134s[0m 77ms/step - loss: 19.3561 - sparse_categorical_accuracy: 0.5872


### 微调后的推理

测试微调后的模型：

In [None]:
for prompt in TEST_PROMPTS:
    output = gemma_lm.generate(prompt, max_length=30)
    print(f"{output}\n{'- '*40}")

Instruction:
What are good activities for a toddler?

Response:
The best activities for a toddler are those that are fun and engaging.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
Instruction:
What can we hope to see after rain and sun?

Response:
After rain and sun, we can see the rainbow.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
Instruction:
What's the most famous painting by Monet?

Response:
The most famous painting by Monet is "Impression, Sunrise".
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
Instruction:
Who engineered the Statue of Liberty?

Response:
The Statue of Liberty was designed by a French sculptor, Frederic Auguste Bartholdi
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
Instruction:
Who were "The Lumières"?

Response:
The Lumières were the inventors of the first motion picture camera. They were
- - - - - - - - - - - - - - - 

你应该注意到输出现在更有结构、更一致和更具事实性。

## 将 Gemma 转换为 Hugging Face Transformers

在下一步中，该模型将部署到Vertex AI，并由[vLLM](https://docs.vllm.ai)容器映像提供服务。vLLM是一个优化的LLM服务库，支持Hugging Face的[Transformers](https://huggingface.co/docs/transformers)。要被vLLM服务加载，微调的模型需要转换为Hugging Face架构。KerasNLP提供了一个转换脚本用于这个过程。

### 检查点

保存微调过的模型资源：

In [None]:
# Make sure the directory exists
%mkdir -p $FINETUNED_MODEL_DIR

gemma_lm.save_weights(FINETUNED_WEIGHTS_PATH)

gemma_lm.preprocessor.tokenizer.save_assets(FINETUNED_MODEL_DIR)

列出检查点文件:

In [None]:
!du -shc $FINETUNED_MODEL_DIR/*

4.7G	./gemma_2b_en_databricks-dolly-15k/model.weights.h5
4.1M	./gemma_2b_en_databricks-dolly-15k/vocabulary.spm
4.7G	total


释放资源，确保GPU可用于下一步操作：

In [None]:
del gemma_lm

device = cuda.get_current_device()
cuda.select_device(device.id)
cuda.close()

### 模型转换

运行KerasNLP转换脚本：

In [None]:
# Download the conversion script from KerasNLP tools
!wget -nv -nc https://raw.githubusercontent.com/keras-team/keras-nlp/master/tools/gemma/export_gemma_to_hf.py

# Run the conversion script
# Note: it uses the PyTorch backend of Keras (hence the KERAS_BACKEND env variable)
!KERAS_BACKEND=torch python export_gemma_to_hf.py \
    --weights_file $FINETUNED_WEIGHTS_PATH \
    --size $MODEL_SIZE \
    --vocab_path $FINETUNED_VOCAB_PATH \
    --output_dir $HUGGINGFACE_MODEL_DIR

### 使用Transformers进行推理

在部署转换后的模型之前，使用`transformers`库进行测试。

加载模型和分词器：

In [None]:
model = transformers.GemmaForCausalLM.from_pretrained(
    HUGGINGFACE_MODEL_DIR,
    local_files_only=True,
    device_map="auto",  # Library "accelerate" to auto-select GPU
)
tokenizer = transformers.GemmaTokenizer.from_pretrained(
    HUGGINGFACE_MODEL_DIR,
    local_files_only=True,
)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

测试这个模型:

In [None]:
def test_transformers_model(
    model: transformers.GemmaForCausalLM,
    tokenizer: transformers.GemmaTokenizer,
) -> None:
    for prompt in TEST_PROMPTS:
        inputs = tokenizer([prompt], return_tensors="pt").to(model.device)
        outputs = model.generate(**inputs, max_length=30)

        output = tokenizer.decode(outputs[0], skip_special_tokens=True)
        print(f"{output}\n{'- '*40}")


test_transformers_model(model, tokenizer)

Instruction:
What are good activities for a toddler?

Response:
Toddlers are very active and curious. They love to explore and learn
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
Instruction:
What can we hope to see after rain and sun?

Response:
After rain and sun, we can see the rainbow.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
Instruction:
What's the most famous painting by Monet?

Response:
The most famous painting by Monet is "Impression, Sunrise".
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
Instruction:
Who engineered the Statue of Liberty?

Response:
The Statue of Liberty was designed by a French sculptor, Frederic Auguste Bartholdi
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
Instruction:
Who were "The Lumières"?

Response:
The Lumières were the inventors of the first motion picture camera. They were
- - - - - - - - - - - - - - - - 

释放资源。

In [None]:
# Release resources
del model, tokenizer

# Free GPU RAM
torch.cuda.empty_cache()

# Restore the default encoding (current issue with the transformers library)
locale.getpreferredencoding = lambda: "UTF-8"

您已准备好将您的微调模型部署到Vertex AI！

## 将 Gemma 部署到 Vertex AI

初始化Vertex AI：

In [None]:
aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=BUCKET_URI)

### 模型上传

将模型上传到云存储桶：

In [None]:
!gcloud storage rsync --recursive --verbosity error $HUGGINGFACE_MODEL_DIR $DEPLOYED_MODEL_URI

检查桶内物品:

In [None]:
!gcloud storage du $DEPLOYED_MODEL_URI --readable-sizes

### 辅助函数

定义辅助函数以部署模型到vLLM容器中：

In [None]:
VLLM_DOCKER_URI = "us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20240220_0936_RC01"


def get_job_name_with_datetime(prefix: str) -> str:
    suffix = datetime.datetime.now().strftime("_%Y%m%d_%H%M%S")
    return f"{prefix}{suffix}"


def deploy_model_vllm(
    model_name: str,
    model_uri: str,
    service_account: str,
    machine_type: str = "g2-standard-8",
    accelerator_type: str = "NVIDIA_L4",
    accelerator_count: int = 1,
    max_model_len: int = 8192,
    dtype: str = "bfloat16",
) -> tuple[aiplatform.Model, aiplatform.Endpoint]:
    # Upload the model to "Model Registry"
    job_name = get_job_name_with_datetime(model_name)
    vllm_args = [
        "--host=0.0.0.0",
        "--port=7080",
        f"--tensor-parallel-size={accelerator_count}",
        "--swap-space=16",
        "--gpu-memory-utilization=0.95",
        f"--max-model-len={max_model_len}",
        f"--dtype={dtype}",
        "--disable-log-stats",
    ]
    model = aiplatform.Model.upload(
        display_name=job_name,
        artifact_uri=model_uri,
        serving_container_image_uri=VLLM_DOCKER_URI,
        serving_container_command=["python", "-m", "vllm.entrypoints.api_server"],
        serving_container_args=vllm_args,
        serving_container_ports=[7080],
        serving_container_predict_route="/generate",
        serving_container_health_route="/ping",
    )

    # Deploy the model to an endpoint to serve "Online predictions"
    endpoint = aiplatform.Endpoint.create(display_name=f"{model_name}-endpoint")
    model.deploy(
        endpoint=endpoint,
        machine_type=machine_type,
        accelerator_type=accelerator_type,
        accelerator_count=accelerator_count,
        deploy_request_timeout=1800,
        service_account=service_account,
    )

    return model, endpoint

模型部署

部署模型。这一步可能需要10分钟以上。

In [None]:
MODEL_NAME_VLLM = f"{MODEL_NAME}-vllm"

# Start with a G2 Series cost-effective configuration
match MODEL_SIZE:
    case "2b":
        machine_type = "g2-standard-8"
        accelerator_type = "NVIDIA_L4"
        accelerator_count = 1
    case "7b":
        machine_type = "g2-standard-12"
        accelerator_type = "NVIDIA_L4"
        accelerator_count = 1
    case _:
        assert MODEL_SIZE in ("2b", "7b")

# See supported machine/GPU configurations in chosen region:
# https://cloud.google.com/vertex-ai/docs/predictions/configure-compute

# For even more performance, consider V100 and A100 GPUs
# > Nvidia Tesla V100
# machine_type = "n1-standard-8"
# accelerator_type = "NVIDIA_TESLA_V100"
# > Nvidia Tesla A100
# machine_type = "a2-highgpu-1g"
# accelerator_type = "NVIDIA_TESLA_A100"

# Larger `max_model_len` values will require more GPU memory
max_model_len = 2048

model, endpoint = deploy_model_vllm(
    MODEL_NAME_VLLM,
    DEPLOYED_MODEL_URI,
    SERVICE_ACCOUNT,
    machine_type=machine_type,
    accelerator_type=accelerator_type,
    accelerator_count=accelerator_count,
    max_model_len=max_model_len,
)

### 在线推理

模型已部署！测试端点：

In [None]:
def test_vertexai_endpoint(endpoint: aiplatform.Endpoint):
    for question, prompt in zip(TEST_EXAMPLES, TEST_PROMPTS):
        instance = {
            "prompt": prompt,
            "max_tokens": 10,
            "temperature": 0.0,
            "top_p": 1.0,
            "top_k": 1,
            "raw_response": True,
        }
        response = endpoint.predict(instances=[instance])
        output = response.predictions[0]
        print(f"{question}\n{output}\n{'- '*40}")


test_vertexai_endpoint(endpoint)

What are good activities for a toddler?
The best activities for a toddler are those that are
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
What can we hope to see after rain and sun?
After rain and sun, we can see the rainbow
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
What's the most famous painting by Monet?
The most famous painting by Monet is "Impression,
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
Who engineered the Statue of Liberty?
The Statue of Liberty was designed by a French sculptor
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
Who were "The Lumières"?
The Lumières were the inventors of the first motion
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 


请查看[vLLM `SamplingParams`](https://github.com/vllm-project/vllm/blob/main/vllm/sampling_params.py)了解vLLM支持的采样参数的更多详情。

清理工作

要清理本项目中使用的所有Google Cloud资源，您可以删除用于本教程的[Google Cloud项目](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects)。

否则，您可以删除在本教程中创建的各个资源：

In [None]:
delete_model = False
delete_objects = False
delete_bucket = False

if delete_model:
    endpoint.delete(force=True)
    model.delete()
if delete_objects:
    !gcloud storage rm --recursive $BUCKET_URI/**
if delete_bucket:
    !gcloud storage buckets delete $BUCKET_URI

接下来做什么呢

- 探索[Vertex AI模型花园](https://console.cloud.google.com/vertex-ai/model-garden)
- 还可以看看如何使用GKE上的GPU为Gemma开放模型提供服务，使用vLLM(https://cloud.google.com/kubernetes-engine/docs/tutorials/serve-gemma-gpu-vllm)
- 了解更多关于[KerasLP](https://keras.io/keras_nlp)
- 了解更多关于[vLLM](https://github.com/vllm-project/vllm)