In [None]:
# @title Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

在Vertex AI上获取调整后的文本嵌入

<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/generative_ai/tuned_text-embeddings.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo">
      在Colab中打开
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/main/notebooks/official/generative_ai/tuned_text-embeddings.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      在Vertex AI Workbench中打开
    </a>
  </td>
  <td>
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/generative_ai/tuned_text-embeddings.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      在GitHub上查看
    </a>
  </td>
</table>

## 概述

本笔记本将指导您了解微调文本嵌入的过程。通过将文本嵌入模型适应到您的特定领域或任务，您可以获得更好的结果。也可参见 [调整文本嵌入](https://cloud.google.com/vertex-ai/generative-ai/docs/models/tune-embeddings)。

### 目标

在本教程中，您将学习如何调整文本嵌入模型textembedding-gecko。

本教程使用以下Google Cloud ML服务和资源：

- Vertex AI
- Cloud Storage

步骤包括：

- 在Vertex AI管道上运行文本嵌入模型调整作业。
- 将调整后的文本嵌入模型部署到端点。
- 检查质量指标以评估调整后的模型。
- 从调整后的模型获取嵌入以进行下游任务。

### 数据集

这个笔记本使用了来自[字母公司年度财务业绩报告（10K表格）](https://abc.xyz/assets/ff/7c/06d6f493f6462caf08e8502ffc33/596de1b094c32cf0592a08edfe84ae74.html)的合成数据集（语料库，查询和标签）。

* [corpus.jsonl](https://storage.googleapis.com/cloud-samples-data/ai-platform/embedding/goog-10k-2024/r11/corpus.jsonl) (397Ki),
* [queries.jsonl](https://storage.googleapis.com/cloud-samples-data/ai-platform/embedding/goog-10k-2024/r11/queries.jsonl) (321Ki)
* [test.tsv](https://storage.googleapis.com/cloud-samples-data/ai-platform/embedding/goog-10k-2024/r11/test.tsv) (3.7Ki)
* [train.tsv](https://storage.googleapis.com/cloud-samples-data/ai-platform/embedding/goog-10k-2024/r11/train.tsv) (29Ki)
* [validation.tsv](https://storage.googleapis.com/cloud-samples-data/ai-platform/embedding/goog-10k-2024/r11/validation.tsv) (3.7Ki)

### 成本

本教程使用 Google Cloud 的收费组件：

* Vertex AI
* 云存储

了解 [Vertex AI 价格](https://cloud.google.com/vertex-ai/pricing),
和 [Cloud Storage 价格](https://cloud.google.com/storage/pricing),
并使用 [定价计算器](https://cloud.google.com/products/calculator/)
根据您的预期使用量生成成本估算。

## 安装

此教程需要您安装 `google-cloud-aiplatform` 包。

In [None]:
! pip3 install --upgrade --user --quiet google-cloud-aiplatform

### 重新启动运行时（仅限Colab）

为使用新安装的包，您必须在Google Colab上重新启动运行时。

In [None]:
import sys

if "google.colab" in sys.modules:

    import IPython

    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

在开始之前

### 设置您的Google Cloud项目

**无论您使用的是哪种笔记本环境，都需要按照以下步骤进行操作。**

1. [选择或创建一个Google Cloud项目](https://console.cloud.google.com/cloud-resource-manager)。当您首次创建账户时，您将获得$300的免费信用额用于计算/存储成本。

2. [确保您的项目已启用计费功能](https://cloud.google.com/billing/docs/how-to/modify-project)。

3. [启用API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com,documentai.googleapis.com)。

4. 如果您在本地运行这个笔记本，您需要安装[Cloud SDK](https://cloud.google.com/sdk)。

另请参阅[设置权限和资源以调整文本嵌入模型](https://cloud.google.com/vertex-ai/generative-ai/docs/models/tune-embeddings#project-setup)。

### 初始化 Vertex AI 平台

为您的项目和地区导入并初始化 AI 平台。

**如果您不知道您的项目 ID**，请尝试以下操作：
* 运行 `gcloud config list`。
* 运行 `gcloud projects list`。
* 参阅 [查找项目 ID](https://support.google.com/googleapi/answer/7014113)。

参阅 [Vertex AI 地区](https://cloud.google.com/vertex-ai/docs/general/locations)。

In [None]:
# @title (Required) Set PROJECT_ID and REGION
PROJECT_ID = "[your-project-id]"  # @param {type:"string"}
REGION = "us-central1"  # @param {type:"string"}
if not PROJECT_ID.strip():
    raise ValueError("'PROJECT_ID' is required.")
if not REGION.strip():
    raise ValueError("'REGION' is required.")
!gcloud config set project {PROJECT_ID}

import vertexai
from google.cloud.aiplatform import pipeline_jobs
from vertexai.language_models import TextEmbeddingInput, TextEmbeddingModel

vertexai.init(project=PROJECT_ID, location=REGION)

### 验证您的谷歌云账户

根据您的Jupyter环境，您可能需要手动进行身份验证。请按照以下相关说明操作。

1. 合作：

In [None]:
# @title (Required on Colab) `authenticate_user()`
import builtins
import os
import sys

running_in_colab = "google.colab" in sys.modules and hasattr(builtins, "get_ipython")
if running_in_colab and not os.getenv("IS_TESTING"):
    from google.colab import auth

    auth.authenticate_user(project_id=PROJECT_ID)

确保在运行 Vertex AI Workbench 实例的 Compute Engine 默认服务帐户具有权限 iam.serviceAccounts.actAs （很可能是通过 roles/iam.serviceAccountUser）在云控制台的 IAM 和管理页面上。这个权限允许 Workbench 实例在与其他 Google Cloud 服务交互时充当服务帐户。

3. 本地的JupyterLab实例，取消注释并运行:

In [None]:
# !gcloud auth login

## 调整文本嵌入

（可选）如果要从上一次调整会话的位置继续本教程，请相应地设置**`TUNING_JOB_ID`**。或者，清除**`TUNING_JOB_ID`**以重新开始一个新的调整会话。

本教程为您的项目创建了一个 Vertex AI 流水线的调整作业，用于调整文本嵌入模型。详情请参阅[使用参数和默认值创建调整作业](https://cloud.google.com/vertex-ai/generative-ai/docs/models/tune-embeddings#create-embedding-tuning-job)，以及最新的[文本嵌入模型和可用任务](https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/get-text-embeddings#api_changes_to_models_released_on_or_after_august_2023)。

In [None]:
# @title (Optional) Set `TUNING_JOB_ID` to resume an existing tuning session: { run: "auto" }
TUNING_JOB_ID = ""  # @param {type: "string"}

In [None]:
# @title (Required) Resume an existing or start a fresh tuning session (depending on TUNING_JOB_ID)
BASE_MODEL = "text-embedding-004"  # @param ["textembedding-gecko@003", "text-embedding-004", "textembedding-gecko-multilingual@001", "text-multilingual-embedding-002"]
TASK = "DEFAULT"  # @param ["DEFAULT", "RETRIEVAL_QUERY", "RETRIEVAL_DOCUMENT", "SEMANTIC_SIMILARITY", "CLASSIFICATION", "CLUSTERING", "QUESTION_ANSWERING", "FACT_VERIFICATION"]
CORPUS_DATA = "gs://cloud-samples-data/ai-platform/embedding/goog-10k-2024/r11/corpus.jsonl"  # @param {type: "string"}
QUERIES_DATA = "gs://cloud-samples-data/ai-platform/embedding/goog-10k-2024/r11/queries.jsonl"  # @param {type: "string"}
TRAINING_DATA = "gs://cloud-samples-data/ai-platform/embedding/goog-10k-2024/r11/train.tsv"  # @param{type: "string"}
VALIDATION_DATA = "gs://cloud-samples-data/ai-platform/embedding/goog-10k-2024/r11/validation.tsv"  # @param{type: "string"}
TEST_DATA = "gs://cloud-samples-data/ai-platform/embedding/goog-10k-2024/r11/test.tsv"  # @param{type: "string"}
BATCH_SIZE = 128  # @param {type: "number"}
TRAIN_STEPS = 1000  # @param {type: "number"}
OUTPUT_DIMENSIONALITY = 768  # @param {type: "number"}
LEARNING_RATE_MULTIPLIER = 1.0  # @param {type: "number"}

# Synchronously validate some edge cases that will cause async validation to fail.
if BASE_MODEL not in ["text-embedding-004", "text-multilingual-embedding-002"]:
    if TASK in ["QUESTION_ANSWERING", "FACT_VERIFICATION"]:
        raise ValueError(f"TASK '{TASK}' is not valid for model '{BASE_MODEL}'.")

    if OUTPUT_DIMENSIONALITY not in [-1, 768]:
        raise ValueError(f"Model '{BASE_MODEL}' does not support the output_dimensionality parameter.")

base_model = TextEmbeddingModel.from_pretrained(BASE_MODEL)
if "TUNING_JOB_ID" in locals() and TUNING_JOB_ID:
    filter = f'pipelineJobUserId="{TUNING_JOB_ID}"'
    tuning_job = next(iter(pipeline_jobs.PipelineJob.list(filter=filter)))
    print(
        f"Got an existing tuning job '{tuning_job.name}' (state: {tuning_job.state.name})."
    )
else:
    tuning_result = base_model.tune_model(
        task_type=TASK,
        corpus_data=CORPUS_DATA,
        queries_data=QUERIES_DATA,
        training_data=TRAINING_DATA,
        validation_data=VALIDATION_DATA,
        test_data=TEST_DATA,
        batch_size=BATCH_SIZE,
        train_steps=TRAIN_STEPS,
        tuned_model_location=REGION,
        learning_rate_multiplier=LEARNING_RATE_MULTIPLIER,
        output_dimensionality=OUTPUT_DIMENSIONALITY,
    )
    tuning_job = pipeline_jobs.PipelineJob.get(tuning_result.pipeline_job_name)
    print(
        f"Got a fresh tuning job '{tuning_job.name}' (state: {tuning_job.state.name})."
    )
    print(
        f"(OPTIONAL) Set 'TUNING_JOB_ID' to '{tuning_job.name}' when you want to resume this tuning session."
    )

部署优化的文本嵌入模型

一旦调优作业完成，你可以将优化后的模型部署到一个端点上。可以通过以下方式之一来部署模型：

- 在基础模型对象的`tune_model`方法返回的调优结果对象上调用`deploy_tuned_model`方法。
- 调用类方法`TextEmbeddingModel.deploy_tuned_model`，并提供一个已调优的模型资源名称。你可以从调优作业中检索上传的、调优后的模型资源名称。该资源名称是一个字符串，遵循以下模式：projects/{PROJECT_ID}/locations/{REGION}/models/{MODEL_ID}。

**重要提示**：部署经过优化的文本嵌入模型，因为是定制训练的，需要为你的项目分配服务资源，如机器和加速器。`deploy_tuned_model`在第一次调用时，会将调优模型部署到一个新的服务资源上。在后续调用中，它只是从现有服务资源中检索调优模型部署。

另请参阅[使用调优模型](https://cloud.google.com/vertex-ai/generative-ai/docs/models/tune-embeddings#use-tuned-model)。

In [None]:
MACHINE_TYPE = "a2-highgpu-1g"  # @param {type: "string"}
ACCELERATOR = "NVIDIA_TESLA_A100"  # @param {type: "string"}
ACCELERATOR_COUNT = 1  # @param {type: "number"}

# CAVEAT: Colab disruptions may cause 'tuning_result' to be undefined.
running_interactively = not os.getenv("IS_TESTING")
if "tuning_job" not in locals() and running_interactively:
    message = "[Action Required] Run the preceding code cells to define 'tuning_job'."
    raise RuntimeError(message)
if "tuning_job" in locals() and running_interactively:
    if "tuning_result" in locals():
        model = tuning_result.deploy_tuned_model(
            machine_type=MACHINE_TYPE,
            accelerator=ACCELERATOR,
            accelerator_count=ACCELERATOR_COUNT,
        )
        print(f"Got deployed, tuned {model=}")
    else:
        tuning_job.wait()
        tasks = tuning_job.task_details
        upload_task = next(t for t in tasks if "uploader" in t.task_name)
        upload_metadata = dict(upload_task.execution.metadata)
        tuned_model_name = upload_metadata["output:model_resource_name"]
        model = TextEmbeddingModel.deploy_tuned_model(
            tuned_model_name=tuned_model_name,
            machine_type=MACHINE_TYPE,
            accelerator=ACCELERATOR,
            accelerator_count=ACCELERATOR_COUNT,
        )
        print(f"Got deployed, tuned model '{tuned_model_name}'.")

### 从调整作业中检查质量度量

通过使用验证和测试数据集计算NDCG@10指标，管道评估了调整过程中基础模型和调整模型的性能。这些度量指标可在管道作业的`度量`工件中找到。

In [None]:
import pandas as pd

if "tuning_job" in locals() and tuning_job.done():
    tasks = tuning_job.task_details
    eval_task = next(t for t in tasks if "evaluator" in t.task_name)
    metrics = dict(eval_task.outputs["metrics"].artifacts[0].metadata)
    metrics_df = pd.DataFrame.from_dict(
        {"metric": metrics.keys(), "value": metrics.values()}
    )
    display(metrics_df.sort_values(by="metric", ignore_index=True))

获取调整嵌入以用于下游任务。

In [None]:
import pandas as pd

if "model" in locals():
    texts = ["banana muffins?"]  # @param {type:"raw"}
    titles = ["none"]  # @param {type:"raw"}
    embedding_inputs = [
        TextEmbeddingInput(text=text, task_type=TASK, title=title)
        for text, title in zip(texts, titles)
    ]
    tuned_embeddings = [
        pd.Series(e.values) for e in model.get_embeddings(embedding_inputs)
    ]
    display(tuned_embeddings)

清理

要清理此项目中使用的所有Google Cloud资源，您可以删除用于本教程的[Google Cloud项目](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects)。

否则，您可以删除本教程中创建的各个资源。