In [None]:
# Copyright 2021 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

<table align="left">

  <td>
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/master/notebooks/official/feature_store/gapic-feature-store.ipynb"">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> 在Colab中运行
    </a>
  </td>
  <td>
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/master/notebooks/official/feature_store/gapic-feature-store.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      在GitHub上查看
    </a>
  </td>
</table>

## 概述

这个 Colab 介绍了 Vertex AI 特征存储，这是一个托管的云服务，用于机器学习工程师和数据科学家以大规模存储、提供、管理和共享机器学习特征。

这个 Colab 假设您了解基本的 Google Cloud 概念，比如 [项目](https://cloud.google.com/storage/docs/projects)，[存储](https://cloud.google.com/storage) 和 [Vertex AI](https://cloud.google.com/vertex-ai/docs)。一些机器学习知识也会有帮助，但不是必需的。

### 数据集

这个 Colab 使用一个电影推荐数据集作为所有会话中的示例。任务是训练一个模型来预测用户是否会观看一部电影，并在线提供这个模型。

### 目标

在这个笔记本中，您将学会：

    * 如何将您的特征导入到 Vertex AI 特征存储中。
    * 如何使用导入的特征提供在线预测请求。
    * 如何在离线作业中访问导入的特征，比如训练作业。

### 成本

本教程使用 Google Cloud 的计费组件：

* Vertex AI
* Cloud 存储
* Cloud Bigtable

了解[Vertex AI 定价](https://cloud.google.com/vertex-ai/pricing)和[Cloud 存储定价](https://cloud.google.com/storage/pricing)，并使用[定价计算器](https://cloud.google.com/products/calculator/)根据您的预期使用量生成成本估算。

### 设置本地开发环境

**如果您正在使用Colab或Google Cloud笔记本**，您的环境已经满足运行此笔记本的所有要求。您可以跳过这一步。

否则，请确保您的环境符合此笔记本的要求。
您需要以下内容：

* Google Cloud SDK
* Git
* Python 3
* virtualenv
* 在使用 Python 3 的虚拟环境中运行的 Jupyter notebook

Google Cloud 指南中提供了 [设置 Python 开发环境](https://cloud.google.com/python/setup) 和 [Jupyter 安装指南](https://jupyter.org/install)，详细说明了满足这些要求的步骤。以下步骤提供了一个简要的说明：

1. [安装并初始化 Cloud SDK。](https://cloud.google.com/sdk/docs/)

2. [安装 Python 3。](https://cloud.google.com/python/setup#installing_python)

3. [安装virtualenv](https://cloud.google.com/python/setup#installing_and_using_virtualenv) 并创建一个使用 Python 3 的虚拟环境。 激活虚拟环境。

4. 要安装 Jupyter，请在终端 shell 中运行 `pip install jupyter`。

5. 要启动 Jupyter，请在终端 shell 中运行 `jupyter notebook`。

6. 在 Jupyter Notebook 仪表板中打开此笔记本。

### 安装额外的软件包

对于这个Colab，您需要Python的Vertex SDK。

In [None]:
import os

# The Google Cloud Notebook product has specific requirements
IS_GOOGLE_CLOUD_NOTEBOOK = os.path.exists("/opt/deeplearning/metadata/env_version")

# Google Cloud Notebook requires dependencies to be installed with '--user'
USER_FLAG = ""
if IS_GOOGLE_CLOUD_NOTEBOOK:
    USER_FLAG = "--user"

In [None]:
! pip install {USER_FLAG} --upgrade google-cloud-aiplatform

### 重新启动内核

安装完SDK后，您需要重新启动笔记本内核，以便它可以找到这些软件包。您可以从*Kernel -> Restart Kernel*重新启动内核，或者运行以下命令：

In [None]:
# Automatically restart kernel after installs
import os

if not os.getenv("IS_TESTING"):
    # Automatically restart kernel after installs
    import IPython

    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

## 在开始之前

### 选择一个GPU运行时

**如果有可能的话，请确保您在GPU运行时中运行此笔记本。在Colab中，选择“运行时 --> 更改运行时类型 > GPU”**

### 设置您的 Google Cloud 项目

**无论您使用哪种笔记本环境，都需要完成以下步骤。**

1. [选择或创建 Google Cloud 项目](https://console.cloud.google.com/cloud-resource-manager)。当您第一次创建帐户时，您将获得$300的免费信用额度用于支付计算/存储成本。

2. [确保为您的项目启用了计费](https://cloud.google.com/billing/docs/how-to/modify-project)。

3. [启用 Vertex AI API 和 Compute Engine API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com,compute_component)。

4. 如果您在本地运行此笔记本，您将需要安装 [Cloud SDK](https://cloud.google.com/sdk)。

5. 在下面的单元格中输入您的项目 ID。然后运行该单元格，确保 Cloud SDK 在本笔记本中的所有命令中使用正确的项目。

**注意**：Jupyter 运行以 `!` 为前缀的行作为 shell 命令，并将以 `$` 为前缀的 Python 变量插入这些命令中。

设置您的项目ID

**如果您不知道您的项目ID**，您可以使用`gcloud`来获取您的项目ID。

In [None]:
import os

PROJECT_ID = ""

# Get your Google Cloud project ID from gcloud
if not os.getenv("IS_TESTING"):
    shell_output = !gcloud config list --format 'value(core.project)' 2>/dev/null
    PROJECT_ID = shell_output[0]
    print("Project ID: ", PROJECT_ID)

否则，请在这里设置您的项目ID。

In [None]:
if PROJECT_ID == "" or PROJECT_ID is None:
    PROJECT_ID = "python-docs-samples-tests"  # @param {type:"string"}

UUID

如果您在进行实时教程会话，您可能正在使用共享的测试账户或项目。为了避免用户在创建资源时发生名称冲突，您可以为每个实例会话创建一个UUID，并将其附加到您在本教程中创建的资源的名称上。

In [None]:
import random
import string


# Generate a uuid of a specifed length(default=8)
def generate_uuid(length: int = 8) -> str:
    return "".join(random.choices(string.ascii_lowercase + string.digits, k=length))


UUID = generate_uuid()

### 认证您的Google Cloud账户

**如果您正在使用Google Cloud笔记本**，您的环境已经通过身份验证。请跳过此步骤。

**如果您正在使用Colab**，运行下面的单元格，按照提示进行身份验证，通过oAuth认证您的帐号。

**否则**，请按照以下步骤操作：

1. 在云控制台中，转到[**创建服务帐号密钥**页面](https://console.cloud.google.com/apis/credentials/serviceaccountkey)。

2. 点击**创建服务帐号**。

3. 在**服务帐号名称**字段中输入一个名称，然后点击**创建**。

4. 在**授予此服务帐号对项目的访问权限**部分，点击**角色**下拉列表。在过滤框中输入 "Vertex AI"，选择**Vertex AI管理员**。在过滤框中输入 "存储对象管理员"，选择**存储对象管理员**。

5. 点击*创建*。一个包含您密钥的JSON文件将下载到您的本地环境。

6. 在下面的单元格中，输入您服务帐号密钥的路径作为`GOOGLE_APPLICATION_CREDENTIALS`变量，然后运行该单元格。

In [None]:
import os
import sys

# If you are running this notebook in Colab, run this cell and follow the
# instructions to authenticate your GCP account. This provides access to your
# Cloud Storage bucket and lets you submit training jobs and prediction
# requests.

# The Google Cloud Notebook product has specific requirements
IS_GOOGLE_CLOUD_NOTEBOOK = os.path.exists("/opt/deeplearning/metadata/env_version")

# If on Google Cloud Notebooks, then don't execute this code
if not IS_GOOGLE_CLOUD_NOTEBOOK:
    if "google.colab" in sys.modules:
        from google.colab import auth as google_auth

        google_auth.authenticate_user()

    # If you are running this notebook locally, replace the string below with the
    # path to your service account key and run this cell to authenticate your GCP
    # account.
    elif not os.getenv("IS_TESTING"):
        %env GOOGLE_APPLICATION_CREDENTIALS ''

## 为输出准备

### 步骤1. 为输出创建数据集

您需要一个 BigQuery 数据集来托管在 `us-central1` 中的输出数据。输入您想要创建的数据集的名称，并指定您想要存储输出数据的表的名称。这些将在后面的笔记本中使用。

**确保表名尚不存在**。 

In [None]:
from datetime import datetime

from google.cloud import bigquery

In [None]:
# Output dataset
DESTINATION_DATA_SET = "movie_predictions"  # @param {type:"string"}
TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")
DESTINATION_DATA_SET = "{prefix}_{timestamp}".format(
    prefix=DESTINATION_DATA_SET, timestamp=TIMESTAMP
)

# Output table. Make sure that the table does NOT already exist; the BatchReadFeatureValues API cannot overwrite an existing table
DESTINATION_TABLE_NAME = "training_data"  # @param {type:"string"}

DESTINATION_PATTERN = "bq://{project}.{dataset}.{table}"
DESTINATION_TABLE_URI = DESTINATION_PATTERN.format(
    project=PROJECT_ID, dataset=DESTINATION_DATA_SET, table=DESTINATION_TABLE_NAME
)

In [None]:
# Create dataset
REGION = "us-central1"  # @param {type:"string"}
client = bigquery.Client(project=PROJECT_ID)
dataset_id = "{}.{}".format(client.project, DESTINATION_DATA_SET)
dataset = bigquery.Dataset(dataset_id)
dataset.location = REGION
dataset = client.create_dataset(dataset)
print("Created dataset {}.{}".format(client.project, dataset.dataset_id))

### 导入库和定义常量

In [None]:
# Other than project ID and featurestore ID and endpoints needs to be set
API_ENDPOINT = "us-central1-aiplatform.googleapis.com"  # @param {type:"string"}
INPUT_CSV_FILE = "gs://cloud-samples-data-us-central1/vertex-ai/feature-store/datasets/movie_prediction.csv"

In [None]:
from google.cloud.aiplatform_v1 import (FeaturestoreOnlineServingServiceClient,
                                        FeaturestoreServiceClient)
from google.cloud.aiplatform_v1.types import FeatureSelector, IdMatcher
from google.cloud.aiplatform_v1.types import entity_type as entity_type_pb2
from google.cloud.aiplatform_v1.types import feature as feature_pb2
from google.cloud.aiplatform_v1.types import featurestore as featurestore_pb2
from google.cloud.aiplatform_v1.types import \
    featurestore_online_service as featurestore_online_service_pb2
from google.cloud.aiplatform_v1.types import \
    featurestore_service as featurestore_service_pb2
from google.cloud.aiplatform_v1.types import io as io_pb2

# Create admin_client for CRUD and data_client for reading feature values.
admin_client = FeaturestoreServiceClient(client_options={"api_endpoint": API_ENDPOINT})
data_client = FeaturestoreOnlineServingServiceClient(
    client_options={"api_endpoint": API_ENDPOINT}
)

# Represents featurestore resource path.
BASE_RESOURCE_PATH = admin_client.common_location_path(PROJECT_ID, REGION)

## 术语和概念

### 特征存储数据模型

Vertex AI 特征存储采用以下三个重要的层次概念组织数据：
```
特征存储 -> 实体类型 -> 特征
```
* **特征存储**: 存储特征的位置
* **实体类型**: 在特征存储下，一个*实体类型*描述了一个要建模的对象，可以是真实的或虚拟的。
* **特征**: 在一个实体类型下，一个*特征*描述了实体类型的属性。

在电影预测的示例中，您将创建一个名为*movie_prediction*的特征存储。这个存储有两个实体类型：*Users*和*Movies*。Users实体类型有年龄、性别和喜欢的流派特征。Movies实体类型有流派和平均评分特征。

创建特征存储和定义模式

### 创建特征存储

创建特征存储的方法返回一个[长时间运行的操作](https://google.aip.dev/151)（LRO）。LRO启动一个异步作业。对于其他API方法，如更新或删除特征存储，也会返回LRO。调用`create_fs_lro.result()`等待LRO完成。

In [None]:
FEATURESTORE_ID = f"movie_prediction_{UUID}"
try:
    create_lro = admin_client.create_featurestore(
        featurestore_service_pb2.CreateFeaturestoreRequest(
            parent=BASE_RESOURCE_PATH,
            featurestore_id=FEATURESTORE_ID,
            featurestore=featurestore_pb2.Featurestore(
                online_serving_config=featurestore_pb2.Featurestore.OnlineServingConfig(
                    fixed_node_count=1
                ),
            ),
        )
    )
    # Wait for LRO to finish and get the LRO result.
    print(create_lro.result())
except Exception as e:
    print(e)

您可以使用[GetFeaturestore](https://cloud.google.com/vertex-ai/docs/reference/rpc/google.cloud.aiplatform.v1#google.cloud.aiplatform.v1.FeaturestoreService.GetFeaturestore)或[ListFeaturestores](https://cloud.google.com/vertex-ai/docs/reference/rpc/google.cloud.aiplatform.v1#google.cloud.aiplatform.v1.FeaturestoreService.ListFeaturestores)来检查Featurestore是否成功创建。以下示例获取Featurestore的详细信息。

In [None]:
admin_client.get_featurestore(
    name=admin_client.featurestore_path(PROJECT_ID, REGION, FEATURESTORE_ID)
)

自 v1.11 版本起，可以在 v1 中使用自动缩放。以下是具有自动缩放功能的 `CreateFeaturestoreRequest` 的示例，请使用 `aiplatform_v1.FeaturestoreServiceClient` 创建特征存储：

In [None]:
from google.cloud.aiplatform_v1.types import \
    featurestore as v1_featurestore_pb2
from google.cloud.aiplatform_v1.types import \
    featurestore_service as v1_featurestore_service_pb2

create_featurestore_request = v1_featurestore_service_pb2.CreateFeaturestoreRequest(
    parent=BASE_RESOURCE_PATH,
    featurestore_id=FEATURESTORE_ID,
    featurestore=v1_featurestore_pb2.Featurestore(
        online_serving_config=v1_featurestore_pb2.Featurestore.OnlineServingConfig(
            scaling=v1_featurestore_pb2.Featurestore.OnlineServingConfig.Scaling(
                min_node_count=1, max_node_count=5
            )
        ),
    ),
)

创建实体类型

In [None]:
try:
    users_entity_type_lro = admin_client.create_entity_type(
        featurestore_service_pb2.CreateEntityTypeRequest(
            parent=admin_client.featurestore_path(PROJECT_ID, REGION, FEATURESTORE_ID),
            entity_type_id="users",
            entity_type=entity_type_pb2.EntityType(
                description="Users entity",
            ),
        )
    )
    # Similarly, wait for EntityType creation operation.
    print(users_entity_type_lro.result())
except Exception as e:
    print(e)

In [None]:
# Create movies entity type without a monitoring configuration.
try:
    movies_entity_type_lro = admin_client.create_entity_type(
        featurestore_service_pb2.CreateEntityTypeRequest(
            parent=admin_client.featurestore_path(PROJECT_ID, REGION, FEATURESTORE_ID),
            entity_type_id="movies",
            entity_type=entity_type_pb2.EntityType(description="Movies entity"),
        )
    )

    # Similarly, wait for EntityType creation operation.
    print(movies_entity_type_lro.result())
except Exception as e:
    print(e)

功能[监控](https://cloud.google.com/vertex-ai/docs/featurestore/monitoring)目前处于预览阶段，因此您需要使用v1 Python。导入特征分析目前仅通过SDK提供。

In [None]:
from google.cloud.aiplatform_v1 import \
    FeaturestoreServiceClient as v1_FeaturestoreServiceClient
from google.cloud.aiplatform_v1.types import entity_type as v1_entity_type_pb2
from google.cloud.aiplatform_v1.types import \
    featurestore_monitoring as v1_featurestore_monitoring_pb2
from google.cloud.aiplatform_v1.types import \
    featurestore_service as v1_featurestore_service_pb2

v1_admin_client = v1_FeaturestoreServiceClient(
    client_options={"api_endpoint": API_ENDPOINT}
)

# Enable import feature analysis for users entity type.
# All Features belonging to this EntityType will by default inherit the monitoring config.
v1_admin_client.update_entity_type(
    v1_featurestore_service_pb2.UpdateEntityTypeRequest(
        entity_type=v1_entity_type_pb2.EntityType(
            name=admin_client.entity_type_path(
                PROJECT_ID, REGION, FEATURESTORE_ID, "users"
            ),
            monitoring_config=v1_featurestore_monitoring_pb2.FeaturestoreMonitoringConfig(
                import_features_analysis=v1_featurestore_monitoring_pb2.FeaturestoreMonitoringConfig.ImportFeaturesAnalysis(
                    anomaly_detection_baseline=v1_featurestore_monitoring_pb2.FeaturestoreMonitoringConfig.ImportFeaturesAnalysis.Baseline.LATEST_STATS,
                    state=v1_featurestore_monitoring_pb2.FeaturestoreMonitoringConfig.ImportFeaturesAnalysis.State.ENABLED,
                ),
                numerical_threshold_config=v1_featurestore_monitoring_pb2.FeaturestoreMonitoringConfig.ThresholdConfig(
                    value=0.001,
                ),
                categorical_threshold_config=v1_featurestore_monitoring_pb2.FeaturestoreMonitoringConfig.ThresholdConfig(
                    value=0.001,
                ),
            ),
        ),
    )
)

目前设置快照分析的最简单方法是使用[控制台UI](https://console.cloud.google.com/vertex-ai/features)。为了完整起见，以下是使用v1 SDK进行此操作的示例。

您可以在[控制台UI](https://console.cloud.google.com/vertex-ai/features)上查看监控统计信息。

In [None]:
from google.cloud.aiplatform_v1 import \
    FeaturestoreServiceClient as v1_FeaturestoreServiceClient
from google.cloud.aiplatform_v1.types import entity_type as v1_entity_type_pb2
from google.cloud.aiplatform_v1.types import \
    featurestore_monitoring as v1_featurestore_monitoring_pb2
from google.cloud.aiplatform_v1.types import \
    featurestore_service as v1_featurestore_service_pb2

v1_admin_client = v1_FeaturestoreServiceClient(
    client_options={"api_endpoint": API_ENDPOINT}
)

# Enable snapshot analysis for users entity type.
# All Features belonging to this EntityType will by default inherit the monitoring config.
v1_admin_client.update_entity_type(
    v1_featurestore_service_pb2.UpdateEntityTypeRequest(
        entity_type=v1_entity_type_pb2.EntityType(
            name=admin_client.entity_type_path(
                PROJECT_ID, REGION, FEATURESTORE_ID, "users"
            ),
            monitoring_config=v1_featurestore_monitoring_pb2.FeaturestoreMonitoringConfig(
                snapshot_analysis=v1_featurestore_monitoring_pb2.FeaturestoreMonitoringConfig.SnapshotAnalysis(
                    monitoring_interval_days=1,  # 1 day
                    staleness_days=30,
                ),
                numerical_threshold_config=v1_featurestore_monitoring_pb2.FeaturestoreMonitoringConfig.ThresholdConfig(
                    value=0.001,
                ),
                categorical_threshold_config=v1_featurestore_monitoring_pb2.FeaturestoreMonitoringConfig.ThresholdConfig(
                    value=0.001,
                ),
            ),
        ),
    )
)

### 创建功能

In [None]:
# Create features for the 'users' entity.
try:
    admin_client.batch_create_features(
        parent=admin_client.entity_type_path(
            PROJECT_ID, REGION, FEATURESTORE_ID, "users"
        ),
        requests=[
            featurestore_service_pb2.CreateFeatureRequest(
                feature=feature_pb2.Feature(
                    value_type=feature_pb2.Feature.ValueType.INT64,
                    description="User age",
                    disable_monitoring=False,
                ),
                feature_id="age",
            ),
            featurestore_service_pb2.CreateFeatureRequest(
                feature=feature_pb2.Feature(
                    value_type=feature_pb2.Feature.ValueType.STRING,
                    description="User gender",
                    # Default is False. If True, Feature 'gender' monitoring analysis is disabled.
                    disable_monitoring=True,
                ),
                feature_id="gender",
            ),
            featurestore_service_pb2.CreateFeatureRequest(
                feature=feature_pb2.Feature(
                    value_type=feature_pb2.Feature.ValueType.STRING_ARRAY,
                    description="An array of genres that this user liked",
                ),
                feature_id="liked_genres",
            ),
        ],
    ).result()
except Exception as e:
    print(e)

In [None]:
# Create features for movies type.
try:
    admin_client.batch_create_features(
        parent=admin_client.entity_type_path(
            PROJECT_ID, REGION, FEATURESTORE_ID, "movies"
        ),
        requests=[
            featurestore_service_pb2.CreateFeatureRequest(
                feature=feature_pb2.Feature(
                    value_type=feature_pb2.Feature.ValueType.STRING,
                    description="The title of the movie",
                ),
                feature_id="title",
            ),
            featurestore_service_pb2.CreateFeatureRequest(
                feature=feature_pb2.Feature(
                    value_type=feature_pb2.Feature.ValueType.STRING,
                    description="The genres of the movie",
                ),
                feature_id="genres",
            ),
            featurestore_service_pb2.CreateFeatureRequest(
                feature=feature_pb2.Feature(
                    value_type=feature_pb2.Feature.ValueType.DOUBLE,
                    description="The average rating for the movie, range is [1.0-5.0]",
                ),
                feature_id="average_rating",
            ),
        ],
    ).result()
except Exception as e:
    print(e)

搜索创建的特征

虽然[ListFeatures](https://cloud.google.com/vertex-ai/docs/reference/rpc/google.cloud.aiplatform.v1#google.cloud.aiplatform.v1.FeaturestoreService.ListFeatures)方法允许您轻松查看单个实体类型的所有特征，但[SearchFeatures](https://cloud.google.com/vertex-ai/docs/reference/rpc/google.cloud.aiplatform.v1#google.cloud.aiplatform.v1.FeaturestoreService.SearchFeatures)方法可以在给定位置（如`us-central1`）的所有特征存储和实体类型中进行搜索。这可以帮助您发现由其他人创建的特征。

您可以基于特征属性进行查询，包括特征ID、实体类型ID和特征描述。您还可以通过过滤特定的特征存储、特征值类型和/或标签来限制结果。

In [None]:
# Search for all features across all featurestores.
list(admin_client.search_features(location=BASE_RESOURCE_PATH))

现在，将搜索范围缩小到类型为`DOUBLE`的特征。

In [None]:
# Search for all features with value type `DOUBLE`
list(
    admin_client.search_features(
        featurestore_service_pb2.SearchFeaturesRequest(
            location=BASE_RESOURCE_PATH, query="value_type=DOUBLE"
        )
    )
)

或者，限制搜索结果到具有特定关键词在其ID和类型中的功能。

In [None]:
# Filter on feature value type and keywords.
list(
    admin_client.search_features(
        featurestore_service_pb2.SearchFeaturesRequest(
            location=BASE_RESOURCE_PATH, query="feature_id:title AND value_type=STRING"
        )
    )
)

导入特征值

在您可以将其用于在线/离线服务之前，您需要导入特征值。在这一步中，您将学习如何通过调用Python SDK中的ImportFeatureValues API来导入特征值。

### 源数据格式和布局

如上所述，支持 BigQuery 表/Avro/CSV。无论您使用哪种格式，每个导入的实体*必须*有一个 ID；此外，每个实体可以 *选择性* 包含一个时间戳，指定生成特征值的时间。此 Colab 使用 Avro 作为输入，位于此公共[bucket](https://pantheon.corp.google.com/storage/browser/cloud-samples-data/ai-platform-unified/datasets/featurestore;tab=objects?project=storage-samples&prefix=&forceOnObjectsSortingFiltering=false)中。Avro schemas 如下：

**对于 Users 实体**：
```
schema = {
  "type": "record",
  "name": "User",
  "fields": [
      {
       "name":"user_id",
       "type":["null","string"]
      },
      {
       "name":"age",
       "type":["null","long"]
      },
      {
       "name":"gender",
       "type":["null","string"]
      },
      {
       "name":"liked_genres",
       "type":{"type":"array","items":"string"}
      },
      {
       "name":"update_time",
       "type":["null",{"type":"long","logicalType":"timestamp-micros"}]
      },
  ]
 }
```

**对于 Movies 实体**
```
schema = {
 "type": "record",
 "name": "Movie",
 "fields": [
     {
      "name":"movie_id",
      "type":["null","string"]
     },
     {
      "name":"average_rating",
      "type":["null","double"]
     },
     {
      "name":"title",
      "type":["null","string"]
     },
     {
      "name":"genres",
      "type":["null","string"]
     },
     {
      "name":"update_time",
      "type":["null",{"type":"long","logicalType":"timestamp-micros"}]
     },
 ]
}
```

### 为用户导入特征值

在导入时，请在您的请求中指定以下内容：

* 数据源格式：BigQuery表/Avro/CSV
* 数据源URL
* 目的地：要导入的特征存储/实体类型/特征

In [None]:
import_users_request = featurestore_service_pb2.ImportFeatureValuesRequest(
    entity_type=admin_client.entity_type_path(
        PROJECT_ID, REGION, FEATURESTORE_ID, "users"
    ),
    avro_source=io_pb2.AvroSource(
        # Source
        gcs_source=io_pb2.GcsSource(
            uris=[
                "gs://cloud-samples-data-us-central1/vertex-ai/feature-store/datasets/users.avro"
            ]
        )
    ),
    entity_id_field="user_id",
    feature_specs=[
        # Features
        featurestore_service_pb2.ImportFeatureValuesRequest.FeatureSpec(id="age"),
        featurestore_service_pb2.ImportFeatureValuesRequest.FeatureSpec(id="gender"),
        featurestore_service_pb2.ImportFeatureValuesRequest.FeatureSpec(
            id="liked_genres"
        ),
    ],
    feature_time_field="update_time",
    worker_count=1,
    # Default is False. If True, the import feature analysis won't happen for this specific operation.
    disable_ingestion_analysis=False,
)

In [None]:
# Start to import, will take a couple of minutes
ingestion_lro = admin_client.import_feature_values(import_users_request)

In [None]:
# Polls for the LRO status and prints when the LRO has completed
ingestion_lro.result()

### 导入电影的特征值

同样地，将电影的特征值导入特征存储。

In [None]:
import_movie_request = featurestore_service_pb2.ImportFeatureValuesRequest(
    entity_type=admin_client.entity_type_path(
        PROJECT_ID, REGION, FEATURESTORE_ID, "movies"
    ),
    avro_source=io_pb2.AvroSource(
        gcs_source=io_pb2.GcsSource(
            uris=[
                "gs://cloud-samples-data-us-central1/vertex-ai/feature-store/datasets/movies.avro"
            ]
        )
    ),
    entity_id_field="movie_id",
    feature_specs=[
        featurestore_service_pb2.ImportFeatureValuesRequest.FeatureSpec(id="title"),
        featurestore_service_pb2.ImportFeatureValuesRequest.FeatureSpec(id="genres"),
        featurestore_service_pb2.ImportFeatureValuesRequest.FeatureSpec(
            id="average_rating"
        ),
    ],
    feature_time_field="update_time",
    worker_count=1,
)

In [None]:
# Start to import, will take a couple of minutes
ingestion_lro = admin_client.import_feature_values(import_movie_request)

In [None]:
# Polls for the LRO status and prints when the LRO has completed
ingestion_lro.result()

在线服务

[在线Serving APIs](https://cloud.google.com/vertex-ai/docs/reference/rpc/google.cloud.aiplatform.v1#featurestoreonlineservingservice) 允许您为实体的小批量提供特征值。它专为延迟敏感的服务设计，比如在线模型预测。例如，对于一个电影服务，您可能希望通过在线预测快速展示当前用户最有可能观看的电影。

ReadFeatureValues API 用于读取一个实体的特征值；因此它的自定义 HTTP 动词是 `readFeatureValues`。默认情况下，该 API 将返回每个特征的最新值，即具有最近时间戳的特征值。

要读取特征值，请指定实体 ID 和要读取的特征。响应包含一个 `header` 和一个 `entity_view`。`entity_view` 中的每行数据包含一个特征值，其顺序与响应头中列出的特征顺序相同。

In [None]:
# Fetch the following 3 features.
feature_selector = FeatureSelector(
    id_matcher=IdMatcher(ids=["age", "gender", "liked_genres"])
)

data_client.read_feature_values(
    featurestore_online_service_pb2.ReadFeatureValuesRequest(
        # Fetch from the following feature store/entity type
        entity_type=admin_client.entity_type_path(
            PROJECT_ID, REGION, FEATURESTORE_ID, "users"
        ),
        # Fetch the user features whose ID is "alice"
        entity_id="alice",
        feature_selector=feature_selector,
    )
)

### 每个请求读取多个实体

要从多个实体中读取特征值，请使用StreamingReadFeatureValues API，它与先前的ReadFeatureValues API 几乎相同。请注意，由于其对延迟敏感的特性，在使用此API时推荐仅获取少量的实体。

In [None]:
# Read the same set of features as above, but for multiple entities.
response_stream = data_client.streaming_read_feature_values(
    featurestore_online_service_pb2.StreamingReadFeatureValuesRequest(
        entity_type=admin_client.entity_type_path(
            PROJECT_ID, REGION, FEATURESTORE_ID, "users"
        ),
        entity_ids=["alice", "bob"],
        feature_selector=feature_selector,
    )
)

In [None]:
# Iterate and process response. Note the first one is always the header only.
for response in response_stream:
    print(response)

现在您已经学会了如何为在线服务获取导入的特征值，下一步是学习如何将导入的特征值用于离线用例。

批量提供

批量提供用于获取大批量的特征值，通常用于高吞吐量训练模型或批量预测。在本节中，您将学习如何通过调用BatchReadFeatureValues API来准备训练示例。

### 用例

**任务** 是准备一个训练数据集来训练一个模型，该模型预测给定用户是否会观看给定电影。为了实现这一点，您需要两组输入：

* 特征：您已经将其导入到特征存储中。
* 标签：记录了用户 X 已观看电影 Y 的真实数据。

更具体地，真实观察结果在表1中描述，而所需的训练数据集在表2中描述。表2中的每一行都是根据表1中的实体 ID 和时间戳联合导入的特征值的结果。在本例中，从`users`中选取了`age`、`gender`和`liked_genres`特征，以及从`movies`中选取了`genres`和`average_rating`特征进行训练模型。请注意，这两个表中仅显示了正样本，即可以想象存在一个标签列，其值全部为“True”。

BatchReadFeatureValues API将表1作为输入，从特征存储中联合所有必需的特征值，并返回表2用于训练。

<h4 align="center">表1. 真实数据</h4>

users | movies | timestamp
----- | -------- | --------------------
alice  | Cinema Paradiso | 2019-11-01T00:00:00Z
bob  | The Shining | 2019-11-15T18:09:43Z
... | ... | ...


<h4 align="center">表2. Batch Read API 生成的预期训练数据（正样本）</h4>

timestamp | entity_type_users | age | gender | liked_genres | entity_type_movies | genres | average_rating
-------------------- | ----------------- | --------------- | ---------------- | -------------------- | -------- | --------- | -----
2019-11-01T00:00:00Z | bob | 35 | M | [动作, 犯罪] | The Shining | 恐怖 | 4.8
2019-11-01T00:00:00Z | alice | 55 | F | [戏剧, 喜剧] | Cinema Paradiso | 浪漫 | 4.5
... | ... | ... | ... | ... | ... | ... | ...

时间戳的原因是什么？

请注意，表2中有一个`时间戳`列。这表示观察到地面真相的时间。这是为了避免数据不一致。

例如，表2的第一行表示用户`alice`于`2019-11-01T00:00:00Z`观看了电影`天堂电影院`。特征存储会保留所有时间戳的特征值，但在批处理服务期间仅获取给定时间戳处的特征值。在2019年11月1日，alice可能是54岁，但现在可能是56岁；特征存储将返回`年龄=54`作为alice的年龄，而不是`年龄=56`，因为这是观察时间的特征值。同样，其他特征也可能是随时间变化的，例如喜欢的流派。

批量读取特征值

组装请求，指定以下信息：

* 标签数据在哪里，即表1。
* 要读取哪些特征，即表2中的列名。

输出存储在一个BigQuery表中。

In [None]:
batch_serving_request = featurestore_service_pb2.BatchReadFeatureValuesRequest(
    # featurestore info
    featurestore=admin_client.featurestore_path(PROJECT_ID, REGION, FEATURESTORE_ID),
    # URL for the label data, i.e., Table 1.
    csv_read_instances=io_pb2.CsvSource(
        gcs_source=io_pb2.GcsSource(uris=[INPUT_CSV_FILE])
    ),
    destination=featurestore_service_pb2.FeatureValueDestination(
        bigquery_destination=io_pb2.BigQueryDestination(
            # Output to BigQuery table created earlier
            output_uri=DESTINATION_TABLE_URI
        )
    ),
    entity_type_specs=[
        featurestore_service_pb2.BatchReadFeatureValuesRequest.EntityTypeSpec(
            # Read the 'age', 'gender' and 'liked_genres' features from the 'users' entity
            entity_type_id="users",
            feature_selector=FeatureSelector(
                id_matcher=IdMatcher(
                    ids=[
                        # features, use "*" if you want to select all features within this entity type
                        "age",
                        "gender",
                        "liked_genres",
                    ]
                )
            ),
        ),
        featurestore_service_pb2.BatchReadFeatureValuesRequest.EntityTypeSpec(
            # Read the 'average_rating' and 'genres' feature values of the 'movies' entity
            entity_type_id="movies",
            feature_selector=FeatureSelector(
                id_matcher=IdMatcher(ids=["average_rating", "genres"])
            ),
        ),
    ],
)

In [None]:
# Execute the batch read
batch_serving_lro = admin_client.batch_read_feature_values(batch_serving_request)

In [None]:
# This long runing operation will poll until the batch read finishes.
batch_serving_lro.result()

LRO完成后，您应该能够在早期创建的数据集中从[BigQuery控制台](https://console.cloud.google.com/bigquery)中看到结果。

清理

要清理此项目中使用的所有Google Cloud资源，您可以[删除用于本教程的Google Cloud项目](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects)。

您也可以保留项目，但删除特征存储：

In [None]:
admin_client.delete_featurestore(
    request=featurestore_service_pb2.DeleteFeaturestoreRequest(
        name=admin_client.featurestore_path(PROJECT_ID, REGION, FEATURESTORE_ID),
        force=True,
    )
).result()
client.delete_dataset(
    DESTINATION_DATA_SET, delete_contents=True, not_found_ok=True
)  # Make an API request.

print("Deleted dataset '{}'.".format(DESTINATION_DATA_SET))