In [None]:
# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

<table align="left">

  <td>
    <a href="https://colab.research.google.com/github/vertex-ai-samples/blob/main/notebooks/community/feature_store/mobile_gaming/mobile_gaming_feature_store.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> 在 Colab 中运行
    </a>
  </td>
  <td>
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/feature_store/mobile_gaming/mobile_gaming_feature_store.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      在 GitHub 上查看
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/feature_store/mobile_gaming/mobile_gaming_feature_store.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      在 Vertex AI 工作台中打开
    </a>
  </td>                                                                                               
</table>

## 概述

想象一下，您是数据科学团队的成员，正在处理与[使用 Google Analytics 4（GA4）和 BigQuery ML 进行游戏开发者的流失预测](https://cloud.google.com/blog/topics/developers-practitioners/churn-prediction-game-developers-using-google-analytics-4-ga4-and-bigquery-ml)博客中报告的相同移动游戏应用程序相关的工作。

业务希望将该信息实时应用于游戏中，以立即采取干预行动，以防止流失。具体来说，对于每个玩家，他们希望根据客户人口统计信息、行为信息和回归的倾向性来提供类似新物品或奖励包的游戏激励。

去年，Google Cloud 宣布推出 Vertex AI，这是一个托管的机器学习（ML）平台，允许数据科学团队加快 ML 模型的部署和维护。该平台的构建块之一是 Vertex AI 功能存储库，提供了一个托管服务，用于低延迟可扩展的特征服务。此外，它是一个集中式特征存储库，具有易于搜索和发现功能的简单 API，并具有特征监控功能，以跟踪漂移等质量问题。

在这个笔记本中，我们将展示 Vertex AI 功能存储库在准备投入生产的场景中的作用，用户在最后一次参与活动的前 24 小时内的活动以及游戏平台将使用这些活动来提高用户体验。下面是系统的高级图片

<img src="./assets/mobile_gaming_architecture_1.png">

### 数据集

数据集是来自一个名为“Flood It!”（Android、iOS）的实际移动游戏应用程序的公共样本导出数据。

### 目标

在以下笔记本中，您将了解 Vertex AI 功能存储库：

1. 提供一个集中式特征存储库，具有易于搜索和发现功能的简单 API，以及获取它们进行训练/服务。

2. 通过低延迟可扩展的特征服务，简化在线预测模型的部署。

3. 通过执行点时间查找来获取历史数据进行训练，以缓解训练服务偏差和数据泄漏。

**请注意，我们假设您已经知道如何设置 Vertex AI 特征存储库。如果您不知道，请查看[这个详细的笔记本](https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/feature_store/gapic-feature-store.ipynb)。**

### 成本

本教程使用 Google Cloud 的计费组件：

* Vertex AI
* BigQuery
* Cloud Storage

了解[Vertex AI 定价](https://cloud.google.com/vertex-ai/pricing)和[Cloud Storage 定价](https://cloud.google.com/storage/pricing)，并使用[Pricing 计算器](https://cloud.google.com/products/calculator/)根据您的预期使用量生成成本估算。

### 设置本地开发环境

**如果您正在使用Colab或Vertex AI Workbench笔记本电脑**，您的环境已经满足运行此笔记本的所有要求。您可以跳过这一步。

如果不符合条件，请确保您的环境符合本笔记本的要求。
您需要以下内容：

* Google Cloud SDK
* Git
* Python 3
* virtualenv
* 在使用Python 3的虚拟环境中运行的Jupyter笔记本

Google Cloud的[设置Python开发环境指南](https://cloud.google.com/python/setup)和[Jupyter安装指南](https://jupyter.org/install)提供了满足这些要求的详细说明。以下步骤提供一套简洁的说明：

1. [安装并初始化Cloud SDK。](https://cloud.google.com/sdk/docs/)

1. [安装Python 3。](https://cloud.google.com/python/setup#installing_python)

1. [安装virtualenv](https://cloud.google.com/python/setup#installing_and_using_virtualenv)
   并创建一个使用Python 3的虚拟环境。激活虚拟环境。

1. 要安装Jupyter，可以在终端中运行 `pip3 install jupyter` 命令。

1. 要启动Jupyter，可以在终端中运行 `jupyter notebook` 命令。

1. 在Jupyter Notebook Dashboard中打开此笔记本。

### 安装额外的包

安装在您的笔记本环境中尚未安装的额外包依赖项，例如XGBoost。使用每个包的最新主要GA版本。

In [None]:
import os

# The Vertex AI Workbench Notebook product has specific requirements
IS_WORKBENCH_NOTEBOOK = os.getenv("DL_ANACONDA_HOME")
IS_USER_MANAGED_WORKBENCH_NOTEBOOK = os.path.exists(
    "/opt/deeplearning/metadata/env_version"
)

# Vertex AI Notebook requires dependencies to be installed with '--user'
USER_FLAG = ""
if IS_WORKBENCH_NOTEBOOK:
    USER_FLAG = "--user"

In [None]:
! pip3 install {USER_FLAG} --upgrade pip -q
! pip3 install {USER_FLAG} --upgrade google-cloud-aiplatform==1.11.0 -q --no-warn-conflicts
! pip3 install {USER_FLAG} git+https://github.com/googleapis/python-aiplatform.git@main # For features monitoring
! pip3 install {USER_FLAG} --upgrade google-cloud-bigquery==2.24.0 -q --no-warn-conflicts
! pip3 install {USER_FLAG} --upgrade xgboost==1.1.1 -q --no-warn-conflicts

### 重新启动内核

在安装了额外的包之后，您需要重新启动笔记本内核，以便它可以找到这些包。

In [None]:
# Automatically restart kernel after installs
import os

if not os.getenv("IS_TESTING"):
    # Automatically restart kernel after installs
    import IPython

    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

## 在你开始之前

### 设置您的 Google Cloud 项目

**无论您使用什么笔记本环境，都需要执行以下步骤。**

1. [选择或创建 Google Cloud 项目](https://console.cloud.google.com/cloud-resource-manager)。当您首次创建帐户时，您可以获得$300的免费信用额度，用于支付计算/存储成本。

1. [确保项目已启用计费](https://cloud.google.com/billing/docs/how-to/modify-project)。

1. [启用 API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com,notebooks.googleapis.com,)。

1. 如果您在本地运行此笔记本，您需要安装[Cloud SDK](https://cloud.google.com/sdk)。

1. 在下面的单元格中输入您的项目 ID。然后运行该单元格，确保 Cloud SDK 在此笔记本中的所有命令中使用正确的项目。

**注意**：Jupyter 运行以 `!` 作为前缀的行作为 shell 命令，并将以 `$` 作为前缀的 Python 变量插入这些命令中。

#### 设置您的项目ID

**如果您不知道您的项目ID**，您可以使用`gcloud`来获取您的项目ID。

In [None]:
import os

PROJECT_ID = ""

# Get your Google Cloud project ID from gcloud
if not os.getenv("IS_TESTING"):
    shell_output = !gcloud config list --format 'value(core.project)' 2>/dev/null
    PROJECT_ID = shell_output[0]
    print("Project ID: ", PROJECT_ID)

否则，请在这里设置您的项目ID。

In [None]:
if PROJECT_ID == "" or PROJECT_ID is None:
    PROJECT_ID = "[your-project-id]"  # @param {type:"string"}

In [None]:
! gcloud config set project $PROJECT_ID

#### 获取您的项目编号（可选）

现在项目ID已设置，您将获得相应的项目编号。

In [None]:
shell_output = ! gcloud projects list --filter="PROJECT_ID:'{PROJECT_ID}'" --format='value(PROJECT_NUMBER)'
PROJECT_NUMBER = shell_output[0]
print("Project Number:", PROJECT_NUMBER)

#### 区域

您还可以更改“REGION”变量，该变量用于笔记本的其余操作。以下是Vertex AI支持的区域。我们建议您选择最靠近您的区域。

- 美洲：`us-central1`
- 欧洲：`europe-west4`
- 亚太地区：`asia-east1`

您可能无法使用多区域存储桶进行 Vertex AI 的训练。并非所有区域都支持所有 Vertex AI 服务。

了解更多关于[Vertex AI 区域](https://cloud.google.com/vertex-ai/docs/general/locations)。

In [None]:
REGION = "[your-region]"  # @param {type: "string"}

if REGION == "[your-region]":
    REGION = "us-central1"

时间戳

如果您正在进行实时教程会话，您可能正在使用共享的测试账户或项目。为了避免用户在创建的资源之间发生名称冲突，您可以为每个实例会话创建一个时间戳，并将其附加到您在本教程中创建的资源名称之后。

In [None]:
from datetime import datetime

TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")

### 验证您的 Google Cloud 账户

**如果您正在使用 Vertex AI Workbench 笔记本**，您的环境已经通过验证。请跳过此步骤。

如果您正在使用Colab，请运行下面的单元格，并按照提示进行oAuth身份验证。

否则，请按照以下步骤操作：

1. 在Cloud控制台中，转到[**创建服务帐号密钥**页面](https://console.cloud.google.com/apis/credentials/serviceaccountkey)。

2. 单击**创建服务帐号**。

3. 在**服务帐号名称**字段中输入名称，然后单击**创建**。

4. 在**将此服务帐号授予项目访问权限**部分，单击**角色**下拉菜单并添加以下角色：
   - BigQuery管理员
   - 存储管理员
   - 存储对象管理员
   - Vertex AI管理员
   - Vertex AI特征存储管理员

5. 单击*创建*。包含您密钥的JSON文件将下载到您的本地环境。

6. 在下面的单元格中将您的服务帐号密钥路径输入为`GOOGLE_APPLICATION_CREDENTIALS`变量，然后运行该单元格。

In [None]:
# If you are running this notebook in Colab, run this cell and follow the
# instructions to authenticate your GCP account. This provides access to your
# Cloud Storage bucket and lets you submit training jobs and prediction
# requests.

import os
import sys

# If on Vertex AI Workbench, then don't execute this code
IS_COLAB = "google.colab" in sys.modules
if not os.path.exists("/opt/deeplearning/metadata/env_version") and not os.getenv(
    "DL_ANACONDA_HOME"
):
    if "google.colab" in sys.modules:
        from google.colab import auth as google_auth

        google_auth.authenticate_user()

    # If you are running this notebook locally, replace the string below with the
    # path to your service account key and run this cell to authenticate your GCP
    # account.
    elif not os.getenv("IS_TESTING"):
        %env GOOGLE_APPLICATION_CREDENTIALS ''

### 创建一个云存储桶

**以下步骤是必需的，无论您使用的是哪种笔记本环境。**

在下面设置您的云存储桶的名称。它必须在所有云存储桶中是唯一的。

您也可以更改`REGION`变量，该变量在本笔记本的其余部分中使用。确保[选择一个支持 Vertex AI 服务的区域](https://cloud.google.com/vertex-ai/docs/general/locations#available_regions)。您不能使用多区域存储桶来训练 Vertex AI。

In [None]:
BUCKET_NAME = "[your-bucket-name]"  # @param {type:"string"}
BUCKET_URI = f"gs://{BUCKET_NAME}"

In [None]:
if BUCKET_NAME == "" or BUCKET_NAME is None or BUCKET_NAME == "[your-bucket-name]":
    BUCKET_NAME = PROJECT_ID + "-aip-" + TIMESTAMP
    BUCKET_URI = f"gs://{BUCKET_NAME}"

只有在您的存储桶不存在时才运行以下单元格以创建您的云存储桶。

In [None]:
! gsutil mb -l $REGION -p $PROJECT_ID $BUCKET_URI

最后，通过检查其内容来验证对云存储桶的访问权限。

In [None]:
! gsutil ls -al $BUCKET_URI

服务账户（可选）

如果您不想使用项目的Compute Engine服务账户，请将 `SERVICE_ACCOUNT` 设置为另一个服务账户ID。

In [None]:
SERVICE_ACCOUNT = "[your-service-account]"  # @param {type:"string"}

In [None]:
if (
    SERVICE_ACCOUNT == ""
    or SERVICE_ACCOUNT is None
    or SERVICE_ACCOUNT == "[your-service-account]"
):
    # Get your service account from gcloud
    if not IS_COLAB:
        shell_output = !gcloud auth list 2>/dev/null
        SERVICE_ACCOUNT = shell_output[2].replace("*", "").strip()

    else:  # IS_COLAB:
        shell_output = ! gcloud projects describe  $PROJECT_ID
        project_number = shell_output[-1].split(":")[1].strip().replace("'", "")
        SERVICE_ACCOUNT = f"{project_number}-compute@developer.gserviceaccount.com"

    print("Service Account:", SERVICE_ACCOUNT)

设置服务帐户访问权限

运行以下命令以授予您的服务帐户访问权限。每个服务帐户只需运行此步骤一次。

In [None]:
! gsutil iam ch serviceAccount:{SERVICE_ACCOUNT}:roles/storage.objectCreator $BUCKET_URI

! gsutil iam ch serviceAccount:{SERVICE_ACCOUNT}:roles/storage.objectViewer $BUCKET_URI

创建一个BigQuery数据集

您可以创建BigQuery数据集来存储演示中的数据。

In [None]:
BQ_DATASET = "Mobile_Gaming"  # @param {type:"string"}
LOCATION = "US"

In [None]:
!bq mk --location=$LOCATION --dataset $PROJECT_ID:$BQ_DATASET

### 导入库

In [None]:
# General
import os
import random
import sys
import time

# Data Science
import pandas as pd
# Vertex AI and its Feature Store
from google.cloud import aiplatform as vertex_ai
from google.cloud import bigquery
from google.cloud.aiplatform import Feature, Featurestore

### 定义常量

In [None]:
# Data Engineering and Feature Engineering
TODAY = "2022-06-16"
LABEL_TABLE = f"label_table_{TODAY}".replace("-", "")
FEATURES_TABLE = f"wide_features_table_{TODAY}"  # @param {type:"string"}
FEATURESTORE_ID = "mobile_gaming"  # @param {type:"string"}
ENTITY_TYPE_ID = "user"

# Vertex AI Feature store
ONLINE_STORE_NODES_COUNT = 5
ENTITY_ID = "user"
API_ENDPOINT = f"{REGION}-aiplatform.googleapis.com"
FEATURE_TIME = "timestamp"
ENTITY_ID_FIELD = "user_pseudo_id"
BQ_SOURCE_URI = f"bq://{PROJECT_ID}.{BQ_DATASET}.{FEATURES_TABLE}"
GCS_DESTINATION_PATH = f"data/features/train_features_{TODAY}".replace("-", "")
GCS_DESTINATION_OUTPUT_URI = f"{BUCKET_URI}/{GCS_DESTINATION_PATH}"
SERVING_FEATURE_IDS = {"user": ["*"]}
READ_INSTANCES_TABLE = f"ground_truth_{TODAY}".replace("-", "")
READ_INSTANCES_URI = f"bq://{PROJECT_ID}.{BQ_DATASET}.{READ_INSTANCES_TABLE}"

# Vertex AI Training
BASE_CPU_IMAGE = "us-docker.pkg.dev/vertex-ai/training/scikit-learn-cpu.0-23:latest"
DATASET_NAME = f"churn_mobile_gaming_{TODAY}".replace("-", "")
TRAIN_JOB_NAME = f"xgb_classifier_training_{TODAY}".replace("-", "")
MODEL_NAME = f"churn_xgb_classifier_{TODAY}".replace("-", "")
MODEL_PACKAGE_PATH = "train_package"
TRAINING_MACHINE_TYPE = "n1-standard-4"
TRAINING_REPLICA_COUNT = 1
DATA_PATH = f"{GCS_DESTINATION_OUTPUT_URI}/000000000000.csv".replace("gs://", "/gcs/")
MODEL_PATH = f"model/{TODAY}".replace("-", "")
MODEL_DIR = f"{BUCKET_URI}/{MODEL_PATH}".replace("gs://", "/gcs/")

# Vertex AI Prediction
DESTINATION_URI = f"{BUCKET_URI}/{MODEL_PATH}"
VERSION = "v1"
SERVING_CONTAINER_IMAGE_URI = (
    "us-docker.pkg.dev/vertex-ai/prediction/sklearn-cpu.0-23:latest"
)
ENDPOINT_NAME = "mobile_gaming_churn"
DEPLOYED_MODEL_NAME = f"churn_xgb_classifier_{VERSION}"
MODEL_DEPLOYED_NAME = "churn_xgb_classifier_v1"
SERVING_MACHINE_TYPE = "n1-highcpu-4"
MIN_NODES = 1
MAX_NODES = 1

In [None]:
# Sampling distributions for categorical features implemented in
# https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/model_monitoring/model_monitoring.ipynb

LANGUAGE = [
    "en-us",
    "en-gb",
    "ja-jp",
    "en-au",
    "en-ca",
    "de-de",
    "en-in",
    "en",
    "fr-fr",
    "pt-br",
    "es-us",
    "zh-tw",
    "zh-hans-cn",
    "es-mx",
    "nl-nl",
    "fr-ca",
    "en-za",
    "vi-vn",
    "en-nz",
    "es-es",
]

OS = ["IOS", "ANDROID", "null"]
COUNTRY = [
    "United States",
    "India",
    "Japan",
    "Canada",
    "Australia",
    "United Kingdom",
    "Germany",
    "Mexico",
    "France",
    "Brazil",
    "Taiwan",
    "China",
    "Saudi Arabia",
    "Pakistan",
    "Egypt",
    "Netherlands",
    "Vietnam",
    "Philippines",
    "South Africa",
]

USER_IDS = [
    "C8685B0DFA2C4B4E6E6EA72894C30F6F",
    "A976A39B8E08829A5BC5CD3827C942A2",
    "DD2269BCB7F8532CD51CB6854667AF51",
    "A8F327F313C9448DFD5DE108DAE66100",
    "8BE7BF90C971453A34C1FF6FF2A0ACAE",
    "8375B114AFAD8A31DE54283525108F75",
    "4AD259771898207D5869B39490B9DD8C",
    "51E859FD9D682533C094B37DC85EAF87",
    "8C33815E0A269B776AAB4B60A4F7BC63",
    "D7EA8E3645EFFBD6443946179ED704A6",
    "58F3D672BBC613680624015D5BC3ADDB",
    "FF955E4CA27C75CE0BEE9FC89AD275A3",
    "22DC6A6AE86C0AA33EBB8C3164A26925",
    "BC10D76D02351BD4C6F6F5437EE5D274",
    "19DEEA6B15B314DB0ED2A4936959D8F9",
    "C2D17D9066EE1EB9FAE1C8A521BFD4E5",
    "EFBDEC168A2BF8C727B060B2E231724E",
    "E43D3AB2F9B9055C29373523FAF9DB9B",
    "BBDCBE2491658165B7F20540DE652E3A",
    "6895EEFC23B59DB13A9B9A7EED6A766F",
]

### 帮手

In [None]:
def run_bq_query(query: str):
    """
    An helper function to run a BigQuery job
    Args:
        query: a formatted SQL query
    Returns:
        None
    """
    try:
        job = bq_client.query(query)
        _ = job.result()
    except RuntimeError as error:
        print(error)


def upload_model(
    display_name: str,
    serving_container_image_uri: str,
    artifact_uri: str,
    sync: bool = True,
) -> vertex_ai.Model:
    """

    Args:
        display_name: The name of Vertex AI Model artefact
        serving_container_image_uri: The uri of the serving image
        artifact_uri: The uri of artefact to import
        sync:

    Returns: Vertex AI Model

    """
    model = vertex_ai.Model.upload(
        display_name=display_name,
        artifact_uri=artifact_uri,
        serving_container_image_uri=serving_container_image_uri,
        sync=sync,
    )
    model.wait()
    print(model.display_name)
    print(model.resource_name)
    return model


def create_endpoint(display_name: str) -> vertex_ai.Endpoint:
    """
    An utility to create a Vertex AI Endpoint
    Args:
        display_name: The name of Endpoint

    Returns: Vertex AI Endpoint

    """
    endpoint = vertex_ai.Endpoint.create(display_name=display_name)

    print(endpoint.display_name)
    print(endpoint.resource_name)
    return endpoint


def deploy_model(
    model: vertex_ai.Model,
    machine_type: str,
    endpoint: vertex_ai.Endpoint = None,
    deployed_model_display_name: str = None,
    min_replica_count: int = 1,
    max_replica_count: int = 1,
    sync: bool = True,
) -> vertex_ai.Model:
    """
    An helper function to deploy a Vertex AI Endpoint
    Args:
        model: A Vertex AI Model
        machine_type: The type of machine to serve the model
        endpoint: An Vertex AI Endpoint
        deployed_model_display_name: The name of the model
        min_replica_count: Minimum number of serving replicas
        max_replica_count: Max number of serving replicas
        sync: Whether to execute method synchronously

    Returns: vertex_ai.Model

    """
    model_deployed = model.deploy(
        endpoint=endpoint,
        deployed_model_display_name=deployed_model_display_name,
        machine_type=machine_type,
        min_replica_count=min_replica_count,
        max_replica_count=max_replica_count,
        sync=sync,
    )

    model_deployed.wait()

    print(model_deployed.display_name)
    print(model_deployed.resource_name)
    return model_deployed


def endpoint_predict_sample(
    instances: list, endpoint: vertex_ai.Endpoint
) -> vertex_ai.models.Prediction:
    """
    An helper function to get prediction from Vertex AI Endpoint
    Args:
        instances: The list of instances to score
        endpoint: An Vertex AI Endpoint

    Returns:
        vertex_ai.models.Prediction

    """
    prediction = endpoint.predict(instances=instances)
    print(prediction)
    return prediction


def generate_online_sample() -> dict:
    """
    An helper function to generate a sample of online features
    Returns:
        online_sample: dict of online features
    """
    online_sample = {}
    online_sample["entity_id"] = random.choices(USER_IDS)
    online_sample["country"] = random.choices(COUNTRY)
    online_sample["operating_system"] = random.choices(OS)
    online_sample["language"] = random.choices(LANGUAGE)
    return online_sample


def simulate_prediction(endpoint: vertex_ai.Endpoint, n_requests: int, latency: int):
    """
    An helper function to simulate online prediction with customer entity type
        - format entities for prediction
        - retrieve static features with a singleton lookup operations from Vertex AI Feature store
        - run the prediction request and get back the result
    Args:
        endpoint: Vertex AI Endpoint object
        n_requests: number of requests to run
        latency: latency in seconds
    Returns:
        vertex_ai.models.Prediction
    """
    for i in range(n_requests):
        online_sample = generate_online_sample()
        online_features = pd.DataFrame.from_dict(online_sample)
        entity_ids = online_features["entity_id"].tolist()

        customer_aggregated_features = user_entity_type.read(
            entity_ids=entity_ids,
            feature_ids=[
                "cnt_user_engagement",
                "cnt_level_start_quickplay",
                "cnt_level_end_quickplay",
                "cnt_level_complete_quickplay",
                "cnt_level_reset_quickplay",
                "cnt_post_score",
                "cnt_spend_virtual_currency",
                "cnt_ad_reward",
                "cnt_challenge_a_friend",
                "cnt_completed_5_levels",
                "cnt_use_extra_steps",
            ],
        )

        prediction_sample_df = pd.merge(
            customer_aggregated_features.set_index("entity_id"),
            online_features.set_index("entity_id"),
            left_index=True,
            right_index=True,
        ).reset_index(drop=True)

        # prediction_sample = prediction_sample_df.to_dict("records")
        prediction_instance = prediction_sample_df.values.tolist()
        prediction = endpoint.predict(prediction_instance)
        print(
            f"Prediction request: user_id - {entity_ids} - values - {prediction_instance} - prediction - {prediction[0]}"
        )
        time.sleep(latency)

# 设定实时场景

为了进行实时流失预测，您需要

1. 收集关于用户事件和行为的历史数据
2. 设计您的数据模型，构建特征并将它们投入特征存储中，以便同时为脱机训练和在线服务。
3. 定义流失，并获取用于训练流失模型的数据
4. 在规模上训练模型
5. 部署模型到终端点，并实时生成预测分数

您将在下文详细介绍这些步骤。

### 初始化 Python 的 Vertex AI SDK

为您的项目和相应的存储桶初始化 Python 的 Vertex AI SDK。

In [None]:
vertex_ai.init(project=PROJECT_ID, location=REGION, staging_bucket=BUCKET_URI)

### 初始化Python的BigQuery SDK

为您的项目和相应的存储桶初始化Python的BigQuery AI SDK。

In [None]:
bq_client = bigquery.Client(project=PROJECT_ID, location=LOCATION)

识别用户并构建您的特征

这一部分，我们将从Vertex AI Feature Store中提取静态特征。具体来说，我们将涵盖以下步骤：

1. 识别用户，使用**BigQuery**处理人口统计特征和行为特征，在过去24小时内进行处理。

2. 设置特征存储

3. 使用**Vertex AI Feature Store**和SDK注册特征。

下面有一幅图显示了整个过程。

原始数据集包含我们无法直接在特征存储中提取的原始事件数据。我们需要预处理原始数据，以获取用户特征。

**请注意我们模拟这些转换在不同的时间点（今天和明天）。**

### 标签、人口统计和行为转换

该部分基于 Minhaz Kazi 和 Polong Lin 撰写的[《使用Google Analytics 4（GA4）和BigQuery ML为游戏开发者预测流失》](https://cloud.google.com/blog/topics/developers-practitioners/churn-prediction-game-developers-using-google-analytics-4-ga4-and-bigquery-ml?utm_source=linkedin&utm_medium=unpaidsoc&utm_campaign=FY21-Q2-Google-Cloud-Tech-Blog&utm_content=google-analytics-4&utm_term=-)博客文章。

您将对其进行调整，将批次流失预测（使用首次参与用户的最初24小时内的特征）转换为实时流失预测（使用最后参与用户的最初6小时内的特征）。

In [None]:
features_sql_query = f"""
CREATE OR REPLACE TABLE
  `{PROJECT_ID}.{BQ_DATASET}.{FEATURES_TABLE}` AS
WITH

  # query to extract demographic data for each user ---------------------------------------------------------
  get_demographic_data AS (
  SELECT * EXCEPT (row_num)
  FROM (
    SELECT
      user_pseudo_id,
      geo.country as country,
      device.operating_system as operating_system,
      device.language as language,
      ROW_NUMBER() OVER (PARTITION BY user_pseudo_id ORDER BY event_timestamp DESC) AS row_num
    FROM `firebase-public-project.analytics_153293282.events_*`)
  WHERE row_num = 1),

  # query to extract behavioral data for each user ----------------------------------------------------------
  get_behavioral_data AS (
  SELECT
    event_timestamp,
    user_pseudo_id,
    SUM(IF(event_name = 'user_engagement', 1, 0)) OVER (PARTITION BY user_pseudo_id ORDER BY event_timestamp ASC RANGE BETWEEN 21600000000 PRECEDING
      AND CURRENT ROW ) AS cnt_user_engagement,
    SUM(IF(event_name = 'level_start_quickplay', 1, 0)) OVER (PARTITION BY user_pseudo_id ORDER BY event_timestamp ASC RANGE BETWEEN 21600000000 PRECEDING
      AND CURRENT ROW ) AS cnt_level_start_quickplay,
    SUM(IF(event_name = 'level_end_quickplay', 1, 0)) OVER (PARTITION BY user_pseudo_id ORDER BY event_timestamp ASC RANGE BETWEEN 21600000000 PRECEDING
      AND CURRENT ROW ) AS cnt_level_end_quickplay,
    SUM(IF(event_name = 'level_complete_quickplay', 1, 0)) OVER (PARTITION BY user_pseudo_id ORDER BY event_timestamp ASC RANGE BETWEEN 21600000000 PRECEDING
      AND CURRENT ROW ) AS cnt_level_complete_quickplay,
    SUM(IF(event_name = 'level_reset_quickplay', 1, 0)) OVER (PARTITION BY user_pseudo_id ORDER BY event_timestamp ASC RANGE BETWEEN 21600000000 PRECEDING
      AND CURRENT ROW ) AS cnt_level_reset_quickplay,
    SUM(IF(event_name = 'post_score', 1, 0)) OVER (PARTITION BY user_pseudo_id ORDER BY event_timestamp ASC RANGE BETWEEN 21600000000 PRECEDING
      AND CURRENT ROW ) AS cnt_post_score,
    SUM(IF(event_name = 'spend_virtual_currency', 1, 0)) OVER (PARTITION BY user_pseudo_id ORDER BY event_timestamp ASC RANGE BETWEEN 21600000000 PRECEDING
      AND CURRENT ROW ) AS cnt_spend_virtual_currency,
    SUM(IF(event_name = 'ad_reward', 1, 0)) OVER (PARTITION BY user_pseudo_id ORDER BY event_timestamp ASC RANGE BETWEEN 21600000000 PRECEDING
      AND CURRENT ROW ) AS cnt_ad_reward,
    SUM(IF(event_name = 'challenge_a_friend', 1, 0)) OVER (PARTITION BY user_pseudo_id ORDER BY event_timestamp ASC RANGE BETWEEN 21600000000 PRECEDING
      AND CURRENT ROW ) AS cnt_challenge_a_friend,
    SUM(IF(event_name = 'completed_5_levels', 1, 0)) OVER (PARTITION BY user_pseudo_id ORDER BY event_timestamp ASC RANGE BETWEEN 21600000000 PRECEDING
      AND CURRENT ROW ) AS cnt_completed_5_levels,
    SUM(IF(event_name = 'use_extra_steps', 1, 0)) OVER (PARTITION BY user_pseudo_id ORDER BY event_timestamp ASC RANGE BETWEEN 21600000000 PRECEDING
      AND CURRENT ROW ) AS cnt_use_extra_steps,
  FROM (
    SELECT
      e.*
    FROM
      `firebase-public-project.analytics_153293282.events_*` AS e
    )
)

SELECT
    -- PARSE_TIMESTAMP('%Y-%m-%d %H:%M:%S', CONCAT('{TODAY}', ' ', STRING(TIME_TRUNC(CURRENT_TIME(), SECOND))), 'UTC') as timestamp,
    TIMESTAMP_ADD(PARSE_TIMESTAMP('%Y-%m-%d %H:%M:%S', FORMAT_TIMESTAMP('%Y-%m-%d %H:%M:%S', TIMESTAMP_MICROS(beh.event_timestamp))), INTERVAL 1351 DAY) AS timestamp,
    dem.*,
    CAST(IFNULL(beh.cnt_user_engagement, 0) AS FLOAT64)  AS cnt_user_engagement,
    CAST(IFNULL(beh.cnt_level_start_quickplay, 0) AS FLOAT64) AS cnt_level_start_quickplay,
    CAST(IFNULL(beh.cnt_level_end_quickplay, 0) AS FLOAT64) AS cnt_level_end_quickplay,
    CAST(IFNULL(beh.cnt_level_complete_quickplay, 0) AS FLOAT64) AS cnt_level_complete_quickplay,
    CAST(IFNULL(beh.cnt_level_reset_quickplay, 0) AS FLOAT64) AS cnt_level_reset_quickplay,
    CAST(IFNULL(beh.cnt_post_score, 0) AS FLOAT64) AS cnt_post_score,
    CAST(IFNULL(beh.cnt_spend_virtual_currency, 0) AS FLOAT64) AS cnt_spend_virtual_currency,
    CAST(IFNULL(beh.cnt_ad_reward, 0) AS FLOAT64) AS cnt_ad_reward,
    CAST(IFNULL(beh.cnt_challenge_a_friend, 0) AS FLOAT64) AS cnt_challenge_a_friend,
    CAST(IFNULL(beh.cnt_completed_5_levels, 0) AS FLOAT64) AS cnt_completed_5_levels,
    CAST(IFNULL(beh.cnt_use_extra_steps, 0) AS FLOAT64) AS cnt_use_extra_steps,
FROM
  get_demographic_data dem
LEFT OUTER JOIN 
  get_behavioral_data beh
ON
  dem.user_pseudo_id = beh.user_pseudo_id
"""

In [None]:
run_bq_query(features_sql_query)

## 创建一个Vertex AI功能存储库并输入您的特征

现在您有一个包含大量特征的宽表格。现在是将它们导入到特征存储库中的时候了。

在继续之前，您可能会有一个问题：为什么在这种情况下我需要一个特征存储库呢？

其中一个原因是要使这些特征能够跨团队访问，只需计算一次，就可以多次重复使用。为了实现这一点，您还需要能够随时监控这些特征以确保其新鲜度，并在需要时进行新的特征工程运行以对其进行刷新。

如果这不是您的情况，我将在接下来的部分中提供更多关于为什么您应该考虑使用功能存储库的原因。现在请继续跟随我。

其中一个最重要的事情与其数据模型有关。正如您在下面的图片中所看到的，Vertex AI Feature Store按照以下顺序层次化地组织资源：`Featurestore -> EntityType -> Feature`。您必须在将数据导入到Vertex AI功能存储库之前创建这些资源。

在我们的情况下，我们将创建一个名为**mobile_gaming**的功能存储资源，其中包含**user**实体类型以及其所有关联的**特征**，例如国家或用户向朋友发起挑战的次数（cnt_challenge_a_friend）。

### 创建特征存储，```mobile_gaming```

您需要创建一个`featurestore`资源来包含实体类型、特征和特征值。在您的情况下，您将称其为`mobile_gaming`。

In [None]:
try:
    mobile_gaming_feature_store = Featurestore.create(
        featurestore_id=FEATURESTORE_ID,
        online_store_fixed_node_count=ONLINE_STORE_NODES_COUNT,
        labels={"team": "dataoffice", "app": "mobile_gaming"},
        sync=True,
    )
except RuntimeError as error:
    print(error)
else:
    FEATURESTORE_RESOURCE_NAME = mobile_gaming_feature_store.resource_name
    print(f"Feature store created: {FEATURESTORE_RESOURCE_NAME}")

### 创建```User```实体类型及其特性

您可以定义自己的实体类型，表示您决定引用特性的一个或多个级别。在您的情况下，它将有一个`user`实体。

In [None]:
try:
    user_entity_type = mobile_gaming_feature_store.create_entity_type(
        entity_type_id=ENTITY_ID, description="User Entity", sync=True
    )
except RuntimeError as error:
    print(error)
else:
    USER_ENTITY_RESOURCE_NAME = user_entity_type.resource_name
    print("Entity type name is", USER_ENTITY_RESOURCE_NAME)

### 设置特征监控

请注意，Vertex AI 特征存储具有[特征监控功能](https://cloud.google.com/vertex-ai/docs/featurestore/monitoring)。这是预览版，因此您需要使用比我们在此笔记本中迄今为止使用的更低级别的 v1beta1 Python API。

目前设置最简单的方法是使用[控制台 UI](https://console.cloud.google.com/vertex-ai/features)。为了完整起见，以下是使用 v1beta1 SDK 进行此操作的示例。

In [None]:
from google.cloud.aiplatform_v1beta1 import \
    FeaturestoreServiceClient as v1beta1_FeaturestoreServiceClient
from google.cloud.aiplatform_v1beta1.types import \
    entity_type as v1beta1_entity_type_pb2
from google.cloud.aiplatform_v1beta1.types import \
    featurestore_monitoring as v1beta1_featurestore_monitoring_pb2
from google.cloud.aiplatform_v1beta1.types import \
    featurestore_service as v1beta1_featurestore_service_pb2
from google.protobuf.duration_pb2 import Duration

v1beta1_admin_client = v1beta1_FeaturestoreServiceClient(
    client_options={"api_endpoint": API_ENDPOINT}
)

In [None]:
v1beta1_admin_client.update_entity_type(
    v1beta1_featurestore_service_pb2.UpdateEntityTypeRequest(
        entity_type=v1beta1_entity_type_pb2.EntityType(
            name=v1beta1_admin_client.entity_type_path(
                PROJECT_ID, REGION, FEATURESTORE_ID, ENTITY_ID
            ),
            monitoring_config=v1beta1_featurestore_monitoring_pb2.FeaturestoreMonitoringConfig(
                snapshot_analysis=v1beta1_featurestore_monitoring_pb2.FeaturestoreMonitoringConfig.SnapshotAnalysis(
                    monitoring_interval=Duration(seconds=86400),  # 1 day
                ),
            ),
        ),
    )
)

### 创建特征

为了接收特征，您需要提供特征配置并将其创建为特征商店资源。

创建特性配置

为简单起见，我以声明方式创建了配置。当然，我们可以创建一个帮助函数，从Bigquery模式中构建它。
还要注意，我们希望可以动态地传递一些特性。在这种情况下，国家、操作系统和语言看起来非常适合。

In [None]:
feature_configs = {
    "country": {
        "value_type": "STRING",
        "description": "The country of customer",
        "labels": {"status": "passed"},
    },
    "operating_system": {
        "value_type": "STRING",
        "description": "The operating system of device",
        "labels": {"status": "passed"},
    },
    "language": {
        "value_type": "STRING",
        "description": "The language of device",
        "labels": {"status": "passed"},
    },
    "cnt_user_engagement": {
        "value_type": "DOUBLE",
        "description": "A variable of user engagement level",
        "labels": {"status": "passed"},
    },
    "cnt_level_start_quickplay": {
        "value_type": "DOUBLE",
        "description": "A variable of user engagement with start level",
        "labels": {"status": "passed"},
    },
    "cnt_level_end_quickplay": {
        "value_type": "DOUBLE",
        "description": "A variable of user engagement with end level",
        "labels": {"status": "passed"},
    },
    "cnt_level_complete_quickplay": {
        "value_type": "DOUBLE",
        "description": "A variable of user engagement with complete status",
        "labels": {"status": "passed"},
    },
    "cnt_level_reset_quickplay": {
        "value_type": "DOUBLE",
        "description": "A variable of user engagement with reset status",
        "labels": {"status": "passed"},
    },
    "cnt_post_score": {
        "value_type": "DOUBLE",
        "description": "A variable of user score",
        "labels": {"status": "passed"},
    },
    "cnt_spend_virtual_currency": {
        "value_type": "DOUBLE",
        "description": "A variable of user virtual amount",
        "labels": {"status": "passed"},
    },
    "cnt_ad_reward": {
        "value_type": "DOUBLE",
        "description": "A variable of user reward",
        "labels": {"status": "passed"},
    },
    "cnt_challenge_a_friend": {
        "value_type": "DOUBLE",
        "description": "A variable of user challenges with friends",
        "labels": {"status": "passed"},
    },
    "cnt_completed_5_levels": {
        "value_type": "DOUBLE",
        "description": "A variable of user level 5 completed",
        "labels": {"status": "passed"},
    },
    "cnt_use_extra_steps": {
        "value_type": "DOUBLE",
        "description": "A variable of user extra steps",
        "labels": {"status": "passed"},
    },
}

使用`batch_create_features`方法创建特征

一旦您有了特征配置，您可以使用`batch_create_features`方法创建特征资源。

In [None]:
try:
    user_entity_type.batch_create_features(feature_configs=feature_configs, sync=True)
except RuntimeError as error:
    print(error)
else:
    for feature in user_entity_type.list_features():
        print("")
        print(f"The resource name of {feature.name} feature is", feature.resource_name)

搜索功能

Vertex AI Feature存储支持搜索功能。以下是一个简单示例，展示如何根据特征名称对特征进行过滤。

In [None]:
feature_query = "feature_id:cnt_user_engagement"
searched_features = Feature.search(query=feature_query)
searched_features

摄入特性

在那个时候，您创建与特性库相关的所有资源。在您可以将特性值用于在线/离线服务之前，只需导入特性值即可。

In [None]:
FEATURES_IDS = [feature.name for feature in user_entity_type.list_features()]

In [None]:
try:
    user_entity_type.ingest_from_bq(
        feature_ids=FEATURES_IDS,
        feature_time=FEATURE_TIME,
        bq_source_uri=BQ_SOURCE_URI,
        entity_id_field=ENTITY_ID_FIELD,
        disable_online_serving=False,
        worker_count=10,
        sync=False,
    )
except RuntimeError as error:
    print(error)

使用Vertex AI Training和Endpoints训练和部署实时流失ML模型

现在您已经拥有了您的特征，并且几乎准备好训练我们的流失模型。

以下是一个高层次的图片

<img src="./assets/train_model_4.png">

让我们深入了解这个过程的每个步骤。

使用BigQuery和Vertex AI Feature存储库，使用时间点查询获取训练数据

如上所述，在实时流失预测中，定义您的模型要预测的标签非常重要。

假设您决定在接下来的一个小时内预测流失概率。现在您有了标签。下一步是定义您的训练样本。但让我们考虑一下。

在这个实时流失系统中，您有大量的交易可以用来计算那些随时间不断变化并不断收集的特征。这意味着您始终可以获得新鲜数据来重建特征。取决于您何时决定计算一个特征或另一个特征，您可能会得到一组在时间上不对齐的特征。

当您有标签可用时，非常难以确定哪组特征包含与您要预测的标签相关的最新历史信息。当您无法保证时，您的模型的表现将受到严重影响，因为当其实时进行时，您无法提供数据和标签的代表性特征。因此，您需要一种在标签可用之前获取您随时间计算的最新特征的方法，以避免这种信息偏移。

**使用Vertex AI Feature存储库，您可以通过时间点查找功能值对应于特定时间戳，**在我们的情况下，这将是与您的模型要预测的标签相关联的时间戳。通过这种方式，您将避免数据泄漏，并获得最新的特征来训练您的模型。

让我们看看如何做到这一点。

### 定义在特定时间点阅读实例的查询

首先，您需要定义在特定时间点阅读实例的集合，以便生成您的训练样本。

In [None]:
read_instances_query = f"""
CREATE OR REPLACE TABLE
  `{PROJECT_ID}.{BQ_DATASET}.{READ_INSTANCES_TABLE}` AS
WITH

  # get training threshold ----------------------------------------------------------------------------------
  get_training_threshold AS (
  SELECT
    (MAX(event_timestamp) - 10800000000) AS training_thrs
  FROM
    `firebase-public-project.analytics_153293282.events_*`
  WHERE
    event_name="user_engagement"
    AND
    TIMESTAMP_ADD(PARSE_TIMESTAMP('%Y-%m-%d %H:%M:%S', FORMAT_TIMESTAMP('%Y-%m-%d %H:%M:%S', TIMESTAMP_MICROS(event_timestamp))), INTERVAL 1351 DAY) < '{TODAY}'),

  # query to create label -----------------------------------------------------------------------------------
  get_label AS (
  SELECT
    user_pseudo_id,
    user_last_engagement,
    #label = 1 if last_touch within last hour hr else 0
  IF
    (user_last_engagement < (
      SELECT
        training_thrs
      FROM
        get_training_threshold),
      1,
      0 ) AS churned
  FROM (
    SELECT
      user_pseudo_id,
      MAX(event_timestamp) AS user_last_engagement
    FROM
      `firebase-public-project.analytics_153293282.events_*`
    WHERE
      event_name="user_engagement"
    AND
    TIMESTAMP_ADD(PARSE_TIMESTAMP('%Y-%m-%d %H:%M:%S', FORMAT_TIMESTAMP('%Y-%m-%d %H:%M:%S', TIMESTAMP_MICROS(event_timestamp))), INTERVAL 1351 DAY) < '{TODAY}'
    GROUP BY
      user_pseudo_id )
  GROUP BY
    1,
    2),

  # query to create class weights --------------------------------------------------------------------------------
  get_class_weights AS (
  SELECT
    CAST(COUNT(*) / (2*(COUNT(*) - SUM(churned))) AS STRING) AS class_weight_zero,
    CAST(COUNT(*) / (2*SUM(churned)) AS STRING) AS class_weight_one,
  FROM
    get_label )

SELECT
  user_pseudo_id as user,
  PARSE_TIMESTAMP('%Y-%m-%d %H:%M:%S', CONCAT('{TODAY}', ' ', STRING(TIME_TRUNC(CURRENT_TIME(), SECOND))), 'UTC') as timestamp,
  churned AS churned,
  CASE
      WHEN churned = 0 THEN ( SELECT class_weight_zero FROM get_class_weights)
      ELSE ( SELECT class_weight_one
       FROM get_class_weights)
    END AS class_weights
FROM
  get_label 
"""

### 创建BigQuery实例表

您将这些实例存储在一个BigQuery表中。

In [None]:
run_bq_query(read_instances_query)

### 为批量训练提供功能

然后使用 `batch_serve_to_gcs` 来生成您的训练样本，并将其存储为csv文件在目标云存储桶中。

In [None]:
mobile_gaming_feature_store.batch_serve_to_gcs(
    gcs_destination_output_uri_prefix=GCS_DESTINATION_OUTPUT_URI,
    gcs_destination_type="csv",
    serving_feature_ids=SERVING_FEATURE_IDS,
    read_instances_uri=READ_INSTANCES_URI,
    pass_through_fields=["churned", "class_weights"],
)

使用Training Pipelines在Vertex AI上训练一个自定义模型

现在我们已经生成了训练样本，我们使用Vertex AI SDK使用Vertex AI Training来训练一个新版本的模型。

创建培训套餐和培训样本

In [None]:
!rm -Rf train_package #if train_package already exist

In [None]:
!mkdir -m 777 -p trainer data/ingest data/raw model config
!gsutil -m cp -r $GCS_DESTINATION_OUTPUT_URI/*.csv data/ingest
!head -n 2000 data/ingest/*.csv > data/raw/sample.csv

创建训练脚本

您创建训练脚本以训练一个XGboost模型。

In [None]:
!touch trainer/__init__.py

In [None]:
%%writefile trainer/task.py
import os
from pathlib import Path
import argparse
import yaml

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
import xgboost as xgb
import joblib
import warnings
warnings.filterwarnings("ignore")

def get_args():
    """
    Get arguments from command line.
    Returns:
        args: parsed arguments
    """
    parser = argparse.ArgumentParser()
    parser.add_argument(
        '--data_path',
        required=False,
        default=os.getenv('AIP_TRAINING_DATA_URI'),
        type=str,
        help='path to read data')
    parser.add_argument(
        '--learning_rate',
        required=False,
        default=0.01,
        type=int,
        help='number of epochs')
    parser.add_argument(
        '--model_dir',
        required=False,
        default=os.getenv('AIP_MODEL_DIR'),
        type=str,
        help='dir to store saved model')
    parser.add_argument(
        '--config_path',
        required=False,
        default='../config.yaml',
        type=str,
        help='path to read config file')
    args = parser.parse_args()
    return args


def ingest_data(data_path, data_model_params):
    """
    Ingest data
    Args:
        data_path: path to read data
        data_model_params: data model parameters
    Returns:
        df: dataframe
    """
    # read training data
    df = pd.read_csv(data_path, sep=',',
                     dtype={col: 'string' for col in data_model_params['categorical_features']})
    return df


def preprocess_data(df, data_model_params):
    """
    Preprocess data
    Args:
        df: dataframe
        data_model_params: data model parameters
    Returns:
        df: dataframe
    """

    # convert nan values because pd.NA ia not supported by SimpleImputer
    # bug in sklearn 0.23.1 version: https://github.com/scikit-learn/scikit-learn/pull/17526
    # decided to skip NAN values for now
    df.replace({pd.NA: np.nan}, inplace=True)
    df.dropna(inplace=True)

    # get features and labels
    x = df[data_model_params['numerical_features'] + data_model_params['categorical_features'] + [
        data_model_params['weight_feature']]]
    y = df[data_model_params['target']]

    # train-test split
    x_train, x_test, y_train, y_test = train_test_split(x, y,
                                                        test_size=data_model_params['train_test_split']['test_size'],
                                                        random_state=data_model_params['train_test_split'][
                                                            'random_state'])
    return x_train, x_test, y_train, y_test


def build_pipeline(learning_rate, model_params):
    """
    Build pipeline
    Args:
        learning_rate: learning rate
        model_params: model parameters
    Returns:
        pipeline: pipeline
    """
    # build pipeline
    pipeline = Pipeline([
        # ('imputer', SimpleImputer(strategy='most_frequent')),
        ('encoder', OneHotEncoder(handle_unknown='ignore')),
        ('model', xgb.XGBClassifier(learning_rate=learning_rate,
                                    use_label_encoder=False, #deprecated and breaks Vertex AI predictions
                                    **model_params))
    ])
    return pipeline


def main():
    print('Starting training...')
    args = get_args()
    data_path = args.data_path
    learning_rate = args.learning_rate
    model_dir = args.model_dir
    config_path = args.config_path

    # read config file
    with open(config_path, 'r') as f:
        config = yaml.load(f, Loader=yaml.FullLoader)
    f.close()
    data_model_params = config['data_model_params']
    model_params = config['model_params']

    # ingest data
    print('Reading data...')
    data_df = ingest_data(data_path, data_model_params)

    # preprocess data
    print('Preprocessing data...')
    x_train, x_test, y_train, y_test = preprocess_data(data_df, data_model_params)
    sample_weight = x_train.pop(data_model_params['weight_feature'])
    sample_weight_eval_set = x_test.pop(data_model_params['weight_feature'])

    # train lgb model
    print('Training model...')
    xgb_pipeline = build_pipeline(learning_rate, model_params)
    # need to use fit_transform to get the encoded eval data
    x_train_transformed = xgb_pipeline[:-1].fit_transform(x_train)
    x_test_transformed = xgb_pipeline[:-1].transform(x_test)
    xgb_pipeline[-1].fit(x_train_transformed, y_train,
                         sample_weight=sample_weight,
                         eval_set=[(x_test_transformed, y_test)],
                         sample_weight_eval_set=[sample_weight_eval_set],
                         eval_metric='error',
                         early_stopping_rounds=50,
                         verbose=True)
    # save model
    print('Saving model...')
    model_path = Path(model_dir)
    model_path.mkdir(parents=True, exist_ok=True)
    joblib.dump(xgb_pipeline, f'{model_dir}/model.joblib')


if __name__ == "__main__":
    main()

创建 requirements.txt 文件

你需要编写 requirements.txt 文件来构建训练容器。

In [None]:
%%writefile requirements.txt
pip==22.0.4
PyYAML==5.3.1
joblib==0.15.1
numpy==1.18.5
pandas==1.0.4
scipy==1.4.1
scikit-learn==0.23.1
xgboost==1.1.1

### 创建训练配置

您可以使用数据和模型参数来创建训练配置。

In [None]:
%%writefile config/config.yaml
data_model_params:
  target: churned
  categorical_features:
    - country
    - operating_system
    - language
  numerical_features:
    - cnt_user_engagement
    - cnt_level_start_quickplay
    - cnt_level_end_quickplay
    - cnt_level_complete_quickplay
    - cnt_level_reset_quickplay
    - cnt_post_score
    - cnt_spend_virtual_currency
    - cnt_ad_reward
    - cnt_challenge_a_friend
    - cnt_completed_5_levels
    - cnt_use_extra_steps
  weight_feature: class_weights
  train_test_split:
    test_size: 0.2
    random_state: 8
model_params:
  booster: gbtree
  objective: binary:logistic
  max_depth: 80
  n_estimators: 100
  random_state: 8

使用`local-run`在本地测试模型。您可以利用 Vertex AI SDK 中的`local-run`命令来运行脚本。

In [None]:
test_job_script = f"""
gcloud ai custom-jobs local-run \
--executor-image-uri={BASE_CPU_IMAGE} \
--python-module=trainer.task \
--extra-dirs=config,data,model \
-- \
--data_path data/raw/sample.csv \
--model_dir model \
--config_path config/config.yaml
"""

with open("local_train_job_run.sh", "w+") as s:
    s.write(test_job_script)
s.close()

In [None]:
!chmod +x ./local_train_job_run.sh && ./local_train_job_run.sh

### 创建并启动自定义训练流程，使用 `autopackaging` 训练模型。

您可以使用 Vertex AI SDK 中的 `autopackaging` 来：

1. 构建自定义的 Docker 训练镜像。
2. 将镜像推送到容器注册表。
3. 启动一个 Vertex AI CustomJob。

In [None]:
!mkdir -m 777 -p {MODEL_PACKAGE_PATH} && mv -t {MODEL_PACKAGE_PATH} trainer requirements.txt config

In [None]:
train_job_script = f"""
gcloud ai custom-jobs create \
--region={REGION} \
--display-name={TRAIN_JOB_NAME} \
--worker-pool-spec=machine-type={TRAINING_MACHINE_TYPE},replica-count={TRAINING_REPLICA_COUNT},executor-image-uri={BASE_CPU_IMAGE},local-package-path={MODEL_PACKAGE_PATH},python-module=trainer.task,extra-dirs=config \
--args=--data_path={DATA_PATH},--model_dir={MODEL_DIR},--config_path=config/config.yaml \
--verbosity='info'
"""

with open("train_job_run.sh", "w+") as s:
    s.write(train_job_script)
s.close()

In [None]:
!chmod +x ./train_job_run.sh && ./train_job_run.sh

### 检查培训任务的状态和结果。

您可以使用以下命令来监视作业的状态，并在成功运行培训后检查存储桶中的工件。

In [None]:
TRAIN_JOB_RESOURCE_NAME = "[your-train-job-resource-name]"  # @param {type:"string"}

In [None]:
!gcloud ai custom-jobs describe $TRAIN_JOB_RESOURCE_NAME

In [None]:
!gsutil ls $DESTINATION_URI

### 在 Vertex AI 端点上上传和部署模型

您可以使用自定义函数将您的模型上传到 Vertex AI 模型注册表。

In [None]:
xgb_model = upload_model(
    display_name=MODEL_NAME,
    serving_container_image_uri=SERVING_CONTAINER_IMAGE_URI,
    artifact_uri=DESTINATION_URI,
)

### 将模型部署到具有流量分割的同一端点

现在您已经在模型注册表中注册，您可以将其部署到一个端点中。因此，您首先创建端点，然后部署您的模型。

In [None]:
endpoint = create_endpoint(display_name=ENDPOINT_NAME)

In [None]:
deployed_model = deploy_model(
    model=xgb_model,
    machine_type=SERVING_MACHINE_TYPE,
    endpoint=endpoint,
    deployed_model_display_name=DEPLOYED_MODEL_NAME,
    min_replica_count=1,
    max_replica_count=1,
    sync=False,
)

# 使用低延迟在规模上提供机器学习特性
 
那时候，你已经准备好**部署我们的简单模型，该模型需要在实时中提取预处理属性作为输入特性**。
 
以下是它是如何运作的
 
但是想一想那些特性。
 
用于训练模型的行为特性，在在线服务模型时无法计算。
 
你如何在现场计算用户在过去24小时内挑战朋友的次数？
 
你需要在服务器端计算这种特性并以低延迟提供。并且因为 Bigquery 不是针对这些读操作进行了优化，我们需要另一种允许单例查找的服务，其中结果是一个具有许多列的单行。
 
此外，即使不是这种情况，当你部署一个需要预处理数据的模型时，你需要确保在训练时采用相同的预处理步骤。如果你无法做到这一点，训练和服务数据之间会发生偏移，这将严重影响你的模型性能（并在最糟糕的情况下破坏你的服务系统）。
 
你需要一种方法来减轻这种情况，你无需在线实施这些预处理步骤，只需提供用于训练的相同汇总特性，以生成在线预测。
 
这些是引入 Vertex AI 特性存储的其他有价值的理由。有了它，你可以通过一种帮助你以与训练时相同的方式在规模上低延迟提供特性的服务，从而减轻可能的训练-服务偏移。
 
现在你知道**为什么你需要一个特性存储**，让我们通过使用特性存储部署你的模型来结束这次旅程，在线检索特性，将它们传递到端点并生成预测。

## 开始模拟在线预测

一旦模型准备好接收预测请求，您可以使用 `simulate_prediction` 函数来生成预测。

具体来说，该函数会：

- 为预测格式化实体
- 通过在 Vertex AI 特征存储中执行单例查找操作检索静态特征
- 运行预测请求并获取结果

根据您定义的请求数和一些延迟。

In [None]:
simulate_prediction(endpoint=endpoint, n_requests=10, latency=1)

清理工作

要清理此项目中使用的所有Google Cloud资源，您可以删除用于本教程的[Google Cloud项目](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects)。

否则，您可以删除在本教程中创建的各个资源。

In [None]:
# delete feature store
mobile_gaming_feature_store.delete(sync=True, force=True)

In [None]:
# delete Vertex AI resources
endpoint.undeploy_all()
xgb_model.delete()

In [None]:
# Delete bucket
delete_bucket = False
if (delete_bucket or os.getenv("IS_TESTING")) and "BUCKET_URI" in globals():
    ! gsutil -m rm -r $BUCKET_URI

In [None]:
# Delete the BigQuery Dataset
!bq rm -r -f -d $PROJECT_ID:$BQ_DATASET