![tracker](https://us-central1-vertex-ai-mlops-369716.cloudfunctions.net/pixel-tracking?path=statmike%2Fvertex-ai-mlops%2FFramework+Workflows%2FCatBoost&file=CatBoost+Prediction+With+Vertex+AI+Feature+Store.ipynb)
<!--- header table --->
<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/statmike/vertex-ai-mlops/blob/main/Framework%20Workflows/CatBoost/CatBoost%20Prediction%20With%20Vertex%20AI%20Feature%20Store.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Google Colaboratory logo">
      <br>Run in<br>Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https%3A%2F%2Fraw.githubusercontent.com%2Fstatmike%2Fvertex-ai-mlops%2Fmain%2FFramework%2520Workflows%2FCatBoost%2FCatBoost%2520Prediction%2520With%2520Vertex%2520AI%2520Feature%2520Store.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo">
      <br>Run in<br>Colab Enterprise
    </a>
  </td>      
  <td style="text-align: center">
    <a href="https://github.com/statmike/vertex-ai-mlops/blob/main/Framework%20Workflows/CatBoost/CatBoost%20Prediction%20With%20Vertex%20AI%20Feature%20Store.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      <br>View on<br>GitHub
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/statmike/vertex-ai-mlops/main/Framework%20Workflows/CatBoost/CatBoost%20Prediction%20With%20Vertex%20AI%20Feature%20Store.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      <br>Open in<br>Vertex AI Workbench
    </a>
  </td>
</table>

# CatBoost Prediction With Vertex AI Feature Store

---
## Colab Setup

To run this notebook in Colab run the cells in this section.  Otherwise, skip this section.

This cell will authenticate to GCP (follow prompts in the popup).

In [1]:
PROJECT_ID = 'statmike-mlops-349915' # replace with project ID

In [2]:
try:
    import google.colab
    from google.colab import auth
    auth.authenticate_user()
    !gcloud config set project {PROJECT_ID}
except Exception:
    pass

---
## Installs

The list `packages` contains tuples of package import names and install names.  If the import name is not found then the install name is used to install quitely for the current user.

In [22]:
# tuples of (import name, install name, min_version)
packages = [
    ('numpy', 'numpy'),
    ('catboost', 'catboost'),
    ('docker', 'docker'),
    ('google.cloud.aiplatform', 'google-cloud-aiplatform'),
    ('google.cloud.bigquery', 'google-cloud-bigquery'),
    ('google.cloud.storage', 'google-cloud-storage'),
    ('google.cloud.artifactregistry_v1', 'google-cloud-artifact-registry'),
    ('google.cloud.devtools', 'google-cloud-build'),
    ('google.cloud.run_v2', 'google-cloud-run'),   
]

import importlib
install = False
for package in packages:
    if not importlib.util.find_spec(package[0]):
        print(f'installing package {package[1]}')
        install = True
        !pip install {package[1]} -U -q --user
    elif len(package) == 3:
        if importlib.metadata.version(package[0]) < package[2]:
            print(f'updating package {package[1]}')
            install = True
            !pip install {package[1]} -U -q --user

### API Enablement

In [4]:
!gcloud services enable artifactregistry.googleapis.com
!gcloud services enable cloudbuild.googleapis.com
!gcloud services enable run.googleapis.com

### Restart Kernel (If Installs Occured)

After a kernel restart the code submission can start with the next cell after this one.

In [5]:
if install:
    import IPython
    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)
    IPython.display.display(IPython.display.Markdown("""<div class=\"alert alert-block alert-warning\">
        <b>⚠️ The kernel is going to restart. Please wait until it is finished before continuing to the next step. The previous cells do not need to be run again⚠️</b>
        </div>"""))

---
## Setup

inputs:

In [6]:
project = !gcloud config get-value project
PROJECT_ID = project[0]
PROJECT_ID

'statmike-mlops-349915'

In [19]:
REGION = 'us-central1'
SERIES = 'frameworks-catboost'
EXPERIMENT = 'prediction-feature-store'

# GCS Names
GCS_BUCKET = PROJECT_ID

# make this the BigQuery Project / Dataset / Table prefix to store results
BQ_PROJECT = PROJECT_ID
BQ_DATASET = SERIES.replace('-', '_')
BQ_TABLE = EXPERIMENT
BQ_REGION = REGION[0:2]

# Vertex AI Feature Store names:
FS_NAME = PROJECT_ID.replace('-', '_')
FV_NAME = f"{SERIES}-{EXPERIMENT}".replace('-', '_')

packages:

In [46]:
import json, os
import time, datetime
import requests

import catboost 
import numpy as np
import docker

import google.auth
from google.cloud import storage
from google.cloud import artifactregistry_v1
from google.cloud.devtools import cloudbuild_v1
from google.cloud import run_v2
from google.cloud import bigquery

from google.cloud import aiplatform
from vertexai.resources.preview import feature_store

clients:

In [24]:
# gcs storage client
gcs = storage.Client(project = GCS_BUCKET)
bucket = gcs.bucket(GCS_BUCKET)

# cloud build client
cb = cloudbuild_v1.CloudBuildClient()

# artifact registry client
ar = artifactregistry_v1.ArtifactRegistryClient()

# cloud run client
cr = run_v2.ServicesClient()

# BigQuery client
bq = bigquery.Client(project = PROJECT_ID)

# vertex ai client
aiplatform.init(project = PROJECT_ID, location = REGION)

Parameters:

In [10]:
DIR = f"files/{EXPERIMENT}"

Environment:

In [11]:
if not os.path.exists(DIR):
    os.makedirs(DIR)

---
## CatBoost Model

Retrieve the model trained in prior workflow along with test records.  Test the model directly in this environment.

### Check For Files

In [12]:
files = list(bucket.list_blobs(prefix = f'{SERIES}/notebook'))
if len(files) > 0:
    print('Found the files created by the prerequisite workflow:')
    for file in files:
        print(f'- gs://{bucket.name}/{file.name}')
else:
    print('Files note found - Please run the prerequisite notebook (listed at top of this workflow)')

Found the files created by the prerequisite workflow:
- gs://statmike-mlops-349915/frameworks-catboost/notebook/examples.json
- gs://statmike-mlops-349915/frameworks-catboost/notebook/model.cbm


### Load Model

In [13]:
model_blob = bucket.blob(f'{SERIES}/notebook/model.cbm')
model_bytes = model_blob.download_as_bytes()
model = catboost.CatBoostClassifier().load_model(blob = model_bytes)

### Load Inference Examples

In [14]:
examples_blob = bucket.blob(f'{SERIES}/notebook/examples.json')
examples_np = np.array(
    json.loads(examples_blob.download_as_string())
)

### Test Model With Examples

In [15]:
model.predict(examples_np)

array([1, 1, 1, 1, 0, 0, 1, 1, 1, 1])

In [18]:
model.feature_names_

['Time',
 'V1',
 'V2',
 'V3',
 'V4',
 'V5',
 'V6',
 'V7',
 'V8',
 'V9',
 'V10',
 'V11',
 'V12',
 'V13',
 'V14',
 'V15',
 'V16',
 'V17',
 'V18',
 'V19',
 'V20',
 'V21',
 'V22',
 'V23',
 'V24',
 'V25',
 'V26',
 'V27',
 'V28',
 'Amount']

---
## BigQuery - The Offline Store For Vertex AI Feature Store

The offline store for [Vertex AI Feature Store](https://cloud.google.com/vertex-ai/docs/featurestore/latest/overview) is BigQuery.  This streamlines ML feature management prior to serving online with feature store.  The data used to train this model in [CatBoost In Notebook](./CatBoost%20In%20Notebook.ipynb) actually came from BigQuery already.

This section prepares a version of the BigQuery public table used in the training notebook as a table with an id column to identify individual rows - `transaction_id`.

### Create/Recall Dataset

In [25]:
dataset = bigquery.Dataset(f"{BQ_PROJECT}.{BQ_DATASET}")
dataset.location = BQ_REGION
bq_dataset = bq.create_dataset(dataset, exists_ok = True)

### Create/Recall Table

In [27]:
query = f"""
    CREATE TABLE IF NOT EXISTS `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}` AS
    SELECT GENERATE_UUID() AS transaction_id, *
    FROM `bigquery-public-data.ml_datasets.ulb_fraud_detection`;
"""
job = bq.query(query = query)
job.result()

<google.cloud.bigquery.table._EmptyRowIterator at 0x7fa919c827a0>

In [29]:
(job.ended - job.started).total_seconds()

9.254

### Review BigQuery Table

In [31]:
bq.query(f"SELECT * FROM `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}` LIMIT 3").to_dataframe()

Unnamed: 0,transaction_id,Time,V1,V2,V3,V4,V5,V6,V7,V8,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,3a21a7e6-4b33-4fce-b7da-289d16146df3,115285.0,-0.992995,0.594899,1.61371,2.939088,1.675179,0.327479,0.328362,0.207276,...,0.156509,0.445081,-0.364915,-0.640996,0.401665,0.180925,0.081706,0.134099,0.0,0
1,cb06dca2-6573-4fa9-a51e-42ae1bbbaeba,91543.0,-0.929128,2.514791,-2.878171,-0.1975,0.709615,0.097873,-3.285721,-11.169795,...,-4.757274,2.759508,0.350877,-0.008925,-0.196056,0.320867,0.04583,0.503654,0.0,0
2,621d7cfc-278e-41d5-a533-c3cdb0e68b44,141273.0,2.147444,0.200913,-2.664948,0.195083,1.234707,-0.832371,0.870348,-0.420298,...,0.282831,0.953374,-0.212726,0.384479,0.735806,0.696998,-0.131764,-0.100937,0.0,0


### Get A List of `transaction_id` Values For Testing

Get a list of `transaction_id` values to use later in this workflow:

In [50]:
transaction_ids = bq.query(f"SELECT transaction_id FROM `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}` WHERE Class = 1 LIMIT 10").to_dataframe()['transaction_id'].tolist()

In [51]:
transaction_ids

['ac53e6ca-f676-48fa-94cc-9a35972f838d',
 'dcadf588-4079-474f-b0b6-5872949ab2bf',
 'cff98a84-d59e-4782-a830-85719d1bc568',
 '342e315a-baad-469c-b940-68cdffe5b250',
 '920807da-79dc-408e-9aed-da8ce7da0ec7',
 '806c0df5-a785-4562-addd-2a0fcbd03273',
 '0953278e-e1cc-4895-91bf-dfb2e79e86e4',
 '1a380124-4ca5-41b4-afc0-18fb4478a76e',
 'fb15047a-d389-48c3-8b3b-c4faba32a418',
 'e12d9f95-f115-4d7f-8da8-437c584c5740']

---
## Understanding Vertex AI Feature Store

The next sections will setup online serving for the BigQuery table with [Vertex AI Feature Store](https://cloud.google.com/vertex-ai/docs/featurestore/latest/overview). This workflow takes the shortest path of synchronizing a BigQuery table to feature store.  There are more flexible paths as well using the Feature Registry where features across multiple tables and views can come together in a single serving structure called a feature view.  You can read more about this within the [MLOps](../../MLOps/readme.md) section of this repository, which includes a deep dive into [feature stores](../../MLOps/Feature%20Store/readme.md).

<p align="center" ><center>
    <img src="../../MLOps/resources/images/created/featurestore/overview.png" width="75%">
</center></p>

---
## Setup Vertex AI Feature Store

### Create/Retrieve Online Store

The first step is to create a Vertex AI Feature Store.  There are two serving types to choose from when setting up a feature store: Bigtable and Optimized.  For this work the Optimized online serving is picked which can even [provide vector similarity search](https://cloud.google.com/vertex-ai/docs/featurestore/latest/embeddings-search) functionality that Bigtable serving does not.
>**NOTE:** This can take around 10 minutes if creating a new feature store instance

**Reference:**
- [Create an Online Store Instance](https://cloud.google.com/vertex-ai/docs/featurestore/latest/create-onlinestore)
- [Online Serving Types](https://cloud.google.com/vertex-ai/docs/featurestore/latest/online-serving-types)

In [36]:
try:
    online_store = feature_store.FeatureOnlineStore(name = FS_NAME)
    print(f"Found the feature store:\n{online_store.resource_name}")
except Exception:
    print("Create the feature store...")
    online_store = feature_store.FeatureOnlineStore.create_optimized_store(
        name = FS_NAME
    )
    print(f"Create the feature store:\n{online_store.resource_name}")

Found the feature store:
projects/1026793852137/locations/us-central1/featureOnlineStores/statmike_mlops_349915


In [37]:
online_store.name

'statmike_mlops_349915'

### Create/Retrieve Feature View From BigQuery Source

There are two paths to [creating feature views](https://cloud.google.com/vertex-ai/docs/featurestore/latest/create-featureview) in feature store. The one used here is syncing a BigQuery table or view directly to the online store. The alternative involves using the feature registry which gives greater control of selecting features (columns) form multiple BigQuery source tables and views.  Learn more about Vertex AI Feature Store in this repository's [MLOps](../../MLOps/readme.md) section, which includes a deep dive into [feature stores](../../MLOps/Feature%20Store/readme.md).

**Reference:**
- [Create a feature view instance](https://cloud.google.com/vertex-ai/docs/featurestore/latest/create-featureview)

In [39]:
try:
    feature_view = feature_store.FeatureView(
        name = FV_NAME,
        feature_online_store_id = online_store.resource_name
    )
    print(f"Found the feature view:\n{feature_view.resource_name}")
except Exception:
    print(f"Create the feature view...")
    feature_view = online_store.create_feature_view(
        name = FV_NAME,
        source = feature_store.utils.FeatureViewBigQuerySource(
            uri = f'bq://{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}',
            entity_id_columns = ['transaction_id'] # can be multiple columns 
        ),
        sync_config = 'TZ=America/New_York 0 22 * * *' # Ex: every day at 10PM, just once per day
    )   
    print(f"Created the feature view:\n{feature_view.resource_name}")

Found the feature view:
projects/1026793852137/locations/us-central1/featureOnlineStores/statmike_mlops_349915/featureViews/frameworks_catboost_prediction_feature_store


In [40]:
feature_view.name

'frameworks_catboost_prediction_feature_store'

### Managing Synchronization

Force a synchronization rather than wait for the next scheduled sync:

In [41]:
force_sync = feature_view.sync()

In [42]:
type(force_sync)

vertexai.resources.preview.feature_store.feature_view.FeatureView.FeatureViewSync

In [43]:
force_sync.to_dict()

{'name': 'projects/1026793852137/locations/us-central1/featureOnlineStores/statmike_mlops_349915/featureViews/frameworks_catboost_prediction_feature_store/featureViewSyncs/2189078868663468032',
 'createTime': '2024-11-19T16:35:33.210253Z',
 'runTime': {'startTime': '2024-11-19T16:35:33.210253Z'}}

Get updated information about the sync job:

In [44]:
force_sync = feature_view.get_sync(name = force_sync.name)
force_sync.to_dict()

{'name': 'projects/1026793852137/locations/us-central1/featureOnlineStores/statmike_mlops_349915/featureViews/frameworks_catboost_prediction_feature_store/featureViewSyncs/2189078868663468032',
 'createTime': '2024-11-19T16:35:33.210253Z',
 'runTime': {'startTime': '2024-11-19T16:35:33.210253Z'}}

Wait on the sync job to complete and report timing and rows synced:

In [47]:
waited = 0
while True:
    sync_status = feature_view.get_sync(name = force_sync.name).to_dict()
    if 'endTime' in list(sync_status['runTime'].keys()):
        seconds = (
            datetime.datetime.fromisoformat(sync_status['runTime']['endTime'].replace('Z', '+00:00'))
            -
            datetime.datetime.fromisoformat(sync_status['runTime']['startTime'].replace('Z', '+00:00'))
        ).total_seconds()
        rows = sync_status['syncSummary']['rowSynced']
        print(f"Sync completed in {seconds} seconds and synced {rows} rows.")
        break
    else:
        print(f"Waited {waited} seconds, Update again in 30 seconds...")
        time.sleep(30)
        waited += 30

Sync completed in 203.904381 seconds and synced 284807 rows.


Get a list of sync jobs:

In [48]:
list_syncs = feature_view.list_syncs()

Print out the end time and rows synced for each job:

In [49]:
for sync in list_syncs:
    s = feature_view.get_sync(name = sync.name).to_dict()
    ended = datetime.datetime.fromisoformat(s['runTime']['endTime'].replace('Z', '+00:00')).strftime("%m/%d/%Y %H:%M:%S")
    rows = s['syncSummary']['rowSynced']
    print(f"Sync completed at {ended} and synced {rows} rows.")

Sync completed at 11/19/2024 16:38:57 and synced 284807 rows.


### Retrieve: Features For Entity


In [52]:
results = feature_view.read(key = [transaction_ids[0]]).to_dict()['features']

Public endpoint for the optimized online store statmike_mlops_349915 is 6457115130579648512.us-central1-1026793852137.featurestore.vertexai.goog


In [53]:
results

[{'name': 'Time', 'value': {'double_value': 84204.0}},
 {'name': 'V1', 'value': {'double_value': -0.937843305478391}},
 {'name': 'V2', 'value': {'double_value': 3.46288948991687}},
 {'name': 'V3', 'value': {'double_value': -6.44510395393435}},
 {'name': 'V4', 'value': {'double_value': 4.9321986662268005}},
 {'name': 'V5', 'value': {'double_value': -2.2339830698224503}},
 {'name': 'V6', 'value': {'double_value': -2.29156112129773}},
 {'name': 'V7', 'value': {'double_value': -5.69559392853253}},
 {'name': 'V8', 'value': {'double_value': 1.3388246336226102}},
 {'name': 'V9', 'value': {'double_value': -4.3223765532932905}},
 {'name': 'V10', 'value': {'double_value': -8.0991193981365}},
 {'name': 'V11', 'value': {'double_value': 7.18296700883659}},
 {'name': 'V12', 'value': {'double_value': -9.44594338249901}},
 {'name': 'V13', 'value': {'double_value': -0.31461996752022403}},
 {'name': 'V14', 'value': {'double_value': -12.9914655817567}},
 {'name': 'V15', 'value': {'double_value': -0.13635

---
## CatBoost Prediction With Feature Store

Predictions with CatBoost are the same and the input is just an array of feature values.  The value of feature store is providing the current value of these features as a simple API call with just the entity id for which the features are needed.  This section does these steps separately and then build a simple Python function to bring them together:
- request features for an entity id, `transaction_id`
- prepare the features for the model: ensure the order is correct and covert to a Numpy array for input to the model
- make the prediction request

### Step 1: Request Features Based On Entity ID

In [54]:
entity_id = transaction_ids[1]

In [55]:
features = feature_view.read(key = [entity_id]).to_dict()['features']

### Step 2: Prepare Features For Inference

### Step 3: Make Prediction

### Combine Steps Into Function

- Idea:
    - ~~Store original data in BigQuery Table with transation ID~~
    - ~~Sync to Feature Store Online~~
    - Build Function: Retrieve record (by entity_id), parse response, order columns, create array, predict
    - Create custom preidction container with FastAPI using the Function to retrieve based on input of entity_id
    - Deploy to local, cloud run, and Vertex Endpoint
    
    