# 07 - Vertex AI > Features - Feature Store

This is a demonstration of [Vertex AI Feature Store](https://cloud.google.com/vertex-ai/docs/featurestore/overview). A feature store is a central repository for organizing, storing, and retrieving features.  This is a fully managed service that scales the underlying compute and storage resources.  The feature store becomes a central location for serving features for training and prediction with low-latency. It stores feature values at points-in-time:

-  Point-in-time lookups for retrieving features for model training. Retrieve feature values prior to a prediction to prevent data leakage.
-  Manage training-serving skew

**Prerequisites:**

-  01 - BigQuery - Table Data Source
-  Any of 02-05 That Deploy A Model To An Endpoint
   -  Used to demonstrate online predictions with feature store serving features

**Overview:**

-  Create a Feature Store
-  Define an entity type
-  Define features for and entity type
   -  For this demonstration I use metadata from a BigQuery table to define features
      -  project.table.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS
-  Search Features
   -  Using FeaturestoreServiceClient.search_features()
-  Import feature values
   -  Using a BigQuery table as the source data
-  Serve feature values
   -  For one entity_Id
   -  For multiple entity_id values
   -  Batch Feature request
-  Use online feature serving as input to online prediction with Vertex AI Endpoint

**Resources:**

-  [Python Client for Vertex AI](https://googleapis.dev/python/aiplatform/latest/aiplatform.html)
   -  Currently using the [v1beta1 services](https://googleapis.dev/python/aiplatform/latest/aiplatform_v1beta1/services.html)
-  [Feature Store Overview](https://cloud.google.com/vertex-ai/docs/featurestore/overview)
-  [Data Model and Concepts](https://cloud.google.com/vertex-ai/docs/featurestore/concepts)
-  [Best Practices](https://cloud.google.com/vertex-ai/docs/featurestore/best-practices) including info on composite entity types

**Related Training:**

-  todo

---
## Vertex AI - Conceptual Flow

<img src="architectures/slides/slide_35.png">

---
## Vertex AI - Workflow

<img src="architectures/slides/slide_36.png">

---
## Setup

inputs:

In [43]:
REGION = 'us-central1'
PROJECT_ID='statmike-mlops'
DATANAME = 'fraud'
NOTEBOOK = '07'

ENTITYTYPE_ID = 'transaction'

# Model Training
VAR_TARGET = 'Class'
VAR_OMIT = 'transaction_id' # add more variables to the string with space delimiters

packages:

In [44]:
from google.cloud.aiplatform_v1beta1 import (FeaturestoreOnlineServingServiceClient, FeaturestoreServiceClient, types)
from google.cloud import aiplatform

from google.protobuf.duration_pb2 import Duration
from google.protobuf.timestamp_pb2 import Timestamp
from google.protobuf.field_mask_pb2 import FieldMask

from google.cloud import bigquery
from google.cloud.aiplatform_v1beta1 import (PredictionServiceClient, EndpointServiceClient)
from google.protobuf import json_format
from google.protobuf.struct_pb2 import Value
import json
import numpy as np

clients:

In [45]:
client_options = {"api_endpoint": f"{REGION}-aiplatform.googleapis.com"}

clients = {}
clients['fs'] = FeaturestoreServiceClient(client_options = client_options)
clients['fs_olserve'] = FeaturestoreOnlineServingServiceClient(client_options = client_options)

clients['bq'] = bigquery.Client()

aiplatform.init(project=PROJECT_ID, location=REGION)

parameters:

In [46]:
PARENT = f"projects/{PROJECT_ID}/locations/{REGION}"
DIR = f"temp/{NOTEBOOK}"

environment:

In [47]:
!rm -rf {DIR}
!mkdir -p {DIR}

---
## Feature Store Data model
Feature Store organizes data with the following 3 important hierarchical concepts:

Featurestore -> EntityType -> Feature

- **Featurestore**: the place to store your features
    - **EntityType**: under a Featurestore, an EntityType describes an object to be modeled, real one or virtual one.
        - **Feature**: under an EntityType, a feature describes an attribute of the EntityType

For the digits data used in these examples, the feature store will be called digits_featurestore.  The store has 1 entity type: images.  The features will be the pixels and the target values.

---
## Create Feature Store

In [48]:
FEATURESTORE_ID = DATANAME

In [49]:
featurestore_lro = clients['fs'].create_featurestore(
    types.featurestore_service.CreateFeaturestoreRequest(
        parent = PARENT,
        featurestore_id = FEATURESTORE_ID,
        featurestore=types.featurestore.Featurestore(
            display_name=f"Notebook {NOTEBOOK} demonstration of Vertex AI Features (feature store) using {DATANAME} data",
            online_serving_config=types.featurestore.Featurestore.OnlineServingConfig(
                fixed_node_count=2
            ),
        ),
    )
)

In [50]:
featurestore_lro.result()

name: "projects/691911073727/locations/us-central1/featurestores/fraud"

Use `get_featurestore` to see details of specified feature store:

In [51]:
clients['fs'].get_featurestore(name=clients['fs'].featurestore_path(PROJECT_ID, REGION, FEATURESTORE_ID))

name: "projects/691911073727/locations/us-central1/featurestores/fraud"
create_time {
  seconds: 1632489662
  nanos: 433944000
}
update_time {
  seconds: 1632489662
  nanos: 530980000
}
etag: "AMEw9yOB5uzhckBozEp_Sf5DHT8XNebqDrEj8wQR-FmQtrdvYe8vVyFIkKaflRo8Ms57"
online_serving_config {
  fixed_node_count: 2
}
state: STABLE

Use `list_featurestores` to see details of all feature stores:

In [52]:
clients['fs'].list_featurestores(parent=PARENT)

ListFeaturestoresPager<featurestores {
  name: "projects/691911073727/locations/us-central1/featurestores/fraud"
  create_time {
    seconds: 1632489662
    nanos: 433944000
  }
  update_time {
    seconds: 1632489662
    nanos: 530980000
  }
  etag: "AMEw9yNM8E8ZLIMiBFF9KgC8FUxjBsUL0e0rwyoHPaIrQGbSCZhVEwci5cxmZnTzwcfm"
  online_serving_config {
    fixed_node_count: 2
  }
  state: STABLE
}
>

---
## Create Entity Type

In [53]:
entitytype_lro = clients['fs'].create_entity_type(
    types.featurestore_service.CreateEntityTypeRequest(
        parent=clients['fs'].featurestore_path(PROJECT_ID, REGION, FEATURESTORE_ID),
        entity_type_id = ENTITYTYPE_ID,
        entity_type=types.entity_type.EntityType(
            description=f"Entity: {ENTITYTYPE_ID}, for data: {DATANAME}",
            monitoring_config=types.featurestore_monitoring.FeaturestoreMonitoringConfig(
                snapshot_analysis=types.featurestore_monitoring.FeaturestoreMonitoringConfig.SnapshotAnalysis(
                    monitoring_interval=Duration(seconds=900),  # 15 minutes
                ),
            ),
        ),
    )
)

In [54]:
entitytype_lro.result()

name: "projects/691911073727/locations/us-central1/featurestores/fraud/entityTypes/transaction"
etag: "AMEw9yN873V75quSLZWa41VPozgukMtRHN3umbE3ovGcI6EdI0c9"

Use `list_entity_types` to see details of all entity types:

In [55]:
clients['fs'].list_entity_types(parent = f"{PARENT}/featurestores/{FEATURESTORE_ID}")

ListEntityTypesPager<entity_types {
  name: "projects/691911073727/locations/us-central1/featurestores/fraud/entityTypes/transaction"
  description: "Entity: transaction, for data: fraud"
  create_time {
    seconds: 1632489671
    nanos: 843915000
  }
  update_time {
    seconds: 1632489671
    nanos: 843915000
  }
  etag: "AMEw9yMk3l3r5gFK-06abAInTUUN4ifkZlHyCSqAf0vCcRguF1v5SQ5QD62jMOTorGFv"
  monitoring_config {
    snapshot_analysis {
      monitoring_interval {
        seconds: 86400
      }
    }
  }
}
>

---
## Create Features

Get the schema of the data source for new features:

In [56]:
schema = clients['bq'].query(query = f"SELECT * FROM {DATANAME}.INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = '{DATANAME}_prepped'").to_dataframe()

In [57]:
schema

Unnamed: 0,table_catalog,table_schema,table_name,column_name,ordinal_position,is_nullable,data_type,is_generated,generation_expression,is_stored,is_hidden,is_updatable,is_system_defined,is_partitioning_column,clustering_ordinal_position
0,statmike-mlops,fraud,fraud_prepped,Time,1,YES,INT64,NEVER,,,NO,,NO,NO,
1,statmike-mlops,fraud,fraud_prepped,V1,2,YES,FLOAT64,NEVER,,,NO,,NO,NO,
2,statmike-mlops,fraud,fraud_prepped,V2,3,YES,FLOAT64,NEVER,,,NO,,NO,NO,
3,statmike-mlops,fraud,fraud_prepped,V3,4,YES,FLOAT64,NEVER,,,NO,,NO,NO,
4,statmike-mlops,fraud,fraud_prepped,V4,5,YES,FLOAT64,NEVER,,,NO,,NO,NO,
5,statmike-mlops,fraud,fraud_prepped,V5,6,YES,FLOAT64,NEVER,,,NO,,NO,NO,
6,statmike-mlops,fraud,fraud_prepped,V6,7,YES,FLOAT64,NEVER,,,NO,,NO,NO,
7,statmike-mlops,fraud,fraud_prepped,V7,8,YES,FLOAT64,NEVER,,,NO,,NO,NO,
8,statmike-mlops,fraud,fraud_prepped,V8,9,YES,FLOAT64,NEVER,,,NO,,NO,NO,
9,statmike-mlops,fraud,fraud_prepped,V9,10,YES,FLOAT64,NEVER,,,NO,,NO,NO,


Prepare a request for `batch_create_features`:
- specification for the features, data type and descriptions ....

In [60]:
REQUESTS = []
for i in range(schema.shape[0]):
    
    if schema['column_name'][i] in [VAR_TARGET, 'splits'] + VAR_OMIT.split():
        continue
    
    if schema['data_type'][i] == 'STRING': value_type = types.feature.Feature.ValueType.STRING
    elif schema['data_type'][i] == 'INT64': value_type = types.feature.Feature.ValueType.INT64
    elif schema['data_type'][i] == 'FLOAT64': value_type = types.feature.Feature.ValueType.DOUBLE
    
    description = f"Column named {schema['column_name'][i]} from BQ Table {PROJECT_ID}.{DATANAME}.{DATANAME}_prepped"
    
    REQUESTS.append(
        types.featurestore_service.CreateFeatureRequest(
            feature=types.feature.Feature(
                value_type = value_type,
                description = description,
                # optional, monitoring_config here as override, otherwise it inherits from entity_type
            ),
            feature_id = schema['column_name'][i].lower(),
        )    
    )

In [61]:
batchfeatures = clients['fs'].batch_create_features(
    parent = clients['fs'].entity_type_path(PROJECT_ID, REGION, FEATURESTORE_ID, ENTITYTYPE_ID),
    requests = REQUESTS,
)

In [62]:
#list(item.name for item in batchfeatures.result().features)

---
## Search Features
Search goes across all Feature Stores and Entity Types.

Also, use the list_features function to list all.

In [63]:
# return the first feature:
list(clients['fs'].search_features(location = PARENT))[0]

name: "projects/691911073727/locations/us-central1/featurestores/fraud/entityTypes/transaction/features/amount"
description: "Column named Amount from BQ Table statmike-mlops.fraud.fraud_prepped"
create_time {
  seconds: 1632489732
  nanos: 380530000
}
update_time {
  seconds: 1632489732
  nanos: 380530000
}

In [64]:
# find all features with INT64 value type
list(clients['fs'].search_features(types.featurestore_service.SearchFeaturesRequest(location = PARENT, query = "value_type=INT64")))

[name: "projects/691911073727/locations/us-central1/featurestores/fraud/entityTypes/transaction/features/time"
 description: "Column named Time from BQ Table statmike-mlops.fraud.fraud_prepped"
 create_time {
   seconds: 1632489732
   nanos: 320983000
 }
 update_time {
   seconds: 1632489732
   nanos: 320983000
 }]

In [65]:
# find all features of the form V*9 with DOUBLE value type
list(clients['fs'].search_features(types.featurestore_service.SearchFeaturesRequest(location = PARENT, query = "feature_id:V*9 AND value_type=DOUBLE")))

[name: "projects/691911073727/locations/us-central1/featurestores/fraud/entityTypes/transaction/features/v19"
 description: "Column named V19 from BQ Table statmike-mlops.fraud.fraud_prepped"
 create_time {
   seconds: 1632489732
   nanos: 357729000
 }
 update_time {
   seconds: 1632489732
   nanos: 357729000
 },
 name: "projects/691911073727/locations/us-central1/featurestores/fraud/entityTypes/transaction/features/v9"
 description: "Column named V9 from BQ Table statmike-mlops.fraud.fraud_prepped"
 create_time {
   seconds: 1632489732
   nanos: 337463000
 }
 update_time {
   seconds: 1632489732
   nanos: 337463000
 }]

---
## Import Feature Values
- BigQuery (THIS DEMO)
- Avro
- CSV

Prepare a source table with timestamp (update_time) and unique id's for each entity

In [66]:
query = f"""
CREATE OR REPLACE TABLE {PROJECT_ID}.{DATANAME}.{DATANAME}_featurestore_import AS
WITH A AS 
    (SELECT *, CAST(FLOOR(10*RAND()) AS INT64) day_offset
    FROM {PROJECT_ID}.{DATANAME}.{DATANAME}_prepped)
SELECT * EXCEPT(day_offset),
        DATE_SUB(CURRENT_TIMESTAMP, INTERVAL day_offset DAY) AS update_time
FROM A
"""
bqjob = clients['bq'].query(query = query)

In [67]:
bqjob.result()

<google.cloud.bigquery.table._EmptyRowIterator at 0x7f2301599a10>

Create Feature specification for each feature in the input source:

In [68]:
FEATURE_SPECS = []
for i in range(schema.shape[0]):
    if schema['column_name'][i] in [VAR_TARGET, 'splits'] + VAR_OMIT.split():
        continue
    
    FEATURE_SPECS.append(
        types.featurestore_service.ImportFeatureValuesRequest.FeatureSpec(
            id = schema['column_name'][i].lower(),
            source_field = schema['column_name'][i]
        )
    )

In [69]:
import_request = types.featurestore_service.ImportFeatureValuesRequest(
    entity_type = clients['fs'].entity_type_path(PROJECT_ID, REGION, FEATURESTORE_ID, ENTITYTYPE_ID),
    bigquery_source = types.BigQuerySource(input_uri = f'bq://{PROJECT_ID}.{DATANAME}.{DATANAME}_featurestore_import'),
    feature_time_field = "update_time",
    feature_time = Timestamp().GetCurrentTime(),
    entity_id_field = "transaction_id",
    feature_specs = FEATURE_SPECS,
    worker_count = 4,
)

In [70]:
importjob = clients['fs'].import_feature_values(import_request)

In [71]:
importjob.result()

imported_entity_count: 284807
imported_feature_value_count: 8544210

---
## Prediction with Feature Store for Serving Features

### Entity Id's
Retrieve a list of entity id's from the source BigQuery table.  These are in the column `transaction_id`.

In [75]:
unique_id = clients['bq'].query(query = f"SELECT * FROM {DATANAME}.{DATANAME}_prepped WHERE splits='TEST' LIMIT 10").to_dataframe()

In [76]:
unique_id['transaction_id'][0]

'648731a9-54ff-4641-8d93-eda1fb480720'

### Data For Prediction: Single Entity Served by Vertex AI > Features (Feature Store)

In [77]:
feature_values = clients['fs_olserve'].read_feature_values(
    types.featurestore_online_service.ReadFeatureValuesRequest(
        entity_type = clients['fs'].entity_type_path(PROJECT_ID, REGION, FEATURESTORE_ID, ENTITYTYPE_ID),
        entity_id = unique_id['transaction_id'][0],
        feature_selector = types.FeatureSelector(id_matcher=types.IdMatcher(ids=['*'])),
    )
)

In [78]:
print(list(item.id for item in feature_values.header.feature_descriptors))

['v2', 'v28', 'v1', 'v13', 'v24', 'v15', 'v5', 'v10', 'v11', 'v23', 'amount', 'v18', 'v6', 'v17', 'v26', 'v7', 'v19', 'v9', 'v8', 'v16', 'v22', 'v12', 'v14', 'v3', 'time', 'v21', 'v27', 'v25', 'v20', 'v4']


In [79]:
print(list(item.value.double_value for item in feature_values.entity_view.data))

[-0.0107482915944279, 0.0104688128953261, 0.896724595112629, -0.263265135895566, -1.02092220324013, 0.10181862912285801, 0.0600891602875172, 0.475407568501481, 1.67558031412855, 0.162478000106847, 0.0, -1.63084524092967, 2.5108239146546296, 0.5451652678301121, 0.0681724437689762, -1.00659622377877, -2.10148754998933, -0.14496178003346, 0.920047141986182, -0.4413478524079651, 0.5017110088071359, 1.15188029738727, -0.0692698337502762, 1.4230359856442298, 0.0, 0.0829014821340028, 0.10960414350812302, -0.0514650738254434, -0.29859717953230297, 2.37905669888513]


### Prepare a record for prediction: instance and parameters lists

In [89]:
newob = {}
features = list(item.id for item in feature_values.header.feature_descriptors)
for e, f in enumerate(features):
    newob[f.capitalize()] = feature_values.entity_view.data[e].value.double_value

In [90]:
instances = [json_format.ParseDict(newob, Value())]
parameters = json_format.ParseDict({}, Value())

### Pick An Endpoint
A list index of [0] here retrieves the first endpoint in this project:

In [91]:
aiplatform.Endpoint.list()[0].display_name

'04a_fraud_20210924125209'

In [92]:
endpoint = aiplatform.Endpoint(endpoint_name = aiplatform.Endpoint.list()[0].name)

### Get Predictions: Python Client

In [93]:
prediction = endpoint.predict(instances = instances, parameters = parameters)

In [94]:
prediction

Prediction(predictions=[[0.998906374, 0.00109357678]], deployed_model_id='5211025408681574400', explanations=None)

In [95]:
prediction.predictions[0]

[0.998906374, 0.00109357678]

In [96]:
np.argmax(prediction.predictions[0])

0

### Get Predictions: REST

In [97]:
with open(f'{DIR}/request.json','w') as file:
    file.write(json.dumps({"instances": [newob]}))

In [98]:
!curl -X POST \
-H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
-H "Content-Type: application/json; charset=utf-8" \
-d @{DIR}/request.json \
https://{REGION}-aiplatform.googleapis.com/v1/{endpoint.resource_name}:predict

{
  "predictions": [
    [
      0.998906374,
      0.00109357678
    ]
  ],
  "deployedModelId": "5211025408681574400"
}


### Get Predictions: gcloud (CLI)

In [99]:
!gcloud beta ai endpoints predict {endpoint.name.rsplit('/',1)[-1]} --region={REGION} --json-request={DIR}/request.json

Using endpoint [https://us-central1-prediction-aiplatform.googleapis.com/]
[[0.998906374, 0.00109357678]]


### Data For Prediction: Multiple Entities Served by Vertex AI > Features (Feature Store)

In [100]:
unique_id['transaction_id']

0    648731a9-54ff-4641-8d93-eda1fb480720
1    e651b24c-a678-40c0-96b4-d151bf008432
2    2c30d9b6-9761-4184-973b-3a2ec07fb16a
3    26dcbbe1-2fa0-44f2-a5ae-a3a95dcea45e
4    1e2b5ef1-9141-410d-82a7-67cf3bc6dc26
5    6f6315fa-298a-45b0-8eb0-39d8c90d60c7
6    5c688cee-4490-4ea3-8e18-8a6d09eb037d
7    8d4e4170-1d76-4658-ae51-80cc8f287e66
8    e3a9c8e0-7e46-4c55-8581-e984d44cda27
9    58ff2375-e98f-40da-9f4e-d230dfc73009
Name: transaction_id, dtype: object

In [101]:
multi_feature_values = clients['fs_olserve'].streaming_read_feature_values(
    types.featurestore_online_service.StreamingReadFeatureValuesRequest(
        entity_type = clients['fs'].entity_type_path(PROJECT_ID, REGION, FEATURESTORE_ID, ENTITYTYPE_ID),
        entity_ids = unique_id['transaction_id'],
        feature_selector = types.FeatureSelector(id_matcher=types.IdMatcher(ids=['*'])),
    )
)

In [102]:
for i in multi_feature_values:
    print(i.entity_view.entity_id)
    print(list(item.value.double_value for item in i.entity_view.data))


[]
1e2b5ef1-9141-410d-82a7-67cf3bc6dc26
[0.0178310570385444, 0.15673410757831502, 0.139533476511716, 0.580525099921806, 0.0476565878703416, 0.436983825476803, 0.216634632508865, -1.48300166754983, -0.34190963302503896, -0.0761383782342794, -0.22374463132037195, 0.0, 2.5222071878835095, 0.671519420789799, -0.0925779950306399, 0.6954121042826129, -0.8121280800316691, 0.641141993499356, -0.179985039634078, 0.0193547006102132, -0.3457789151643039, 0.0524879466068395, -0.00148555498065213, 0.44277449238592104, 1.14147949549921, 0.38386608549225604, -0.0295474308731907, -0.7843151061108721, 0.0, -0.0894955489964147]
26dcbbe1-2fa0-44f2-a5ae-a3a95dcea45e
[0.144413644775616, 0.206504349021617, 0.279253081937816, 0.9745231602166771, -0.268562183744188, -0.6297075515639511, 0.703987073734131, -1.64988769762954, 0.309237229164127, 0.24454278087435502, -0.17885861261774105, 0.0, 1.62437481328574, -0.22974152028534603, -0.24799022077719898, -1.1279856031668898, -0.283090583580245, 2.84800224883177,

### Data For Training: Batch (For training or large scale prediction)

In [103]:
# get current timestamp (protobuf3 is seconds since ephoch (1970))
timestamp = Timestamp()
timestamp.GetCurrentTime()

# adjust timestamp to 2 days ago: 60*60*24*4
newtimestamp = Timestamp(seconds = timestamp.seconds - 60*60*24*2, nanos = timestamp.nanos)

batch_request = types.featurestore_service.ExportFeatureValuesRequest(
    entity_type = clients['fs'].entity_type_path(PROJECT_ID, REGION, FEATURESTORE_ID, ENTITYTYPE_ID),
    snapshot_export = types.ExportFeatureValuesRequest.SnapshotExport(snapshot_time = Timestamp(seconds=newtimestamp.seconds)),
    destination = types.FeatureValueDestination(bigquery_destination = types.BigQueryDestination(output_uri = f'bq://{PROJECT_ID}.{DATANAME}.{DATANAME}_fs_training')),
    feature_selector = types.FeatureSelector(id_matcher=types.IdMatcher(ids = ['*']))
)

In [104]:
batchjob = clients['fs'].export_feature_values(batch_request)

In [105]:
batchjob.result()



By Adjusting the `snapshot_time` to 2 days ago, the batch_request creates a BigQuery table that has all the orginal rows, 1 per entity, but the features are null for 20% of the rows.  This is because the features were loaded with `feature_time_field = "update_time"` and `update_time` was set to a random day between today and 10 days ago.

In [106]:
query = f"""
SELECT CASE WHEN {list(newob.keys())[0]} is not null then False ELSE True END as Null_Rows, count(*) as counts
FROM {PROJECT_ID}.{DATANAME}.{DATANAME}_fs_training
GROUP BY Null_Rows
"""
clients['bq'].query(query = query).to_dataframe()

Unnamed: 0,Null_Rows,counts
0,True,57143
1,False,227664


---
## Remove Resources
see notebook "XX - Cleanup"