# Module 1: Introduction to SageMaker Feature Store

**Note:** Please set kernel to `Python 3 (Data Science)` and select instance to `ml.t3.medium`

---

## Contents

1. [Background](#Background)
1. [Setup](#Setup)
1. [Load and explore datasets](#Load-and-explore-datasets)
1. [Create feature definitions and groups](#Create-feature-definitions-and-groups)
1. [Ingest data into feature groups](#Ingest-data-into-feature-groups)
1. [Get feature record from the Online feature store](#Get-feature-record-from-the-Online-feature-store)
1. [List feature groups](#List-feature-groups)

# Background

In this notebook, you will learn how to create **3** feature groups for `customers`, `products` and `orders` datasets 
in the SageMaker Feature Store. You will then learn how to ingest the feature 
columns into the created feature groups (both the Online and the Offline store) using SageMaker Python SDK. You will also see how to get an ingested feature record from the Online store. In the end, you will know how to list all the feature groups created within the Feature Store and delete them.

**Note:** The feature groups created in this notebook will be used in the upcoming modules.


# Setup

#### Imports

In [2]:
from sagemaker.feature_store.feature_group import FeatureGroup
from time import gmtime, strftime, sleep
from random import randint
import pandas as pd
import numpy as np
import subprocess
import sagemaker
import importlib
import logging
import time
import sys
from sagemaker.feature_store.inputs import TableFormatEnum

In [3]:
sm_version = sagemaker.__version__
major, minor, patch = sm_version.split('.')
if int(major) < 2 or int(minor) < 125:
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', 'sagemaker==2.125.0'])
    importlib.reload(sagemaker)

In [4]:
logger = logging.getLogger('__name__')
logger.setLevel(logging.DEBUG)
logger.addHandler(logging.StreamHandler())

In [5]:
logger.info(f'Using SageMaker version: {sagemaker.__version__}')
logger.info(f'Using Pandas version: {pd.__version__}')

Using SageMaker version: 2.145.0
Using Pandas version: 1.3.5


#### Essentials

In [6]:
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()
default_bucket = sagemaker_session.default_bucket()
logger.info(f'Default S3 bucket = {default_bucket}')
prefix = 'sagemaker-feature-store'

Default S3 bucket = sagemaker-us-east-1-161615149547


In [7]:
region = sagemaker_session.boto_region_name

# Load and explore datasets

In [8]:
customers_df = pd.read_csv('./data/transformed/customers.csv')
customers_df.head(5)

Unnamed: 0,customer_id,sex,is_married,event_time,age_18-29,age_30-39,age_40-49,age_50-59,age_60-69,age_70-plus,n_days_active
0,C1,0,0,2023-05-22T14:15:48.008Z,0,0,0,1,0,0,0.026027
1,C2,1,0,2023-05-22T14:15:48.010Z,1,0,0,0,0,0,0.077397
2,C3,0,1,2023-05-22T14:15:48.013Z,0,0,0,0,1,0,0.821233
3,C4,1,1,2023-05-22T14:15:48.015Z,0,0,0,1,0,0,0.887671
4,C5,0,1,2023-05-22T14:15:48.017Z,0,1,0,0,0,0,0.265753


In [9]:
customers_df.dtypes

customer_id       object
sex                int64
is_married         int64
event_time        object
age_18-29          int64
age_30-39          int64
age_40-49          int64
age_50-59          int64
age_60-69          int64
age_70-plus        int64
n_days_active    float64
dtype: object

In [10]:
customers_df['customer_id'] = customers_df['customer_id'].astype('string')
customers_df['event_time'] = customers_df['event_time'].astype('string')

In [11]:
customers_df.dtypes

customer_id       string
sex                int64
is_married         int64
event_time        string
age_18-29          int64
age_30-39          int64
age_40-49          int64
age_50-59          int64
age_60-69          int64
age_70-plus        int64
n_days_active    float64
dtype: object

In [12]:
products_df = pd.read_csv('./data/transformed/products.csv')
products_df.head(5)

Unnamed: 0,product_id,event_time,category_baby_food_formula,category_baking_ingredients,category_candy_chocolate,category_chips_pretzels,category_cleaning_products,category_coffee,category_cookies_cakes,category_crackers,...,category_hair_care,category_ice_cream_ice,category_juice_nectars,category_packaged_cheese,category_refrigerated,category_soup_broth_bouillon,category_spices_seasonings,category_tea,category_vitamins_supplements,category_yogurt
0,P1,2023-05-22T14:22:58.775Z,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
1,P2,2023-05-22T14:22:58.775Z,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
2,P3,2023-05-22T14:22:58.775Z,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,P4,2023-05-22T14:22:58.775Z,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,P5,2023-05-22T14:22:58.775Z,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0


In [13]:
products_df['product_id'] = products_df['product_id'].astype('string')
products_df['event_time'] = products_df['event_time'].astype('string')

In [14]:
products_df.dtypes

product_id                       string
event_time                       string
category_baby_food_formula        int64
category_baking_ingredients       int64
category_candy_chocolate          int64
category_chips_pretzels           int64
category_cleaning_products        int64
category_coffee                   int64
category_cookies_cakes            int64
category_crackers                 int64
category_energy_granola_bars      int64
category_frozen_meals             int64
category_hair_care                int64
category_ice_cream_ice            int64
category_juice_nectars            int64
category_packaged_cheese          int64
category_refrigerated             int64
category_soup_broth_bouillon      int64
category_spices_seasonings        int64
category_tea                      int64
category_vitamins_supplements     int64
category_yogurt                   int64
dtype: object

In [15]:
orders_df = pd.read_csv('./data/transformed/orders.csv')
orders_df

Unnamed: 0,order_id,customer_id,product_id,purchase_amount,is_reordered,event_time,n_days_since_last_purchase
0,O1,C8,P8577,0.565446,1,2023-05-22T14:23:24.376Z,0.527132
1,O2,C8510,P12984,0.566832,0,2023-05-22T14:23:24.376Z,0.253876
2,O3,C6316,P13498,0.569802,1,2023-05-22T14:23:24.376Z,0.852713
3,O4,C8392,P11612,0.924653,1,2023-05-22T14:23:24.376Z,0.360465
4,O5,C1816,P6096,0.121188,1,2023-05-22T14:23:24.376Z,0.846899
...,...,...,...,...,...,...,...
99995,O99996,C3642,P6210,0.129109,1,2023-05-22T14:23:32.236Z,0.794574
99996,O99997,C6145,P5740,0.825050,1,2023-05-22T14:23:32.236Z,0.521318
99997,O99998,C7567,P14942,0.602772,1,2023-05-22T14:23:32.236Z,0.453488
99998,O99999,C3417,P1772,0.987921,1,2023-05-22T14:23:32.236Z,0.742248


In [16]:
orders_df['order_id'] = orders_df['order_id'].astype('string')
orders_df['customer_id'] = orders_df['customer_id'].astype('string')
orders_df['product_id'] = orders_df['product_id'].astype('string')
orders_df['event_time'] = orders_df['event_time'].astype('string')

In [17]:
orders_df.dtypes

order_id                       string
customer_id                    string
product_id                     string
purchase_amount               float64
is_reordered                    int64
event_time                     string
n_days_since_last_purchase    float64
dtype: object

In [18]:
customers_count = customers_df.shape[0]
%store customers_count
products_count = products_df.shape[0]
%store products_count
orders_count = orders_df.shape[0]
%store orders_count

Stored 'customers_count' (int)
Stored 'products_count' (int)
Stored 'orders_count' (int)


# Create feature definitions and groups

In [19]:
current_timestamp = strftime('%m-%d-%H-%M', gmtime())

In [20]:
# prefix to track all the feature groups created as part of feature store champions workshop (fscw)
fs_prefix = 'fscw-' 

In [21]:
customers_feature_group_name = f'{fs_prefix}customers-{current_timestamp}'
%store customers_feature_group_name
products_feature_group_name = f'{fs_prefix}products-{current_timestamp}'
%store products_feature_group_name
orders_feature_group_name = f'{fs_prefix}orders-{current_timestamp}'
%store orders_feature_group_name

Stored 'customers_feature_group_name' (str)
Stored 'products_feature_group_name' (str)
Stored 'orders_feature_group_name' (str)


In [22]:
logger.info(f'Customers feature group name = {customers_feature_group_name}')
logger.info(f'Products feature group name = {products_feature_group_name}')
logger.info(f'Orders feature group name = {orders_feature_group_name}')

Customers feature group name = fscw-customers-05-22-21-52
Products feature group name = fscw-products-05-22-21-52
Orders feature group name = fscw-orders-05-22-21-52


In [23]:
customers_feature_group = FeatureGroup(name=customers_feature_group_name, sagemaker_session=sagemaker_session)
products_feature_group = FeatureGroup(name=products_feature_group_name, sagemaker_session=sagemaker_session)
orders_feature_group = FeatureGroup(name=orders_feature_group_name, sagemaker_session=sagemaker_session)

In [24]:
customers_feature_group.load_feature_definitions(data_frame=customers_df)

[FeatureDefinition(feature_name='customer_id', feature_type=<FeatureTypeEnum.STRING: 'String'>),
 FeatureDefinition(feature_name='sex', feature_type=<FeatureTypeEnum.INTEGRAL: 'Integral'>),
 FeatureDefinition(feature_name='is_married', feature_type=<FeatureTypeEnum.INTEGRAL: 'Integral'>),
 FeatureDefinition(feature_name='event_time', feature_type=<FeatureTypeEnum.STRING: 'String'>),
 FeatureDefinition(feature_name='age_18-29', feature_type=<FeatureTypeEnum.INTEGRAL: 'Integral'>),
 FeatureDefinition(feature_name='age_30-39', feature_type=<FeatureTypeEnum.INTEGRAL: 'Integral'>),
 FeatureDefinition(feature_name='age_40-49', feature_type=<FeatureTypeEnum.INTEGRAL: 'Integral'>),
 FeatureDefinition(feature_name='age_50-59', feature_type=<FeatureTypeEnum.INTEGRAL: 'Integral'>),
 FeatureDefinition(feature_name='age_60-69', feature_type=<FeatureTypeEnum.INTEGRAL: 'Integral'>),
 FeatureDefinition(feature_name='age_70-plus', feature_type=<FeatureTypeEnum.INTEGRAL: 'Integral'>),
 FeatureDefinition

In [25]:
products_feature_group.load_feature_definitions(data_frame=products_df)

[FeatureDefinition(feature_name='product_id', feature_type=<FeatureTypeEnum.STRING: 'String'>),
 FeatureDefinition(feature_name='event_time', feature_type=<FeatureTypeEnum.STRING: 'String'>),
 FeatureDefinition(feature_name='category_baby_food_formula', feature_type=<FeatureTypeEnum.INTEGRAL: 'Integral'>),
 FeatureDefinition(feature_name='category_baking_ingredients', feature_type=<FeatureTypeEnum.INTEGRAL: 'Integral'>),
 FeatureDefinition(feature_name='category_candy_chocolate', feature_type=<FeatureTypeEnum.INTEGRAL: 'Integral'>),
 FeatureDefinition(feature_name='category_chips_pretzels', feature_type=<FeatureTypeEnum.INTEGRAL: 'Integral'>),
 FeatureDefinition(feature_name='category_cleaning_products', feature_type=<FeatureTypeEnum.INTEGRAL: 'Integral'>),
 FeatureDefinition(feature_name='category_coffee', feature_type=<FeatureTypeEnum.INTEGRAL: 'Integral'>),
 FeatureDefinition(feature_name='category_cookies_cakes', feature_type=<FeatureTypeEnum.INTEGRAL: 'Integral'>),
 FeatureDefinit

In [26]:
orders_feature_group.load_feature_definitions(data_frame=orders_df)

[FeatureDefinition(feature_name='order_id', feature_type=<FeatureTypeEnum.STRING: 'String'>),
 FeatureDefinition(feature_name='customer_id', feature_type=<FeatureTypeEnum.STRING: 'String'>),
 FeatureDefinition(feature_name='product_id', feature_type=<FeatureTypeEnum.STRING: 'String'>),
 FeatureDefinition(feature_name='purchase_amount', feature_type=<FeatureTypeEnum.FRACTIONAL: 'Fractional'>),
 FeatureDefinition(feature_name='is_reordered', feature_type=<FeatureTypeEnum.INTEGRAL: 'Integral'>),
 FeatureDefinition(feature_name='event_time', feature_type=<FeatureTypeEnum.STRING: 'String'>),
 FeatureDefinition(feature_name='n_days_since_last_purchase', feature_type=<FeatureTypeEnum.FRACTIONAL: 'Fractional'>)]

Let's create the feature groups now

Amazon SageMaker Feature Store supports the AWS Glue and Apache Iceberg table formats for the offline store. You can choose the table format when you’re creating a new feature group. 

In this notebook, we will be using the Iceberg table format. Using Apache Iceberg for storing features accelerates model development by enabling faster query performance when extracting ML training datasets, taking advantage of Iceberg table compaction. Depending on the design of your feature groups and their scale, you can experience training query performance improvements of 10x to 100x by using this new capability.

If you need to use the Glue table format, please update the variable below to `'Glue'`.  For more information on offline store formats, please refer to the [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/feature-store-offline.html).

In [27]:
table_format_param = 'ICEBERG' # or 'GLUE'

In [28]:
if table_format_param == 'ICEBERG':
    table_format = TableFormatEnum.ICEBERG
else:
    table_format = TableFormatEnum.GLUE

In [29]:
def wait_for_feature_group_creation_complete(feature_group):
    status = feature_group.describe().get('FeatureGroupStatus')
    print(f'Initial status: {status}')
    while status == 'Creating':
        logger.info(f'Waiting for feature group: {feature_group.name} to be created ...')
        time.sleep(5)
        status = feature_group.describe().get('FeatureGroupStatus')
    if status != 'Created':
        raise SystemExit(f'Failed to create feature group {feature_group.name}: {status}')
    logger.info(f'FeatureGroup {feature_group.name} was successfully created.')

In [30]:
customers_feature_group.create(s3_uri=f's3://{default_bucket}/{prefix}', 
                               record_identifier_name='customer_id', 
                               event_time_feature_name='event_time', 
                               role_arn=role, 
                               enable_online_store=True,
                               table_format=table_format 
                              )

{'FeatureGroupArn': 'arn:aws:sagemaker:us-east-1:161615149547:feature-group/fscw-customers-05-22-21-52',
 'ResponseMetadata': {'RequestId': 'b20853de-d626-42e6-9127-dccfb6d510e0',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'b20853de-d626-42e6-9127-dccfb6d510e0',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '103',
   'date': 'Mon, 22 May 2023 21:52:50 GMT'},
  'RetryAttempts': 0}}

In [31]:
wait_for_feature_group_creation_complete(customers_feature_group)

Waiting for feature group: fscw-customers-05-22-21-52 to be created ...


Initial status: Creating


Waiting for feature group: fscw-customers-05-22-21-52 to be created ...
Waiting for feature group: fscw-customers-05-22-21-52 to be created ...
Waiting for feature group: fscw-customers-05-22-21-52 to be created ...
FeatureGroup fscw-customers-05-22-21-52 was successfully created.


In [32]:
products_feature_group.create(s3_uri=f's3://{default_bucket}/{prefix}', 
                               record_identifier_name='product_id', 
                               event_time_feature_name='event_time', 
                               role_arn=role, 
                               enable_online_store=True,
                               table_format=TableFormatEnum.ICEBERG # or 'GLUE'
                             )

{'FeatureGroupArn': 'arn:aws:sagemaker:us-east-1:161615149547:feature-group/fscw-products-05-22-21-52',
 'ResponseMetadata': {'RequestId': '3634d749-ba5e-49a5-8af8-9e34a0e80ac1',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '3634d749-ba5e-49a5-8af8-9e34a0e80ac1',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '102',
   'date': 'Mon, 22 May 2023 21:53:12 GMT'},
  'RetryAttempts': 0}}

In [33]:
wait_for_feature_group_creation_complete(products_feature_group)

Waiting for feature group: fscw-products-05-22-21-52 to be created ...


Initial status: Creating


Waiting for feature group: fscw-products-05-22-21-52 to be created ...
Waiting for feature group: fscw-products-05-22-21-52 to be created ...
Waiting for feature group: fscw-products-05-22-21-52 to be created ...
Waiting for feature group: fscw-products-05-22-21-52 to be created ...
FeatureGroup fscw-products-05-22-21-52 was successfully created.


In [34]:
orders_feature_group.create(s3_uri=f's3://{default_bucket}/{prefix}', 
                            record_identifier_name='order_id', 
                            event_time_feature_name='event_time', 
                            role_arn=role, 
                            enable_online_store=True,
                            table_format=TableFormatEnum.ICEBERG # or 'GLUE'
                           )

{'FeatureGroupArn': 'arn:aws:sagemaker:us-east-1:161615149547:feature-group/fscw-orders-05-22-21-52',
 'ResponseMetadata': {'RequestId': 'f30bf1d2-fa56-44cd-8414-5c8ba8e0c101',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'f30bf1d2-fa56-44cd-8414-5c8ba8e0c101',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '100',
   'date': 'Mon, 22 May 2023 21:53:40 GMT'},
  'RetryAttempts': 0}}

In [35]:
wait_for_feature_group_creation_complete(orders_feature_group)

Waiting for feature group: fscw-orders-05-22-21-52 to be created ...


Initial status: Creating


Waiting for feature group: fscw-orders-05-22-21-52 to be created ...
Waiting for feature group: fscw-orders-05-22-21-52 to be created ...
Waiting for feature group: fscw-orders-05-22-21-52 to be created ...
FeatureGroup fscw-orders-05-22-21-52 was successfully created.


# Ingest data into feature groups 

In [36]:
%%time

logger.info(f'Ingesting data into feature group: {customers_feature_group.name} ...')
customers_feature_group.ingest(data_frame=customers_df, max_processes=16, wait=True)
logger.info(f'{len(customers_df)} customer records ingested into feature group: {customers_feature_group.name}')

Ingesting data into feature group: fscw-customers-05-22-21-52 ...
10000 customer records ingested into feature group: fscw-customers-05-22-21-52


CPU times: user 175 ms, sys: 182 ms, total: 356 ms
Wall time: 17.4 s


In [37]:
%%time

logger.info(f'Ingesting data into feature group: {products_feature_group.name} ...')
products_feature_group.ingest(data_frame=products_df, max_processes=16, wait=True)
logger.info(f'{len(products_df)} product records ingested into feature group: {products_feature_group.name}')  

Ingesting data into feature group: fscw-products-05-22-21-52 ...
17001 product records ingested into feature group: fscw-products-05-22-21-52


CPU times: user 195 ms, sys: 182 ms, total: 377 ms
Wall time: 28.7 s


In [38]:
%%time

logger.info(f'Ingesting data into feature group: {orders_feature_group.name} ...')
orders_feature_group.ingest(data_frame=orders_df, max_processes=16, wait=True)
logger.info(f'{len(orders_df)} order records ingested into feature group: {orders_feature_group.name}')

Ingesting data into feature group: fscw-orders-05-22-21-52 ...
100000 order records ingested into feature group: fscw-orders-05-22-21-52


CPU times: user 1.58 s, sys: 231 ms, total: 1.81 s
Wall time: 2min 41s


# Get feature record from the Online feature store 

In [39]:
featurestore_runtime_client = sagemaker_session.boto_session.client('sagemaker-featurestore-runtime', region_name=region)

Retrieve a record from customers feature group

In [40]:
customer_id =  f'C{randint(1, 10000)}'
logger.info(f'customer_id={customer_id}') 

customer_id=C5943


In [41]:
feature_record = featurestore_runtime_client.get_record(FeatureGroupName=customers_feature_group_name, 
                                                        RecordIdentifierValueAsString=customer_id)
feature_record

{'ResponseMetadata': {'RequestId': 'dc4825fa-e224-465d-8d8d-87df9e3d1abb',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'dc4825fa-e224-465d-8d8d-87df9e3d1abb',
   'content-type': 'application/json',
   'content-length': '605',
   'date': 'Mon, 22 May 2023 21:58:40 GMT'},
  'RetryAttempts': 0},
 'Record': [{'FeatureName': 'customer_id', 'ValueAsString': 'C5943'},
  {'FeatureName': 'sex', 'ValueAsString': '1'},
  {'FeatureName': 'is_married', 'ValueAsString': '1'},
  {'FeatureName': 'event_time', 'ValueAsString': '2023-05-22T14:15:59.457Z'},
  {'FeatureName': 'age_18-29', 'ValueAsString': '0'},
  {'FeatureName': 'age_30-39', 'ValueAsString': '0'},
  {'FeatureName': 'age_40-49', 'ValueAsString': '0'},
  {'FeatureName': 'age_50-59', 'ValueAsString': '1'},
  {'FeatureName': 'age_60-69', 'ValueAsString': '0'},
  {'FeatureName': 'age_70-plus', 'ValueAsString': '0'},
  {'FeatureName': 'n_days_active', 'ValueAsString': '0.5541095890410959'}]}

# List feature groups 
Since we created all of our feature groups with a common name pattern, we'll just list all the ones that have our same month and day (e.g., 04-13).

In [42]:
import sys
sys.path.append('..')
from utilities.feature_store_helper import FeatureStore
fs = FeatureStore()

In [43]:
fs.list_feature_groups(current_timestamp[0:5])

[{'FeatureGroupName': 'fscw-products-05-22-21-52',
  'FeatureGroupArn': 'arn:aws:sagemaker:us-east-1:161615149547:feature-group/fscw-products-05-22-21-52',
  'CreationTime': datetime.datetime(2023, 5, 22, 21, 53, 13, 378000, tzinfo=tzlocal()),
  'FeatureGroupStatus': 'Created'},
 {'FeatureGroupName': 'fscw-products-05-22-14-42',
  'FeatureGroupArn': 'arn:aws:sagemaker:us-east-1:161615149547:feature-group/fscw-products-05-22-14-42',
  'CreationTime': datetime.datetime(2023, 5, 22, 14, 56, 46, 573000, tzinfo=tzlocal()),
  'FeatureGroupStatus': 'Created',
  'OfflineStoreStatus': {'Status': 'Active'}},
 {'FeatureGroupName': 'fscw-orders-05-22-21-52',
  'FeatureGroupArn': 'arn:aws:sagemaker:us-east-1:161615149547:feature-group/fscw-orders-05-22-21-52',
  'CreationTime': datetime.datetime(2023, 5, 22, 21, 53, 40, 406000, tzinfo=tzlocal()),
  'FeatureGroupStatus': 'Created'},
 {'FeatureGroupName': 'fscw-orders-05-22-14-42',
  'FeatureGroupArn': 'arn:aws:sagemaker:us-east-1:161615149547:featur