# Module 3 - Update Feature Group (Optional notebook)
### Module 1 is a pre-requisite for this notebook.

**Note:** Please set kernel to `Python 3 (Data Science)` and select instance to `ml.t3.medium`

---

## Contents

1. [Setup](#setup)
1. [Explore existing customer feature group and data](#explore-customer-fg)
1. [Update customer feature group](#update-customer-fg)
1. [Ingest data into customer feature group](#ingest-customer-fg)
1. [Prepare training data set to retrain model](#model-training-data)
1. [Retrain XG Boost model](#retrain-xg-boost)
1. [Test model performance against test data](#real-time-inference)

---

In this notebook, we will illustrate how to modify a feature group using boto3 API and then ingest data into modified feature group. We will cover the following aspects:

* Look at existing data from customer feature group
* Modify customer feature group to add "has_kids" feature and ingest sample data
* Verify for a customer record that data has been ingested
* Athena query for dataset extraction to prepare data set for retraining(programmatically using SageMaker SDK)
* Retrain an XGBoost model similar to what we did in the notebook `m3_nb1_model_training.ipynb`
* Test by deploying the model and predicting against a sample test record
* Cleanup resources



## Setup
<a id='setup'></a>

In [2]:
from sagemaker.feature_store.feature_group import FeatureGroup
from sagemaker.serializers import CSVSerializer
from sagemaker.inputs import TrainingInput
from sagemaker.predictor import Predictor
from datetime import datetime, timezone, date
from random import randint
import pandas as pd
import numpy as np
import subprocess
import sagemaker
import importlib
import logging
import time
import sys
import boto3
import os
sys.path.append('..')
from utilities import Utils

In [3]:
if sagemaker.__version__ < '2.48.1':
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', 'sagemaker==2.48.1'])
    importlib.reload(sagemaker)

In [4]:
if boto3.__version__ < '1.24.23':
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', 'boto3==1.24.23'])
    importlib.reload(boto3)

In [5]:
logger = logging.getLogger('__name__')
logger.setLevel(logging.DEBUG)
logger.addHandler(logging.StreamHandler())

In [6]:
logger.info(f'Using SageMaker version: {sagemaker.__version__}')
logger.info(f'Using Pandas version: {pd.__version__}')
logger.info(f'Using boto3 version: {boto3.__version__}')

Using SageMaker version: 2.48.1
Using Pandas version: 1.3.5
Using boto3 version: 1.26.111


In [7]:
import pprint
pretty_printer = pprint.PrettyPrinter(indent=4)

In [8]:
!mkdir ./data/retrain

mkdir: cannot create directory ‘./data/retrain’: File exists


## Essentials

In [9]:
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()
default_bucket = sagemaker_session.default_bucket()
logger.info(f'Default S3 bucket = {default_bucket}')
prefix = 'sagemaker-feature-store'

Default S3 bucket = sagemaker-us-east-1-161615149547


In [10]:
region = sagemaker_session.boto_region_name

In [11]:
boto_session = boto3.Session(region_name=region)
sagemaker_runtime = boto_session.client(service_name='sagemaker', region_name=region)
featurestore_runtime = boto_session.client(service_name='sagemaker-featurestore-runtime', region_name=region)
s3 = boto_session.resource('s3')


In [12]:
def generate_event_timestamp():
    # naive datetime representing local time
    naive_dt = datetime.now()
    # take timezone into account
    aware_dt = naive_dt.astimezone()
    # time in UTC
    utc_dt = aware_dt.astimezone(timezone.utc)
    # transform to ISO-8601 format
    event_time = utc_dt.isoformat(timespec='milliseconds')
    event_time = event_time.replace('+00:00', 'Z')
    return event_time

### Explore existing feature definition and the data set
<a id='explore-customer-fg'></a>

Retrieve variables stored in previous notebooks for feature group names

In [13]:
# Retreive FG names
%store -r customers_feature_group_name
%store -r products_feature_group_name
%store -r orders_feature_group_name
logger.info(f'Customers FG: {customers_feature_group_name}')
logger.info(f'Products FG: {products_feature_group_name}')
logger.info(f'Orders FG: {orders_feature_group_name}')

Customers FG: fscw-customers-05-22-21-52
Products FG: fscw-products-05-22-21-52
Orders FG: fscw-orders-05-22-21-52


In [14]:
customers_fg = FeatureGroup(name=customers_feature_group_name, sagemaker_session=sagemaker_session)  
products_fg = FeatureGroup(name=products_feature_group_name, sagemaker_session=sagemaker_session)
orders_fg = FeatureGroup(name=orders_feature_group_name, sagemaker_session=sagemaker_session)

Verify record exists in Customer Feature Group for a random customer_id

In [15]:
customer_id =  f'C{randint(1, 10000)}'
logger.info(f'customer_id={customer_id}') 

customer_id=C4346


In [16]:
feature_record = featurestore_runtime.get_record(FeatureGroupName=customers_feature_group_name, 
                                                        RecordIdentifierValueAsString=customer_id)
feature_record

{'ResponseMetadata': {'RequestId': '60552487-ef3a-46b0-8a31-8b29ac604f32',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '60552487-ef3a-46b0-8a31-8b29ac604f32',
   'content-type': 'application/json',
   'content-length': '604',
   'date': 'Mon, 22 May 2023 22:06:45 GMT'},
  'RetryAttempts': 0},
 'Record': [{'FeatureName': 'customer_id', 'ValueAsString': 'C4346'},
  {'FeatureName': 'sex', 'ValueAsString': '1'},
  {'FeatureName': 'is_married', 'ValueAsString': '1'},
  {'FeatureName': 'event_time', 'ValueAsString': '2023-05-22T14:15:56.615Z'},
  {'FeatureName': 'age_18-29', 'ValueAsString': '0'},
  {'FeatureName': 'age_30-39', 'ValueAsString': '0'},
  {'FeatureName': 'age_40-49', 'ValueAsString': '0'},
  {'FeatureName': 'age_50-59', 'ValueAsString': '0'},
  {'FeatureName': 'age_60-69', 'ValueAsString': '0'},
  {'FeatureName': 'age_70-plus', 'ValueAsString': '1'},
  {'FeatureName': 'n_days_active', 'ValueAsString': '0.821917808219178'}]}

In [17]:
describe_feature_group_result = sagemaker_runtime.describe_feature_group(
    FeatureGroupName=customers_feature_group_name
)
pretty_printer.pprint(describe_feature_group_result)

{   'CreatedBy': {   'DomainId': 'd-wguoldosy9tn',
                     'UserProfileArn': 'arn:aws:sagemaker:us-east-1:161615149547:user-profile/d-wguoldosy9tn/default-1679523110579',
                     'UserProfileName': 'default-1679523110579'},
    'CreationTime': datetime.datetime(2023, 5, 22, 21, 52, 50, 492000, tzinfo=tzlocal()),
    'EventTimeFeatureName': 'event_time',
    'FeatureDefinitions': [   {   'FeatureName': 'customer_id',
                                  'FeatureType': 'String'},
                              {'FeatureName': 'sex', 'FeatureType': 'Integral'},
                              {   'FeatureName': 'is_married',
                                  'FeatureType': 'Integral'},
                              {   'FeatureName': 'event_time',
                                  'FeatureType': 'String'},
                              {   'FeatureName': 'age_18-29',
                                  'FeatureType': 'Integral'},
                              {   'Featur

### Update feature group and ingest data
<a id='update-customer-fg' />

The sample product set that we have are spread out across different categories - baby products, candies, cleaning products etc. So let us assume that a customer *“having kids or not”* is defintely an indicator of them buying baby and kids products. Lets go ahead and modify the customer feature group to add this new feature.

In [18]:
# Call UpdateFeatureGroup with feature addition(s)
sagemaker_runtime.update_feature_group(
    FeatureGroupName=customers_feature_group_name,
    FeatureAdditions=[
        {"FeatureName": "has_kids", "FeatureType": "Integral"}
    ]
)

{'FeatureGroupArn': 'arn:aws:sagemaker:us-east-1:161615149547:feature-group/fscw-customers-05-22-21-52',
 'ResponseMetadata': {'RequestId': 'b326c1d0-46c0-4fb6-858c-52a252c161ad',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'b326c1d0-46c0-4fb6-858c-52a252c161ad',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '103',
   'date': 'Mon, 22 May 2023 22:06:48 GMT'},
  'RetryAttempts': 0}}

We have a sleep set for 60 seconds because the update operation could take a minute.

In [19]:
time.sleep(60)

In [20]:
describe_feature_group_result = sagemaker_runtime.describe_feature_group(
    FeatureGroupName=customers_feature_group_name
)
pretty_printer.pprint(describe_feature_group_result)

{   'CreatedBy': {   'DomainId': 'd-wguoldosy9tn',
                     'UserProfileArn': 'arn:aws:sagemaker:us-east-1:161615149547:user-profile/d-wguoldosy9tn/default-1679523110579',
                     'UserProfileName': 'default-1679523110579'},
    'CreationTime': datetime.datetime(2023, 5, 22, 21, 52, 50, 492000, tzinfo=tzlocal()),
    'EventTimeFeatureName': 'event_time',
    'FeatureDefinitions': [   {   'FeatureName': 'customer_id',
                                  'FeatureType': 'String'},
                              {'FeatureName': 'sex', 'FeatureType': 'Integral'},
                              {   'FeatureName': 'is_married',
                                  'FeatureType': 'Integral'},
                              {   'FeatureName': 'event_time',
                                  'FeatureType': 'String'},
                              {   'FeatureName': 'age_18-29',
                                  'FeatureType': 'Integral'},
                              {   'Featur

### Prepare "has_kids" feature data and ingest data again into customer feature group. 
<a id='ingest-customer-fg' />

Verify that the feature is added to feature group before proceeding with this step. We retrieve the customer data from the csv and randomly generate 0 or 1 for "has_kids" feature and ingest into feature group

In [21]:
customers_df = pd.read_csv('./data/transformed/customers.csv')
customers_df.head(5)

Unnamed: 0,customer_id,sex,is_married,event_time,age_18-29,age_30-39,age_40-49,age_50-59,age_60-69,age_70-plus,n_days_active
0,C1,0,0,2023-05-22T14:15:48.008Z,0,0,0,1,0,0,0.026027
1,C2,1,0,2023-05-22T14:15:48.010Z,1,0,0,0,0,0,0.077397
2,C3,0,1,2023-05-22T14:15:48.013Z,0,0,0,0,1,0,0.821233
3,C4,1,1,2023-05-22T14:15:48.015Z,0,0,0,1,0,0,0.887671
4,C5,0,1,2023-05-22T14:15:48.017Z,0,1,0,0,0,0,0.265753


Use the NumPy library to generate random 1s and 0s

In [22]:
customers_df['has_kids']=np.random.randint(0, 2, customers_df.shape[0])

Drop the existing event time column and add current time as event time. These two steps are optional.

In [23]:
customers_df=customers_df.drop(['event_time'],axis=1)

In [24]:
event_timestamps = [generate_event_timestamp() for _ in range(len(customers_df))]
customers_df['event_time'] = event_timestamps
customers_df.head(5)

Unnamed: 0,customer_id,sex,is_married,age_18-29,age_30-39,age_40-49,age_50-59,age_60-69,age_70-plus,n_days_active,has_kids,event_time
0,C1,0,0,0,0,0,1,0,0,0.026027,0,2023-05-22T22:07:50.082Z
1,C2,1,0,1,0,0,0,0,0,0.077397,0,2023-05-22T22:07:50.082Z
2,C3,0,1,0,0,0,0,1,0,0.821233,1,2023-05-22T22:07:50.082Z
3,C4,1,1,0,0,0,1,0,0,0.887671,1,2023-05-22T22:07:50.082Z
4,C5,0,1,0,1,0,0,0,0,0.265753,1,2023-05-22T22:07:50.082Z


Ingest the updated data into feature group. In case ingest operation throws errors regarding feature not being present in the Feature Group, give the update operation some more time as mentioned before and try the ingest again.

In [64]:
%%time
customers_fg.ingest(data_frame=customers_df, max_processes=16, wait=True)
logger.info(f'{len(customers_df)} customer records ingested into feature group: {customers_feature_group_name}')

10000 customer records ingested into feature group: fscw-customers-05-22-21-52


CPU times: user 166 ms, sys: 194 ms, total: 359 ms
Wall time: 17.8 s


Verify online store for a specific customer_id

In [65]:
get_record_result = featurestore_runtime.get_record(
    FeatureGroupName=customers_feature_group_name,
    RecordIdentifierValueAsString=customer_id
)
pretty_printer.pprint(get_record_result)

{   'Record': [   {'FeatureName': 'customer_id', 'ValueAsString': 'C4346'},
                  {'FeatureName': 'sex', 'ValueAsString': '1'},
                  {'FeatureName': 'is_married', 'ValueAsString': '1'},
                  {   'FeatureName': 'event_time',
                      'ValueAsString': '2023-05-22T22:07:50.117Z'},
                  {'FeatureName': 'age_18-29', 'ValueAsString': '0'},
                  {'FeatureName': 'age_30-39', 'ValueAsString': '0'},
                  {'FeatureName': 'age_40-49', 'ValueAsString': '0'},
                  {'FeatureName': 'age_50-59', 'ValueAsString': '0'},
                  {'FeatureName': 'age_60-69', 'ValueAsString': '0'},
                  {'FeatureName': 'age_70-plus', 'ValueAsString': '1'},
                  {   'FeatureName': 'n_days_active',
                      'ValueAsString': '0.821917808219178'},
                  {'FeatureName': 'has_kids', 'ValueAsString': '1'}],
    'ResponseMetadata': {   'HTTPHeaders': {   'content-length'

Let us run Athena query to verify offline store. Note that the data ingestion into offline store could take some time since data is buffered, batched, and written into Amazon S3 within 15 minutes.

In [66]:
customers_query = customers_fg.athena_query()
customers_table = customers_query.table_name

In [67]:
output_location = f's3://{default_bucket}/{prefix}/query_results/'

In [68]:
query_string = f'SELECT * FROM "{customers_table}" limit 10'

In [69]:
customers_query.run(query_string=query_string,output_location=output_location)
customers_query.wait()
athena_df = customers_query.as_dataframe()
athena_df.head()

Unnamed: 0,write_time,api_invocation_time,is_deleted,customer_id,sex,is_married,event_time,age_18-29,age_30-39,age_40-49,age_50-59,age_60-69,age_70-plus,n_days_active,has_kids
0,2023-05-22 21:59:37.011 UTC,2023-05-22 21:54:29.000 UTC,False,C13,0,0,2023-05-22T14:15:48.033Z,0,1,0,0,0,0,0.996575,
1,2023-05-22 21:59:37.011 UTC,2023-05-22 21:54:30.000 UTC,False,C37,1,0,2023-05-22T14:15:48.088Z,0,0,0,0,0,1,0.715753,
2,2023-05-22 21:59:37.011 UTC,2023-05-22 21:54:30.000 UTC,False,C4393,0,0,2023-05-22T14:15:56.688Z,0,0,0,0,0,1,0.639041,
3,2023-05-22 21:59:37.011 UTC,2023-05-22 21:54:30.000 UTC,False,C7506,0,0,2023-05-22T14:16:02.756Z,0,0,0,0,0,1,0.865753,
4,2023-05-22 21:59:37.011 UTC,2023-05-22 21:54:30.000 UTC,False,C4396,1,1,2023-05-22T14:15:56.693Z,0,1,0,0,0,0,0.464384,


As we see from the above step, it is very easy now to modify an existing feature group, add new features, and ingest data.

In [70]:
customers_df.to_csv('./data/transformed/customers_has_kids.csv', index=False)

### Verify offline store in Athena Console

If it is for the first time we are launching Athena in AWS console we need to click on `Get Started` button and then before we run the first query we need to set up a query results location in Amazon S3. 

After setting the query results location, on the left panel we need to select the `AwsDataCatalog` as Data source and the `sagemaker_featurestore` as Database.

We can run now run a query for the offline feature store data in Athena. To select the entries from the orders feature group we use the following SQL query. You will need to replace the orders table name with the corresponded value from your environment.

```sql
select * from "<customers-table>"
limit 100
```

![Customers offline data](./images/m3_nb4_athena_query.png "Customers Offline Data")

## Optional steps
From here on in this notebook, we use the data that has the new feature "has_kids" and train the model again with the data, deploy the model and test it against sample data. The intention is not to prove that model performance improves (mind you this is sample data!) but to show a real life use case where modified feature groups can be used for training.

### Prepare model training dataset
<a id='model-training-data' />

Prepare train, test and validation data 

In [71]:
products_query = products_fg.athena_query()
products_table = products_query.table_name

orders_query = orders_fg.athena_query()
orders_table = orders_query.table_name

To prepare training, validation and test data, we run an Athena query against offline feature store and get records for which "has_kids" has been populated. Why do we do this? Because offline feature store has historical records, we want only the latest ingested data that has "has_kids" populated for retraining our model.

In [72]:
query_string = f'SELECT * FROM "{customers_table}", "{products_table}", "{orders_table}" ' \
               f'WHERE ("{orders_table}"."customer_id" = "{customers_table}"."customer_id") ' \
               f'AND ("{orders_table}"."product_id" = "{products_table}"."product_id")' \
               f'AND ("{customers_table}"."has_kids" is not null)'
query_string

'SELECT * FROM "fscw_customers_05_22_21_52_1684792370", "fscw_products_05_22_21_52_1684792393", "fscw_orders_05_22_21_52_1684792420" WHERE ("fscw_orders_05_22_21_52_1684792420"."customer_id" = "fscw_customers_05_22_21_52_1684792370"."customer_id") AND ("fscw_orders_05_22_21_52_1684792420"."product_id" = "fscw_products_05_22_21_52_1684792393"."product_id")AND ("fscw_customers_05_22_21_52_1684792370"."has_kids" is not null)'

In [73]:
orders_query.run(query_string=query_string, output_location=output_location)
orders_query.wait()
joined_df = orders_query.as_dataframe()
joined_df.head()

Unnamed: 0,write_time,api_invocation_time,is_deleted,customer_id,sex,is_married,event_time,age_18-29,age_30-39,age_40-49,...,write_time.2,api_invocation_time.2,is_deleted.2,order_id,customer_id.1,product_id.1,purchase_amount,is_reordered,event_time.2,n_days_since_last_purchase
0,2023-05-22 22:12:49.367 UTC,2023-05-22 22:07:51.000 UTC,False,C3,0,1,2023-05-22T22:07:50.082Z,0,0,0,...,2023-05-22 22:01:07.695 UTC,2023-05-22 21:58:18.000 UTC,False,O24315,C3,P8252,0.794257,1,2023-05-22T14:23:26.133Z,0.47093
1,2023-05-22 22:12:49.367 UTC,2023-05-22 22:07:51.000 UTC,False,C3,0,1,2023-05-22T22:07:50.082Z,0,0,0,...,2023-05-22 22:00:57.702 UTC,2023-05-22 21:57:32.000 UTC,False,O41122,C3,P186,0.987921,1,2023-05-22T14:23:27.412Z,0.670543
2,2023-05-22 22:12:49.367 UTC,2023-05-22 22:07:51.000 UTC,False,C3,0,1,2023-05-22T22:07:50.082Z,0,0,0,...,2023-05-22 22:00:57.813 UTC,2023-05-22 21:57:40.000 UTC,False,O22783,C3,P6721,0.510297,1,2023-05-22T14:23:26.040Z,0.377907
3,2023-05-22 22:12:49.367 UTC,2023-05-22 22:07:51.000 UTC,False,C3,0,1,2023-05-22T22:07:50.082Z,0,0,0,...,2023-05-22 22:00:57.921 UTC,2023-05-22 21:56:50.000 UTC,False,O52027,C3,P12115,0.914158,0,2023-05-22T14:23:28.191Z,0.432171
4,2023-05-22 22:12:49.367 UTC,2023-05-22 22:07:51.000 UTC,False,C3,0,1,2023-05-22T22:07:50.082Z,0,0,0,...,2023-05-22 22:01:01.613 UTC,2023-05-22 21:58:19.000 UTC,False,O92924,C3,P2175,0.553069,1,2023-05-22T14:23:31.531Z,0.364341


In [74]:
joined_df.shape

(99986, 50)

In [75]:
model_df = joined_df.drop(['order_id', 
                           'customer_id', 
                           'product_id', 
                           'event_time', 
                           'write_time', 
                           'api_invocation_time', 
                           'is_deleted', 
                           'product_id.1', 
                           'event_time.1', 
                           'write_time.1', 
                           'api_invocation_time.1', 
                           'is_deleted.1', 
                           'customer_id.1', 
                           'purchase_amount',
                           'event_time.2', 
                           'n_days_since_last_purchase',
                           'write_time.2', 
                           'api_invocation_time.2', 
                           'is_deleted.2'], axis=1)

In [76]:
model_df.head(5)

Unnamed: 0,sex,is_married,age_18-29,age_30-39,age_40-49,age_50-59,age_60-69,age_70-plus,n_days_active,has_kids,...,category_ice_cream_ice,category_juice_nectars,category_packaged_cheese,category_refrigerated,category_soup_broth_bouillon,category_spices_seasonings,category_tea,category_vitamins_supplements,category_yogurt,is_reordered
0,0,1,0,0,0,0,1,0,0.821233,1,...,0,0,0,0,0,1,0,0,0,1
1,0,1,0,0,0,0,1,0,0.821233,1,...,0,0,0,0,0,0,0,0,0,1
2,0,1,0,0,0,0,1,0,0.821233,1,...,0,0,0,0,1,0,0,0,0,1
3,0,1,0,0,0,0,1,0,0.821233,1,...,0,0,0,1,0,0,0,0,0,0
4,0,1,0,0,0,0,1,0,0.821233,1,...,0,0,0,0,0,0,0,0,0,1


In [77]:
first_column = model_df.pop('is_reordered')
model_df.insert(0, 'is_reordered', first_column)
model_df.head()

Unnamed: 0,is_reordered,sex,is_married,age_18-29,age_30-39,age_40-49,age_50-59,age_60-69,age_70-plus,n_days_active,...,category_hair_care,category_ice_cream_ice,category_juice_nectars,category_packaged_cheese,category_refrigerated,category_soup_broth_bouillon,category_spices_seasonings,category_tea,category_vitamins_supplements,category_yogurt
0,1,0,1,0,0,0,0,1,0,0.821233,...,0,0,0,0,0,0,1,0,0,0
1,1,0,1,0,0,0,0,1,0,0.821233,...,0,0,0,0,0,0,0,0,0,0
2,1,0,1,0,0,0,0,1,0,0.821233,...,0,0,0,0,0,1,0,0,0,0
3,0,0,1,0,0,0,0,1,0,0.821233,...,0,0,0,0,1,0,0,0,0,0
4,1,0,1,0,0,0,0,1,0,0.821233,...,0,0,0,0,0,0,0,0,0,0


In [78]:
model_df.to_csv('./data/retrain/transformed_has_kids.csv', index=False)

### Retrain the XGBoost model with the update feature group
<a id='retrain-xg-boost' />

Now lets train the model again with this new data set

In [79]:
train_df, validation_df, test_df = np.split(model_df.sample(frac=1, random_state=123), [int(.7*len(model_df)), int(.9*len(model_df))])

In [80]:
train_df.shape

(69990, 31)

In [81]:
validation_df.shape

(19997, 31)

In [82]:
test_df.shape

(9999, 31)

Store the train, validation and test data locally and i

In [83]:
train_df.to_csv('./data/retrain/train.csv', index=False)
validation_df.to_csv('./data/retrain/validation.csv', index=False)
test_df.to_csv('./data/retrain/test.csv', index=False)

In [84]:
s3.Bucket(default_bucket).Object(os.path.join(prefix, 'retrain/train.csv')).upload_file('./data/retrain/train.csv')
s3.Bucket(default_bucket).Object(os.path.join(prefix, 'retrain/validation.csv')).upload_file('./data/retrain/validation.csv')
s3.Bucket(default_bucket).Object(os.path.join(prefix, 'retrain/test.csv')).upload_file('./data/retrain/test.csv')

In [85]:
train_set_location = 's3://{}/{}/retrain/'.format(default_bucket, prefix)
validation_set_location = 's3://{}/{}/retrain/'.format(default_bucket, prefix)
test_set_location = 's3://{}/{}/retrain/'.format(default_bucket, prefix)

In [86]:
train_set_pointer = TrainingInput(s3_data=train_set_location, content_type='csv')
validation_set_pointer = TrainingInput(s3_data=validation_set_location, content_type='csv')
test_set_pointer = TrainingInput(s3_data=test_set_location, content_type='csv')

In [87]:
container_uri = sagemaker.image_uris.retrieve(region=boto_session.region_name, 
                                              framework='xgboost', 
                                              version='1.0-1', 
                                              image_scope='training')

In [88]:
xgb = sagemaker.estimator.Estimator(image_uri=container_uri,
                                    role=role, 
                                    instance_count=2, 
                                    instance_type='ml.m5.xlarge',
                                    output_path='s3://{}/{}/model-artifacts'.format(default_bucket, prefix),
                                    sagemaker_session=sagemaker_session,
                                    base_job_name='reorder-classifier')

xgb.set_hyperparameters(objective='binary:logistic',
                        num_round=100)

In [89]:
xgb.fit({'train': train_set_pointer, 'validation': validation_set_pointer})

2023-05-22 22:31:25 Starting - Starting the training job...ProfilerReport-1684794685: InProgress
...
2023-05-22 22:32:24 Starting - Preparing the instances for training......
2023-05-22 22:33:26 Downloading - Downloading input data...
2023-05-22 22:33:56 Training - Downloading the training image...
2023-05-22 22:34:16 Training - Training image download completed. Training in progress..[34m[2023-05-22 22:34:27.856 ip-10-0-155-106.ec2.internal:7 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None[0m
[34mINFO:sagemaker-containers:Imported framework sagemaker_xgboost_container.training[0m
[34mINFO:sagemaker-containers:Failed to parse hyperparameter objective value binary:logistic to Json.[0m
[34mReturning the value itself[0m
[34mINFO:sagemaker-containers:No GPUs detected (normal if no gpus installed)[0m
[34mINFO:sagemaker_xgboost_container.training:Running XGBoost Sagemaker in algorithm mode[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34mINFO:root:Determi

In [90]:
# Saving training job information to be used in subsequent modeules
training_job_info = xgb.latest_training_job.describe()
if training_job_info != None :
    training_jobName = training_job_info["TrainingJobName"]
    %store training_jobName

Stored 'training_jobName' (str)


In [91]:
xgb_predictor = xgb.deploy(initial_instance_count=2,
                           instance_type='ml.m5.xlarge')

------!

In [92]:
# Saving endpoint name to be used in subsequent modules
endpoint_name = xgb_predictor.endpoint_name
%store endpoint_name

Stored 'endpoint_name' (str)


### Real time inference using the deployed endpoint
<a id='real-time-inference' />

Lets get a record from test data and test the inference.

In [93]:
csv_serializer = CSVSerializer()
endpoint_name = xgb_predictor.endpoint_name
predictor = Predictor(endpoint_name=endpoint_name, 
                      serializer=csv_serializer)

In [94]:
test_df = pd.read_csv('./data/retrain/test.csv')
record = test_df.sample(1)
record

Unnamed: 0,is_reordered,sex,is_married,age_18-29,age_30-39,age_40-49,age_50-59,age_60-69,age_70-plus,n_days_active,...,category_hair_care,category_ice_cream_ice,category_juice_nectars,category_packaged_cheese,category_refrigerated,category_soup_broth_bouillon,category_spices_seasonings,category_tea,category_vitamins_supplements,category_yogurt
5980,1,0,1,0,0,1,0,0,0,0.031507,...,0,0,0,1,0,0,0,0,0,0


In [95]:
test_df = pd.read_csv('./data/retrain/test.csv')
test_df

Unnamed: 0,is_reordered,sex,is_married,age_18-29,age_30-39,age_40-49,age_50-59,age_60-69,age_70-plus,n_days_active,...,category_hair_care,category_ice_cream_ice,category_juice_nectars,category_packaged_cheese,category_refrigerated,category_soup_broth_bouillon,category_spices_seasonings,category_tea,category_vitamins_supplements,category_yogurt
0,0,1,0,0,0,0,0,0,1,0.862329,...,0,0,0,0,0,0,0,0,0,1
1,1,0,0,1,0,0,0,0,0,0.288356,...,0,0,0,0,0,0,0,0,0,0
2,0,0,1,0,0,0,0,1,0,0.288356,...,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,1,0,0,0,0,0.352055,...,0,0,1,0,0,0,0,0,0,0
4,1,1,0,0,0,1,0,0,0,0.574658,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9994,1,0,0,0,0,1,0,0,0,0.341781,...,0,0,0,0,0,0,0,0,0,0
9995,1,1,1,1,0,0,0,0,0,0.745205,...,0,0,0,0,0,0,0,0,0,1
9996,0,0,1,0,0,0,0,0,1,0.991781,...,0,0,1,0,0,0,0,0,0,0
9997,0,1,1,0,1,0,0,0,0,0.729452,...,0,0,0,1,0,0,0,0,0,0


In [96]:
X = record.values[0]
payload = X[1:]
payload

array([0.        , 1.        , 0.        , 0.        , 1.        ,
       0.        , 0.        , 0.        , 0.03150685, 1.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 1.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ])

In [97]:
%%time

predicted_class_prob = predictor.predict(payload).decode('utf-8')
logger.info(f'Predicted calss probability {predicted_class_prob}')
if float(predicted_class_prob) < 0.5:
    logger.info('Prediction (y) = Will not reorder')
else:
    logger.info('Prediction (y) = Will reorder')

Predicted calss probability 0.7311009168624878
Prediction (y) = Will reorder


CPU times: user 14.7 ms, sys: 0 ns, total: 14.7 ms
Wall time: 112 ms


### Cleanup

Now that we have seen how features can be added to feature groups, it is time to delete unwated resources like endpoints to not incur charges

In [None]:
describe_feature_group_result = sagemaker_runtime.describe_feature_group(
    FeatureGroupName=customers_feature_group_name
)
pretty_printer.pprint(describe_feature_group_result)

Delete the endpoint

In [None]:
response = sagemaker_runtime.describe_endpoint_config(EndpointConfigName=endpoint_name)
model_name = response['ProductionVariants'][0]['ModelName']
model_name

In [None]:
sagemaker_runtime.delete_model(ModelName=model_name)  

In [None]:
sagemaker_runtime.delete_endpoint(EndpointName=endpoint_name)

In [None]:
sagemaker_runtime.delete_endpoint_config(EndpointConfigName=endpoint_name)