# Part 1 : Data Preparation, Process, and Store Features

<a id='all-up-overview'></a>

## [Overview](./0-AutoClaimFraudDetection.ipynb)
* Notebook 1/2: Overview, Architecture and Data Exploration
* **[Notebook 3: Data Preparation, Process, and Store Features](./1-data-prep-e2e.ipynb)**
  * **[Architecture](#arch)**
  * **[Getting started](#aud-getting-started)**
  * **[DataSets](#aud-datasets)**
  * **[SageMaker Feature Store](#aud-feature-store)**
  * **[Create train and test datasets](#aud-dataset)**
* Notebook 4: Train, Check Bias, Tune, Record Lineage, and Register a Model
* Notebook 5: Mitigate Bias, Train New Model, Store in Registry)
* Notebook 6: Deploy Model, Run Predictions
* Notebook 7: Create and Run an End-to-End Pipeline to Deploy the Model

The purpose of this notebook is to perform the Data Prep phase of the ML life cycle. The main Data Wrangling, data ingestion, and multiple transformations will be done through the SageMaker Studio Data Wrangler GUI.

In this notebook, we will take the `.flow` files that define the transformations to the raw data. and apply them using a SageMaker Processing job that will apply those transformations to the raw data deposited in the S3 bucket as `.csv` files.

<a id='arch'> </a>

## Architecture for Data Prep, Process and Store Features
[overview](#all-up-overview)
----
![Data Prep and Store](./images/e2e-1-pipeline-v3b.png)

### Install required and/or update third-party libraries

In [3]:
import warnings
warnings.filterwarnings('ignore')

### Loading stored variables
If you ran this notebook before, you may want to re-use the resources you aready created with AWS. Run the cell below to load any prevously created variables. You should see a print-out of the existing variables. If you don't see anything printed then it's probably the first time you are running the notebook! 

In [4]:
%store -r
%store

Stored variables and their in-db values:
bucket                             -> 'sagemaker-us-east-1-875692608981'
claims_preprocessed                ->       policy_id  incident_severity  num_vehicles_i
customers_preprocessed             ->       policy_id  customer_age  customer_education 
prefix                             -> 'fraud-detect-demo'


**<font color='red'>Important</font>: You must have run the previous sequential notebooks to retrieve variables using the StoreMagic command.**

### Import libraries

In [5]:
import json
import time
import boto3
import string
import sagemaker
import pandas as pd
import awswrangler as wr

from sagemaker.feature_store.feature_group import FeatureGroup

<a id='aud-getting-started'></a>

## Getting started: Creating Resources

[overview](#all-up-overview)
----
In order to successfully run this notebook you will need to create some AWS resources. 
First, an S3 bucket will be created to store all the data for this tutorial. 
Once created, you will then need to create an AWS Glue role using the IAM console then attach a policy to the S3 bucket to allow FeatureStore access to this notebook. If you've already run this notebook and are picking up where you left off, then running the cells below should pick up the resources you already created without creating any additional resources.

#### Add FeatureStore policy to Studio's execution role

![title](images/iam-policies.png)


1. In a separate brower tab go to the IAM section of the AWS Console
2. Navigate to the Roles section and select the execution role you're using for your SageMaker Studio user
    * If you're not sure what role you're using, run the cell below to print it out
3. Attach the <font color='green'> AmazonSageMakerFeatureStoreAccess </font> policy to this role. Once attached, the changes take  effect immediately.

### Set region, boto3 and SageMaker SDK variables

In [8]:
# You can change this to a region of your choice
region = sagemaker.Session().boto_region_name
print("Using AWS Region: {}".format(region))

Using AWS Region: us-east-1


In [10]:
boto_session = boto3.Session(region_name=region)
s3_client = boto3.client("s3", region_name=region)
sagemaker_client = boto_session.client("sagemaker")
sagemaker_session = sagemaker.session.Session(
    boto_session=boto_session, sagemaker_client=sagemaker_client
)
sagemaker_role = sagemaker.get_execution_role()
account_id = boto3.client("sts").get_caller_identity()["Account"]
print("SageMaker Role:", sagemaker.get_execution_role().split("/")[-1])

SageMaker Role: AmazonSageMaker-ExecutionRoleMLT


<a id='aud-feature-store'></a>

## SageMaker Feature Store

[overview](#all-up-overview)
----
Amazon SageMaker Feature Store is a purpose-built repository where you can store and access features so it’s much easier to name, organize, and reuse them across teams. SageMaker Feature Store provides a unified store for features during training and real-time inference without the need to write additional code or create manual processes to keep features consistent. SageMaker Feature Store keeps track of the metadata of stored features (e.g. feature name or version number) so that you can query the features for the right attributes in batches or in real time using Amazon Athena, an interactive query service. SageMaker Feature Store also keeps features updated, because as new data is generated during inference, the single repository is updated so new features are always available for models to use during training and inference.

A feature store consists of an offline componet stored in S3 and an online component stored in a low-latency database. The online database is optional, but very useful if you need supplemental features to be available at inference. In this section, we will create a feature groups for our Claims and Customers datasets. After inserting the claims and customer data into their respective feature groups, you need to query the offline store with Athena to build the training dataset.

You can reference the [SageMaker Developer Guide](https://docs.aws.amazon.com/sagemaker/latest/dg/feature-store.html) for more information about the SageMaker Feature Store.


In [11]:
featurestore_runtime = boto_session.client(
    service_name="sagemaker-featurestore-runtime", region_name=region
)

feature_store_session = sagemaker.Session(
    boto_session=boto_session,
    sagemaker_client=sagemaker_client,
    sagemaker_featurestore_runtime_client=featurestore_runtime,
)

### Configure the feature groups
The datatype for each feature is set by passing a dataframe and inferring the proper datatype. Feature data types can also be set via a config variable, but it will have to match the correspongin Python data type in the Pandas dataframe when it's ingested to the Feature Group.

In [12]:
claims_fg_name = f"{prefix}-claims"
customers_fg_name = f"{prefix}-customers"
%store claims_fg_name
%store customers_fg_name

claims_feature_group = FeatureGroup(name=claims_fg_name, sagemaker_session=feature_store_session)

customers_feature_group = FeatureGroup(
    name=customers_fg_name, sagemaker_session=feature_store_session
)

claims_feature_group.load_feature_definitions(data_frame=claims_preprocessed)
customers_feature_group.load_feature_definitions(data_frame=customers_preprocessed);

Stored 'claims_fg_name' (str)
Stored 'customers_fg_name' (str)


### Create the feature groups
You must tell the Feature Group which columns in the dataframe correspond to the required record indentifier and event time features.

In [13]:
print(f"{customers_fg_name} -- {claims_fg_name} are the feature group names in use")

fraud-detect-demo-customers -- fraud-detect-demo-claims are the feature group names in use


In [14]:
record_identifier_feature_name = "policy_id"
event_time_feature_name = "event_time"

try:
    print(f"\n Using s3://{bucket}/{prefix}")
    claims_feature_group.create(
        s3_uri=f"s3://{bucket}/{prefix}",
        record_identifier_name=record_identifier_feature_name,
        event_time_feature_name=event_time_feature_name,
        role_arn=sagemaker_role,
        enable_online_store=True,
    )
    print(f'Create "claims" feature group: SUCCESS')
except Exception as e:
    code = e.response.get("Error").get("Code")
    if code == "ResourceInUse":
        print(f"Using existing feature group: {claims_fg_name}")
    else:
        raise (e)

try:
    customers_feature_group.create(
        s3_uri=f"s3://{bucket}/{prefix}",
        record_identifier_name=record_identifier_feature_name,
        event_time_feature_name=event_time_feature_name,
        role_arn=sagemaker_role,
        enable_online_store=True,
    )
    print(f'Create "customers" feature group: SUCCESS')
except Exception as e:
    code = e.response.get("Error").get("Code")
    if code == "ResourceInUse":
        print(f"Using existing feature group: {customers_fg_name}")
    else:
        raise (e)


 Using s3://sagemaker-us-east-1-875692608981/fraud-detect-demo
Create "claims" feature group: SUCCESS
Create "customers" feature group: SUCCESS


### Wait until feature group creation has fully completed

In [15]:
def wait_for_feature_group_creation_complete(feature_group):
    status = feature_group.describe().get("FeatureGroupStatus")
    while status == "Creating":
        print("Waiting for Feature Group Creation")
        time.sleep(5)
        status = feature_group.describe().get("FeatureGroupStatus")
    if status != "Created":
        raise RuntimeError(f"Failed to create feature group {feature_group.name}")
    print(f"FeatureGroup {feature_group.name} successfully created.")


wait_for_feature_group_creation_complete(feature_group=claims_feature_group)
wait_for_feature_group_creation_complete(feature_group=customers_feature_group)

Waiting for Feature Group Creation
FeatureGroup fraud-detect-demo-claims successfully created.
Waiting for Feature Group Creation
FeatureGroup fraud-detect-demo-customers successfully created.


### Ingest records into the Feature Groups
After the Feature Groups have been created, we can put data into each store by using the PutRecord API. This API can handle high TPS and is designed to be called by different streams. The data from all of these Put requests is buffered and written to s3 in chunks. The files will be written to the offline store within a few minutes of ingestion.

In [16]:
if "claims_table" in locals():
    print(
        "You may have already ingested the data into your Feature Groups. If you'd like to do this again, you can run the ingest methods outside of the 'if/else' statement."
    )

else:
    claims_feature_group.ingest(data_frame=claims_preprocessed, max_workers=3, wait=True)

    customers_feature_group.ingest(data_frame=customers_preprocessed, max_workers=3, wait=True);

### Wait for offline store data to become available
This usually takes 5-8 minutes

In [17]:
if 'claims_table' not in locals():
    claims_table = (
        claims_feature_group.describe()["OfflineStoreConfig"]["DataCatalogConfig"]["TableName"]
    )
if 'customers_table' not in locals():
    customers_table = (
        customers_feature_group.describe()["OfflineStoreConfig"]["DataCatalogConfig"]["TableName"]
    )

claims_feature_group_s3_prefix = (
    f"{prefix}/{account_id}/sagemaker/{region}/offline-store/{claims_table}/data"
)
customers_feature_group_s3_prefix = (
    f"{prefix}/{account_id}/sagemaker/{region}/offline-store/{customers_table}/data"
)

offline_store_contents = None
while offline_store_contents is None:
    objects_in_bucket = s3_client.list_objects(
        Bucket=bucket, Prefix=customers_feature_group_s3_prefix
    )
    if "Contents" in objects_in_bucket and len(objects_in_bucket["Contents"]) > 1:
        offline_store_contents = objects_in_bucket["Contents"]
    else:
        print("Waiting for data in offline store...")
        time.sleep(60)

print("\nData available.")

Waiting for data in offline store...
Waiting for data in offline store...
Waiting for data in offline store...
Waiting for data in offline store...
Waiting for data in offline store...

Data available.


<a id='aud-dataset'></a>

## Create train and test datasets

[overview](#all-up-overview)
----
Once the data is available in the offline store, it will automatically be cataloged and loaded into an Athena table (this is done by default, but can be turned off). In order to build our training and test datasets, you will submit a SQL query to join the the Claims and Customers tables created in Athena.

In [18]:
claims_query = claims_feature_group.athena_query()
customers_query = customers_feature_group.athena_query()

claims_table = claims_query.table_name
customers_table = customers_query.table_name
database_name = customers_query.database
%store claims_table
%store customers_table
%store database_name

feature_columns = list(set(claims_preprocessed.columns) ^ set(customers_preprocessed.columns))
feature_columns_string = ", ".join(f'"{c}"' for c in feature_columns)
feature_columns_string = f'"{claims_table}".policy_id as policy_id, ' + feature_columns_string

query_string = f"""
SELECT DISTINCT {feature_columns_string}
FROM "{claims_table}" LEFT JOIN "{customers_table}" 
ON "{claims_table}".policy_id = "{customers_table}".policy_id
"""
print(f"Query String: {query_string}")

Stored 'claims_table' (str)
Stored 'customers_table' (str)
Stored 'database_name' (str)
Query String: 
SELECT DISTINCT "fraud-detect-demo-claims-1636727045".policy_id as policy_id, "months_as_customer", "vehicle_claim", "collision_type_na", "total_claim_amount", "policy_state_ca", "policy_liability", "authorities_contacted_police", "collision_type_front", "incident_type_theft", "incident_type_breakin", "policy_state_wa", "incident_day", "fraud", "collision_type_rear", "num_witnesses", "policy_state_nv", "driver_relationship_spouse", "num_insurers_past_5_years", "customer_gender_female", "incident_hour", "incident_severity", "collision_type_side", "policy_state_id", "authorities_contacted_fire", "driver_relationship_child", "injury_claim", "policy_state_or", "incident_dow", "driver_relationship_self", "customer_gender_male", "incident_month", "num_claims_past_year", "num_injuries", "authorities_contacted_none", "auto_year", "num_vehicles_involved", "authorities_contacted_ambulance", "in

In [19]:
claims_query.run(query_string=query_string, output_location=f"s3://{bucket}/{prefix}/query_results")
claims_query.wait()
dataset = claims_query.as_dataframe()

In [20]:
dataset.to_csv("./data/claims_customer.csv")

In [21]:
col_order = ["fraud"] + list(dataset.drop(["fraud", "policy_id"], axis=1).columns)
%store col_order

train = dataset.sample(frac=0.80, random_state=0)[col_order]
test = dataset.drop(train.index)[col_order]
%store train

Stored 'col_order' (list)
Stored 'train' (DataFrame)


In [22]:
train.shape

(4000, 46)

### Write train, test data to S3

In [23]:
dataset

Unnamed: 0,policy_id,months_as_customer,vehicle_claim,collision_type_na,total_claim_amount,policy_state_ca,policy_liability,authorities_contacted_police,collision_type_front,incident_type_theft,...,authorities_contacted_ambulance,incident_type_collision,policy_deductable,policy_annual_premium,policy_state_az,driver_relationship_na,driver_relationship_other,customer_education,police_report_available,customer_age
0,2013,166,9955.0,0,25455.0,1,0,0,0,0,...,0,1,750,2900,0,0,0,4,0,46
1,149,45,33786.0,0,36786.0,1,0,1,0,0,...,0,1,750,3000,0,0,0,4,1,27
2,3857,367,9655.0,0,20355.0,1,0,1,1,0,...,0,1,750,3000,0,0,0,1,1,48
3,3876,32,9512.0,1,14812.0,0,1,1,0,0,...,0,0,750,3000,0,1,0,4,0,27
4,4226,4,14889.0,0,23889.0,1,0,1,1,0,...,0,1,750,3000,0,0,0,4,1,24
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,1179,120,35732.0,0,45732.0,0,2,1,0,0,...,0,1,750,3000,0,0,1,2,1,50
4996,1181,156,8265.0,0,36565.0,0,1,0,0,0,...,0,1,750,2600,0,0,0,4,0,55
4997,1248,118,10515.0,0,31515.0,1,2,1,1,0,...,0,1,750,2900,0,0,0,3,1,43
4998,4888,33,27522.0,0,41222.0,0,0,1,1,0,...,0,1,750,3000,1,0,0,4,1,53


In [24]:
train.to_csv("data/train.csv", index=False)
test.to_csv("data/test.csv", index=False)
dataset.to_csv("data/dataset.csv", index=True)

In [32]:
s3_client.upload_file(Filename="data/train.csv", Bucket=bucket, Key=f"{prefix}/data/train/train.csv")
s3_client.upload_file(Filename="data/test.csv", Bucket=bucket, Key=f"{prefix}/data/test/test.csv")
s3_client.upload_file(Filename="data/dataset.csv", Bucket=bucket, Key=f"{prefix}/data/dataset/dataset.csv")
train_data_uri = f"s3://{bucket}/{prefix}/data/train/train.csv"
test_data_uri = f"s3://{bucket}/{prefix}/data/train/test.csv"
dataset_uri = f"s3://{bucket}/{prefix}/data/train/dataset.csv"
%store train_data_uri
%store test_data_uri
%store dataset_uri

Stored 'train_data_uri' (str)
Stored 'test_data_uri' (str)
Stored 'dataset_uri' (str)


In [29]:
train

Unnamed: 0,fraud,months_as_customer,vehicle_claim,collision_type_na,total_claim_amount,policy_state_ca,policy_liability,authorities_contacted_police,collision_type_front,incident_type_theft,...,authorities_contacted_ambulance,incident_type_collision,policy_deductable,policy_annual_premium,policy_state_az,driver_relationship_na,driver_relationship_other,customer_education,police_report_available,customer_age
398,0,92,8590.0,1,46690.0,1,2,1,0,0,...,0,0,750,2900,0,1,0,1,0,28
3833,0,71,9056.0,1,20156.0,1,2,0,0,0,...,0,0,750,2700,0,1,0,0,0,32
4836,0,65,9720.0,0,37020.0,0,2,0,1,0,...,0,1,750,3000,0,0,0,3,0,38
4572,0,50,12323.0,0,26623.0,1,2,1,1,0,...,0,1,750,2650,0,0,0,3,0,54
636,0,129,9333.0,0,90133.0,1,0,1,0,0,...,0,1,750,3000,0,0,0,4,1,66
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2061,0,46,13139.0,1,76139.0,1,0,1,0,0,...,0,0,750,3000,0,1,0,3,0,35
4211,0,52,16206.0,0,35406.0,1,0,1,0,0,...,0,1,750,2750,0,0,0,3,0,36
2018,1,75,20560.0,0,22460.0,0,1,1,0,0,...,0,1,750,3000,0,0,0,3,0,31
3940,0,80,15785.0,0,23285.0,0,1,0,1,0,...,0,1,750,3000,0,0,0,1,0,46


In [30]:
test

Unnamed: 0,fraud,months_as_customer,vehicle_claim,collision_type_na,total_claim_amount,policy_state_ca,policy_liability,authorities_contacted_police,collision_type_front,incident_type_theft,...,authorities_contacted_ambulance,incident_type_collision,policy_deductable,policy_annual_premium,policy_state_az,driver_relationship_na,driver_relationship_other,customer_education,police_report_available,customer_age
0,1,166,9955.0,0,25455.0,1,0,0,0,0,...,0,1,750,2900,0,0,0,4,0,46
7,0,140,6568.0,0,71668.0,1,1,0,0,0,...,0,1,750,3000,0,0,0,4,0,37
21,0,178,12624.0,0,52424.0,1,3,1,0,0,...,0,1,750,2550,0,0,0,3,1,53
24,0,71,14489.0,0,34389.0,1,0,1,0,0,...,0,1,750,2700,0,0,0,1,1,31
25,0,229,1000.0,0,6100.0,1,2,0,0,0,...,0,1,750,2400,0,0,0,1,0,48
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4978,0,34,10757.0,0,145557.0,0,1,0,1,0,...,0,1,750,3000,0,0,0,3,0,22
4985,0,39,7547.0,0,18847.0,1,0,1,0,0,...,0,1,750,3000,0,0,0,3,1,33
4986,0,52,15387.0,0,47787.0,1,0,1,1,0,...,0,1,750,3000,0,0,0,1,0,22
4991,0,53,31191.0,0,45491.0,1,0,1,0,0,...,0,1,750,2850,0,0,0,1,1,30


----

### Next Notebook: [Train, Check Bias, Tune, Record Lineage, Register Model](./04-Lineage-Train-Assess-Bias-Tune-Registry.ipynb)