# Part 1 : Data Preparation, Process, and Store Features

<a id='all-up-overview'></a>

## [Overview](./0-AutoClaimFraudDetection.ipynb)
* Notebook 1/2: Overview, Architecture and Data Exploration
* **[Notebook 3: Data Preparation, Process, and Store Features](./1-data-prep-e2e.ipynb)**
  * **[Architecture](#arch)**
  * **[Getting started](#aud-getting-started)**
  * **[DataSets](#aud-datasets)**
  * **[SageMaker Feature Store](#aud-feature-store)**
  * **[Create train and test datasets](#aud-dataset)**
* Notebook 4: Train, Check Bias, Tune, Record Lineage, and Register a Model
* Notebook 5: Mitigate Bias, Train New Model, Store in Registry)
* Notebook 6: Deploy Model, Run Predictions
* Notebook 7: Create and Run an End-to-End Pipeline to Deploy the Model

The purpose of this notebook is to perform the Data Prep phase of the ML life cycle. The main Data Wrangling, data ingestion, and multiple transformations will be done through the SageMaker Studio Data Wrangler GUI.

In this notebook, we will take the `.flow` files that define the transformations to the raw data. and apply them using a SageMaker Processing job that will apply those transformations to the raw data deposited in the S3 bucket as `.csv` files.

<a id='arch'> </a>

## Architecture for Data Prep, Process and Store Features
[overview](#all-up-overview)
----
![Data Prep and Store](./images/e2e-1-pipeline-v3b.png)

### Install required and/or update third-party libraries

In [47]:
import warnings
warnings.filterwarnings('ignore')

### Loading stored variables
If you ran this notebook before, you may want to re-use the resources you aready created with AWS. Run the cell below to load any prevously created variables. You should see a print-out of the existing variables. If you don't see anything printed then it's probably the first time you are running the notebook! 

In [64]:
%store -r
%store

Stored variables and their in-db values:
bucket             -> 'sagemaker-us-east-1-875692608981'
prefix             -> 'fraud-detect-demo'


**<font color='red'>Important</font>: You must have run the previous sequential notebooks to retrieve variables using the StoreMagic command.**

### Import libraries

In [49]:
import json
import boto3
import sagemaker

<a id='aud-getting-started'></a>

## Getting started: Creating Resources

[overview](#all-up-overview)
----
In order to successfully run this notebook you will need to create some AWS resources. 
First, an S3 bucket will be created to store all the data for this tutorial. 
Once created, you will then need to create an AWS Glue role using the IAM console then attach a policy to the S3 bucket to allow FeatureStore access to this notebook. If you've already run this notebook and are picking up where you left off, then running the cells below should pick up the resources you already created without creating any additional resources.

#### Add FeatureStore policy to Studio's execution role

![title](images/iam-policies.png)


1. In a separate brower tab go to the IAM section of the AWS Console
2. Navigate to the Roles section and select the execution role you're using for your SageMaker Studio user
    * If you're not sure what role you're using, run the cell below to print it out
3. Attach the <font color='green'> AmazonSageMakerFeatureStoreAccess </font> policy to this role. Once attached, the changes take  effect immediately.

### Set region, boto3 and SageMaker SDK variables

In [50]:
# You can change this to a region of your choice

region = sagemaker.Session().boto_region_name
print("Using AWS Region: {}".format(region))

Using AWS Region: us-east-1


In [51]:
boto3.setup_default_session(region_name=region)
boto_session = boto3.Session(region_name=region)
s3_client = boto3.client("s3", region_name=region)
sagemaker_client = boto_session.client("sagemaker")
sagemaker_session = sagemaker.session.Session(
    boto_session=boto_session, sagemaker_client=sagemaker_client
)

In [52]:
sagemaker_role = sagemaker.get_execution_role()
print(f"Sagemaker Role: {sagemaker_role}")
account_id = boto3.client("sts").get_caller_identity()["Account"]

Sagemaker Role: arn:aws:iam::875692608981:role/AmazonSageMaker-ExecutionRoleMLT


### Create a directory in the SageMaker default bucket for this tutorial

In [63]:
if 'bucket' not in locals():
    bucket = sagemaker_session.default_bucket()
    prefix = 'fraud-detect-demo'
    %store bucket
    %store prefix
    s3_client.create_bucket(Bucket=bucket)
    print(f'Sagemaker Bucket: {bucket}')

### Upload raw data to S3
Before you can preprocess the raw data with Data Wrangler, it must exist in S3.

In [44]:
s3_client.upload_file(
    Filename="data/claims.csv", Bucket=bucket, Key=f"{prefix}/data/raw/claims.csv"
)
s3_client.upload_file(
    Filename="data/customers.csv", Bucket=bucket, Key=f"{prefix}/data/raw/customers.csv"
)

### Update attributes within the  `.flow` file 
DataWrangler will generate a .flow file. It contains a reference to an S3 bucket used during the Wrangling. This may be different from the one you have as a default in this notebook eg if the Wrangling was done by someone else, you will probably not have access to their bucket and you now need to point to your own S3 bucket so you can actually load the .flow file into Wrangler or access the data.

After running the cell below you can open the `claims.flow` and `customers.flow` files and export the data to S3 or you can continue the guide using the provided `data/claims_preprocessed.csv` and `data/customers_preprocessed.csv` files.

In [45]:
claims_flow_template_file = "flows/claims_flow_template"

with open(claims_flow_template_file, "r") as f:
    variables = {"bucket": bucket, "prefix": prefix}
    template = string.Template(f.read())
    claims_flow = template.substitute(variables)
    claims_flow = json.loads(claims_flow)

with open("flows/claims.flow", "w") as f:
    json.dump(claims_flow, f)

customers_flow_template_file = "flows/customers_flow_template"

with open(customers_flow_template_file, "r") as f:
    variables = {"bucket": bucket, "prefix": prefix}
    template = string.Template(f.read())
    customers_flow = template.substitute(variables)
    customers_flow = json.loads(customers_flow)

with open("flows/customers.flow", "w") as f:
    json.dump(customers_flow, f)

----

### Next Notebook: [Train, Check Bias, Tune, Record Lineage, Register Model](./04-Lineage-Train-Assess-Bias-Tune-Registry.ipynb)