# Create a SageMaker Pipeline to Automate All the Steps from Data Prep to Model Deployment

Reference: 
- https://sagemaker-examples.readthedocs.io/en/latest/end_to_end/fraud_detection/pipeline-e2e.html
- https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines.html


Steps:
- Claims Data Wrangler Preprocessing Step
- Customers Data Wrangler Preprocessing Step
- Create Dataset and Train/Test Split
- Train XGBoost Model
- Model Pre-Deployment Step
- Run Bias Metrics with Clarify
- Register Model
- Deploy Model
- Combine and Run the Pipeline Steps



In [20]:
import sagemaker 
import boto3 
import pandas as pd 


In [34]:
REGION = sagemaker.session.Session().boto_region_name
print("REGION: ", REGION) 

boto3_session = boto3.Session(region_name=REGION)

sagemaker_boto3_client = boto3_session.client("sagemaker")
s3_boto3_client = boto3_session.client("s3")
sagemaker_session = sagemaker.session.Session(boto_session=boto3_session, sagemaker_client=sagemaker_boto3_client)

BUCKET = sagemaker_session.default_bucket()
PREFIX = "FraudDetection_AutoInsurance"

ROLE=sagemaker.get_execution_role()
print("ROLE: ", ROLE)
print("BUCKET: ", BUCKET) 
print("PREFIX: ", PREFIX) 

s3_data_dir_uri = f"s3://{BUCKET}/{PREFIX}/data"
print(s3_data_dir_uri)

REGION:  us-east-1
ROLE:  arn:aws:iam::205930620783:role/service-role/AmazonSageMaker-ExecutionRole-20250401T145997
BUCKET:  sagemaker-us-east-1-205930620783
PREFIX:  FraudDetection_AutoInsurance
s3://sagemaker-us-east-1-205930620783/FraudDetection_AutoInsurance/data


## Pipeline Parameters

Pipeline parameters are conceptually similar to command-line arguments (argparse) in a Python script. Both allow external users or systems to provide input values at runtime instead of hardcoding them.

As well, unlike command line args, these "Parameters" are automatically logged and tracked. 


In [43]:
from sagemaker.workflow.parameters import (    
    ParameterInteger, ParameterFloat, ParameterString
)

p_processing_instance_type = ParameterString(name="DataProcessingInstance", default_value="ml.m5.xlarge")# "ml.t2.medium" is not supported
p_s3_data_dir_uri = ParameterString(name="s3DataDirURI", default_value=s3_data_dir_uri)
p_train_instance_type = ParameterString(name="TrainingInstance", default_value="ml.t2.medium")
p_model_appoval_status = ParameterString(name="ModelApprovalStatus", default_value="PendingManualApproval")

## Data Preprocessing Step

For now create a step which returns the path of the processed train and test data csv on s3.

### sagemaker.processing.Processor v/s  sagemaker.sklearn.processing.SKLearnProcessor
-  **Processor** is base class in the Sagemaker SDK used to run arbritary processing jobs. It gives you full control, i.e. which docker image to choose, entry point, env vars. You must configure everything including the image uri.
-  **SKLearnProcessor** is a pre-configured sub class of Processor. Automaticallt sets the scikit-learn image, handles dependencies. Designed specifically for scikit learn based workflows.
  

In [44]:
from sagemaker.sklearn.processing import SKLearnProcessor 
from sagemaker.workflow.steps import ProcessingStep, TrainingStep

# first configure the SKLearnProcessor Class
data_processor = SKLearnProcessor(
    framework_version='0.23-1',
    role=ROLE,
    instance_type=p_processing_instance_type,
    instance_count=1,
    base_job_name='data_processing_job',
    sagemaker_session=sagemaker_session
)

dataset_step_01 = ProcessingStep(
    name="ProcessData",
    processor=data_processor,
    code="data_processing_script.py",
    inputs=[
        sagemaker.processing.ProcessingInput(source=p_s3_data_dir_uri, destination="/opt/ml/processing/input/")
    ],
    outputs=[
        sagemaker.processing.ProcessingOutput(source="/opt/ml/processing/output/", destination=p_s3_data_dir_uri)
    ],
    #job_arguments=[
    #    
    #]
)


## Combine and Run the pipeline steps.


In [45]:
# Composing and creating the pipeline
from sagemaker.workflow.pipeline import Pipeline

pipeline_name = f"FraudDetection-AutoInsurance"

pipeline= Pipeline(
    name=pipeline_name,
    parameters=[p_processing_instance_type, p_s3_data_dir_uri, p_train_instance_type, p_model_appoval_status],
    steps=[
        dataset_step_01
    ]
)

pipeline.upsert(role_arn=ROLE)
import json
print(json.loads(pipeline.describe()["PipelineDefinition"]))

{'Version': '2020-12-01', 'Metadata': {}, 'Parameters': [{'Name': 'DataProcessingInstance', 'Type': 'String', 'DefaultValue': 'ml.m5.xlarge'}, {'Name': 's3DataDirURI', 'Type': 'String', 'DefaultValue': 's3://sagemaker-us-east-1-205930620783/FraudDetection_AutoInsurance/data'}, {'Name': 'TrainingInstance', 'Type': 'String', 'DefaultValue': 'ml.t2.medium'}, {'Name': 'ModelApprovalStatus', 'Type': 'String', 'DefaultValue': 'PendingManualApproval'}], 'PipelineExperimentConfig': {'ExperimentName': {'Get': 'Execution.PipelineName'}, 'TrialName': {'Get': 'Execution.PipelineExecutionId'}}, 'Steps': [{'Name': 'ProcessData', 'Type': 'Processing', 'Arguments': {'ProcessingResources': {'ClusterConfig': {'InstanceType': {'Get': 'Parameters.DataProcessingInstance'}, 'InstanceCount': 1, 'VolumeSizeInGB': 30}}, 'AppSpecification': {'ImageUri': '683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-scikit-learn:0.23-1-cpu-py3', 'ContainerEntrypoint': ['python3', '/opt/ml/processing/input/code/data_pro

In [46]:
# run the pipeline
#parameters = {"s3DataDirURI": s3_data_dir_uri}
pipeline.start()#parameters=parameters) # You can set parameters are the run time and it will override the default ones.

_PipelineExecution(arn='arn:aws:sagemaker:us-east-1:205930620783:pipeline/FraudDetection-AutoInsurance/execution/jgucu33x8y84', sagemaker_session=<sagemaker.session.Session object at 0x7fdd0bff7470>)