# Lab 10: Using SageMaker Pipelines and the SageMaker Model Registry with SageMaker Studio

In this lab you create and run an Amazon Sagemaker Pipeline and monitor the pipeline's progress. You also locate and explore some of the artifacts that the machine learning (ML) process uses or generates.

If time permits, you can also review the lineage details for the model that the pipeline generated.

## Task 2.1 Environment setup

Before you create your SageMaker pipeline, you must prepare the environment by installing necessary packages, importing modules, and staging supporting files. This pipeline was designed to use a feature group, so you also create a feature group in Amazon SageMaker Feature Store and run a Data Wrangler flow to prepare your environment. 

Run the cells in this task to do the following:
- Install dependencies.
- Import required modules.
- Copy data and code to Amazon Simple Storage Service (Amazon S3).
- Create a feature group.
- Ingest features into the feature group.

### Task 2.1.1 Install dependencies

In [None]:
#install dependencies
%pip install --upgrade pip 
%pip install pytest-astropy ==  0.7.0
%pip install rsa == 4.7.2
%pip install PyYAML
!apt update && apt install -y git
%pip install git+https://github.com/aws-samples/ml-lineage-helper

### Task 2.1.2 Import modules

In [None]:
#import-modules
import os
import json
import boto3
import sagemaker
import sagemaker_datawrangler
import sagemaker.session
import datetime as dt
import pandas as pd
import time
from time import gmtime, strftime
import uuid
from sagemaker.estimator import Estimator
from sagemaker.inputs import TrainingInput
from sagemaker.model_metrics import (
    MetricsSource,
    ModelMetrics,
)
from sagemaker.processing import (
    ProcessingInput,
    ProcessingOutput,
    ScriptProcessor,
)
from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.workflow.conditions import ConditionGreaterThan
from sagemaker.workflow.parameters import (
    ParameterInteger,
    ParameterString,
)
from sagemaker.workflow.pipeline import Pipeline
from sagemaker.workflow.properties import PropertyFile
from sagemaker.workflow.steps import (
    ProcessingStep,
    TrainingStep,
)
from sagemaker.workflow.condition_step import (
    ConditionStep,
    JsonGet,
)
from sagemaker.workflow.step_collections import RegisterModel
from sagemaker.model import Model
from sagemaker.workflow.steps import CreateModelStep
from sagemaker.inputs import CreateModelInput
from sagemaker.inputs import TransformInput
from sagemaker.workflow.steps import TransformStep
from sagemaker.transformer import Transformer
from sagemaker.pytorch.estimator import PyTorch
from sagemaker.tuner import HyperparameterTuner
from sagemaker.inputs import TrainingInput
from sagemaker.workflow.steps import TuningStep
from sagemaker.tuner import (
    IntegerParameter,
    CategoricalParameter,
    ContinuousParameter,
    HyperparameterTuner,
)
from ml_lineage_helper import *
from sagemaker.feature_store.feature_definition import FeatureDefinition
from sagemaker.feature_store.feature_definition import FeatureTypeEnum
from sagemaker.feature_store.feature_group import FeatureGroup
from sagemaker.session import Session
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.processing import FeatureStoreOutput
from sagemaker.processing import Processor
from sagemaker.network import NetworkConfig
from sagemaker.dataset_definition.inputs import AthenaDatasetDefinition, DatasetDefinition, RedshiftDatasetDefinition

In [None]:
#create sessions
boto_session  =  boto3.Session()
sagemaker_session = sagemaker.Session()


In [None]:
#create clients
s3_client = boto3.client('s3')
featurestore_runtime = boto3.client('sagemaker-featurestore-runtime')
sagemaker_client = boto3.client('sagemaker')

In [None]:
#feature store session
feature_store_session = Session(
    boto_session = boto_session,
    sagemaker_client = sagemaker_client,
    sagemaker_featurestore_runtime_client = featurestore_runtime
)

In [None]:
#set global variables
default_bucket = sagemaker_session.default_bucket()
region = boto_session.region_name
role = sagemaker.get_execution_role()

### Task 2.1.3 Copy lab files to Amazon S3 

In [None]:
# Upload files to default bucket
s3_client.put_object(Bucket = default_bucket, Key = 'data/')
s3_client.put_object(Bucket = default_bucket, Key = 'input/code/')
s3_client.upload_file('pipelines/data/storedata_total.csv', default_bucket, 'data/storedata_total.csv')
s3_client.upload_file('pipelines/input/code/evaluate.py', default_bucket, 'input/code/evaluate.py')
s3_client.upload_file('pipelines/input/code/generate_config.py', default_bucket, 'input/code/generate_config.py')
s3_client.upload_file('pipelines/input/code/processfeaturestore.py', default_bucket, 'input/code/processfeaturestore.py')

# Preview the dataset
print('Dataset preview:')
customer_data = pd.read_csv('pipelines/data/storedata_total.csv')
customer_data.head()

### Task 2.1.4 Create the feature group

In this task, you create a feature group for the data. First, create a schema of the data. For this lab, the schema should be by the columns **name**, and then by **type** of variable.

In [None]:
#set-up-feature-store-variables
record_identifier_feature_name = 'FS_ID'
event_time_feature_name = 'FS_time'

column_schemas = [
    {
        "name": "retained",
        "type": "long"
    },
    {
        "name": "esent",
        "type": "long"
    },
    {
        "name": "eopenrate",
        "type": "float"
    },
    {
        "name": "eclickrate",
        "type": "float"
    },
    {
        "name": "avgorder",
        "type": "float"
    },
    {
        "name": "ordfreq",
        "type": "float"
    },
    {
        "name": "paperless",
        "type": "long"
    },
    {
        "name": "refill",
        "type": "long"
    },
    {
        "name": "doorstep",
        "type": "long"
    },
    {
        "name": "first_last_days_diff",
        "type": "long"
    },
    {
        "name": "created_first_days_diff",
        "type": "long"
    },
    {
        "name": "favday_Friday",
        "type": "long"
    },
    {
        "name": "favday_Monday",
        "type": "long"
    },
    {
        "name": "favday_Saturday",
        "type": "long"
    },
    {
        "name": "favday_Sunday",
        "type": "long"
    },
    {
        "name": "favday_Thursday",
        "type": "long"
    },
    {
        "name": "favday_Tuesday",
        "type": "long"
    },
    {
        "name": "favday_Wednesday",
        "type": "long"
    },
    {
        "name": "city_BLR",
        "type": "long"
    },
    {
        "name": "city_BOM",
        "type": "long"
    },
    {
        "name": "city_DEL",
        "type": "long"
    },
    {
        "name": "city_MAA",
        "type": "long"
    },
    {
        "name": "FS_ID",
        "type": "long"
    },
    {
        "name": "FS_time",
        "type": "float"
    }
]


Now, create the feature group.

In [None]:
# Flow name and a unique ID for this export (used later as the processing job name for the export)
flow_name = 'featureengineer'
flow_export_id = f"{strftime('%d-%H-%M-%S', gmtime())}-{str(uuid.uuid4())[:8]}"
flow_export_name = f"flow-{flow_export_id}"

# Feature group name, with flow_name and a unique id. You can give it a customized name
feature_group_name = f"FG-{flow_name}-{str(uuid.uuid4())[:8]}"

# SageMaker Feature Store writes the data in the offline store of a Feature Group to a 
# Amazon S3 location owned by you.
feature_store_offline_s3_uri = 's3://' + default_bucket

# Controls if online store is enabled. Enabling the online store allows quick access to 
# the latest value for a record by using the GetRecord API.
enable_online_store = True

In [None]:
#create-feature-group
default_feature_type = FeatureTypeEnum.STRING
column_to_feature_type_mapping = {
    "float": FeatureTypeEnum.FRACTIONAL,
    "long": FeatureTypeEnum.INTEGRAL
}

feature_definitions = [
    FeatureDefinition(
        feature_name = column_schema['name'], 
        feature_type = column_to_feature_type_mapping.get(column_schema['type'], default_feature_type)
    ) for column_schema in column_schemas
]


print(f"Feature Group Name: {feature_group_name}")

# Confirm the Athena settings are configured
try:
    boto3.client('athena').update_work_group(
        WorkGroup = 'primary',
        ConfigurationUpdates = {
            'EnforceWorkGroupConfiguration':False
        }
    )
except Exception:
    pass

feature_group = FeatureGroup(
    name = feature_group_name, sagemaker_session = feature_store_session, feature_definitions = feature_definitions)

feature_group.create(
    s3_uri = feature_store_offline_s3_uri,
    record_identifier_name = record_identifier_feature_name,
    event_time_feature_name = event_time_feature_name,
    role_arn = role,
    enable_online_store = enable_online_store
)

def wait_for_feature_group_creation_complete(feature_group):
    """Helper function to wait for the completions of creating a feature group"""
    response = feature_group.describe()
    status = response.get("FeatureGroupStatus")
    while status == "Creating":
        print("Waiting for feature group creation")
        time.sleep(5)
        response = feature_group.describe()
        status = response.get("FeatureGroupStatus")

    if status != "Created":
        print(f"Failed to create feature group, response: {response}")
        failureReason = response.get("FailureReason", "")
        raise SystemExit(
            f"Failed to create feature group {feature_group.name}, status: {status}, reason: {failureReason}"
        )
    print(f"Feature Group {feature_group.name} successfully created.")

wait_for_feature_group_creation_complete(feature_group = feature_group)


### Task 2.1.5 Ingest features

This process takes approximately 8 minutes to complete.

In [None]:
#populate-feature-store
column_list = ['retained','esent','eopenrate','eclickrate','avgorder','ordfreq','paperless','refill','doorstep','first_last_days_diff','created_first_days_diff','favday_Friday','favday_Monday', 'favday_Saturday','favday_Sunday','favday_Thursday','favday_Tuesday','favday_Wednesday','city_BLR','city_BOM','city_DEL','city_MAA','FS_ID','FS_time']
lab_test_data = pd.read_csv('featureengineer_data/store_data_processed.csv', names = (column_list), header = 1)
feature_group.ingest(data_frame = lab_test_data, wait = True)

## Task 2.2 Create and run a SageMaker pipeline

Now that your environment is set up, you configure, create, and start a SageMaker pipeline. 

A SageMaker pipeline is a workflow that runs a set of dependent steps. Steps can accept inputs and send outputs, so data and other assets can be passed between them. 

Run the following cells to:
- Define variables that are needed to configure the pipeline.
- Configure a SageMaker session.
- Define the pipeline steps.
- Configure the pipeline.
- Create the pipeline.
- Start the pipeline.
- Describe the pipeline.
- Create a wait event so that the notebook does not proceed until the pipeline has finished running.

### Task 2.2.1 Set up the variables that the pipeline uses

In [None]:

#pipeline-variables
feature_group_name = feature_group.name
model_name = "Churn-model"

sklearn_processor_version = "0.23-1"
model_package_group_name = "ChurnModelPackageGroup"
pipeline_name = "ChurnModelSMPipeline"

processing_instance_count = ParameterInteger(
    name = "ProcessingInstanceCount",
    default_value = 1
    )

processing_instance_type = ParameterString(
        name = "ProcessingInstanceType",
        default_value = "ml.m5.xlarge"
    )

training_instance_type = ParameterString(
        name = "TrainingInstanceType",
        default_value = "ml.m5.xlarge"
    )

input_data = ParameterString(
        name = "InputData",
        default_value = "s3://{}/data/storedata_total.csv".format(default_bucket), 
    )

batch_data = ParameterString(
        name = "BatchData",
        default_value = "s3://{}/data/batch/batch.csv".format(default_bucket),
    )

### Task 2.2.2 Configure the pipeline

You define a pipeline named **ChurnModelPipeline** to produce a model that evaluates the likelihood of retaining or losing customers. This pipeline has nine steps. 

Each step in a pipeline runs a specific job type. The required inputs for a job vary based on the job type. Refer to [Step Types](https://docs.aws.amazon.com/sagemaker/latest/dg/build-and-manage-steps.html#build-and-manage-steps-types) for more information about SageMaker pipeline step types.

Review the code in the following cells to understand how each step was defined:

The **ChurnModelProcess** step is defined in the variable named **step_process**. 

Step configuration includes the following:
- **Type:** Processing – Processing jobs are defined using the class ProcessingStep().
- **Processor:** SKLearnProcessor.
- **Destination:** Output will be sent to folders under your default S3 bucket.
- **Job Arguments:** This step will use the Feature Store to process the dataset.
- **Code:** **processfeaturestore.py**, which resides in your default S3 bucket.


In [None]:
#configure-processing-step
# Run a scikit-learn script to do data processing on SageMaker 
# using the SKLearnProcessor class
sklearn_processor = SKLearnProcessor(
        framework_version = sklearn_processor_version,
        instance_type = processing_instance_type.default_value, 
        instance_count = processing_instance_count,
        sagemaker_session = sagemaker_session,
        role = role,
    )

# Inputs, outputs, and code are parameters to the processor
# step_* will become the pipeline steps toward the end of the cell
# in this case, use the feature store as input, so there is no externalinput
step_process = ProcessingStep(
        name = "ChurnModelProcess",
        processor = sklearn_processor,
        outputs = [
            ProcessingOutput(output_name = "train", source = "/opt/ml/processing/train",\
                             destination = f"s3://{default_bucket}/output/train" ),
            ProcessingOutput(output_name = "validation", source = "/opt/ml/processing/validation",\
                            destination = f"s3://{default_bucket}/output/validation"),
            ProcessingOutput(output_name = "test", source = "/opt/ml/processing/test",\
                            destination = f"s3://{default_bucket}/output/test"),
            ProcessingOutput(output_name = "batch", source = "/opt/ml/processing/batch",\
                            destination = f"s3://{default_bucket}/data/batch"),
            ProcessingOutput(output_name = "baseline", source = "/opt/ml/processing/baseline",\
                            destination = f"s3://{default_bucket}/input/baseline")
        ],
        job_arguments = ["--featuregroupname",feature_group_name,"--default-bucket",default_bucket,"--region",region],
        code = f"s3://{default_bucket}/input/code/processfeaturestore.py",
    )

The **ChurnHyperParameterTuning** step is defined in the variable named **step_tuning**. 

Step configuration includes the following:
- **Type:** Tuning – Tuning jobs are defined using the class TuningStep().
- **Tuner:** This job uses the XGBoost framework.
- **Inputs:** Notice that this job uses the training and validation data that was produced by the ChurnModelProcess step, **step_process**.

In [None]:
#configure-churn-hyperparameter-tuning
# Training/tuning step for generating model artifacts
model_path = f"s3://{default_bucket}/output"
image_uri = sagemaker.image_uris.retrieve(
    framework = "xgboost",
    region = region,
    version = "1.5-1",
    py_version = "py3",
    instance_type = training_instance_type.default_value,
)

fixed_hyperparameters = {
    "eval_metric":"auc",
    "objective":"binary:logistic",
    "num_round":"100",
    "rate_drop":"0.3",
    "tweedie_variance_power":"1.4"
    }

xgb_train = Estimator(
    image_uri = image_uri,
    instance_type = training_instance_type,
    instance_count = 1,
    hyperparameters = fixed_hyperparameters,
    output_path = model_path,
    base_job_name = f"churn-train",
    sagemaker_session = sagemaker_session,
    role = role
    )

In [None]:
#Tuning steps
hyperparameter_ranges = {
    "eta": ContinuousParameter(0, 1),
    "min_child_weight": ContinuousParameter(1, 10),
    "alpha": ContinuousParameter(0, 2),
    "max_depth": IntegerParameter(1, 10),
    }
objective_metric_name = "validation:auc"

step_tuning = TuningStep(
    name = "ChurnHyperParameterTuning",
    tuner = HyperparameterTuner(xgb_train, objective_metric_name, hyperparameter_ranges, max_jobs = 2, max_parallel_jobs = 2),
    inputs = {
            "train": TrainingInput(
                s3_data = step_process.properties.ProcessingOutputConfig.Outputs[
                    "train"
                ].S3Output.S3Uri,
                content_type = "text/csv",
            ),
            "validation": TrainingInput(
                s3_data = step_process.properties.ProcessingOutputConfig.Outputs[
                    "validation"
                ].S3Output.S3Uri,
                content_type = "text/csv",
            ),
        },
    )

The **ChurnEvalBestModel** step is defined in the variable named **step_eval**. 

Step configuration includes the following:
- **Type:** Processing.
- **Processor:** ScriptProcessor.
- **Inputs:** Notice that this job uses the top model from ChurnHyperParameterTuning (**step_tuning**) and the test output from ChurnModelProcess (**step_process**).
- **Outputs:** Output is written to the default S3 bucket.
- **Code:** A script named **evaluate.py**, which resides in Amazon S3, is used for the evaluation.

In [None]:
#configure-churn-best-model
evaluation_report = PropertyFile(
    name = "ChurnEvaluationReport",
    output_name = "evaluation",
    path = "evaluation.json",
)

script_eval = ScriptProcessor(
    image_uri = image_uri,
    command = ["python3"],
    instance_type = processing_instance_type,
    instance_count = 1,
    base_job_name = "script-churn-eval",
    role = role,
    sagemaker_session = sagemaker_session,
)

step_eval = ProcessingStep(
    name = "ChurnEvalBestModel",
    processor = script_eval,
    inputs = [
        ProcessingInput(
            source = step_tuning.get_top_model_s3_uri(top_k = 0, s3_bucket = default_bucket, prefix = "output"),
            destination = "/opt/ml/processing/model"
        ),
        ProcessingInput(
            source = step_process.properties.ProcessingOutputConfig.Outputs[
                "test"
            ].S3Output.S3Uri,
            destination = "/opt/ml/processing/test"
        )
    ],
    outputs = [
        ProcessingOutput(output_name = "evaluation", source = "/opt/ml/processing/evaluation",\
                            destination = f"s3://{default_bucket}/output/evaluation"),
    ],
    code = f"s3://{default_bucket}/input/code/evaluate.py",
    property_files = [evaluation_report],
)

The **ChurnCreateModel** step is defined in the variable named **step_create_model**. 

Step configuration includes the following:
- **Type:** Model – Model jobs are defined using the class Model().
- **Model:** The model used by the step is defined in the previously defined variable named **model**. Notice that the **model** variable uses the top model that was created by ChurnHyperParameterTuning (**step_tuning**).
- **Inputs:** The inputs include an instance type and an accelerator type.

In [None]:
#configure-model-creation
model = Model(
    image_uri = image_uri,        
    model_data = step_tuning.get_top_model_s3_uri(top_k = 0,s3_bucket = default_bucket,prefix = "output"),
    name = model_name,
    sagemaker_session = sagemaker_session,
    role = role,
)

inputs = CreateModelInput(
    instance_type = "ml.m5.large",
    accelerator_type = "ml.inf1.xlarge",
)

step_create_model = CreateModelStep(
    name = "ChurnCreateModel",
    model = model,
    inputs = inputs,
)

The **ChurnModelConfigFile** step is defined in the variable named **step_config_file**. 

Step configuration includes the following:
- **Type:** Processing.
- **Processor:** ScriptProcessor.
- **Code:** **generate_config.py**, which resides in your default S3 bucket.
- **Job Arguments:** Job arguments include the model that was generated by **ChurnCreateModel**, the path to the bias report, the default bucket, the number of samples, and the number of instances used for processing.
- **Depends On:** Notice that this job cannot run until the model creation has completed.

In [None]:

#configure-script-processing
bias_report_output_path = f"s3://{default_bucket}/clarify-output/bias"
clarify_instance_type = 'ml.m5.xlarge'
analysis_config_path = f"s3://{default_bucket}/clarify-output/bias/analysis_config.json"
clarify_image = sagemaker.image_uris.retrieve(framework = 'sklearn', version = sklearn_processor_version, region = region)

#custom_image_uri = None
script_processor = ScriptProcessor(
    command = ['python3'],
    image_uri = clarify_image,
    role = role,
    instance_count = 1,
    instance_type = processing_instance_type,
    sagemaker_session = sagemaker_session,
)

step_config_file = ProcessingStep(
    name = "ChurnModelConfigFile",
    processor = script_processor,
    code = f"s3://{default_bucket}/input/code/generate_config.py",
    job_arguments = ["--modelname", step_create_model.properties.ModelName, "--bias-report-output-path", bias_report_output_path, "--clarify-instance-type", clarify_instance_type,\
                  "--default-bucket", default_bucket, "--num-baseline-samples", "50", "--instance-count", "1"],
    depends_on = [step_create_model.name]
)

The **ChurnTransform** step is defined in the variable named **step_transform**. 

Step configuration includes the following:
- **Type:** Transform – Transform jobs are defined using the class TransformStep().
- **Transformer:** The transformer details are set in the previously defined variable named **transformer**. Notice that this variable is using the model that was created in ChurnCreateModel (**step_create_model**).
- **Inputs:** The data that will be transformed, batch.csv, which was defined earlier in the notebook. The input also includes the file type and how it should be split.

In [None]:
#configure-inference
transformer = Transformer(
    model_name=step_create_model.properties.ModelName,
    instance_type = "ml.m5.xlarge",
    instance_count = 1,
    assemble_with = "Line",
    accept = "text/csv",    
    output_path = f"s3://{default_bucket}/ChurnTransform"
    )

step_transform = TransformStep(
    name = "ChurnTransform",
    transformer = transformer,
    inputs = TransformInput(data = batch_data, content_type = "text/csv", join_source = "Input", split_type = "Line")
    )

The **ClarifyProcessingStep** step is defined in the variable named **step_clarify**. 

Step configuration includes the following:
- **Type:** Processing.
- **Processor:** This job uses SageMakerClarifyProcessor. You can review the processor configuration in the variable named **clarify_processor**.
- **Inputs:** The inputs are defined in the **data_input** and **congif_input** variables.
- **Outputs:** The output is written to a folder under the default bucket. 
- **Depends On:**  Notice that this job cannot run until the configuration file required by Amazon SageMaker Clarify has been created by the **ChurnModelConfigFile**. 

In [None]:
#configure-clarify-processing
data_config = sagemaker.clarify.DataConfig(
s3_data_input_path = f's3://{default_bucket}/output/train/train.csv',
s3_output_path = bias_report_output_path,
    label = 0,
    headers = ['target','esent','eopenrate','eclickrate','avgorder','ordfreq','paperless','refill','doorstep','first_last_days_diff','created_first_days_diff','favday_Friday','favday_Monday','favday_Saturday','favday_Sunday','favday_Thursday','favday_Tuesday','favday_Wednesday','city_BLR','city_BOM','city_DEL','city_MAA'],
    dataset_type = "text/csv",
)

clarify_processor = sagemaker.clarify.SageMakerClarifyProcessor(
    role = role,
    instance_count = 1,
    instance_type = clarify_instance_type,
    sagemaker_session = sagemaker_session,
)

config_input = ProcessingInput(
    input_name = "analysis_config",
    source=analysis_config_path,
    destination = "/opt/ml/processing/input/analysis_config",
    s3_data_type = "S3Prefix",
    s3_input_mode = "File",
    s3_compression_type = "None",
    )

data_input = ProcessingInput(
    input_name = "dataset",
    source = data_config.s3_data_input_path,
    destination = "/opt/ml/processing/input/data",
    s3_data_type = "S3Prefix",
    s3_input_mode = "File",
    s3_data_distribution_type = data_config.s3_data_distribution_type,
    s3_compression_type = data_config.s3_compression_type,
)

result_output = ProcessingOutput( 
    source = "/opt/ml/processing/output",
    destination = data_config.s3_output_path,
    output_name = "analysis_result",
    s3_upload_mode = "EndOfJob",
)

step_clarify = ProcessingStep(
    name = "ClarifyProcessingStep",
    processor = clarify_processor,
    inputs = [data_input, config_input],
    outputs = [result_output],
    depends_on = [step_config_file.name]
)

The **RegisterChurnModel** step is defined in the variable named **step_register**. 

Step configuration includes the following:
- **Type:** Register Model – Register jobs are defined using the class RegisterMode().
- **Estimator:** The estimator is defined in the **xgbtrain** variable earlier in the cell.
- **Model Data:** This is the model URI that is returned by **ChurnHyperParameterTuning**.
- **Content Types:** text/csv
- **Response Types** text/csv
- **Inference Instance:** This is the instance type that will be used for inference processing.
- **Transform Instance:** This is the instance type that will be used to process transformations.
- **Model Package Group Name:** This is the name of the group that will store the group of model versions.
- **Model Metrics:** This defines the location of the model metrics. Files included are the SageMaker Clarify bias report, SageMaker Clarify explainability report, and the model evaluation. 

In [None]:
#configure-model-registry
model_statistics = MetricsSource(
    s3_uri = "s3://{}/output/evaluation/evaluation.json".format(default_bucket),
    content_type = "application/json"
    )
explainability = MetricsSource(
    s3_uri = "s3://{}/clarify-output/bias/analysis.json".format(default_bucket),
    content_type = "application/json"
    )

bias = MetricsSource(
    s3_uri = "s3://{}/clarify-output/bias/analysis.json".format(default_bucket),
    content_type = "application/json"
    ) 

model_metrics = ModelMetrics(
    model_statistics = model_statistics,
    explainability = explainability,
    bias = bias
)

step_register = RegisterModel(
    name = "RegisterChurnModel",
    estimator = xgb_train,
    model_data = step_tuning.get_top_model_s3_uri(top_k = 0, s3_bucket = default_bucket, prefix = "output"),
    content_types = ["text/csv"],
    response_types = ["text/csv"],
    inference_instances = ["ml.t2.medium", "ml.m5.large"],
    transform_instances = ["ml.m5.large"],
    model_package_group_name = model_package_group_name,
    model_metrics = model_metrics,
)

The **CheckAUCScoreChurnEvaluation** step is defined in the variable named **step_cond**. 

Step configuration includes the following:
- **Type:** Condition – Condition jobs are defined using the class ConditionStep().
- **Conditions:** This condition evaluates to True if the output from **ChurnEvalBestModel** is greater than 0.75.
- **If Steps:** This is the list of steps that runs if the condition evaluates to True.
- **Else Steps:** This is the list of steps that run if the condition evaluates to False. Notice that this list is empty, which means the pipeline stops processing if the condition is not met.

In [None]:
%%capture
cond_lte = ConditionGreaterThan(
    left = JsonGet(
        step = step_eval,
        property_file = evaluation_report,
        json_path = "binary_classification_metrics.auc.value"
    ),
    right = 0.75,
)

step_cond = ConditionStep(
    name = "CheckAUCScoreChurnEvaluation",
    conditions = [cond_lte],
    if_steps = [step_create_model, step_config_file, step_transform, step_clarify, step_register],
    else_steps = [],
)

### Task 2.2.3 Define the pipeline

After you define the steps, you configure the pipeline in the variable named **pipeline**. Notice how steps that were previously defined are passed into the pipeline definition.

In [None]:
 #define pipeline function
def get_pipeline(
    region,
    role = None,
    default_bucket = None,
    model_package_group_name = "ChurnModelPackageGroup",
    pipeline_name = "ChurnModelPipeline",
    base_prefix = None,
    custom_image_uri = None,
    sklearn_processor_version = None
    ):

    #configure pipeline instance
    pipeline = Pipeline(
        name = pipeline_name,
        parameters = [
            processing_instance_type,
            processing_instance_count,
            training_instance_type,
            input_data,
            batch_data,
        ],
        steps = [step_process, step_tuning, step_eval, step_cond],
        sagemaker_session = sagemaker_session
    )
    return pipeline


### Task 2.2.4 Create the pipeline

In [None]:
 #create pipeline using function
pipeline = get_pipeline(
  region = region,
    role = role,
    default_bucket = default_bucket,
    model_package_group_name = model_package_group_name,
    pipeline_name = pipeline_name,
    custom_image_uri = clarify_image,
    sklearn_processor_version = sklearn_processor_version
)

### Task 2.2.5 Update the pipeline to use the correct IAM role. 

In [None]:
#set-iam-role
pipeline.upsert(role_arn = role)

**Note:** If you find the following warning after running the cell, you can safely ignore it.

"No finished training job found associated with this estimator. Please make sure this estimator is only used for building workflow config".

### Task 2.2.6 Start the pipeline

In [None]:
#start-pipeline
RunPipeline = pipeline.start()

### Task 2.2.7 Describe the pipeline

In [None]:
#describe-pipeline
RunPipeline.describe()

This pipeline takes about 35 minutes to run.

While the pipeline is running, continue to the next task to explore the pipeline in the Amazon SageMaker Studio console.

## Task 2.3 Monitor and approve the pipeline

In this task, you explore the pipeline using the  Amazon SageMaker Studio console.

### Task 2.3.1 Monitor the pipeline in SageMaker Studio

The next task opens new tab in SageMaker Studio. To follow these directions, use one of the following options:
- **Option 1:** View the tabs side by side. To create a split screen view from the main SageMaker Studio window, either drag the **lab_10.ipynb** tab to the side or choose (right-click) the **lab_10.ipynb** tab and select **New View for Notebook**. You can now have the directions displayed as you explore the pipeline steps.
- **Option 2:** Switch between the SageMaker Studio tabs to follow these instructions. When you are finished exploring the pipeline steps, return to the notebook by selecting the **lab_10.ipynb** tab.

1. In SageMaker Studio, choose the **SageMaker Home** icon.
1. Choose **Pipelines**.

SageMaker Studio opens the **Pipelines** tab.

1. Select the pipeline named **ChurnModelSMPipeline**. 

SageMaker Studio opens the **ChurnModelSMPipeline** tab.

1. In the **ChurnModelSMPipeline** tab, under **Executions**, open (right-click) the pipeline status, and then choose **Open execution details**. 

SageMaker Studio opens the **Directed Acyclic Graph** (DAG) page.

The Directed Acyclic Graph (DAG) shows the pipeline's workflow and progress. Colors are used to indicate the status of a step. The step color indicators are:
- **gray:** waiting to run.
- **blue:** running.
- **green:** completed successfully.
- **red:** error.

Each step that you explore has the same four tabs in the step's details pane. Although the tab titles are the same, the contents of each tab varies depending on the job type, and the job's configuration:

- **Input:** This tab contains inputs that were passed to the job. Examples of inputs are parameters that specify instance types, AWS Identity and Access Management (IAM) roles, or arguments needed to run the job. Other input examples include files such as code, data sets, and Docker images.
- **Output:** This tab shows the output that the job created. Some examples of outputs are metrics, charts, files, and evaluation outcomes.
- **Logs:** This tab provides a list of logs associated with the job. If the user has sufficient privileges to Amazon CloudWatch logs, users can choose the logs link and view detailed log messages in CloudWatch. Some job types do not generate logs.
- **Information:** This tab provides basic information about a job such as the job type, job name, and when the job ran.

1. Choose the step named **ChurnModelProcess**. A new pane named **ChurnModelProcess** is displayed.
1. In the **ChurnModelProcess** pane, review the tabs associated with this pipeline step: 
    - Choose the **Input** tab. This tab contains helpful information about the parameters and files that the processing step uses. In the parameters list, there are details including the instance type and image that the job uses, dataset location, code location, and destinations for the different outputs that are generated. Scroll to the bottom of the pane to find the file inputs that were passed to the job.
    - Choose the **Output** tab. This tab shows the different files that the pipeline step generates and where they are placed. This pipeline places all outputs in the SageMaker Studio default bucket.
    - Choose the **Logs** tab. This tab shows the logs that the job generates. Having the logging available inside SageMaker Studio speeds up investigation and troubleshooting when a pipeline step fails to run successfully.
    - Choose the **Information** tab. This tab provides a high-level overview of the pipeline step. It includes information such as the step type, step name, and a link to the job log. It also provides details about when the job ran and how long it took to run.
        - Notice that the **Step Type** is **Processing**.

### Task 2.3.2 Discover pipeline step details

In the following steps, you choose the appropriate node from the directed acyclic graph (DAG) to find information about a given pipeline step. If you need help finding the answers, correct responses or hints are included at the end of this python notebook.

1. For the step named **ChurnHyperParameterTuning**, locate the following details:
    - What is the **Step Type** for this step?
    - What was the **Overall Best Training Job** generated by this step?
1. For the step named **ChurnEvalBestModel**, locate the following details:
    - What is the **Step Type** for this step?
    - What is the name of the Python script that is used to evaluate the top model that was identified in the previous step?
    - Where is this file located?
    - Where were the results from this step written?
1. For the step named **CheckAUCScoreChurnEvaluation**, locate the following details:
    - What is the **Step Type** for this step?
    - What was the **Evaluation outcome**?
1. For the step named **ChurnCreateModel**, locate the following details:
    - What is the **Step Type** for this step?
    - Did this job generate any logs?
1. For the step named **RegisterChurnModel**, locate the following details:
    - What is the **Step Type** for this step?
    - What is value for the area under the curve (AUC) metric?
1. For the step named **ChurnTransform**, locate the following details:
    - What is the **Step Type** for this step?
    - Did this job generate logs?
    - Which files were inputs for this step?
1. For the step named **ChurnModelConfigFile**, locate the following details: 
    - Which ProcessingInstanceType was used to run this job?
    - What is the **Step Type** for this step?
1. For the step named **ClarifyProcessingStep**, locate the following details:
    - What was the file output from this step?
    - Where was the output written?

### 2.3.3 Approve the model in the pipeline

1. After the pipeline has finished running, view the model that the pipeline created in the **Model registry**:
    - In SageMaker Studio, choose the **SageMaker Home** icon.
    - Expand the **Models** list.
    - Choose **Model registry**.
    - Open the Model group named **ChurnModelPackageGroup**.
    - In the **ChurnModelPackageGroup** tab, open (right-click) the row in the **Versions** table and choose **Open model version**. Notice that the model status is **Pending**. Also, notice that the **Execution** value is the name of the pipeline run that just completed.

    Additional details about the pipeline are found in each tab:
    - **Activity:** This tab shows activity for the model. It contains event information and how long it has been since the model was modified.
    - **Model quality:** This tab shows model accuracy metrics.
    - **Explainability:** This tab shows the importance of the model's features in terms of Shapley Values (SHAP).
    - **Bias report:** This tab shows potential model bias.
    - **Inference recommender:** This tab provides recommendations to improve the price performance of a model. This tab does not contain data because this feature is not supported for this model package.
    - **Load test:** From this tab you can launch load tests to try different instance types and evaluate them for required throughput and latency metrics that are required for a production deployment.
    - **Settings:** This tab shows information such as when the model was created, which pipeline generated the model, where the model is located, and the trial component associated with the model.

1. Approve the model. This process is designed for manual review before approving the model. However, it is possible to automate the model approval within the pipeline:
    - Choose <span style="background-color:#57c4f8; font-size:90%;  color:black; position:relative; top:-1px; padding-top:3px; padding-bottom:3px; padding-left:10px; padding-right:10px; border-color:#00a0d2; border-radius:2px; margin-right:5px; white-space:nowrap">**Update status**</span>.
    - Open the dropdown list and choose <span style="background-color:#1a1b22; font-size:90%; color:#57c4f8; position:relative; top:-1px; padding-top:3px; padding-bottom:3px; padding-left:10px; padding-right:10px; border-color:#57c4f8; border-width:thin; border-style:solid; border-radius:2px; margin-right:5px; white-space:nowrap">**Approved**</span>.
    - Choose <span style="background-color:#57c4f8; font-size:90%;  color:black; position:relative; top:-1px; padding-top:3px; padding-bottom:3px; padding-left:10px; padding-right:10px; border-color:#00a0d2; border-radius:2px; margin-right:5px; white-space:nowrap">**Update status**</span>.
1. Close the **ChurnModelPackageGroup** tab.

### Task 2.3.4 View the pipeline steps using the AWS SDK

In addition to using the SageMaker Studio UI to view pipeline details, you can also use AWS SDK commands. For example, the following command returns a list of the pipeline steps.

In [None]:
#list-steps
RunPipeline.list_steps()

## Task 2.4 Review the artifacts

The next task opens a new tab in SageMaker Studio. To follow these directions, use one of the following options:
- **Option 1:** View the tabs side by side. To create a split screen view from the main SageMaker Studio window, either drag the **lab_10.ipynb** tab to the side or choose (right-click) the **lab_10.ipynb** tab and choose **New View for Notebook**. You can now have the directions displayed as you explore the artifacts.
- **Option 2:** Switch between the SageMaker Studio tabs to follow these instructions. When you are finished exploring the artifacts, return to the notebook by choosing the **lab_10.ipynb** tab.

### Task 2.4.1 Review the artifacts in SageMaker Studio

As the pipeline ran, each step generated artifacts such as files, trained parameters, and models. You can identify artifacts that the pipeline in SageMaker Studio created.
1. Return to the tab named **ChurnModelSMPipeline**.
1. Select the **Executions** tab.
1. Open (right-click) the listed execution and choose **View trial components generated by execution**. 

SageMaker Studio opens a new tab named **Trial Component List**. 

A list of all the jobs that the pipeline ran is displayed.

Notice that each trial component, has a **Trial Component Type**. Information available under the various tabs in the associated trial detail depend on the trial component type. Not all tabs under trial detail are populated with data for all component types.

1. Open (right-click) the top row in the list of jobs and choose **Open in trial details**. 

SageMaker Studio opens a new tab named **Describe Trial Component**. 

Under **Trial Components**, multiple tabs are available. Depending on what a pipeline step was doing, some tabs might be empty. 

1. Choose the **Artifacts** tab. The details of both the input and the output used in the step.
1. Choose the **Explainability** tab. This tab shows the explainability report that SageMaker Clarify generated.
1. Choose the **Bias Report** tab. This tab shows the bias report that SageMaker Clarify generated.

### Task 2.4.2 Locate the artifacts in the default S3 bucket
**Note:** You use the AWS Management Console for this task. After you have explored the S3 bucket, return to the browser tab where SageMaker Studio is open and choose the **lab_10.ipynb** tab.

1. On the browser tab where the console is open, navigate to Amazon S3.
1. Choose the bucket name that begins with **sagemaker-** and the AWS Region; for example **sagemaker-us-west-2-123456789**.
1. Explore the folders and files under this bucket. This bucket contains the dataset, processing inputs and outputs, the SageMaker Clarify results, and other files that contributed to the resulting model.
1. Return to the browser tab where SageMaker Studio is open and choose the **lab_10.ipynb** tab.

## Task 2.5 (Optional) Build and review the lineage for the pipeline

You learned how to use SageMaker Clarify to help explain how a model makes predictions and understand the potential bias of a model. You can also use SageMaker Clarify to discover the steps that are used to generate the model, which are often needed for model auditing. In this task you take advantage of the MLLineageHelper module to build the lineage of the current pipeline run. Refer to [MLLineageHelper](https://github.com/aws-samples/ml-lineage-helper) for more information about ML Lineage Helper.

Amazon SageMaker ML Lineage Tracking creates and stores information about the steps of an ML workflow from data preparation to model deployment. With the tracking information, you can reproduce the workflow steps, track model and dataset lineage, and establish model governance and audit standards.

### Task 2.5.1 Setting up the session and variables

In [None]:
#set-variables
fs_query = feature_group.athena_query()
fs_table = fs_query.table_name
query_string = 'SELECT * FROM "'+fs_table+'"'

### Task 2.5.2 Show values that will be used to build the model's lineage

Configurations include the following:
- **query_string:** This is the SageMaker Feature Store query that will be passed to the MLLineageHelper module.
- **model_ref:** This is the name of the model that is being evaluated.
- **processing_job:** This is the name of the processing job that generated the model.

In [None]:
#print-values
print ('query_string:',query_string)

model_ref = sagemaker_client.list_models(SortBy = 'CreationTime', SortOrder = 'Descending')['Models'][0]['ModelName']
print ('model_ref:',model_ref)

processing_job = sagemaker_client.list_processing_jobs(SortBy = 'CreationTime', SortOrder = 'Descending', NameContains = 'ChurnModelProcess')['ProcessingJobSummaries'][0]['ProcessingJobName']
print ('processing_job:',processing_job)

processing_job_description = sagemaker_client.describe_processing_job(
    ProcessingJobName = processing_job
    )

### Task 2.5.3 Describe the processing job

In [None]:
#describe-processing-job
processing_job_description

### Task 2.5.4 Show the name of the training job used to create the model

In [None]:
#print-training-job
training_job_name  =  sagemaker_client.list_training_jobs(SortBy = 'CreationTime', SortOrder = 'Descending')['TrainingJobSummaries'][0]['TrainingJobName']
print (training_job_name)

### Task 2.5.5 Build the lineage for the model

If you receive the following error, run the cell again.
- **ClientError: An error occurred (ThrottlingException) when calling the UpdateArtifact operation (reached max retries: 4): Rate exceeded**

In [None]:
#build-lineage
ml_lineage = MLLineageHelper()
lineage = ml_lineage.create_ml_lineage(training_job_name, model_name = model_ref,
                                       query = query_string, sagemaker_processing_job_description = processing_job_description,
                                       feature_group_names = [feature_group_name])

### Task 2.5.6 Limit the lineage to include only the current trial and feature group

A pipeline can run multiple times. To ensure that you are retrieving details from the most recent training job run, filter the lineage call using the name of the current trial and the feature group that the trial uses. 

After you run this cell, the steps used to create the model, the order in which the steps ran, and which jobs contributed to other jobs in the pipeline are displayed as a table. This same information is also written to a file named **lineage_FS.csv**. You can download this file to save the output and share it with other team members, such as auditors.

In [None]:
#limit-lineage
trial_name = RunPipeline.describe()['PipelineExperimentConfig']['TrialName']
pat = str(trial_name)+'|'+'fg-FG'
df1 = lineage[lineage.apply(lambda x: any(x.str.contains(pat)),axis = 1)]
pd.set_option('display.max_colwidth', 120)
df1.to_csv('lineage_FS.csv') 
df1

### Task 2.5.7 Generate a visualization of the model's lineage

In [None]:
#visualize-lineage
plt.figure(3, figsize = (20, 14))
graph = nx.DiGraph()
graph.add_edges_from([(each[0], each[2]) for each in df1.values])
fig, ax = plt.subplots()
nx.draw_networkx(
    graph,
    node_size = 300,
    node_color = "orange",
    alpha = 0.65,
    font_size = 8,
    pos = nx.spring_layout(graph)
)
ax.set_facecolor('deepskyblue')
ax.axis('off')
fig.set_facecolor('deepskyblue')
plt.show()

## Task 2.6 Remove the pipeline

To delete the pipeline, run the following cell:

In [None]:
#delete-pipeline
response = sagemaker_client.delete_pipeline(PipelineName = 'ChurnModelSMPipeline')
print (response)

### Conclusion 

Congratulations! You have used SageMaker Pipelines to automate the creation and registry of a model. You learned how to drill down into each pipeline step to identify associated parameters, files, and logs. You know how to identify the assets that the pipeline used to generate the model, how to find the model in the model registry, and how to find and view the explainability and bias reports that a pipeline can generate.

### Cleanup

You have completed this notebook. To move to the next part of the lab, do the following:

- Close this notebook file.
- Return to the lab session and continue with the **Conclusion**.

### Hints and answers for Task 2.3.1.1
General hint: **Step type** is found on the **Information** tab.

1. For the step named **ChurnHyperParameterTuning**, locate the following details:
    - What is the **Step Type** for this step?</br>
    **Answer:** Tuning</br>
    - What was the **Overall Best Training Job** generated by this step? </br>
    **Hint:** This information is found on the **Output** tab.</br>
    **Answer:** The model name is generated and will be different for each student. The name should be similar to this example: 056vhzs2vkxc-ChurnHy-TCAtUr16oV-001-17d5bd01
1. For the **ChurnEvalBestModel** step, locate the following details:
    - What is the **Step Type** for this step?
    **Answer:** Processing</br>
    - What is the name of the Python script that is used to evaluate the top model that was identified in the previous step?</br>
    **Hint:** This information is found on the **Input** tab.</br>
    **Answer:** evaluate.py</br>
    - Where is this file located?</br>
    **Hint:** This information is found on the **Input** tab.</br>
    **Answer:** The file resides in an S3 Bucket. The path is similar to this example: s3://sagemaker-us-west-2-1234567890/input/code/evaluate.py</br>
    - Where were the results from this step written?</br>
    **Hint:** This information is found on the **Output** tab.</br>
    **Answer:** The results of the evaluation were written to an S3 bucket. The path to the file should be similar to the following example: s3://sagemaker-us-west-2-1234567890/output/evaluation</br>
1. For the **CheckAUCScoreChurnEvaluation** step, locate the following details:
    - What is the **Step Type** for this step?</br>
    **Answer:** Condition</br>
    - What was the **Evaluation outcome**?</br>
    **Hint:** This information is found on the **Output** tab.</br>
    **Answer:** True
1. For the **ChurnCreateModel** step, locate the following details:
    - What is the **Step Type** for this step?</br>
    **Answer:** Model</br>
    - Did this job generate any logs?</br>
    **Answer:** No
1. For the **RegisterChurnModel** step, locate the following details:
    - What is the **Step Type** for this step?</br>
    **Answer:** RegisterModel</br>
    - What is value for the AUC metric?</br>
    **Hint:** This information is found on the **Output** tab.</br>
    **Answer:** The value will vary, but should be close to 0.98.
1. For the **ChurnTransform** step, locate the following details:
    - What is the **Step Type** for this step?</br>
    **Answer:** Transform</br>
    - Did this job generate logs?</br>
    **Answer:** Yes</br>
    - Which files were inputs for this step?</br>
    **Hint:** This information is found on the **Input** tab. You might need to scroll to the bottom of the pane to find the file names.</br>
    **Answer:** model.tar.gz, sagemaker-xgboost:1.5-1-cpu-py3, batch.csv
1. For the **ChurnModelConfigFile** step, locate the following details: 
    - Which ProcessingInstanceType was used to run this job?</br>
    **Hint:** This information is found on the **Input** tab.</br>
    **Answer:** ml.m5.xlarge
    - What is the **Step Type** for this step?</br>
    **Answer:** Processing
1. For the **ClarifyProcessingStep**, locate the following details:
    - What was the file output from this step?
    **Hint:** This information is found on the **Output** tab.</br>
    **Answer:** The output was bias data.
    - Where was the output written?
    **Answer:** The output was written to an S3 Bucket. The path should be similar to this example: s3://sagemaker-us-west-2-1234567890/clarify-output/bias