## Get started with SageMaker

This notebok is part of [Introductory blog on SageMaker Core](https://aws.amazon.com/blogs/machine-learning/introducing-sagemaker-core-a-new-object-oriented-python-sdk-for-amazon-sagemaker/).

In this notebook you'll learn how SageMaker can be used to:

1. Preprocess (and optionally explore) a dataset
2. Train an XGBoost classifier for customer churn prediction, using a managed job with SageMaker Training, using a managed image.
3. Create a managed real-time SageMaker endpoint.

All SageMaker resources are created using the SageMaker Core SDK. You can find more information about sagemaker-core [here](https://sagemaker-core.readthedocs.io/en/latest/)

In [None]:
%pip install --upgrade pip -q
%pip install sagemaker-core -q

In [None]:
import time
import io
from datetime import datetime
import pandas as pd
from sagemaker_core.helper.session_helper import Session, get_execution_role
from io import StringIO
import pandas as pd
from sagemaker_core.main.shapes import (
    ProcessingInput,
    ProcessingResources,
    AppSpecification,
    ProcessingS3Input,
    ProcessingOutputConfig
)
from sagemaker_core.shapes import (
    ProcessingResources,
    ProcessingClusterConfig,
    ProcessingOutput,
    ProcessingS3Output,
)

from sagemaker_core.resources import ProcessingJob

In [None]:
# Set up region, role and bucket parameters used throughout the notebook.
sagemaker_session = Session()
region = sagemaker_session.boto_region_name
role = get_execution_role()
bucket = sagemaker_session.default_bucket()
s3_client = 

print(f"AWS region: {region}")
print(f"Execution role: {role}")
print(f"Default S3 bucket: {bucket}")

## Preprocess dataset
We'll use a synthetic dataset that AWS provides for customer churn prediction.


<div class="alert alert-block alert-info">
<b>NOTE:</b> This sample doesn't perform any exploratory data anlysis since how to preprocess the dataset is already known.
    
If you're interested in how to perform exploratory analysis, there's a section in the documentation for the sagemaker-python-sdk available that explores the dataset, [here](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_applying_machine_learning/xgboost_customer_churn/xgboost_customer_churn.html).
</div>

## Upload the processing code to S3

The pre-processing code is already available in the current directory with the name `preprocess.py`. Have a look at the code to understand the processing logic.

In [None]:
pre_processing_code_s3_uri = sagemaker_session.upload_data("preprocess.py", key_prefix="sagemaker-core-intro-blog/processing/code")
pre_processing_code_s3_uri

### Create S3 variables for holding the processed data

Create S3 variables for holding the processed data (train, validation and test).

In [None]:
processed_train_data_uri = f"s3://{bucket}/sagemaker-core-intro-blog/processing/output/train"
processed_validation_data_uri = f"s3://{bucket}/sagemaker-core-intro-blog/processing/output/validation"
processed_test_data_uri = f"s3://{bucket}/sagemaker-core-intro-blog/processing/output/test"

## Create processing job

Below code submits a sagemaker processing job.

1. ProcessingResources provides processing cluster details.
2. AppSpecification provides processing container details. Here SKlearn processing container provided by sagemaker is used.
3. Two objects of ProcessingInput specifying code and input data locations and configurations.
4. ProcessingOutputConfig provides details on where the processed data will be stored.

In [None]:
# Initialize a ProcessingJob resource
current_timestamp = datetime.now()
formatted_timestamp = current_timestamp.strftime("%Y-%m-%d-%H-%M-%S")

processing_job = ProcessingJob.create(
    processing_job_name=f"sagemaker-core-data-prep-{formatted_timestamp}",
    processing_resources=ProcessingResources(
        cluster_config=ProcessingClusterConfig(
            instance_count=1,
            instance_type="ml.m5.xlarge",
            volume_size_in_gb=20
        )
    ),
    app_specification=AppSpecification(
        image_uri=f"683313688378.dkr.ecr.{region}.amazonaws.com/sagemaker-scikit-learn:0.23-1-cpu-py3",
        container_entrypoint=["python3", "/opt/ml/processing/code/preprocess.py"]
    ),
    role_arn=role,  # Intelligent default for execution role
    processing_inputs=[
        ProcessingInput(
            input_name="input",
            s3_input=ProcessingS3Input(
                s3_uri=f"s3://sagemaker-example-files-prod-{region}/datasets/tabular/synthetic/churn.txt",
                s3_data_type="S3Prefix",
                local_path="/opt/ml/processing/input",
                s3_input_mode="File"
            ),
        ),
        ProcessingInput(
            input_name="code",
            s3_input=ProcessingS3Input(
                s3_uri=pre_processing_code_s3_uri,
                s3_data_type="S3Prefix",
                local_path="/opt/ml/processing/code",
                s3_input_mode="File"
            ),
        )
    ],
    processing_output_config= ProcessingOutputConfig(
            outputs=[
                ProcessingOutput(
                    output_name="train",
                    s3_output=ProcessingS3Output(
                        s3_uri=processed_train_data_uri,
                        s3_upload_mode="EndOfJob",
                        local_path="/opt/ml/processing/output/train"
                    )
                ),
                ProcessingOutput(
                    output_name="validation",
                    s3_output=ProcessingS3Output(
                        s3_uri=processed_validation_data_uri,
                        s3_upload_mode="EndOfJob",
                        local_path="/opt/ml/processing/output/validation"
                    )
                ),
                ProcessingOutput(
                    output_name="test",
                    s3_output=ProcessingS3Output(
                        s3_uri=processed_test_data_uri,
                        s3_upload_mode="EndOfJob",
                        local_path="/opt/ml/processing/output/test"
                    )
                )
            ]
        )
)

# Wait for the ProcessingJob to complete
processing_job.wait()


## Train a classifier using XGBoost
Use SageMaker Training and the managed XGBoost image to train a classifier. <br />
More details on how to use SageMaker managed training with XGBoost can be found [here](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html).

<div class="alert alert-block alert-info">
  <b>NOTE:</b> For more information on using SageMaker managed container images and retrieving their ECR paths, 
  <a href="https://docs.aws.amazon.com/sagemaker/latest/dg-ecr-paths/sagemaker-algo-docker-registry-paths.html" target="_blank">here</a> 
  is the documentation. Please note that the image URI might need to be updated based on your selected AWS region.
</div>


In [None]:
image = f"683313688378.dkr.ecr.{region}.amazonaws.com/sagemaker-xgboost:1.7-1"

In [None]:
from sagemaker_core.resources import TrainingJob
from sagemaker_core.shapes import (
    AlgorithmSpecification,
    Channel,
    DataSource,
    S3DataSource,
    ResourceConfig,
    StoppingCondition,
    OutputDataConfig,
)

job_name = "xgboost-churn-" + time.strftime(
    "%Y-%m-%d-%H-%M-%S", time.gmtime()
)  # Name of training job
instance_type = "ml.m4.xlarge"  # SageMaker instance type to use for training
instance_count = 1  # Number of instances to use for training
volume_size_in_gb = 30  # Amount of storage to allocate to training job
max_runtime_in_seconds = 600  # Maximum runtimt. Job exits if it doesn't finish before this
s3_output_path = f"s3://{bucket}"  # bucket and optional prefix where the training job stores output artifacts, like model artifact.

# Specify hyperparameters
hyper_parameters = {
    "max_depth": "5",
    "eta": "0.2",
    "gamma": "4",
    "min_child_weight": "6",
    "subsample": "0.8",
    "verbosity": "0",
    "objective": "binary:logistic",
    "num_round": "100",
}

# Create training job.
training_job = TrainingJob.create(
    training_job_name=job_name,
    hyper_parameters=hyper_parameters,
    algorithm_specification=AlgorithmSpecification(
        training_image=image, training_input_mode="File"
    ),
    role_arn=role,
    input_data_config=[
        Channel(
            channel_name="train",
            content_type="csv",
            data_source=DataSource(
                s3_data_source=S3DataSource(
                    s3_data_type="S3Prefix",
                    s3_uri=processed_train_data_uri,
                    s3_data_distribution_type="FullyReplicated",
                )
            ),
        ),
        Channel(
            channel_name="validation",
            content_type="csv",
            data_source=DataSource(
                s3_data_source=S3DataSource(
                    s3_data_type="S3Prefix",
                    s3_uri=processed_validation_data_uri,
                    s3_data_distribution_type="FullyReplicated",
                )
            ),
        ),
    ],
    output_data_config=OutputDataConfig(s3_output_path=s3_output_path),
    resource_config=ResourceConfig(
        instance_type=instance_type,
        instance_count=instance_count,
        volume_size_in_gb=volume_size_in_gb,
    ),
    stopping_condition=StoppingCondition(max_runtime_in_seconds=max_runtime_in_seconds),
)

# Wait for the training job to complete
# training_job.wait()
training_job.wait(poll=60, timeout=None, logs=False)

## Use model artifacts for real time inference
To use the model to perform real time inference, we need to:

1. Create a SageMaker model with the same first-party image as we used for training, and the model artifacts produced during training. Indeed, such image can also be used to run inference
2. Create an `EndpointConfig` using the SageMaker model object created in the previous step. The endpoint configuration specifies what SageMaker model to use, and what endpoint type.
3. Create a SageMaker endpoint using `EndpointConfig` and other optional parameters.

More information about SageMaker Endpoints can be found [here](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints.html).

#### Create SageMaker Model

Create a Model resource based on the model artifacts produced by the training job.

In [None]:
from sagemaker_core.resources import Model
from sagemaker_core.shapes import ContainerDefinition

model_s3_uri = training_job.model_artifacts.s3_model_artifacts  # Get URI of model artifacts from the training job.

# Create SageMaker model: An image along with the model artifact to use.
customer_churn_model = Model.create(
    model_name="customer-churn-xgboost",
    primary_container=ContainerDefinition(image=image, model_data_url=model_s3_uri),
    execution_role_arn=role,
)

## Create endpoint configuration for real-time inference
To create a SageMaker endpoint we first create an `EndpointConfig`. The endpoint configuration specifies what SageMaker model to use.

In [None]:
from sagemaker_core.resources import Endpoint, EndpointConfig
from sagemaker_core.shapes import ProductionVariant

endpoint_config_name = "churn-prediction-endpoint-config"  # Name of endpoint configuration
model_name = customer_churn_model.get_name()  # Get name of SageMaker model created in previous step
endpoint_name = "customer-churn-endpoint"  # Name of SageMaker endpoint

endpoint_config = EndpointConfig.create(
    endpoint_config_name=endpoint_config_name,
    production_variants=[
        ProductionVariant(
            variant_name="AllTraffic",
            model_name=model_name,
            instance_type=instance_type,
            initial_instance_count=1,
        )
    ],
)

In [None]:
sagemaker_endpoint.wait_for_status(
    target_status="InService"
)  # Wait for endpoint to become in service

#### Test live endpoint - with a sample record from test dataset

Let us download the test data from S3.

In [None]:
s3 = boto3.client("s3")
s3.download_file(Bucket = bucket,
                 Key = "sagemaker-core-intro-blog/processing/output/train/test.csv",
                 Filename = "test.csv")


#### Invoke the endpoint

Let us invoke the endpoint now.

In [None]:
#Pick a random record from CSV and convert it to string
df = pd.read_csv("test.csv", header=None)
sample = df.sample(1)
sample_payload = sample.to_csv(header=False, index=False).strip()

# Send sample payload to live endpoint and parse response
res = sagemaker_endpoint.invoke(body=sample_payload, content_type="text/csv")
result = res["Body"].read().decode("utf-8")
result = result.split("\n")[:-1]

# Compute performance metrics
df_result = pd.DataFrame(result).astype(float)
print_performance_metrics(df_result, test_target_column)

## Clean up

In [None]:
sagemaker_endpoint.delete()
endpoint_config.delete()
customer_churn_model.delete()