## SageMaker SDK Demo

**This sample is provided for demonstration purposes, make sure to conduct appropriate testing if derivating this code for your own use-cases!**

### Step 0: Get Admin Setup Results
Bucket names, codecommit repo, docker image, IAM roles, ...

In order to keep things orginized, we will save our `Source Code` (data processing, model training/serving scripts), `datasets`, as well as our trained `model(s) binaries` and their `test-performance metrics` all on S3, **versioned with respect to the date/time of each update.**

In [None]:
from time import gmtime, strftime
import sagemaker
import json

# Grab admin resources (S3 Bucket name, IAM Roles and Docker Image for Training)
with open('admin_setup.txt', 'r') as filehandle:
    admin_setup = json.load(filehandle)

SOURCE_DATA = admin_setup["raw_data_path"]
BUCKET = admin_setup["project_bucket"]
REPO_NAME = admin_setup["repo_name"]
WORKFLOW_NAME = admin_setup["workflow_name"]
WORKFLOW_DATE_TIME = strftime("%Y-%m-%d-%H-%M-%S", gmtime())

### Test our data processing script locally

In [None]:
!pygmentize ./sagemaker-processing-src/processing.py

In [None]:
!mkdir data
!mkdir data/input
!mkdir data/train
!mkdir data/validation
!mkdir data/test
!cp boston.csv data/input/

In [None]:
%run -i ./sagemaker-processing-src/processing.py --local_path ./data/


!ls data/train/ 

### SageMaker Hosted Processing

In [None]:
from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput
role = sagemaker.get_execution_role()

INPUT_DESTINATION = SOURCE_DATA
OUTPUT_DESTINATION = 's3://{}/{}/data'.format(BUCKET, WORKFLOW_DATE_TIME)
PROCESSING_JOB_NAME = "dev-project-data-processing-{}".format(WORKFLOW_DATE_TIME)


sklearn_processor = SKLearnProcessor(framework_version='0.20.0',
                                     role=role,
                                     instance_type='ml.m5.xlarge',
                                     instance_count=1
                                    )

inputs = [ProcessingInput(source=INPUT_DESTINATION,
                          destination='/opt/ml/processing/input',
                          s3_data_distribution_type='ShardedByS3Key'
                         )
         ]

outputs = [ProcessingOutput(output_name='train',
                            destination='{}/train'.format(OUTPUT_DESTINATION),
                            source='/opt/ml/processing/train'
                           ),
           ProcessingOutput(output_name='validation',
                            destination='{}/validation'.format(OUTPUT_DESTINATION),
                            source='/opt/ml/processing/validation'
                           ),
           ProcessingOutput(output_name='test',
                            destination='{}/test'.format(OUTPUT_DESTINATION),
                            source='/opt/ml/processing/test'
                           )
          ]

sklearn_processor.run(code='./sagemaker-processing-src/processing.py',
                      job_name = PROCESSING_JOB_NAME,
                      inputs = inputs,
                      outputs = outputs
                     )

In [None]:
!aws s3 ls {"s3://{}/{}/data/".format(BUCKET, WORKFLOW_DATE_TIME)}

### Test training script locally

In [None]:
!pygmentize ./sagemaker-train-serve-src/train.py

In [None]:
!mkdir ./models

%run -i sagemaker-train-serve-src/train.py \
    --model-dir ./models \
    --train ./data/train/ \
    --validation ./data/validation/

### Build the SageMaker SKLearn Estimator

In [None]:
from sagemaker.sklearn.estimator import SKLearn
import sagemaker

# regex to extract our objective metric from training job logs
validation_metric_defs = [{'Name': 'median_ae',
                           'Regex': "AE-at-50th-percentile: ([0-9.]+).*$"
                          }]

# Instantiate estimator
train_estimator = SKLearn(sagemaker_session = sagemaker.Session(),
                          role = role,
                          source_dir = './sagemaker-train-serve-src/',
                          entry_point = 'train.py',
                          instance_type = 'ml.m5.xlarge',
                          instance_count = 1,
                          framework_version = '0.23-1',
                          metric_definitions = validation_metric_defs,
                          output_path = 's3://{}/{}'.format(BUCKET, WORKFLOW_DATE_TIME + '/model-artifacts'),
                          code_location = 's3://{}/{}'.format(BUCKET, WORKFLOW_DATE_TIME + '/source-code')
                          )

### And fit it

In [None]:
# S3 data paths
TRAINING_DATA_PATH = "s3://{}/{}/data/train/train.csv".format(BUCKET, WORKFLOW_DATE_TIME)
VALIDATION_DATA_PATH = "s3://{}/{}/data/validation/validation.csv".format(BUCKET, WORKFLOW_DATE_TIME)
TESTING_DATA_PATH = "s3://{}/{}/data/test/test.csv".format(BUCKET, WORKFLOW_DATE_TIME)


train_estimator.fit(job_name = "{}-{}-sdk".format(WORKFLOW_NAME, WORKFLOW_DATE_TIME),
                    inputs = {"train" : TRAINING_DATA_PATH,
                              "validation" : VALIDATION_DATA_PATH
                             },
                    wait = True
                   )

### [Built-in](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html) XGBoost Algorithms Example: no training code required

#### Note: For CSV training datasets, the algorithm assumes that the target variable is in the first column and that the CSV does not have a header record. 

``` python
import sagemaker
from sagemaker.amazon.amazon_estimator import get_image_uri 
from sagemaker.session import s3_input, Session
region = boto3.Session().region_name
xgboost_container = get_image_uri(region, 'xgboost', repo_version='latest')

# Initialize hyperparameters
hyperparameters = {
        "max_depth":"5",
        "eta":"0.2",
        "gamma":"4",
        "min_child_weight":"6",
        "subsample":"0.7",
        "objective":"reg:linear",
        "num_round":"50"
}

# Construct a SageMaker estimator that calls the xgboost-container
builtin_xgb_estimator = sagemaker.estimator.Estimator(
    image_name=xgboost_container,
    hyperparameters=hyperparameters,
    role=sagemaker.get_execution_role(),
    train_instance_count=1, 
    train_instance_type='ml.m5.2xlarge', 
    train_volume_size=5, 
    output_path='s3://{}/{}'.format(BUCKET, WORKFLOW_DATE_TIME + '/builtin-xgb-model-artifacts')
)


# Execute the XGBoost training job
train_input = s3_input("s3://...", content_type="csv")
validation_input = s3_input("s3://...", content_type="csv")
builtin_xgb_estimator.fit({'train': train_input, 'validation': validation_input}, wait=False)
```

### Hyperparameter optimization

In [None]:
from sagemaker.tuner import IntegerParameter

# Define exploration boundaries
hyperparameter_ranges = {
    'n-estimators': IntegerParameter(20, 100),
    'min-samples-leaf': IntegerParameter(2, 6)
}

# Create Optimizer
Optimizer = sagemaker.tuner.HyperparameterTuner(
    base_tuning_job_name="{}-tuner-{}".format(WORKFLOW_NAME, WORKFLOW_DATE_TIME),
    estimator=train_estimator,
    max_jobs=12,
    max_parallel_jobs=4,
    hyperparameter_ranges=hyperparameter_ranges,
    objective_metric_name='median-AE',
    objective_type='Minimize',
    metric_definitions=[
        {'Name': 'median-AE',
         'Regex': "AE-at-50th-percentile: ([0-9.]+).*$"
        }] # extract tracked metric from logs with regexp   
)

In [None]:
# Fit Optimizer
Optimizer.fit({'train': TRAINING_DATA_PATH, 'validation': VALIDATION_DATA_PATH})

In [None]:
# Get Optimizer results in a df
results = Optimizer.analytics().dataframe()
results.sort_values("FinalObjectiveValue").head(10)

### SageMaker hosted endpoint

We can easily deploy a SageMaker model to production. A convenient option is to use a SageMaker hosted endpoint, which serves real time predictions from the trained model (Batch Transform jobs also are available for asynchronous, offline predictions on large datasets). The endpoint will retrieve the SavedModel created during training and deploy it within a SageMaker SKLearn container. This all can be accomplished with one line of code.

In [None]:
train_estimator.deploy(initial_instance_count=1,
                       instance_type="ml.m5.xlarge",
                       endpoint_name=WORKFLOW_NAME,
                       wait=False
                      )

### If hyperparameter optimization was used, we can deploy the best model from the HyperparameterTuner directly
By calling the deploy method of the HyperparameterTuner object we instantiated above, we can directly deploy the best model from the tuning job to a SageMaker hosted endpoint.

```python
Optimizer.deploy(initial_instance_count=1,
                 instance_type="ml.m5.xlarge",
                 endpoint_name = "dev-project-tuned-model"
                )
```

### Note on cost:
The above training job took 100 seconds on a `ml.m5.2xlarge` which [costs](https://aws.amazon.com/sagemaker/pricing/) `$0.538` per hour. But let's assume that training the model took 30 minutes instead, and that the data processing job took 60 minutes. 

Furthermore, if we assume that we will host our model on an `ml.c5.large` instance (2 CPUs /4 GiB Memory), the cost per hour for this instance is `$0.119`. With that, our total cost will be:

In [None]:
monthly_notebook_cost = 173.33*0.0582
print("monthly_notebook_cost = $" + str(monthly_notebook_cost))

processing_cost_per_sec = 0.538/3600
training_cost_per_sec = 0.538/3600
monthly_cycles = 4

processing_secs = 60*60
processing_job_cost = processing_cost_per_sec*processing_secs
processing_jobs_cost = monthly_cycles*processing_job_cost
print("processing_jobs_cost = $" + str(processing_jobs_cost))

training_secs = 60*30
training_job_cost = training_cost_per_sec*training_secs
training_jobs_cost = monthly_cycles*training_job_cost
print("training_jobs_cost = $" + str(training_jobs_cost))

hosting_cost = 0.119*24*30
print("hosting_cost = $" + str(hosting_cost))

total_mothly_cost = monthly_notebook_cost + processing_jobs_cost + training_jobs_cost + hosting_cost
print("total_mothly_cost = $" + str(round(total_mothly_cost,1)))

### Let's test our endpoint

In [None]:
import time
import json
import boto3
import pandas as pd
from sklearn.datasets import load_boston

sagemaker_client = boto3.client('sagemaker')
endpoint_status = "Creating"
while endpoint_status != "InService":
    endpoint_status = sagemaker_client.describe_endpoint(**{'EndpointName': WORKFLOW_NAME})["EndpointStatus"]
    print(endpoint_status)
    time.sleep(5)

In [None]:
data = load_boston()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['PRICE'] = data.target
print(df.shape)

sagemaker_runtime = boto3.client('sagemaker-runtime')
response = sagemaker_runtime.invoke_endpoint(
    EndpointName=WORKFLOW_NAME,
    Body=df[data.feature_names].to_csv(header=False, index=False).encode('utf-8'),
    ContentType='text/csv')

decoded_response = json.loads(response['Body'].read().decode("utf-8"))
print(decoded_response[0:10])

### Native support for data-capture
```python
from sagemaker.model_monitor import DataCaptureConfig

data_capture_config = DataCaptureConfig(
                        enable_capture = True,
                        sampling_percentage=50,
                        destination_s3_uri='s3://.../',
                        kms_key_id=None,
                        capture_options=["REQUEST", "RESPONSE"],
                        csv_content_types=["text/csv"],
                        json_content_types=["application/json"]
)
```

Attach the new configuration to your endpoint...

```python
from sagemaker import RealTimePredictor

predictor = RealTimePredictor(endpoint="dev-project")
predictor.update_data_capture_config(data_capture_config=data_capture_config)
```

## Batch transform job (without a real-time endpoint)

```python
transformer = train_estimator.transformer(instance_count=1, instance_type='ml.m4.xlarge')
transformer.transform("<path to input data>", content_type='text/csv')

print('Waiting for transform job: ' + transformer.latest_transform_job.job_name)
transformer.wait()
batch_output = transformer.output_path
print(batch_output)
```

#### Or launch transform jobs from anywhere using boto3
```python
sm_client = boto3.client("sagemaker")
response = sm_client.create_transform_job(
    TransformJobName = "{}-batch-transform-{}".format(WORKFLOW_NAME, WORKFLOW_DATE_TIME),
    ModelName = "{}-{}-sdk".format(WORKFLOW_NAME, WORKFLOW_DATE_TIME)
    TransformInput = {
        'DataSource': {
            'S3DataSource': {
                'S3DataType': 'S3Prefix',
                'S3Uri': TESTING_DATA_PATH
            }
        },
        'ContentType': 'text/csv',
        'CompressionType': 'None',
        'SplitType': 'Line'
    },
    TransformOutput = {
        'S3OutputPath': "s3://{}/{}/data/test/test_pred.csv".format(BUCKET, WORKFLOW_DATE_TIME)
    },
    TransformResources = {
        'InstanceType': 'ml.m5.large',
        'InstanceCount': 1
    }
)```

# The End

### Example of deploying models trained outside SageMaker

```python 
model_artifact_on_s3 = sm_client.describe_training_job(
    TrainingJobName=train_estimator.latest_training_job.name)['ModelArtifacts']['S3ModelArtifacts']
print('Model artifact persisted at ' + model_artifact_on_s3)

from sagemaker.sklearn.model import SKLearnModel
model = SKLearnModel(model_data=model_artifact_on_s3,
                     role=role,
                     source_dir = './sagemaker-train-serve-src/',
                     entry_point='train.py'
                    )

model.deploy(instance_type='ml.c5.large',
             initial_instance_count=1,
             endpoint_name='dev-project',
             wait=False
            )
```

### Example of Custom Inference Functions

```python
import argparse
import os
import sys
import numpy as np
import pandas as pd
import json
import csv
from sklearn.externals import joblib
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PowerTransformer, StandardScaler
from six import BytesIO
from xgboost import XGBRegressor

MODEL_NAME = 'octank_model.joblib'
TARGET_NORMALIZER_NAME = 'octank_target_normalizer.joblib'

def model_fn(model_dir):
    """Part of the SageMaker sklearn docker image, this function loads the serialized model.
    More information: https://sagemaker.readthedocs.io/en/stable/using_sklearn.html
    
    Args:
        model_dir: a directory where model is saved.
    Returns: 
        a tuple of two Scikit-learn models.
    """
    model = joblib.load(os.path.join(model_dir, MODEL_NAME))
    target_normalizer = joblib.load(os.path.join(model_dir, TARGET_NORMALIZER_NAME))
    return (model, target_normalizer)


def input_fn(input_data, content_type):
    """ This function is called on the byte stream sent by the client, and is used to deserialize the
    bytes into a Python object suitable for inference by predict_fn.  -- in this case, a NumPy array.
    More information: https://sagemaker.readthedocs.io/en/stable/using_sklearn.html

    This implementation assumes users will want to run predict on a pandas dataframe. The will first 
    serialize the datafrme using _npy_dumps() and then request a prediction. _npy_load() is in charge
    of deserializing the data. 

    Args:
        input_bytes (numpy array): a numpy array containing the data serialized by the Chainer predictor
        content_type: the MIME type of the data in input_bytes
    Returns:
        a NumPy array represented by input_bytes.
    """    
    if content_type == 'application/json':
        # Read the raw input data as json.
        input_dict = json.loads(input_data)
        
        # Convert to Pandas DF
        input_df = pd.DataFrame(input_dict, index=[0])

        return input_df
    
    elif content_type == 'text/csv':
        data = csv.reader(input_data.splitlines()[1:])
        data = pd.DataFrame(data)
        data.columns = input_data.splitlines()[0].split(',')
        return data.replace("",0)
    else:
        raise ValueError("{} not supported by script!".format(content_type))


        
def predict_fn(input_data, models):
    """Part of the sklearn docker image, this function takes the deserialized request object
    and performs inference against the loaded model(s).
    More information: https://sagemaker.readthedocs.io/en/stable/using_sklearn.html.

    Args:
        input_data: the return value from input_fn()
        model: tuple with loaded models from model_fn(). First element
               is the pipeline_model and the second is the target
               transformer    
    Returns:
        Predictions in numpy array.
    """
    # Parse models
    model, target_normalizer = models
    # Grab model feature list
    model_features = model.named_steps['algo'].get_booster().feature_names
    
    # Lower case feature names
    input_data.columns = [x.lower() for x in input_data.columns.values.tolist()]
    cols = input_data.columns.values.tolist()
    
    for f in model_features:
        if f not in cols:
            input_data[f] = 0
        else:
            input_data[f] = input_data[f].astype('float')
        
    # Predict
    input_data = input_data[model_features]
    
    y_pred_link = model.predict(input_data).reshape(-1,1)
    y_pred = target_normalizer.inverse_transform(y_pred_link).ravel()
    
    return y_pred


def _npy_dumps(data):
    """Serializes a numpy array into a stream of npy-formatted bytes."""
    buffer = BytesIO()
    np.save(buffer, data)
    return buffer.getvalue()


def output_fn(prediction_output, accept):
    if accept == 'application/x-npy' or accept == 'application/json':
        print('output_fn input is', prediction_output, 'in format', accept)
        return _npy_dumps(prediction_output), 'application/x-npy'
    else:
        raise ValueError('Accept header must be application/x-npy or application/json, but it is {}'.format(accept))

        
        
        
```