# SageMaker BYOD

An example workbook that walks through how to bring your custom model written in a framework of your choice as docker image to SageMaker. 

We are using scikit-learn K Means algorithm as an example to do unsupervised clustering. 

**Note** SageMaker now includes a pre-built scikit container. We recommend the pre-built container be used for almost all cases requiring a scikit algorithm.


### Load modules

We start off by loading the required python modules. Next we will use `get_execution_role` from the SasgeMaker python SDK to get the IAM role and a session

In [None]:
import boto3
import re

import os
import numpy as np
import pandas as pd
from sagemaker import get_execution_role

import sagemaker as sage
from time import gmtime, strftime

from sagemaker import s3


In [None]:
role = get_execution_role()
sess = sage.Session()

### Setup S3 buckets

Define the input and output locations. The input location will contain the raw data. The output location will hod the processed data and the model artifact generated after training.

In [None]:
output_path=s3://{bucketname}/{prefix}/
input_path=s3://{bucketname}/{prefix}/

### Get raw data
Since this is a simople datset we can preprocess and extract features on the local SageMaker instance. Start by getting the raw data from our input location in S3 and copy to the local SageMaker instance. 

In [None]:
raw_data = pd.read_csv('./mall.csv')

In [None]:
train_data = raw_data.iloc[:,[3,4]].values

### Upload to S3
After preprocessing and extraction, we save this to a csv and upload it back to an S3 location.

In [None]:
# Convert nparry to csv
np.savetxt('./processed_data.csv', train_data, delimiter=',',fmt='%d')

In [None]:
# Upload the processed csv to S3
s3.S3Uploader.upload('processed_data.csv',output_path, kms_key=None)

### Train

Create a SageMaker estimator with custom hyperparameters and start the training job

In [None]:
account = sess.boto_session.client('sts').get_caller_identity()['Account']
region = sess.boto_session.region_name
image = '{}.dkr.ecr.{}.amazonaws.com/sage-kmeans:latest'.format(account, region)
model = sage.estimator.Estimator(image,
                       role, 
                       train_instance_count=1, 
                       train_instance_type='ml.m5.large',
                       output_path=output_path,
                       hyperparameters={'n_clusters': 5},
                       sagemaker_session=sess)



In [None]:
model.fit(output_path+'processed_data.csv')

### Deploy

Create a deployment endpoint to do real time inference

In [None]:
from sagemaker.predictor import csv_serializer
predictor = model.deploy(1, 'ml.m4.xlarge', serializer=csv_serializer)

### Inference

Run inference against the endpoint

In [None]:
#predictor.predict(train_data.values).decode('utf-8')
data = pd.read_csv(output_path+'processed_data.csv')
predictor.predict(data.values)

The following 5 cells have been commnted out as they are not used in this example. This is for future ToDO items

In [None]:
# runtime = boto3.Session().client('sagemaker-runtime')
# endpoint='sage-kmeans-2020-03-28-01-25-29-970'
# import io
# from io import StringIO
# test_file = io.StringIO()
# train_data.to_csv(test_file)
# response = runtime.invoke_endpoint(EndpointName=endpoint, ContentType='text/csv', Body=test_file.getvalue())
# type(response)

In [None]:
# import json
# result = json.loads(response['Body'].read().decode())
# print (result)

In [None]:
# print(response['Body'].read().decode())

In [None]:
# response_payload = json.loads(response['Body'].read().decode("utf-8"))

# print ("response_payload: {}".format(response_payload))

In [None]:
# !curl https://runtime.sagemaker.us-east-1.amazonaws.com/endpoints/sage-kmeans-2020-03-28-01-25-29-970/invocations

### Clean Up

After running inference, it is important to remove any endpoints that are no longer needed to avoid charges

In [None]:
sess.delete_endpoint(predictor.endpoint)