# Inference Pipeline with Scikit-learn and Linear Learner

A brief practice for table data prediction using preprocessing container and SageMaker built-in algorithm.
- Build SageMaker pipeline for preprocess and train.
- Scikit-learn container is used to featurize the dataset.
- Use batch transformation for preprocess the dataset and then use it for train.

Deteils are following [sample page](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/scikit_learn_inference_pipeline/Inference%20Pipeline%20with%20Scikit-learn%20and%20Linear%20Learner.ipynb).

In [None]:
# S3 prefix
s3_bucket = 'inference-pipeline-with-scikit-learn-and-linearlearner'
prefix = 'Scikit-LinearLearner-pipeline-abalone-example'

import sagemaker
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()

# Get a SageMaker-compatible role used by this Notebook Instance.
role = get_execution_role()

In [2]:
!wget --directory-prefix=./abalone_data https://s3-us-west-2.amazonaws.com/sparkml-mleap/data/abalone/abalone.csv

--2019-06-15 16:44:47--  https://s3-us-west-2.amazonaws.com/sparkml-mleap/data/abalone/abalone.csv
Resolving s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)... 52.218.224.48
Connecting to s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)|52.218.224.48|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 191873 (187K) [binary/octet-stream]
Saving to: ‘./abalone_data/abalone.csv.1’


2019-06-15 16:44:48 (773 KB/s) - ‘./abalone_data/abalone.csv.1’ saved [191873/191873]



In [3]:
WORK_DIRECTORY = 'abalone_data'

train_input = sagemaker_session.upload_data(
    path='{}/{}'.format(WORK_DIRECTORY, 'abalone.csv'), 
    bucket=s3_bucket,
    key_prefix='{}/{}'.format(prefix, 'train'))

In [4]:
from sagemaker.sklearn.estimator import SKLearn

script_path = 'sklearn_abalone_featurizer.py'

sklearn_preprocessor = SKLearn(
    entry_point=script_path,
    role=role,
    train_instance_type="ml.c4.xlarge",
    sagemaker_session=sagemaker_session)

In [5]:
sklearn_preprocessor.fit({'train': train_input})

2019-06-15 16:44:49 Starting - Starting the training job...
2019-06-15 16:44:50 Starting - Launching requested ML instances......
2019-06-15 16:46:00 Starting - Preparing the instances for training...
2019-06-15 16:46:48 Downloading - Downloading input data
2019-06-15 16:46:48 Training - Downloading the training image...
2019-06-15 16:47:18 Uploading - Uploading generated training model.
[31m2019-06-15 16:47:12,915 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training[0m
[31m2019-06-15 16:47:12,917 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[31m2019-06-15 16:47:12,928 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[31m2019-06-15 16:47:13,171 sagemaker-containers INFO     Module sklearn_abalone_featurizer does not provide a setup.py. [0m
[31mGenerating setup.py[0m
[31m2019-06-15 16:47:13,171 sagemaker-containers INFO     Generating setup.cfg[0m
[31m2019-06-15 16:47:13,171 s

## Batch transform our training data

In [7]:
# Define a SKLearn Transformer from the trained SKLearn Estimator
transformer = sklearn_preprocessor.transformer(
    instance_count=1, 
    instance_type='ml.m4.xlarge',
    assemble_with = 'Line',
    accept = 'text/csv')

In [8]:
# Preprocess training input
transformer.transform(train_input, content_type='text/csv')
print('Waiting for transform job: ' + transformer.latest_transform_job.job_name)
transformer.wait()
preprocessed_train = transformer.output_path

Waiting for transform job: sagemaker-scikit-learn-2019-06-15-16-48-10-979
..........................................!


## Fit a LinearLearner Model with the preprocessed data

In [9]:
import boto3
from sagemaker.amazon.amazon_estimator import get_image_uri
ll_image = get_image_uri(boto3.Session().region_name, 'linear-learner')

In [10]:
s3_ll_output_key_prefix = "ll_training_output"
s3_ll_output_location = 's3://{}/{}/{}/{}'.format(s3_bucket, prefix, s3_ll_output_key_prefix, 'll_model')

ll_estimator = sagemaker.estimator.Estimator(
    ll_image,
    role, 
    train_instance_count=1, 
    train_instance_type='ml.m4.2xlarge',
    train_volume_size = 20,
    train_max_run = 3600,
    input_mode= 'File',
    output_path=s3_ll_output_location,
    sagemaker_session=sagemaker_session)

ll_estimator.set_hyperparameters(feature_dim=10, predictor_type='regressor', mini_batch_size=32)

ll_train_data = sagemaker.session.s3_input(
    preprocessed_train, 
    distribution='FullyReplicated',
    content_type='text/csv', 
    s3_data_type='S3Prefix')

data_channels = {'train': ll_train_data}
ll_estimator.fit(inputs=data_channels, logs=True)

2019-06-15 16:52:35 Starting - Starting the training job...
2019-06-15 16:52:37 Starting - Launching requested ML instances......
2019-06-15 16:53:40 Starting - Preparing the instances for training......
2019-06-15 16:54:58 Downloading - Downloading input data
2019-06-15 16:54:58 Training - Downloading the training image..
[31mDocker entrypoint called with argument(s): train[0m
[31m[06/15/2019 16:55:16 INFO 139842181826368] Reading default configuration from /opt/amazon/lib/python2.7/site-packages/algorithm/default-input.json: {u'loss_insensitivity': u'0.01', u'epochs': u'15', u'init_bias': u'0.0', u'lr_scheduler_factor': u'auto', u'num_calibration_samples': u'10000000', u'accuracy_top_k': u'3', u'_num_kv_servers': u'auto', u'use_bias': u'true', u'num_point_for_scaler': u'10000', u'_log_level': u'info', u'quantile': u'0.5', u'bias_lr_mult': u'auto', u'lr_scheduler_step': u'auto', u'init_method': u'uniform', u'init_sigma': u'0.01', u'lr_scheduler_minimum_lr': u'auto', u'target_recall

## Serial Inference Pipeline with Scikit preprocessor and Linear Learner

In [11]:
from sagemaker.model import Model
from sagemaker.pipeline import PipelineModel
import boto3
from time import gmtime, strftime

timestamp_prefix = strftime("%Y-%m-%d-%H-%M-%S", gmtime())

scikit_learn_inferencee_model = sklearn_preprocessor.create_model()
linear_learner_model = ll_estimator.create_model()

model_name = 'inference-pipeline-' + timestamp_prefix
endpoint_name = 'inference-pipeline-ep-' + timestamp_prefix
sm_model = PipelineModel(
    name=model_name, 
    role=role, 
    models=[
        scikit_learn_inferencee_model, 
        linear_learner_model])

sm_model.deploy(initial_instance_count=1, instance_type='ml.c4.xlarge', endpoint_name=endpoint_name)

---------------------------------------------------------------------------------------------------!

## Make a request to our pipeline endpoint

In [13]:
from sagemaker.predictor import json_serializer, csv_serializer, json_deserializer, RealTimePredictor
from sagemaker.content_types import CONTENT_TYPE_CSV, CONTENT_TYPE_JSON
payload = 'M, 0.44, 0.365, 0.125, 0.516, 0.2155, 0.114, 0.155'
actual_rings = 10
predictor = RealTimePredictor(
    endpoint=endpoint_name,
    sagemaker_session=sagemaker_session,
    serializer=csv_serializer,
    content_type=CONTENT_TYPE_CSV,
    accept=CONTENT_TYPE_JSON)

print(predictor.predict(payload))

b'{"predictions": [{"score": 9.528051376342773}]}'


## Delete Endpoint

In [15]:
sm_client = sagemaker_session.boto_session.client('sagemaker')
sm_client.delete_endpoint(EndpointName=endpoint_name)

{'ResponseMetadata': {'RequestId': '853915b2-17b6-429f-85e2-93a647a58927',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '853915b2-17b6-429f-85e2-93a647a58927',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '0',
   'date': 'Sat, 15 Jun 2019 17:10:25 GMT'},
  'RetryAttempts': 0}}