# Amazon SageMaker BATCH Transform examples
With [Amazon SageMaker Batch Transform](https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform.html), Use batch transform when you need to do the following:

    Preprocess datasets to remove noise or bias that interferes with training or inference from your dataset.

    Get inferences from large datasets.

    Run inference when you don't need a persistent endpoint.

    Associate input records with inferences to assist the interpretation of results.

**Caution:** This is a reference material and not to be used in production as-is. Please run your tests with your data thoroughly as we are not liable for any loss of data

To filter input data before performing inferences or to associate input records with inferences about those records, see Associate Prediction Results with Input Records. For example, you can filter input data to provide context for creating and interpreting reports about the output data.For applications which require consistently low inference latency, a traditional endpoint is still the best choice.

To split input files into mini-batches when you create a batch transform job, set the SplitType parameter value to Line. If SplitType is set to None or if an input file can't be split into mini-batches, SageMaker uses the entire input file in a single request. Note that Batch Transform doesn't support CSV-formatted input that contains embedded newline characters. You can control the size of the mini-batches by using the BatchStrategy and MaxPayloadInMB parameters. MaxPayloadInMB must not be greater than 100 MB. If you specify the optional MaxConcurrentTransforms parameter, then the value of (MaxConcurrentTransforms * MaxPayloadInMB) must also not exceed 100 MB.

If the batch transform job successfully processes all of the records in an input file, it creates an output file with the same name and the .out file extension. For multiple input files, such as input1.csv and input2.csv, the output files are named input1.csv.out and input2.csv.out. The batch transform job stores the output files in the specified location in Amazon S3, such as s3://awsexamplebucket/output/.



![](./cw_charts/BatchTransform.png)


### Contents

1. [Generate synthetic data for housing models](#Generate-synthetic-data-for-housing-models)
1. [TRANSFORM the raw housing data using Scikit Learn model](#Preprocess-synthetic-housing-data-using-scikit-learn)
1. [Clean up](#CleanUp)


## Section 1 - Generate synthetic data for housing models <a id='Generate-synthetic-data-for-housing-models'></a>

In this section, you will generate synthetic data that will be used to train the linear learner models.  The data generated consists of 6 numerical features - the year the house was built in, house size in square feet, number of bedrooms, number of bathroom, the lot size and number of garages and two categorial features - deck and front_porch.  

In [3]:
import numpy as np
import pandas as pd
import json
import datetime
import time
import boto3
import sagemaker
import os

from time import gmtime, strftime
from random import choice

from sagemaker import get_execution_role

from sagemaker.multidatamodel import MULTI_MODEL_CONTAINER_MODE
from sagemaker.multidatamodel import MultiDataModel

from sklearn.model_selection import train_test_split
import pandas as pd

In [4]:
sm_client = boto3.client(service_name='sagemaker')
runtime_sm_client = boto3.client(service_name='sagemaker-runtime')
sagemaker_session = sagemaker.Session()

s3 = boto3.resource('s3')
s3_client = boto3.client('s3')

BUCKET  = sagemaker_session.default_bucket()
print("BUCKET : ", BUCKET)

role = get_execution_role()
print("ROLE : ", role)

ACCOUNT_ID = boto3.client('sts').get_caller_identity()['Account']
REGION = boto3.Session().region_name

DATA_PREFIX = 'DEMO_MME_LINEAR_LEARNER'
HOUSING_MODEL_NAME = 'housing'
MULTI_MODEL_ARTIFACTS = 'multi_model_artifacts'

BUCKET :  sagemaker-us-east-1-622343165275
ROLE :  arn:aws:iam::622343165275:role/service-role/AmazonSageMaker-ExecutionRole-20220208T115633


In [68]:
loc='NewYork_NY'

In [69]:
columns_data=['YEAR_BUILT', 'SQUARE_FEET', 'NUM_BEDROOMS', 'NUM_BATHROOMS','LOT_ACRES', 'GARAGE_SPACES', 'FRONT_PORCH', 'DECK', 'PRICE']  


In [70]:
# train:saved:to data/NewYork_NY/train/train.csv 
# Test:saved:to data/NewYork_NY/test/test.csv 
        
train_pd = pd.read_csv(f'./data/{loc}/train/train.csv')   
train_pd.columns=columns_data
train_pd.head(5)

Unnamed: 0,YEAR_BUILT,SQUARE_FEET,NUM_BEDROOMS,NUM_BATHROOMS,LOT_ACRES,GARAGE_SPACES,FRONT_PORCH,DECK,PRICE
0,1979,3746.347409,3,2.5,1.27,1,n,n,463502
1,1990,3781.009326,4,2.5,1.2,3,n,y,562651
2,1990,3443.984535,3,2.5,0.91,3,n,n,497747
3,1999,3777.606002,6,1.0,1.05,2,y,y,607390
4,2018,2191.11462,3,3.0,0.91,3,y,n,477317


In [71]:
# Uplaod to s3
s3_model_loc = sagemaker.s3.S3Uploader().upload(
    local_path=f'models/batch/{loc}/model.tar.gz',
    desired_s3_uri=f's3://{BUCKET}/inferencewk/models/batch/{loc}',
    sagemaker_session=sagemaker_session
)
print(s3_model_loc)


s3://sagemaker-us-east-1-622343165275/inferencewk/models/batch/NewYork_NY/model.tar.gz


In [72]:
# Upload the Raw data set to inferencewk/data/raw/loc
train_inputs=[""*1]
test_inputs=[""*1]

s3_train_loc = sagemaker.s3.S3Uploader().upload(
        local_path=f'data/{loc}/train/train.csv',
        desired_s3_uri=f's3://{BUCKET}/inferencewk/data/raw/{loc}/train',
        sagemaker_session=sagemaker_session
)
print(f"train:loc:uploaded:{s3_train_loc}:WITHOUT HEADERS:")
train_inputs[0] = s3_train_loc

s3_test_loc = sagemaker.s3.S3Uploader().upload(
        local_path=f'data/{loc}/test/test.csv',
        desired_s3_uri=f's3://{BUCKET}/inferencewk/data/raw/{loc}/test',
        sagemaker_session=sagemaker_session
)
print(f"test:loc:uploaded:{s3_test_loc}:WITHPUT HEADERS")
test_inputs[0] = s3_test_loc
 
print(train_inputs)

train:loc:uploaded:s3://sagemaker-us-east-1-622343165275/inferencewk/data/raw/NewYork_NY/train/train.csv:WITHOUT HEADERS:
test:loc:uploaded:s3://sagemaker-us-east-1-622343165275/inferencewk/data/raw/NewYork_NY/test/test.csv:WITHPUT HEADERS
['s3://sagemaker-us-east-1-622343165275/inferencewk/data/raw/NewYork_NY/train/train.csv']


In [73]:
train_inputs[0]

's3://sagemaker-us-east-1-622343165275/inferencewk/data/raw/NewYork_NY/train/train.csv'

# Start BATCH Transform

#### Create Transformer directly from Estimator

You can create the Transformer directly from the trained Estimator but we will not demo that for this lab


## Section 1 Bring your own Model
Here we will work on the tar file created from the training job and create all 
needed jobs definetions from scratch and run the transform job. We will run 4 different kinds

    Run with the Input filter not There so only generate predictions
    Run with Input filter values so we can generate predictions and also combine to the outputs
    Run with Mini batch and instance count > 1

#### Section 2b ) Bring your own Model as tar ball in S3
Here we will use the Tar ball as is and then create all the required artifacts from scratch
    First we upload the Model tar ball to S3 to be used in our Transformer

In [74]:
print(f"Using location of the TAR ball from {batch_model_inputs[0]}")

Using location of the TAR ball from s3://sagemaker-us-east-1-622343165275/inferencewk/models/batch/NewYork_NY/model.tar.gz


**Upload the RAW data set to S3**

#### Upload the Model as tar ball into S3 location for use

#### Create the BATCH Jobs:

The steps to create the Job is straight forward
* Create a Model object from the S3 location with the Image
* Create a Transformer 
* Run the Transform with the S3 INPUT location


In [75]:
# Retrieve the Container image
container = sagemaker.image_uris.retrieve(region=boto3.Session().region_name, framework="sklearn", version="0.20.0") # 0.23-1"

In [76]:
print(f"using the model from s3={batch_model_inputs[0]}:")

using the model from s3=s3://sagemaker-us-east-1-622343165275/inferencewk/models/batch/NewYork_NY/model.tar.gz:


In [77]:
from sagemaker.sklearn.model import SKLearnModel
sklearn_model_name = "DEMO-BATCH-SKLEARN-BYO-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

sklearn_model = SKLearnModel(
    name=sklearn_model_name,
    model_data=batch_model_inputs[0], #model_data_new_loc, 
    role=role,
    sagemaker_session=sagemaker_session,
    entry_point="scripts/sklearn_preprocessor_batch.py",
    framework_version="0.20.0", #"0.23-1", #"0.20.0",
    image_uri=container,
    #source_dir="scripts",
)
print(sklearn_model.source_dir)
print(sklearn_model.entry_point)

None
scripts/sklearn_preprocessor_batch.py


#### Option 1 -- create transformer from Model without Registering

In [78]:
batch_transformer = sklearn_model.transformer(
        instance_count=1,
        instance_type='ml.m4.xlarge',
        assemble_with='Line',
        accept='text/csv'
)
 

In [79]:
batch_transformer.transform(
    train_inputs[0], 
    content_type='text/csv', 
    wait=True, 
    logs=True
)

................................
[34mProcessing /opt/ml/code
  DEPRECATION: A future pip version will change local packages to be built in-place without first copying to a temporary directory. We recommend you use --use-feature=in-tree-build to test your packages with this new behavior before it becomes the default.
   pip 21.3 will remove support for this functionality. You can find discussion regarding this at https://github.com/pypa/pip/issues/7555.[0m
[35mProcessing /opt/ml/code
  DEPRECATION: A future pip version will change local packages to be built in-place without first copying to a temporary directory. We recommend you use --use-feature=in-tree-build to test your packages with this new behavior before it becomes the default.
   pip 21.3 will remove support for this functionality. You can find discussion regarding this at https://github.com/pypa/pip/issues/7555.[0m
[34mBuilding wheels for collected packages: sklearn-preprocessor-batch
  Building wheel for sklearn-preproce

In [31]:
batch_transformer.output_path

's3://sagemaker-us-east-1-622343165275/DEMO-BATCH-SKLEARN-BYO-2022-09-30-02-02-2022-09-30-02-02-53-957'

#### Now we list the data and review it

In [32]:
!mkdir ./data/temp_output

mkdir: cannot create directory ‘./data/temp_output’: File exists


In [33]:
# - download the file
sagemaker.s3.S3Downloader().download(s3_uri=f'{batch_transformer.output_path}/train.csv.out', local_path='./data/temp_output', sagemaker_session=sagemaker_session)  
output_df = pd.read_csv(filepath_or_buffer='./data/output/train.csv.out', header=None)
output_df.head(5)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,453183.0,1.312838,-0.567933,0.696936,-0.723305,-2.451263,1.333814,1.0,0.0,1.0,0.0
1,373648.0,-1.822817,0.264735,0.696936,-0.723305,0.297674,-1.312254,0.0,1.0,0.0,1.0
2,415434.0,1.214848,-0.867431,1.403289,1.367619,1.061268,-1.312254,1.0,0.0,0.0,1.0
3,458912.0,-0.059011,-0.374606,0.696936,1.367619,0.412213,1.333814,0.0,1.0,0.0,1.0
4,320228.0,-0.940914,-0.966433,0.696936,-0.02633,-2.489443,1.333814,0.0,1.0,1.0,0.0


#### Option 2 Register and create batch Predictions

In [34]:
model_registry_package = sklearn_model.register(
    
    content_types=['text/csv'],
    response_types=['application/json'],
    inference_instances=['ml.t2.medium', 'ml.m5.xlarge'],
    transform_instances=['ml.m5.xlarge'],
    model_package_group_name="test-model-ackage",
    approval_status="Approved"

)

In [35]:
model_registry_package

<sagemaker.model.ModelPackage at 0x7f6d30a12710>

In [36]:
registry_transformer = model_registry_package.transformer(
    instance_count=1,
    instance_type='ml.m5.xlarge',
    assemble_with='Line',
    accept='text/csv',
    max_concurrent_transforms=8,
    strategy="MultiRecord",
    max_payload=6,
)
registry_transformer

<sagemaker.transformer.Transformer at 0x7f6d30bf0438>

In [38]:
registry_transformer.transform(
    train_inputs[0], 
    content_type='text/csv', 
    input_filter=None,
    join_source="Input",
    output_filter='$[0,-11]',
    split_type='Line',
    wait=True, 
    logs=True

)

..............................
[34mProcessing /opt/ml/code
  DEPRECATION: A future pip version will change local packages to be built in-place without first copying to a temporary directory. We recommend you use --use-feature=in-tree-build to test your packages with this new behavior before it becomes the default.
   pip 21.3 will remove support for this functionality. You can find discussion regarding this at https://github.com/pypa/pip/issues/7555.[0m
[34mBuilding wheels for collected packages: sklearn-preprocessor-batch
  Building wheel for sklearn-preprocessor-batch (setup.py): started
  Building wheel for sklearn-preprocessor-batch (setup.py): finished with status 'done'
  Created wheel for sklearn-preprocessor-batch: filename=sklearn_preprocessor_batch-1.0.0-py2.py3-none-any.whl size=7596 sha256=b29e98c88ffd448eb95460377fc08bb74a003250116dc948cc113e622c170974
  Stored in directory: /tmp/pip-ephem-wheel-cache-19j02rkk/wheels/3e/0f/51/2f1df833dd0412c1bc2f5ee56baac195b5be563353d1

#### Now we list the data and view it
**Replace the object with the one run batch_transformer or registry_transformer**

In [39]:
# Replace with the Transformer which was actually run
out_file_name = '{}/train.csv.out'.format(batch_transformer.output_path)
out_file_name 

's3://sagemaker-us-east-1-622343165275/DEMO-BATCH-SKLEARN-BYO-2022-09-30-02-02-2022-09-30-02-02-53-957/train.csv.out'

In [40]:
# - download the file
sagemaker.s3.S3Downloader().download(s3_uri=out_file_name, local_path='./data/temp_output', sagemaker_session=sagemaker_session)  
output_df = pd.read_csv(filepath_or_buffer='./data/temp_output/train.csv.out', header=None)
output_df.head(5)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,453183.0,1.312838,-0.567933,0.696936,-0.723305,-2.451263,1.333814,1.0,0.0,1.0,0.0
1,373648.0,-1.822817,0.264735,0.696936,-0.723305,0.297674,-1.312254,0.0,1.0,0.0,1.0
2,415434.0,1.214848,-0.867431,1.403289,1.367619,1.061268,-1.312254,1.0,0.0,0.0,1.0
3,458912.0,-0.059011,-0.374606,0.696936,1.367619,0.412213,1.333814,0.0,1.0,0.0,1.0
4,320228.0,-0.940914,-0.966433,0.696936,-0.02633,-2.489443,1.333814,0.0,1.0,1.0,0.0


#### Optional -- below are advanced Options for Batch Transformations

#### Run the Transformer with a JOIN to the INPUT DATA set with column 'Year_built'

In [42]:
batch_transformer = sklearn_model.transformer(
    instance_count=1,
    instance_type='ml.m4.xlarge',
    assemble_with='Line',
    accept='text/csv',
    max_concurrent_transforms=8,
    strategy="MultiRecord",
    max_payload=6,
)

batch_transformer.transform(
    train_inputs[0], 
    content_type='text/csv', 
    input_filter=None,
    join_source="Input",
    output_filter='$[0,-11]',
    split_type='Line',
    wait=True, 
    logs=True
)

Using already existing model: DEMO-BATCH-SKLEARN-BYO-2022-09-30-02-02-40


...................................[34mProcessing /opt/ml/code
  DEPRECATION: A future pip version will change local packages to be built in-place without first copying to a temporary directory. We recommend you use --use-feature=in-tree-build to test your packages with this new behavior before it becomes the default.
   pip 21.3 will remove support for this functionality. You can find discussion regarding this at https://github.com/pypa/pip/issues/7555.[0m
[34mBuilding wheels for collected packages: sklearn-preprocessor-batch
  Building wheel for sklearn-preprocessor-batch (setup.py): started
  Building wheel for sklearn-preprocessor-batch (setup.py): finished with status 'done'
  Created wheel for sklearn-preprocessor-batch: filename=sklearn_preprocessor_batch-1.0.0-py2.py3-none-any.whl size=7596 sha256=7cb12d8b9ff394fd33a6f785d6ff38c29c548ab5bfdd0af4e8b44d37ced70cc4
  Stored in directory: /tmp/pip-ephem-wheel-cache-c324pb9l/wheels/3e/0f/51/2f1df833dd0412c1bc2f5ee56baac195b5be5633

In [44]:
out_file_name = '{}/train.csv.out'.format(batch_transformer.output_path)
print(out_file_name)
# - download the file
sagemaker.s3.S3Downloader().download(s3_uri=out_file_name, local_path='./data/temp_output', sagemaker_session=sagemaker_session)  
output_df2 = pd.read_csv(filepath_or_buffer='./data/temp_output/train.csv.out', header=None)
output_df2.head(5)

s3://sagemaker-us-east-1-622343165275/DEMO-BATCH-SKLEARN-BYO-2022-09-30-02-02-2022-09-30-02-40-09-748/train.csv.out


Unnamed: 0,0,1
0,1984,380846.0
1,1987,210669.0
2,1987,252826.0
3,1998,435509.0
4,1984,313808.0
