## SageMaker Preprocessing

Now imagine that you can performed data analysis using jupyter notebook and have finalized what kind of data transformation that needs to be done. All the code has also been placed into a python script.

Instead of running locally, you are now ready to perform data preprocessing using SageMaker processing job with managed ec2 instance.



### Reference
https://sagemaker.readthedocs.io/en/stable/amazon_sagemaker_processing.html

Input/Output files to be placed in `/opt/ml/processing/{input,output}`

### Setup

Create local directory and update user owner/group

Take note that this directory will be deleted after notebook shutdown

In [1]:
# one time setup to create directory
!sudo mkdir -p /opt/ml/processing
!sudo chown -R ec2-user:ec2-user /opt/ml/processing

### Verify python script

Verify that your script can run successfully without any bug to speed up development

In [2]:
%%bash

# DATA=s3://sagemaker-sample-data-us-east-1/processing/census/census-income.csv
# aws s3 cp $DATA /tmp/input/
# mkdir /tmp/{train,test,model}
python ../src/mlmax/preprocessing.py --mode "train" --data-dir /tmp


Received arguments Namespace(data_dir='/tmp', data_input='input/census-income.csv', mode='train', train_test_split_ratio=0.3)
Reading input data from /tmp/input/census-income.csv
Data after cleaning: (68285, 9), 11401 positive examples, 56884 negative examples
Splitting data into train and test sets with ratio 0.3
        age                    education  ... capital losses dividends from stocks
26873    28             5th or 6th grade  ...              0                     0
179865   31   Bachelors degree(BA AB BS)  ...              0                     0

[2 rows x 8 columns]
Creating preprocessing and feature engineering transformations
Saving model to /tmp/model/proc_model.tar.gz
Data shape after preprocessing: (47799, 69)
Data shape after preprocessing: (20486, 69)
Saving data to /tmp/train/train_features.csv
Saving data to /tmp/train/train_labels.csv
Saving data to /tmp/test/test_features.csv
Saving data to /tmp/test/test_labels.csv


  'decreasing the number of bins.' % jj)


In [3]:
# Quick check on output files
!ls -l /tmp/train/

total 15336
-rw-rw-r-- 1 ec2-user ec2-user 15605086 Sep 21 00:14 train_features.csv
-rw-rw-r-- 1 ec2-user ec2-user    95598 Sep 21 00:14 train_labels.csv


In [4]:
# Copy to S3
! aws s3 cp ../src/mlmax/preprocessing.py s3://wy-cba/source/

upload: ../src/mlmax/preprocessing.py to s3://wy-cba/source/preprocessing.py


### Run on SageMaker processing

import required packages

In [5]:
import sagemaker
from sagemaker.processing import ProcessingInput, ProcessingOutput  # noqa
from sagemaker.sklearn.processing import ScriptProcessor, SKLearnProcessor  # noqa
from pathlib import Path

Setup directory and parameters

In [6]:
role = "arn:aws:iam::342474125894:role/service-role/AmazonSageMaker-ExecutionRole-20190405T234154"
s3_bucket = "wy-project-template"

Create sklearn processor

In [7]:
local_mode = False

if local_mode:
    instance_type = "local"
else:
    instance_type = "ml.m5.xlarge"

processor = SKLearnProcessor(
    framework_version="0.23-1",
    instance_type=instance_type,
    instance_count=1,
    role=role,
)

print(f"Container image: {processor.image_uri}")

Container image: 121021644041.dkr.ecr.ap-southeast-1.amazonaws.com/sagemaker-scikit-learn:0.23-1-cpu-py3


### S3 - directory mapping

Understand the data mapping between S3 and local directory

In [8]:

processor.run(
    code="../src/mlmax/preprocessing.py",
    # source_dir="src/sklearn/preprocessing.py",
    inputs = [
        ProcessingInput(
            source="s3://sagemaker-sample-data-ap-southeast-1/processing/census/census-income.csv",
            destination="/opt/ml/processing/input",
            input_name="input-1",
        ),
#         ProcessingInput(
#             source="s3://wy-cba/source/preprocessing.py",
#             destination="/opt/ml/processing/input/code",
#             input_name="code",
#         ),
    ],
    outputs=[
        ProcessingOutput(
            source="/opt/ml/processing/train",
            destination=f"s3://{s3_bucket}/sklearn/processed/train_data",
        ),
        ProcessingOutput(
            source="/opt/ml/processing/test",
            destination=f"s3://{s3_bucket}/sklearn/processed/test_data",
        ),
        ProcessingOutput(
            source="/opt/ml/processing/model",
            destination=f"s3://{s3_bucket}/sklearn/processed/model",
        ),
    ],
    arguments=[
        "--mode",
        "train",
        "--preprocessor_name",
        "preprocessor.pkl",
        "--test_size",
        "0.2",
    ],
    wait=True
)

preprocessing_job_description = processor.jobs[-1].describe()
print(preprocessing_job_description)



Job Name:  sagemaker-scikit-learn-2021-09-21-00-14-20-811
Inputs:  [{'InputName': 'input-1', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-sample-data-ap-southeast-1/processing/census/census-income.csv', 'LocalPath': '/opt/ml/processing/input', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'code', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-ap-southeast-1-342474125894/sagemaker-scikit-learn-2021-09-21-00-14-20-811/input/code/preprocessing.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]
Outputs:  [{'OutputName': 'output-1', 'AppManaged': False, 'S3Output': {'S3Uri': 's3://wy-project-template/sklearn/processed/train_data', 'LocalPath': '/opt/ml/processing/train', 'S3UploadMode': 'EndOfJob'}}, {'OutputName': 'output-2', 'AppManaged': False, 'S3O

You can see the progress using SageMaker console under `Processing`.

- sagemaker-scikit-learn-2021-09-20-09-52-52-053
