# Feature Transformation with Amazon a SageMaker Processing Job and Scikit-Learn

**Presentation Deep-Dive on BERT: [https://speakerdeck.com/antje/visualize-bert-attention](https://speakerdeck.com/antje/visualize-bert-attention)**

Typically a machine learning (ML) process consists of few steps. First, gathering data with various ETL jobs, then pre-processing the data, featurizing the dataset by incorporating standard techniques or prior knowledge, and finally training an ML model using an algorithm.

Often, distributed data processing frameworks such as Scikit-Learn are used to pre-process data sets in order to prepare them for training. In this notebook we'll use Amazon SageMaker Processing, and leverage the power of Scikit-Learn in a managed SageMaker environment to run our processing workload.

![](img/prepare_dataset_bert.png)

![](img/processing.jpg)


## Contents

1. Setup Environment
1. Setup Input Data
1. Setup Output Data
1. Build a Spark container for running the processing job
1. Run the Processing Job using Amazon SageMaker
1. Inspect the Processed Output Data

# Setup Environment

Let's start by specifying:
* The S3 bucket and prefixes that you use for training and model data. Use the default bucket specified by the Amazon SageMaker session.
* The IAM role ARN used to give processing and training access to the dataset.

In [1]:
import sagemaker
import boto3

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()
bucket = sagemaker_session.default_bucket()
region = boto3.Session().region_name

sm = boto3.Session().client(service_name='sagemaker', region_name=region)

# Setup Input Data

In [2]:
%store -r s3_public_path_tsv

In [3]:
try:
    s3_public_path_tsv
except NameError:
    print('++++++++++++++++++++++++++++++++++++++++++++++++++++++++')
    print('[ERROR] Please run the notebooks in the INGEST section before you continue.')
    print('++++++++++++++++++++++++++++++++++++++++++++++++++++++++')

In [4]:
print(s3_public_path_tsv)

s3://amazon-reviews-pds/tsv


In [5]:
%store -r s3_private_path_tsv

In [6]:
try:
    s3_private_path_tsv
except NameError:
    print('++++++++++++++++++++++++++++++++++++++++++++++++++++++++')
    print('[ERROR] Please run the notebooks in the INGEST section before you continue.')
    print('++++++++++++++++++++++++++++++++++++++++++++++++++++++++')

In [7]:
print(s3_private_path_tsv)

s3://sagemaker-us-west-2-085964654406/amazon-reviews-pds/tsv


# Let's Copy 1 More Large Data File to Use For Training

In [8]:
!aws s3 cp --recursive $s3_public_path_tsv/ $s3_private_path_tsv/ --exclude "*" --include "amazon_reviews_us_Digital_Ebook_Purchase_v1_01.tsv.gz"

copy: s3://amazon-reviews-pds/tsv/amazon_reviews_us_Digital_Ebook_Purchase_v1_01.tsv.gz to s3://sagemaker-us-west-2-085964654406/amazon-reviews-pds/tsv/amazon_reviews_us_Digital_Ebook_Purchase_v1_01.tsv.gz


In [9]:
raw_input_data_s3_uri = 's3://{}/amazon-reviews-pds/tsv/'.format(bucket)
print(raw_input_data_s3_uri)

s3://sagemaker-us-west-2-085964654406/amazon-reviews-pds/tsv/


In [10]:
!aws s3 ls $raw_input_data_s3_uri

2020-09-26 17:43:25 1294879074 amazon_reviews_us_Digital_Ebook_Purchase_v1_01.tsv.gz
2020-09-26 16:39:04   18997559 amazon_reviews_us_Digital_Software_v1_00.tsv.gz
2020-09-26 16:39:08   27442648 amazon_reviews_us_Digital_Video_Games_v1_00.tsv.gz


# Run the Processing Job using Amazon SageMaker

Next, use the Amazon SageMaker Python SDK to submit a processing job using our custom python script.

# Review the Processing Script

In [11]:
!pygmentize preprocess-scikit-text-to-bert.py

[34mfrom[39;49;00m [04m[36msklearn[39;49;00m[04m[36m.[39;49;00m[04m[36mmodel_selection[39;49;00m [34mimport[39;49;00m train_test_split
[34mfrom[39;49;00m [04m[36msklearn[39;49;00m[04m[36m.[39;49;00m[04m[36mutils[39;49;00m [34mimport[39;49;00m resample
[34mimport[39;49;00m [04m[36mfunctools[39;49;00m
[34mimport[39;49;00m [04m[36mmultiprocessing[39;49;00m

[34mimport[39;49;00m [04m[36mpandas[39;49;00m [34mas[39;49;00m [04m[36mpd[39;49;00m
[34mfrom[39;49;00m [04m[36mdatetime[39;49;00m [34mimport[39;49;00m datetime
[34mimport[39;49;00m [04m[36msubprocess[39;49;00m
[34mimport[39;49;00m [04m[36msys[39;49;00m
subprocess.check_call([sys.executable, [33m'[39;49;00m[33m-m[39;49;00m[33m'[39;49;00m, [33m'[39;49;00m[33mpip[39;49;00m[33m'[39;49;00m, [33m'[39;49;00m[33minstall[39;49;00m[33m'[39;49;00m, [33m'[39;49;00m[33mtensorflow==2.1.0[39;49;00m[33m'[39;49;00m])
[34mimport[39;49;00m [04m[36mtensorf

    [37m# [39;49;00m
    [37m# 1. Lowercase our text (if we're using a BERT lowercase model)[39;49;00m
    [37m# 2. Tokenize it (i.e. "sally says hi" -> ["sally", "says", "hi"])[39;49;00m
    [37m# 3. Break words into WordPieces (i.e. "calling" -> ["call", "##ing"])[39;49;00m
    [37m# 4. Map our words to indexes using a vocab file that BERT provides[39;49;00m
    [37m# 5. Add special "CLS" and "SEP" tokens (see the [readme](https://github.com/google-research/bert))[39;49;00m
    [37m# 6. Append "index" and "segment" tokens to each input (see the [BERT paper](https://arxiv.org/pdf/1810.04805.pdf))[39;49;00m
    [37m# [39;49;00m
    [37m# We don't have to worry about these details.  The Transformers tokenizer does this for us.[39;49;00m
    [37m# [39;49;00m
    train_data = [33m'[39;49;00m[33m{}[39;49;00m[33m/bert/train[39;49;00m[33m'[39;49;00m.format(args.output_data)
    validation_data = [33m'[39;49;00m[33m{}[39;49;00m[33m/bert/validation[

Run this script as a processing job.  You also need to specify one `ProcessingInput` with the `source` argument of the Amazon S3 bucket and `destination` is where the script reads this data from `/opt/ml/processing/input` (inside the Docker container.)  All local paths inside the processing container must begin with `/opt/ml/processing/`.

Also give the `run()` method a `ProcessingOutput`, where the `source` is the path the script writes output data to.  For outputs, the `destination` defaults to an S3 bucket that the Amazon SageMaker Python SDK creates for you, following the format `s3://sagemaker-<region>-<account_id>/<processing_job_name>/output/<output_name>/`.  You also give the `ProcessingOutput` value for `output_name`, to make it easier to retrieve these output artifacts after the job is run.

The arguments parameter in the `run()` method are command-line arguments in our `preprocess-scikit-text-to-bert.py` script.

Note that we sharding the data using `ShardedByS3Key` to spread the transformations across all worker nodes in the cluster.

# Set the Processing Job Hyper-Parameters 

In [12]:
processing_instance_type='ml.c5.2xlarge'
processing_instance_count=2
train_split_percentage=0.90
validation_split_percentage=0.05
test_split_percentage=0.05
balance_dataset=True
max_seq_length=64

# Choosing a `max_seq_length` for BERT
Since a smaller `max_seq_length` leads to faster training and lower resource utilization, we want to find the smallest review length that captures `70%` of our reviews.

Remember our distribution of review lengths from a previous section?

```
mean         67.930174
std         130.954079
min           1.000000
10%           4.000000
20%          14.000000
30%          21.000000
40%          25.000000
50%          31.000000
60%          42.000000
70%          59.000000
80%          87.000000
90%         149.000000
100%       5347.000000
max        5347.000000
```

![](img/review_word_count_distribution.png)

Review length `59` represents the `70th` percentile for this dataset.  However, it's best to stick with powers-of-2 when using BERT.  So let's choose `64` as this is the smallest power-of-2 greater than `59`.  Reviews with length > `64` will be truncated to `64`.

In [13]:
from sagemaker.sklearn.processing import SKLearnProcessor

processor = SKLearnProcessor(framework_version='0.20.0',
                             role=role,
                             instance_type=processing_instance_type,
                             instance_count=processing_instance_count,
                             max_runtime_in_seconds=7200)

In [14]:
from sagemaker.processing import ProcessingInput, ProcessingOutput

processor.run(code='preprocess-scikit-text-to-bert.py',
              inputs=[
                    ProcessingInput(source=raw_input_data_s3_uri,
                                    destination='/opt/ml/processing/input/data/',
                                    s3_data_distribution_type='ShardedByS3Key')
              ],
              outputs=[
                    ProcessingOutput(s3_upload_mode='EndOfJob',
                                     output_name='bert-train',
                                     source='/opt/ml/processing/output/bert/train'),
                    ProcessingOutput(s3_upload_mode='EndOfJob',
                                     output_name='bert-validation',
                                     source='/opt/ml/processing/output/bert/validation'),
                    ProcessingOutput(s3_upload_mode='EndOfJob',
                                     output_name='bert-test',
                                     source='/opt/ml/processing/output/bert/test'),
              ],
              arguments=['--train-split-percentage', str(train_split_percentage),
                         '--validation-split-percentage', str(validation_split_percentage),
                         '--test-split-percentage', str(test_split_percentage),
                         '--max-seq-length', str(max_seq_length),
                         '--balance-dataset', str(balance_dataset)
              ],
              logs=True,
              wait=False)


Job Name:  sagemaker-scikit-learn-2020-09-26-17-44-12-987
Inputs:  [{'InputName': 'input-1', 'S3Input': {'S3Uri': 's3://sagemaker-us-west-2-085964654406/amazon-reviews-pds/tsv/', 'LocalPath': '/opt/ml/processing/input/data/', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'ShardedByS3Key', 'S3CompressionType': 'None'}}, {'InputName': 'code', 'S3Input': {'S3Uri': 's3://sagemaker-us-west-2-085964654406/sagemaker-scikit-learn-2020-09-26-17-44-12-987/input/code/preprocess-scikit-text-to-bert.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]
Outputs:  [{'OutputName': 'bert-train', 'S3Output': {'S3Uri': 's3://sagemaker-us-west-2-085964654406/sagemaker-scikit-learn-2020-09-26-17-44-12-987/output/bert-train', 'LocalPath': '/opt/ml/processing/output/bert/train', 'S3UploadMode': 'EndOfJob'}}, {'OutputName': 'bert-validation', 'S3Output': {'S3U

In [15]:
scikit_processing_job_name = processor.jobs[-1].describe()['ProcessingJobName']
print(scikit_processing_job_name)

sagemaker-scikit-learn-2020-09-26-17-44-12-987


In [16]:
from IPython.core.display import display, HTML

display(HTML('<b>Review <a target="blank" href="https://console.aws.amazon.com/sagemaker/home?region={}#/processing-jobs/{}">Processing Job</a></b>'.format(region, scikit_processing_job_name)))


In [17]:
from IPython.core.display import display, HTML

display(HTML('<b>Review <a target="blank" href="https://console.aws.amazon.com/cloudwatch/home?region={}#logStream:group=/aws/sagemaker/ProcessingJobs;prefix={};streamFilter=typeLogStreamPrefix">CloudWatch Logs</a> After About 5 Minutes</b>'.format(region, scikit_processing_job_name)))


In [18]:
from IPython.core.display import display, HTML

display(HTML('<b>Review <a target="blank" href="https://s3.console.aws.amazon.com/s3/buckets/{}/{}/?region={}&tab=overview">S3 Output Data</a> After The Processing Job Has Completed</b>'.format(bucket, scikit_processing_job_name, region)))


# Monitor the Processing Job

In [19]:
running_processor = sagemaker.processing.ProcessingJob.from_processing_name(processing_job_name=scikit_processing_job_name,
                                                                            sagemaker_session=sagemaker_session)

processing_job_description = running_processor.describe()

print(processing_job_description)

{'ProcessingInputs': [{'InputName': 'input-1', 'S3Input': {'S3Uri': 's3://sagemaker-us-west-2-085964654406/amazon-reviews-pds/tsv/', 'LocalPath': '/opt/ml/processing/input/data/', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'ShardedByS3Key', 'S3CompressionType': 'None'}}, {'InputName': 'code', 'S3Input': {'S3Uri': 's3://sagemaker-us-west-2-085964654406/sagemaker-scikit-learn-2020-09-26-17-44-12-987/input/code/preprocess-scikit-text-to-bert.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}], 'ProcessingOutputConfig': {'Outputs': [{'OutputName': 'bert-train', 'S3Output': {'S3Uri': 's3://sagemaker-us-west-2-085964654406/sagemaker-scikit-learn-2020-09-26-17-44-12-987/output/bert-train', 'LocalPath': '/opt/ml/processing/output/bert/train', 'S3UploadMode': 'EndOfJob'}}, {'OutputName': 'bert-validation', 'S3Output': {'S3Uri': 's3://sagemak

In [20]:
running_processor.wait(logs=False)

..............................................................................!

# _Please Wait Until the ^^ Processing Job ^^ Completes Above._

# Inspect the Processed Output Data

Take a look at a few rows of the transformed dataset to make sure the processing was successful.

In [21]:
processing_job_description = running_processor.describe()

output_config = processing_job_description['ProcessingOutputConfig']
for output in output_config['Outputs']:
    if output['OutputName'] == 'bert-train':
        processed_train_data_s3_uri = output['S3Output']['S3Uri']
    if output['OutputName'] == 'bert-validation':
        processed_validation_data_s3_uri = output['S3Output']['S3Uri']
    if output['OutputName'] == 'bert-test':
        processed_test_data_s3_uri = output['S3Output']['S3Uri']
        
print(processed_train_data_s3_uri)
print(processed_validation_data_s3_uri)
print(processed_test_data_s3_uri)

s3://sagemaker-us-west-2-085964654406/sagemaker-scikit-learn-2020-09-26-17-44-12-987/output/bert-train
s3://sagemaker-us-west-2-085964654406/sagemaker-scikit-learn-2020-09-26-17-44-12-987/output/bert-validation
s3://sagemaker-us-west-2-085964654406/sagemaker-scikit-learn-2020-09-26-17-44-12-987/output/bert-test


In [22]:
!aws s3 ls $processed_train_data_s3_uri/

2020-09-26 17:50:36     352881 part-algo-1-amazon_reviews_us_Digital_Ebook_Purchase_v1_01.tfrecord
2020-09-26 17:50:36      11912 part-algo-1-amazon_reviews_us_Digital_Video_Games_v1_00.tfrecord
2020-09-26 17:49:11      10766 part-algo-2-amazon_reviews_us_Digital_Software_v1_00.tfrecord


In [23]:
!aws s3 ls $processed_validation_data_s3_uri/

2020-09-26 17:50:36      19944 part-algo-1-amazon_reviews_us_Digital_Ebook_Purchase_v1_01.tfrecord
2020-09-26 17:50:36        699 part-algo-1-amazon_reviews_us_Digital_Video_Games_v1_00.tfrecord
2020-09-26 17:49:11        716 part-algo-2-amazon_reviews_us_Digital_Software_v1_00.tfrecord


In [24]:
!aws s3 ls $processed_test_data_s3_uri/

2020-09-26 17:50:36      19970 part-algo-1-amazon_reviews_us_Digital_Ebook_Purchase_v1_01.tfrecord
2020-09-26 17:50:36        710 part-algo-1-amazon_reviews_us_Digital_Video_Games_v1_00.tfrecord
2020-09-26 17:49:11        650 part-algo-2-amazon_reviews_us_Digital_Software_v1_00.tfrecord


# Pass Variables to the Next Notebook(s)

In [25]:
%store raw_input_data_s3_uri

Stored 'raw_input_data_s3_uri' (str)


In [26]:
%store max_seq_length

Stored 'max_seq_length' (int)


In [27]:
%store train_split_percentage

Stored 'train_split_percentage' (float)


In [28]:
%store validation_split_percentage

Stored 'validation_split_percentage' (float)


In [29]:
%store test_split_percentage

Stored 'test_split_percentage' (float)


In [30]:
%store balance_dataset

Stored 'balance_dataset' (bool)


In [31]:
%store processed_train_data_s3_uri

Stored 'processed_train_data_s3_uri' (str)


In [32]:
%store processed_validation_data_s3_uri

Stored 'processed_validation_data_s3_uri' (str)


In [33]:
%store processed_test_data_s3_uri

Stored 'processed_test_data_s3_uri' (str)


In [34]:
%store

Stored variables and their in-db values:
auto_ml_job_name                                      -> 'automl-dm-26-16-00-25'
autopilot_endpoint_name                               -> 'automl-dm-ep-26-16-21-49'
autopilot_train_s3_uri                                -> 's3://sagemaker-us-west-2-085964654406/data/amazon
balance_dataset                                       -> True
ingest_create_athena_db_passed                        -> True
ingest_create_athena_table_parquet_passed             -> True
ingest_create_athena_table_tsv_passed                 -> True
max_seq_length                                        -> 64
processed_test_data_s3_uri                            -> 's3://sagemaker-us-west-2-085964654406/sagemaker-s
processed_train_data_s3_uri                           -> 's3://sagemaker-us-west-2-085964654406/sagemaker-s
processed_validation_data_s3_uri                      -> 's3://sagemaker-us-west-2-085964654406/sagemaker-s
raw_input_data_s3_uri                                 

In [None]:
%%javascript
Jupyter.notebook.save_checkpoint();
Jupyter.notebook.session.delete();