# I/O Benchmark - SageMaker Processing Jobs with Pipe mode
In this example, you'll verify how long does SageMaker take to transfer big files from S3 to the Instance Memory using Pipe Mode.
https://docs.aws.amazon.com/sagemaker/latest/dg/processing-job.html

In Pipe mode, SageMaker maps an S3 bucket with a Pipe inside the docker container created to run the job. That way, when the application reads the pipe, it streams the files from S3 using a very optimized mechanism. This is important specially if you're using really big files like in this example: 10x ~28GB .npy files.


In [1]:
import numpy as np
import sagemaker
import subprocess
import os
sagemaker_session = sagemaker.Session()
default_s3_bucket = sagemaker_session.default_bucket()
print(f'SageMaker default bucket: {default_s3_bucket}')

SageMaker default bucket: sagemaker-us-east-1-715445047862


### Generating big files
### You can skip this if you already generated/uploaded the files
#### You need to run this cell on an instance with more than 35GB of Ram

In [None]:
# now, let's generate 1 bigfile ~29GB, upload it to S3 and replicate it
if not os.path.isfile('big_file.npy'):
    print('Generating a new big file...')
    d = np.ones((10000, 3, 1280, 768), dtype=np.uint8)
    np.save('big_file.npy', d)
    print('Done!')

for i in range(10):
    print(f'Uploading file big_file_0{i}.npy to S3...')
    s3_uri = f's3://{default_s3_bucket}/bigfiles/big_file_0{i}.npy'
    subprocess.run(f"aws s3 cp big_file.npy {s3_uri}".split(' '))

### Processing script
This script will be executed by SageMaker. It will read the Pipe until it gets all the available bytes. In Pipe mode, we don't know when one file ends and another starts. However, we know that the files are numpy .npy files. So, we can read the header and get the size of the body. Using this strategy, we'll get the list of files from the manifest (a virtual file created by SageMaker with the list of all files) and then iterate across all files, reading headers and then the payloads.

In [12]:
%%writefile inference.py
import numpy as np
import argparse
import time
import sys
import os

input_dir = "/opt/ml/processing/input"
output_dir = "/opt/ml/processing/output"
failure_file = output_dir + "/failure"

## helper function that gets the header metadata
def read_array_header(fobj):
    version = np.lib.format.read_magic(fobj)
    func_name = 'read_array_header_' + '_'.join(str(v) for v in version)    
    func = getattr(np.lib.format, func_name)
    return func(fobj)

if __name__ == "__main__":

    # parse the input parameters passed via SageMaker Python SDK
    parser = argparse.ArgumentParser()
    parser.add_argument("--you-can-pass-parameters-like-this", type=float, default=0.3)
    args, _ = parser.parse_known_args()

    print("Received arguments {}".format(args))
    try:
        # list the pipes in the input dir
        print(os.listdir(input_dir))
        # parse the manifest file and extract all file names
        file_names = [f.strip() for f in open(f'{input_dir}/input-1-manifest', 'r').readlines()[1:]] # skip first line
        
        total_start_time = time.time()
        total_bytes = 0
        # read the pipe
        with open(f'{input_dir}/input-1_0', 'rb') as data:
            # iterate over all file names and read only the bytes that belong to each file
            for fname in file_names:
                print(f'Reading header of {fname} - {time.time()}')
                shape,_,dtype = read_array_header(data)
                file_size = np.prod(shape) * dtype.itemsize
                total_bytes += file_size
                print(f'Reading body of {fname} - {time.time()}')
                start_time = time.time()
                payload = data.read(file_size)
                ### --> Here you convert the payload into tensor, invoke the model and save 
                ### --> the predictions to output_dir                
                print(f'Elapsed time: {time.time()-start_time}, Bytes: {file_size}')
        print(f'Total bytes read: {total_bytes}. Total Elapsed time: {time.time()-total_start_time}')
        
    except Exception as e:
        print("Failed to train: %s" % (sys.exc_info()[0]))
        with open(failure_file, 'w') as f:
            f.write(str(e))
        raise e

Overwriting inference.py


### Now, let's run the processing job

In [13]:
import sagemaker
from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput

role = sagemaker.get_execution_role()

# We'll use a pre-defined container with SKLearn to avoid having to build our own container
sklearn_processor = SKLearnProcessor(
    framework_version='0.23-1',
    role=role,
    instance_type='ml.c5.9xlarge',
    instance_count=1
)

### Execute the processin job
Now that we have the processing script, invoke it to compute the I/O times

In [14]:
%%time
sklearn_processor.run(
    code='inference.py',
    inputs=[ProcessingInput(
        source=f's3://{default_s3_bucket}/bigfiles',
        destination='/opt/ml/processing/input',
        s3_input_mode='Pipe',
        s3_data_distribution_type='ShardedByS3Key'
    )],
    outputs=[ProcessingOutput(
        source='/opt/ml/processing/output/predictions'),        
    ],
    arguments=["--you-can-pass-parameters-like-this", "0.1"]
)


Job Name:  sagemaker-scikit-learn-2022-07-14-14-44-36-548
Inputs:  [{'InputName': 'input-1', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-715445047862/bigfiles', 'LocalPath': '/opt/ml/processing/input', 'S3DataType': 'S3Prefix', 'S3InputMode': 'Pipe', 'S3DataDistributionType': 'ShardedByS3Key', 'S3CompressionType': 'None'}}, {'InputName': 'code', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-715445047862/sagemaker-scikit-learn-2022-07-14-14-44-36-548/input/code/inference.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]
Outputs:  [{'OutputName': 'output-1', 'AppManaged': False, 'S3Output': {'S3Uri': 's3://sagemaker-us-east-1-715445047862/sagemaker-scikit-learn-2022-07-14-14-44-36-548/output/output-1', 'LocalPath': '/opt/ml/processing/output/predictions', 'S3UploadMode': 'EndOfJob'}}]
..............................[3