# UDACITY SageMaker Essentials: Batch Transform

In the last exercise, we asked you to reflect on the disadvantages of having to perform preprocessing on a local machine. In addition to those disadvantages, such as user error and hardware limitations, you may have also encountered another frustration in submitting a large amount of data to an endpoint. There may be network limitations on your end, there may be security/privacy concerns, and there might be an obvious performance advantage in parallelism that may be difficult to implement. 

Batch transform essentially addresses all of these issues. The primary use case for this is to make an inference on a dataset rather than making many individual calls to an endpoint. AWS SageMaker, similar to other tools that we encountered, does the heavy implementation lifting of reading data and splitting the burden among instances. All that's required of us is to give batch transform the correct directions to the data we want to submit. 

Alas, this dataset is unfortunately not quite in the correct format to be properly digested by batch transform. Although this tool is capable of digesting lists of json objects, it is not capable of the processing operations that we would ideally perform on it. So, yet again, we must preprocess data. 

## Exercise: Preprocess (again, again) and upload to S3

The cell below provides you two functions. The `split_sentences` preprocesses the reviews and you should be very familiar with function. Remember that the BlazingText expects a input with JSON format, the `cycle_data` formats the review to the following: {'source': 'THIS IS A SAMPLE SENTENCE'} and writes it into a file.

Using the cell to complete the following tasks:
* preprecessing reviews_Musical_Instruments_5.json 
* upload the file consisting of the data to s3

In [2]:
import boto3
import json
import os
import zipfile

# Todo: Input the s3 bucket
s3_bucket = "udacity-ml-workflow"

# Todo: Input the s3 prefix
s3_prefix = "batch-transform"

# Todo: Input the the file to write the data to
file_name = "music-instruments-reviews.txt"

# Function below unzips the archive to the local directory. 

def unzip_data(input_data_path):
    with zipfile.ZipFile(input_data_path, 'r') as input_data_zip:
        input_data_zip.extractall('.')


def split_sentences(input_data):
    split_sentences = []
    for l in open(input_data, 'r'):
        l_object = json.loads(l)
        helpful_votes = float(l_object['helpful'][0])
        total_votes = l_object['helpful'][1]
        if total_votes != 0 and helpful_votes/total_votes != .5:  # Filter out same data as prior jobs. 
            reviewText = l_object['reviewText']
            sentences = reviewText.split(".") 
            for s in sentences:
                if s: # Make sure sentences isn't empty. Common w/ "..."
                    split_sentences.append(s)
    return split_sentences

# Format the data as {'source': 'THIS IS A SAMPLE SENTENCE'}
# And write the data into a file
def cycle_data(fp, data):
    for d in data:
        fp.write(json.dumps({'source':d}) + '\n')

# Todo: write a function to upload the data to s3
def upload_file_to_s3(file_name, s3_prefix):
    object_name = os.path.join(s3_prefix, file_name)
    s3_client = boto3.client('s3')
    try:
        response = s3_client.upload_file(file_name, s3_bucket, object_name)
    except ClientError as e:
        logging.error(e)
        return False


# Unzips file.
unzip_data('reviews_Musical_Instruments_5.json.zip')

# Todo: preprocess reviews_Musical_Instruments_5.json 
sentences = split_sentences('reviews_Musical_Instruments_5.json')

# Write data to a file and upload it to s3.
with open(file_name, 'w') as f:
    cycle_data(f, sentences)

upload_file_to_s3(file_name, s3_prefix)

# Get the s3 path for the data
batch_transform_input_path = "s3://" + "/".join([s3_bucket, s3_prefix, file_name])

print(batch_transform_input_path)

s3://udacity-ml-workflow/batch-transform/music-instruments-reviews.txt


## Exercise: Use Batch Transform to perform an inference on the dataset

We utilize batch transform through a transformer object. Similar to how we initialized a predictor object in the last exercise, complete the code below to initialize a transformer object and launch a transform job.   

You will need the following:

* Similar to last exercise, you will need to get a BlazingText image uri from AWS. The methodology you use to do so should be identical to the last exercise.  
* You will need to instantiate a "model" object.
* You will need to call the "transformer" method on the model object to create a transformer. We suggest using 1 instance of ml.m4.xlarge. If this isn't available in your region, feel free to use another instance, such as ml.m5.large
* You will need to use this transformer on the data we uploaded to s3. You will be able to do so by inserting an "S3Prefix" data_type and a "application/jsonlines" content_type, split by "Line".

Consult the following documentation: https://sagemaker.readthedocs.io/en/stable/api/inference/transformer.html

End-to-end, this process should take about 5 minutes on the whole dataset. While developing, consider uploading a subset of the data to s3, and evaluate on that instead. 


In [3]:
from sagemaker import get_execution_role
from sagemaker.model import Model
from sagemaker import image_uris

# Get the execution role

role = get_execution_role()

# Get the image uri using the "blazingtext" algorithm in your region. 

image_uri = image_uris.retrieve(framework='blazingtext',region='us-east-1')

# Get the model artifact from S3

model_data = "s3://udacity-ml-workflow/hello-blaze-lesson2-ex1/output-hello-blaze/hello-blaze-2022/output/model.tar.gz"

# Get the s3 path for the batch transform data

batch_transform_output_path = "s3://udacity-ml-workflow/batch-transform/batch-transform-output/"

# Define a model object

model =  Model(image_uri=image_uri, model_data=model_data, role=role)

# Define a transformer object, using a single instance ml.m4.xlarge. Specify an output path to your s3 bucket. 

transformer = model.transformer(
    instance_count = 1,
    instance_type = 'ml.m4.xlarge',
    output_path = batch_transform_output_path
)

# Call the transform method. Set content_type='application/jsonlines', split_type='Line'

transformer.transform(
    data=batch_transform_input_path, 
    data_type='S3Prefix',
    content_type='application/jsonlines', 
    split_type='Line'
)

transformer.wait()


..................................[34mArguments: serve[0m
[34m[01/11/2022 18:50:28 INFO 140560943244672] Finding and loading model[0m
[34m[01/11/2022 18:50:28 INFO 140560943244672] Trying to load model from /opt/ml/model/model.bin[0m
[34m[01/11/2022 18:50:28 INFO 140560943244672] Number of server workers: 4[0m
[34m[2022-01-11 18:50:28 +0000] [1] [INFO] Starting gunicorn 19.7.1[0m
[34m[2022-01-11 18:50:28 +0000] [1] [INFO] Listening at: http://0.0.0.0:8080 (1)[0m
[34m[2022-01-11 18:50:28 +0000] [1] [INFO] Using worker: sync[0m
[34m[2022-01-11 18:50:28 +0000] [35] [INFO] Booting worker with pid: 35[0m
[34m[2022-01-11 18:50:28 +0000] [36] [INFO] Booting worker with pid: 36[0m
[34m[2022-01-11 18:50:29 +0000] [37] [INFO] Booting worker with pid: 37[0m
[34m[2022-01-11 18:50:29 +0000] [38] [INFO] Booting worker with pid: 38[0m
[35mArguments: serve[0m
[35m[01/11/2022 18:50:28 INFO 140560943244672] Finding and loading model[0m
[35m[01/11/2022 18:50:28 INFO 14056094324

## Exercise: Sanity Check - Are Results the Same? 

The results of the inference should be printed to the s3 path specified in batch_transform_output_path. We have evaluated the same data on the same model, so if all is done correctly on both exercises, the inferences should be the same. Compare the first five or so inferences on the last exercise and on this exercise to confirm this. 