# MLOps - NLP Lab with Amazon SageMaker
**Step 1** - *Prepare data with SageMaker Processing*

## Initialization
---
### Setup environment

In [None]:
import os
import sagemaker
from sagemaker.processing import Processor, ProcessingInput, ProcessingOutput

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role() # we are using the notebook instance role for training in this example
bucket = sagemaker_session.default_bucket() # you can specify a bucket name here
prefix = 'data/input'
image_name = 'data-processing-containers'

### Download example data

In [None]:
!wget https://aws-mlops-workshop.s3-eu-west-1.amazonaws.com/reviews/workshop_data/reviews.csv

Let's now push this dataset in the default S3 Bucket attached to our SageMaker Notebook instance:

In [None]:
s3_input = sagemaker_session.upload_data('reviews.csv', bucket, prefix)
with open('processing_input_location.txt', 'w') as f:
    f.writelines(s3_input)
    
print(s3_input)

## Build and push Processing container
---

We already wrote the shell script (located in `docker/build_and_push.sh`) that will build the appropriate Docker container that will be executed by Amazon SageMaker Processing. Processing is a capability of Amazon SageMaker that lets customers easily run the preprocessing, postprocessing and model evaluation workloads on fully managed infrastructure. If you're curious about the actual processing file built into this Docker image, you will find it in `docker/code/prepare_data.py`. Feel free to update this file before building the image:

In [None]:
!pygmentize ./docker/code/prepare_data.py

In [None]:
!sh ./docker/build_and_push.sh $image_name

The previous script wrote the full ECR Docker Image ID in the ecr_image_name.txt file. Let's read this ID from this file:

In [None]:
if os.path.exists('ecr_image_name.txt'):
    with open('ecr_image_name.txt', 'r') as f:
        container = f.readline()[:-1]

    print(f'ECR Image ID: {container}')
else:
    print('ECR Image ID not found.')

## Launch the data preprocessing job
---

First, we configure a Processor object that will reference the container we just pushed on the Amazon ECR service:

In [None]:
data_processor = Processor(role=role, 
                           image_uri=container, 
                           instance_count=1, 
                           instance_type='ml.m5.xlarge',
                           volume_size_in_gb=30, 
                           max_runtime_in_seconds=1200,
                           base_job_name='data-processing')

Now we can run this processing job:

In [None]:
input_folder = '/opt/ml/processing/input'
output_folder = '/opt/ml/processing/output'

results = data_processor.run(
    arguments= [
        f'--input={input_folder}',
        f'--output={output_folder}'
    ],
    inputs = [
        ProcessingInput(
            input_name='input',
            source=s3_input,
            destination=input_folder
        )
    ],
    outputs= [
        ProcessingOutput(
            output_name='preprocessed',
            source=output_folder,
            destination=bucket
        )
    ]
)

As you can see in the prepare_data.py script above, this Processing job wrote for files in the output location: we will keep this S3 location handy for the next step where we will train a model using these new CSV file:

In [None]:
results = data_processor.jobs[-1].describe()

with open('training_input_location.txt', 'w') as f:
    f.writelines(results['ProcessingOutputConfig']['Outputs'][0]['S3Output']['S3Uri'])