# Task 4: Perform data processing with your own container

In the previous notebook, you used Amazon SageMaker Processing and the scikit-learn built-in container for data processing.

In this notebook, you set up the environment needed to run a scikit-learn script with your own processing container.

You create your own Docker image, build your processing container, and use a **ScriptProcessor** class from the SageMaker Python SDK to run a scikit-learn preprocessing script within the container.

Finally, you validate the data processing results saved in Amazon Simple Storage Service (Amazon S3).

## Task 4.1: Environment setup

In this task, you install the required libraries and dependencies. 

You set up an Amazon S3 bucket to store the outputs from the processing job and also get the execution role to run the SageMaker Processing job.

In [None]:
#install-dependencies
import logging
import boto3
import sagemaker
import pandas as pd
import numpy as np

sagemaker_logger = logging.getLogger("sagemaker")
sagemaker_logger.setLevel(logging.INFO)
sagemaker_logger.addHandler(logging.StreamHandler())

sagemaker_session = sagemaker.Session()

#Execution role to run the SageMaker Processing job
role = sagemaker.get_execution_role()
print("SageMaker Execution Role: ", role)

#S3 bucket to read the SKLearn processing script and writing processing job outputs
s3 = boto3.resource('s3')
for buckets in s3.buckets.all():
    if 'labdatabucket' in buckets.name:
        bucket = buckets.name
print("Bucket: ", bucket)

## Task 4.2: Create a processing container

In this task, you define and create a scikit-learn container using the Dockerfile.

### Task 4.2.1: Create a Docker file 

In this task, you create a Docker directory and add the Dockerfile used to create the processing container. Because you are creating a scikit-learn container, you install pandas and scikit-learn.

In [None]:
%mkdir docker

In [None]:
%%writefile docker/Dockerfile
FROM public.ecr.aws/docker/library/python:3.7-slim-buster

RUN pip3 install pandas==0.25.3 scikit-learn==0.21.3
ENV PYTHONUNBUFFERED=TRUE

ENTRYPOINT ["python3"]

### Task 4.2.2: Build the container image

In this task, you create a custom container image using the Amazon SageMaker Studio image build command line interface (CLI). 

By using the Amazon SageMaker Studio image build CLI, you can build Amazon SageMaker compatible Docker images directly from your SageMaker Studio environments.

Install the Sagemaker Studio image build package:

In [None]:
%pip install sagemaker-studio-image-build

Navigate to the directory that contains your Dockerfile and run the sm-docker build command. This command automatically logs build output and returns the **Image URI** of your Docker image. This takes approximately 2 minutes to complete.

In [None]:
%%sh

cd docker

sm-docker build .

Next, copy the **Image URI** and paste it into a text editor of your choice. 
You use this **Image URI** to create a **ScriptProcessor** class.

## Task 4.3: Run the SageMaker processing job

In this task, you use the same preprocessed dataset from the previous notebook.

In [None]:
#import-data
shape=pd.read_csv("data/adult_data.csv", header=None)
shape.sample(5)



You then use the SageMaker ScriptProcessor class to define and run a processing script as a processing job. Refer to[SageMaker ScriptProcessor](https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.processing.ScriptProcessor) for more information about this class.

For creating ScriptProcessor class, you configure the following parameters:
- **base_job_name**: Prefix for the processing job name
- **command**: Command to run, in addition to any command-line flags
- **image_uri**: URI of the Docker image to use for the processing jobs
- **role**: SageMaker execution role
- **instance_count**: Number of instances to run the processing job
- **instance_type**: Type of Amazon Elastic Compute Cloud (Amazon EC2) instance used for the processing job

In the following code, replace **REPLACE_IMAGE_URI** with the URI from your text editor.

In [None]:
#sagemaker-script-processor
from sagemaker.processing import ScriptProcessor

# create a ScriptProcessor
script_processor = ScriptProcessor(
    base_job_name="own-processing-container",
    command=["python3"],
    image_uri="REPLACE_IMAGE_URI",
    role=role,
    instance_count=1,
    instance_type="ml.m5.xlarge",
)


Next, you use the ScriptProcessor.run() method to run the **sklearn_preprocessing.py** script as a processing job. This is the same script that you used in Task 3, but you are now running it on a custom container built from a base image. Refer to [ScriptProcessor.run()](https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.processing.ScriptProcessor.run) for more information about this method.

For running the processing job, you configure the following parameters:
- **code**: Path of the preprocessing script 
- **inputs**: Path of input data for the preprocessing script (Amazon S3 input location)
- **outputs**: Path of output for the preprocessing script (Amazon S3 output location)
- **arguments**: Command-line arguments to the preprocessing script (such as train test split ratio)

The processing job takes approximately 4–5 minutes to complete.

In [None]:
#processing-job
import os
from sagemaker.processing import ProcessingInput, ProcessingOutput

# Amazon S3 path prefix
input_raw_data_prefix = "data/input"
output_preprocessed_data_prefix = "data/output"
scripts_prefix = "scripts/smstudiofiles"
logs_prefix = "logs"

# Run the processing job
script_processor.run(
    code="s3://" + os.path.join(bucket, scripts_prefix, "sklearn_preprocessing.py"),
    inputs=[ProcessingInput(source="s3://" + os.path.join(bucket, input_raw_data_prefix, "adult_data.csv"),
                            destination="/opt/ml/processing/input")],
    outputs=[
        ProcessingOutput(output_name="train_data", 
                         source="/opt/ml/processing/train",
                         destination="s3://" + os.path.join(bucket, output_preprocessed_data_prefix, "train")),
        ProcessingOutput(output_name="test_data", 
                         source="/opt/ml/processing/test",
                         destination="s3://" + os.path.join(bucket, output_preprocessed_data_prefix, "test")),
    ],
    arguments=["--train-test-split-ratio", "0.2"],
)
script_processor_job_description = script_processor.jobs[-1].describe()
print(script_processor_job_description)

## Task 4.4: Validate the data processing results

Validate the output of the processing job that you ran by looking at the first five rows of the train and test output datasets.

In [None]:
#view-train-dataset
print("Top 5 rows from s3://{}/{}/train/".format(bucket, output_preprocessed_data_prefix))
!aws s3 cp --quiet s3://$bucket/$output_preprocessed_data_prefix/train/train_features.csv - | head -n5

In [None]:
#view-validation-dataset
print("Top 5 rows from s3://{}/{}/validation/".format(bucket, output_preprocessed_data_prefix))
!aws s3 cp --quiet s3://$bucket/$output_preprocessed_data_prefix/test/test_features.csv - | head -n5

### Conclusion

Congratulations! You have successfully built your own processing container and used SageMaker Processing to run the processing job.

### Cleanup

You have completed this notebook. To move to the next part of the lab, do the following:

- Close this notebook file.
- Return to the lab session and continue with the **Conclusion**.