# Task 4: Bring your own container to SageMaker Studio

In this notebook, you create your own Docker image and build a processing container. You use a **ScriptProcessor** class from the Amazon SageMaker Python SDK to run a scikit-learn preprocessing script within the container. Then, you validate the data processing results that are saved in Amazon Simple Storage Service (Amazon S3).

## Task 4.1: Environment setup

Install the required libraries and dependencies.

You set up an Amazon S3 bucket to store the outputs from the processing job and also get the execution role to run the SageMaker processing job.

In [9]:
%%capture

# add csv visualization and image-build packages
%pip install sagemaker-studio-image-build 
%pip install --upgrade pandas

In [10]:
#install-dependencies
import logging
import boto3
import sagemaker
import pandas as pd
import numpy as np
from sagemaker.s3 import S3Downloader

sagemaker_logger = logging.getLogger("sagemaker")
sagemaker_logger.setLevel(logging.INFO)
sagemaker_logger.addHandler(logging.StreamHandler())

sagemaker_session = sagemaker.Session()

#Execution role to run the SageMaker Processing job
role = sagemaker.get_execution_role()
print("SageMaker Execution Role: ", role)

#S3 bucket to read the SKLearn processing script and writing processing job outputs
s3 = boto3.resource('s3')
for buckets in s3.buckets.all():
    if 'databucket' in buckets.name:
        bucket = buckets.name
print("Bucket: ", bucket)

prefix = 'scripts/data'
S3Downloader.download(s3_uri=f"s3://{bucket}/{prefix}/abalone_data.csv", local_path= 'data/')

SageMaker Execution Role:  arn:aws:iam::995478082385:role/LabVPC-notebook-role
Bucket:  databucket-us-west-2-7611696109606748


['data/abalone_data.csv']

## Task 4.2: Create a processing container

Define and create a scikit-learn container by using the Dockerfile.

### Task 4.2.1: Create a Dockerfile

Create a Docker directory and add the Dockerfile that creates the processing container. Because you are creating a scikit-learn container, you install pandas and scikit-learn.

In [3]:
%mkdir docker

In [11]:
%%writefile docker/Dockerfile
FROM public.ecr.aws/docker/library/python:3.10-slim-bullseye

RUN pip3 install pandas scikit-learn
ENV PYTHONUNBUFFERED=TRUE

ENTRYPOINT ["python3"]

Overwriting docker/Dockerfile


### Task 4.2.2: Build the container image

Create a custom container image by using the Amazon SageMaker Studio Image Build command line interface (CLI).

By using the Amazon SageMaker Studio Image Build CLI, you can build Amazon SageMaker compatible Docker images directly from your SageMaker Studio environments. Using the Image Build CLI helps save time and increase security because it abstracts the creation of the build environment and requires fewer permissions.

Navigate to the directory that contains your Dockerfile and run the sm-docker build command. This command automatically logs build output and returns the **Image URI** of your Docker image. This step takes 2–5 minutes to complete.

In [12]:
%%sh

sudo rm /usr/lib/x86_64-linux-gnu/libstdc++.so.6

sudo cp /opt/conda/lib/libstdc++.so.6 /usr/lib/x86_64-linux-gnu/libstdc++.so.6

cd docker

sm-docker build .



sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml
...................[Container] 2025/04/17 12:05:31.794757 Running on CodeBuild On-demand

[Container] 2025/04/17 12:05:31.794798 Waiting for agent ping
[Container] 2025/04/17 12:05:32.098044 Waiting for DOWNLOAD_SOURCE
[Container] 2025/04/17 12:05:32.279343 Phase is DOWNLOAD_SOURCE
[Container] 2025/04/17 12:05:32.314768 CODEBUILD_SRC_DIR=/codebuild/output/src59280434/src
[Container] 2025/04/17 12:05:32.315387 YAML location is /codebuild/output/src59280434/src/buildspec.yml
[Container] 2025/04/17 12:05:32.317716 Setting HTTP client timeout to higher timeout for S3 source
[Container] 2025/04/17 12:05:32.318029 Processing environment variables
[Container] 2025/04/17 12:05:32.395302 No runtime version selected in buildspec.
[Container] 2025/04/17 12:05:32.415061 Moving to directory

When the cell completes, an Image URI is returned that looks like *012345678910.dkr.ecr.us-east-1.amazonaws.com/sagemaker-studio-d-vcbyjgmmjzzy:data-scientist-test-user*.

1. Copy the **Image URI** and paste it into a text editor of your choice. 

You use this **Image URI** to create a **ScriptProcessor** class.

## Task 4.3: Run the SageMaker processing job

AnyCompany Consulting is working on a project with a wildlife group that is studying abalone age. An abalone is a type of mollusk or marine snail. They want to predict the age of live specimens instead of having to cut open their shells to determine their age.

The abalone dataset represents a population of over 4000 abalones. The dataset includes columns for sex, length, diameter, height, whole weight, shucked weight, viscera weight, shell weight, and rings.

Run a processing job on the abalone dataset.

In [13]:
#import-data
shape=pd.read_csv("data/abalone_data.csv", header=0)
shape.sample(5)

Unnamed: 0,sex,length,diameter,height,whole_weight,shucked_weight,viscera_weight,shell_weight,rings
1631,I,0.57,0.445,0.145,0.741,0.306,0.172,0.183,12
1740,F,0.675,0.51,0.195,1.382,0.605,0.318,0.397,10
2038,I,0.28,0.215,0.08,0.132,0.072,0.022,0.033,5
1768,I,0.435,0.3,0.12,0.597,0.259,0.139,0.165,8
227,I,0.365,0.27,0.085,0.205,0.078,0.049,0.07,7


Then, use the SageMaker ScriptProcessor class to define and run a processing script as a processing job. Refer to [SageMaker ScriptProcessor](https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.processing.ScriptProcessor) for more information about this class.

For creating the ScriptProcessor class, you configure the following parameters:
- **base_job_name**: Prefix for the processing job name
- **command**: Command to run, in addition to any command-line flags
- **image_uri**: URI of the Docker image to use for the processing jobs
- **role**: SageMaker execution role
- **instance_count**: Number of instances to run the processing job
- **instance_type**: Type of Amazon Elastic Compute Cloud (Amazon EC2) instance that is used for the processing job

1. In the following code, replace **REPLACE_IMAGE_URI** with the URI from your text editor.

In [15]:
#sagemaker-script-processor
from sagemaker.processing import ScriptProcessor

# create a ScriptProcessor
script_processor = ScriptProcessor(
    base_job_name="own-processing-container",
    command=["python3"],
    image_uri="REPLACE_IMAGE_URI",
    role=role,
    instance_count=1,
    instance_type="ml.m5.xlarge",
)

Next, use the ScriptProcessor.run() method to run the **sklearn_preprocessing.py** script as a processing job. Refer to [ScriptProcessor.run()](https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.processing.ScriptProcessor.run) for more information about this method.

For running the processing job, you configure the following parameters:
- **code**: Path of the preprocessing script 
- **inputs**: Path of input data for the preprocessing script (Amazon S3 input location)
- **outputs**: Path of output for the preprocessing script (Amazon S3 output location)
- **arguments**: Command-line arguments to the preprocessing script (such as train test split ratio)

The processing job takes approximately 4–5 minutes to complete.

In [16]:
#processing-job
import os
from sagemaker.processing import ProcessingInput, ProcessingOutput

# Amazon S3 path prefix
input_raw_data_prefix = "scripts/data"
output_preprocessed_data_prefix = "scripts/data/output"
scripts_prefix = "scripts/smstudiofiles"
logs_prefix = "logs"

# Run the processing job
script_processor.run(
    code="s3://" + os.path.join(bucket, scripts_prefix, "sklearn_preprocessing.py"),
    inputs=[ProcessingInput(source="s3://" + os.path.join(bucket, input_raw_data_prefix, "abalone_data.csv"),
                            destination="/opt/ml/processing/input")],
    outputs=[
        ProcessingOutput(output_name="train_data", 
                        source="/opt/ml/processing/train",
                        destination="s3://" + os.path.join(bucket, output_preprocessed_data_prefix, "train")),
        ProcessingOutput(output_name="test_data", 
                        source="/opt/ml/processing/test",
                        destination="s3://" + os.path.join(bucket, output_preprocessed_data_prefix, "test")),
    ],
    arguments=["--train-test-split-ratio", "0.2"],
)
script_processor_job_description = script_processor.jobs[-1].describe()
print(script_processor_job_description)

Creating processing-job with name own-processing-container-2025-04-17-12-10-57-739
Creating processing-job with name own-processing-container-2025-04-17-12-10-57-739


...............
..{'ProcessingInputs': [{'InputName': 'input-1', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://databucket-us-west-2-7611696109606748/scripts/data/abalone_data.csv', 'LocalPath': '/opt/ml/processing/input', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'code', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://databucket-us-west-2-7611696109606748/scripts/smstudiofiles/sklearn_preprocessing.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}], 'ProcessingOutputConfig': {'Outputs': [{'OutputName': 'train_data', 'S3Output': {'S3Uri': 's3://databucket-us-west-2-7611696109606748/scripts/data/output/train', 'LocalPath': '/opt/ml/processing/train', 'S3UploadMode': 'EndOfJob'}, 'AppManaged': False}, {'OutputName': 'test_data', 'S3Output': {'S3Uri': 's3://databucket-u

## Task 4.4: Validate the data processing results

Validate the output of the processing job that you ran by looking at the first five rows of the train and test output datasets.

In [17]:
#view-train-dataset
print("Top 5 rows from s3://{}/{}/train/".format(bucket, output_preprocessed_data_prefix))
!aws s3 cp --quiet s3://$bucket/$output_preprocessed_data_prefix/train/train_features.csv - | head -n5

Top 5 rows from s3://databucket-us-west-2-7611696109606748/scripts/data/output/train/
I,0.18,0.135,0.08,0.033,0.015,0.007,0.01
I,0.215,0.15,0.055,0.041,0.015,0.009,0.013
M,0.66,0.53,0.17,1.391,0.591,0.212,0.453
M,0.715,0.525,0.2,1.89,0.95,0.436,0.431
M,0.595,0.455,0.155,1.041,0.416,0.211,0.365


In [19]:
#view-validation-dataset
print("Top 5 rows from s3://{}/{}/validation/".format(bucket, output_preprocessed_data_prefix))
!aws s3 cp --quiet s3://$bucket/$output_preprocessed_data_prefix/test/test_features.csv - | head -n5

Top 5 rows from s3://databucket-us-west-2-7611696109606748/scripts/data/output/validation/
M,0.55,0.425,0.155,0.918,0.278,0.243,0.335
I,0.5,0.4,0.12,0.616,0.261,0.143,0.194
M,0.62,0.48,0.155,1.256,0.527,0.374,0.318
I,0.22,0.165,0.055,0.055,0.022,0.012,0.02
M,0.645,0.5,0.175,1.511,0.674,0.376,0.378


When the cells complete, responses are returned that look like the following output:

```plain
Top 5 rows from s3://databucket-us-east-1-xxxxxxx/scripts/data/output/validation/
M,0.55,0.425,0.155,0.918,0.278,0.243,0.335
I,0.5,0.4,0.12,0.616,0.261,0.143,0.194
M,0.62,0.48,0.155,1.256,0.527,0.374,0.318
I,0.22,0.165,0.055,0.055,0.022,0.012,0.02
M,0.645,0.5,0.175,1.511,0.674,0.376,0.378
```

The column headers for the abalone dataset are sex (Infant, Male, Female), length, diameter, height, whole_weight, shucked_weight, viscera_weight, shell_weight, and rings. The output from the **train/** and **validation/** folders shows the processed data that is stored in your S3 bucket.

You have built your own processing container and used SageMaker Processing to run the processing job on that custom container.

### Cleanup

You have completed this notebook. To move to the next part of the lab, do the following:

- Close this notebook file.
- Return to the lab session and continue with the **Conclusion**.