# Preprocessing data

## Docker image

We need to run a processing job in SageMaker using a Docker container. Since this is going to be a simple job to prepare our dataset of images, let's create a simple Python image with everything we need to run the job.

In [3]:
%%writefile containers/basic/Dockerfile

FROM python:3.8.5-slim AS build

RUN apt-get clean && \
    apt-get update -y && \
    apt-get install -y --no-install-recommends \
    python3-dev build-essential ca-certificates

WORKDIR /build
		
COPY requirements.txt .
ENV PATH=/root/.local/bin:$PATH

RUN pip install --user --upgrade pip
RUN pip install --user cython
RUN pip install --user pyyaml
RUN pip install --user -r requirements.txt

COPY config.yml .

FROM python:3.8.5-slim 

RUN apt-get clean && \
    apt-get update -y && \
    apt-get install -y ca-certificates

ENV PATH="/opt/ml/code:/root/.local/bin:${PATH}"
RUN mkdir -p /opt/ml/code
WORKDIR /opt/ml/code

COPY --from=build /root/.local /root/.local
COPY --from=build /build/ .

ENTRYPOINT ["python3"]

Overwriting containers/basic/Dockerfile


In [4]:
%%writefile containers/basic/requirements.txt

pandas
numpy

Overwriting containers/basic/requirements.txt


### Setting up the ECR repository

We need to make the Docker image available by uploading it to ECR. For that, we first need to create the repository.

In [5]:
REPOSITORY_NAME = "preprocess"
REPOSITORY_TAG = "latest"

Let's create the ECR repository unless it already exists. In either case, we are going to grab the URI of the repository to use it to tag and push our Docker image later.

In [17]:
repository = !aws ecr describe-repositories \
    --repository-names $REPOSITORY_NAME \
    --query "repositories[0].repositoryUri" \
    || aws ecr create-repository --repository-name $REPOSITORY_NAME

repository_uri = repository[0][1:-1]
repository = repository_uri[0:repository_uri.index("/")]

print(f"ECR Repository: {repository}/{REPOSITORY_NAME}")

ECR Repository: 048982217509.dkr.ecr.us-west-2.amazonaws.com/preprocess


### Building and pushing the image to the ECR repository

Let's build, tag, and push our Docker image to the ECR repository that we created.

In order to push the image, we need to authenticate the Docker session.

In [None]:
!aws ecr get-login-password | docker login --username AWS --password-stdin $repository
!docker build -t $repository/$REPOSITORY_NAME:$REPOSITORY_TAG containers/basic/.
!docker push $repository/$REPOSITORY_NAME:$REPOSITORY_TAG