# Preprocessing data

## Docker image

We need to run a processing job in SageMaker using a Docker container. Since this is going to be a simple job to prepare our dataset of images, let's create a simple Python image with everything we need to run the job.

In [9]:
%cd /home/sagemaker-user

[Errno 2] No such file or directory: '/home/sagemaker-user'
/root


In [6]:
%%writefile containers/basic/Dockerfile

FROM python:3.8.5-slim AS build

RUN apt-get clean && \
    apt-get update -y && \
    apt-get install -y --no-install-recommends \
    python3-dev build-essential ca-certificates

WORKDIR /build
		
COPY requirements.txt .
ENV PATH=/root/.local/bin:$PATH

RUN pip install --user --upgrade pip
RUN pip install --user cython
RUN pip install --user pyyaml
RUN pip install --user -r requirements.txt

COPY config.yml .

FROM python:3.8.5-slim 

RUN apt-get clean && \
    apt-get update -y && \
    apt-get install -y ca-certificates

ENV PATH="/opt/ml/code:/root/.local/bin:${PATH}"
RUN mkdir -p /opt/ml/code
WORKDIR /opt/ml/code

COPY --from=build /root/.local /root/.local
COPY --from=build /build/ .

ENTRYPOINT ["python3"]

Writing containers/basic/Dockerfile


FileNotFoundError: [Errno 2] No such file or directory: 'containers/basic/Dockerfile'

In [2]:
%%writefile containers/basic/requirements.txt

pandas
numpy

Overwriting containers/basic/requirements.txt


### Setting up the ECR repository

We need to make the Docker image available by uploading it to ECR. For that, we first need to create the repository.

In [3]:
REPOSITORY_NAME = "preprocess"
REPOSITORY_TAG = "latest"

Let's create the ECR repository unless it already exists. In either case, we are going to grab the URI of the repository to use if to tag and push our Docker image later.

In [4]:
repository = !aws ecr describe-repositories \
    --repository-names $REPOSITORY_NAME \
    --query "repositories[0].repositoryUri" \
    || aws ecr create-repository --repository-name $REPOSITORY_NAME

repository = repository[0][1:-1]
print(f"ECR Repository: {repository}")

ECR Repository: 048982217509.dkr.ecr.us-west-2.amazonaws.com/preprocess


### Building and pushing the image to the ECR repository

In [5]:
# Let's get our AWS account identifier. We are going to need this to tag and
# push our Docker image to ECR.

account_id = !aws sts get-caller-identity --query Account
print(f"Account Identifier: {account_id[0][1:-1]}")

Account Identifier: 048982217509


Let's build and tag our Docker image

In [5]:
!docker build -t $repository:REPOSITORY_TAG containers/basic/.

invalid argument ":REPOSITORY_TAG" for "-t, --tag" flag: invalid reference format
See 'docker build --help'.


In [4]:
!docker


Usage:	docker [OPTIONS] COMMAND

A self-sufficient runtime for containers

Options:
      --config string      Location of client config files (default
                           "/root/.docker")
  -c, --context string     Name of the context to use to connect to the
                           daemon (overrides DOCKER_HOST env var and
                           default context set with "docker context use")
  -D, --debug              Enable debug mode
  -H, --host list          Daemon socket(s) to connect to
  -l, --log-level string   Set the logging level
                           ("debug"|"info"|"warn"|"error"|"fatal")
                           (default "info")
      --tls                Use TLS; implied by --tlsverify
      --tlscacert string   Trust certs signed only by this CA (default
                           "/root/.docker/ca.pem")
      --tlscert string     Path to TLS certificate file (default
                           "/root/.docker/cert.pem")
      --tlskey string     

In [3]:
import tensorflow as tf