# Build and test the Training container
This notebook will provide an step-by-step instruction to create a docker image for training module of tile-based classification and test its performance.

> Note: Before proceeding, make sure to select the correct kernel. In the top-right corner of the notebook, choose the Jupyter kernel named `Bash`.

## Setup the environment

In [5]:
export WORKSPACE=/workspace/machine-learning-process
export RUNTIME=${WORKSPACE}/runs
mkdir -p ${RUNTIME}
cd ${RUNTIME}
printenv | grep RUNTIME
pwd

XDG_RUNTIME_DIR=/workspace/.local
RUNTIME=/workspace/machine-learning-process/runs
/workspace/machine-learning-process/runs


## Build the container

Inspect the container file:

In [6]:
cat ${WORKSPACE}/training/make-ml-model/Dockerfile

# Stage 1: Build stage
FROM rockylinux:9.3-minimal AS build

# Install necessary build tools
RUN microdnf install -y curl tar

# Download the hatch tar.gz file from GitHub
RUN curl -L https://github.com/pypa/hatch/releases/download/hatch-v1.14.0/hatch-x86_64-unknown-linux-gnu.tar.gz -o /tmp/hatch-x86_64-unknown-linux-gnu.tar.gz

# Extract the hatch binary
RUN tar -xzf /tmp/hatch-x86_64-unknown-linux-gnu.tar.gz -C /tmp/

# Stage 2: Final stage
FROM rockylinux:9.3-minimal

# Set up a default user and home directory
ENV HOME=/home/neo

# Install essential libraries including expat and python3 without `config` commands
RUN microdnf install -y \
    expat \
    libpq \
    curl \
    git \
    wget \
    tar \
    && microdnf install -y python3 \
    && microdnf clean all

# Create a user with UID 1001, group root, and a home directory
RUN useradd -u 1001 -g 100 -m -d ${HOME} -s /sbin/nologin \
         -c "Default Neo User" neo && \
    mkdir -p /code /prod ${HOME}/.cache /home/neo/.local/

Build the container using `docker`:

In [7]:
docker build --format docker -t localhost/training:latest ${WORKSPACE}/training/make-ml-model


[33mWARN[0m[0000] "/" is not a shared mount, this could cause issues or missing mounts with rootless containers 
[1/2] STEP 1/4: FROM rockylinux:9.3-minimal AS build
Resolved "rockylinux" as an alias (/etc/containers/registries.conf.d/shortnames.conf)
Trying to pull docker.io/library/rockylinux:9.3-minimal...
Getting image source signatures
Copying blob 8ec988941d66 [-----------------------------------] 1.3KiB / 44.4MiB
[1A[JCopying blob 8ec988941d66 [==>--------------------------------] 4.3MiB / 44.4MiB
[1A[JCopying blob 8ec988941d66 done  
[1A[JCopying blob 8ec988941d66 done  
[1A[JCopying blob 8ec988941d66 done  
[1A[JCopying blob 8ec988941d66 done  
[1A[JCopying blob 8ec988941d66 done  
[1A[JCopying blob 8ec988941d66 done  
[1A[JCopying blob 8ec988941d66 done  
[1A[JCopying config dfaa211c6b done  
[1A[JCopying config dfaa211c6b done  
Writing manifest to image destination
Storing signatures
[1/2] STEP 2/4: RUN microdnf install -y curl tar
Downloading metadata

Show the `tile-based-training` help: 

In [20]:
docker run --rm -it localhost/training:latest hatch run prod:tile-based-training --help

2025-05-08 15:46:00.277106: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-05-08 15:46:00.384174: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-05-08 15:46:00.452870: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1746719160.525223       2 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1746719160.542846       2 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1746719160.613757       2 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linkin

## Test the tile-based-training step in the container

This section is for training a deep learning model on EuroSAT dataset for tile-based classification task and employs [MLflow](https://mlflow.org/) for monitoring the ML model development cycle. MLflow is a crucial tool that ensures effective log tracking and preserves key information, including specific code versions, datasets used, and model hyperparameters.

In [21]:
docker run \
    -it \
    --userns=keep-id \
    --mount=type=bind,source=/workspace/machine-learning-process/runs,target=/runs \
    --workdir=/runs \
    --user=1001:100 \
    -e MLFLOW_TRACKING_URI=http://my-mlflow:5000 \
    --rm \
    localhost/training:latest \
    hatch run tile-based-training \
    --stac_reference https://raw.githubusercontent.com/eoap/machine-learning-process/main/training/app-package/EUROSAT-Training-Dataset/catalog.json \
    --BATCH_SIZE 2 \
    --CLASSES 10 \
    --DECAY 0.1 \
    --EPOCHS 5 \
    --EPSILON 0.000001 \
    --LEARNING_RATE 0.0001 \
    --LOSS categorical_crossentropy \
    --MEMENTUM 0.95 \
    --OPTIMIZER Adam \
    --REGULARIZER None \
    --SAMPLES_PER_CLASS 10


[2K[32m.  [0m [1;35mCreating environment: default[0m0m
[1A[2K[?25l[32m.  [0m [1;35mChecking dependencies[0m
[1A[2K2025-05-08 15:46:19.450821: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-05-08 15:46:19.453804: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-05-08 15:46:19.462626: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1746719179.477456      13 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1746719179.481742      13 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1

List the outputs:

In [22]:
tree ${RUNTIME}

/workspace/machine-learning-process/runs
├── config
│   └── config.yaml
├── envs
├── mlruns
├── output
│   └── logs
│       └── running_logs.log
├── params.yaml
└── src
    └── tile_based_training
        └── output
            ├── data_ingestion
            │   └── splitted_data.json
            ├── prepare_base_model
            │   └── base_model.keras
            └── training
                └── trained_model.keras

11 directories, 6 files


## Clean-up 

In [33]:
rm -fr ${RUNTIME}
docker rmi -f $(docker images -aq)