## Start the tracking server

### Understand the MLFlow tracking server system
The “remote tracking server” system includes:

-   a database Postgre sql =------ in which to store structured data for each “run”, like the start and end time, hyperparameter values, and the values of metrics that we log to the server. In our deploymenet, this will be realized by a PostgreSQL server.
-   an object store  =------ in which MLFlow will log artifacts - model weights, images (e.g. PNGs), and so on. In our deployment, this will be realized by MinIO, an open source object storage system that is compatible with AWS S3 APIs (so it may be used as a drop-in self-managed replacement for AWS S3).
-   the MLFlow tracking server 


### Start MLFlow tracking server system

Now we are ready to get it started! Bring up our MLFlow system with:

``` bash
# run on node-mltrain
docker compose -f Data_eye/Docker/docker-compose-mlflow.yaml up -d
```


When it is finished, the output of

``` bash
# run on node-mltrain
docker ps
```

should show that the `minio`, `postgres`, and `mlflow` containers are running.

### Access dashboards for the MLFlow tracking server system

The MinIO dashboard 

    http://129.114.108.94:9001

MINIO_ROOT_USER: "Project24-MINIO-Id"
      MINIO_ROOT_PASSWORD: "Project-24-MINIO-Id-secret-key"

Log in with the credentials we specified in the Docker Compose YAML:

-   Username: `your-access-key`
-   Password: `your-secret-key`


Next, let’s look at the MLFlow UI. This runs on port 8000. In a browser, open

    http://129.114.108.94:8000
AWS_ACCESS_KEY_ID: "Project24-MLFLOW-Id"
      AWS_SECRET_ACCESS_KEY: "Project-24-MINIO-Id-secret-key"

### Start a Jupyter server

Finally, we’ll start the Jupyter server container, inside which we will run experiments that are tracked in MLFlow. Make sure your container image build, from the previous section, is now finished - you should see a “jupyter-mlflow” image in the output of:

``` bash
“jupyter-mlflow” image in the output of:
# run on node-mltrain
docker image list
```

The command to run will depend on what type of GPU node you are using -

If you are using an AMD GPU (node type `gpu_mi100`), run

``` bash
# run on node-mltrain IF it is a gpu_mi100
HOST_IP=$(curl --silent http://169.254.169.254/latest/meta-data/public-ipv4 )
docker run  -d --rm  -p 8888:8888 \
    --device=/dev/kfd --device=/dev/dri \
    --group-add video --group-add $(getent group | grep render | cut -d':' -f 3) \
    --shm-size 16G \
    -v ~/mltrain-chi/workspace_mlflow:/home/jovyan/work/ \
    -v food11:/mnt/ \
    -e MLFLOW_TRACKING_URI=http://${HOST_IP}:8000/ \
    -e FOOD11_DATA_DIR=/mnt/Food-11 \
    --name jupyter \
    jupyter-mlflow
```
``` bash
# run on node-mltrain IF it is a gpu_mi100
HOST_IP=$(curl --silent http://169.254.169.254/latest/meta-data/public-ipv4 )
docker run  -d --rm  -p 8888:8888 \
    --device=/dev/kfd --device=/dev/dri \
    --group-add video --group-add $(getent group | grep render | cut -d':' -f 3) \
    --shm-size 16G \
    -v ~/workspace/workspace_mlflow:/home/jovyan/work/ \
    -v EYE:/mnt/ \
    -e MLFLOW_TRACKING_URI=http://${HOST_IP}:8000/ \
    -e EYE_DATA_DIR=/mnt/eye_dataset \
    --name jupyter \
    jupyter-mlflow
```
Note that we intially get `HOST_IP`, the floating IP assigned to your instance, as a variable; then we use it to specify the `MLFLOW_TRACKING_URI` inside the container. Training jobs inside the container will access the MLFlow tracking server using its public IP address.

Here,

-   `-d` says to start the container and detach, leaving it running in the background
-   `-rm` says that after we stop the container, it should be removed immediately, instead of leaving it around for potential debugging
-   `-p 8888:8888` says to publish the container’s port `8888` (the second `8888` in the argument) to the host port `8888` (the first `8888` in the argument)
-   `--device=/dev/kfd --device=/dev/dri` pass the AMD GPUs to the container
-   `--group-add video --group-add $(getent group | grep render | cut -d':' -f 3)` makes sure that the user inside the container is a member of a group that has permission to use the GPU(s) - the `video` group and the `render` group. (The `video` group always has the same group ID, by convention, but [the `render` group does not](https://github.com/ROCm/ROCm-docker/issues/90), so we need to find out its group ID on the host and pass that to the container.)
-   `--shm-size 16G` increases the memory available for interprocess communication
-   the host directory `~/mltrain-chi/workspace_mlflow` is mounted inside the workspace as `/home/jovyan/work/`
-   the volume `food11` is mounted inside the workspace as `/mnt/`
-   and we pass `MLFLOW_TRACKING_URI` and `FOOD11_DATA_DIR` as environment variables.

If you are using an NVIDIA GPU (node type `compute_liqid`), run

``` bash
# run on node-mltrain IF it is a compute_liqid
HOST_IP=$(curl --silent http://169.254.169.254/latest/meta-data/public-ipv4 )
docker run  -d --rm  -p 8888:8888 \
    --gpus all \
    --shm-size 16G \
    -v ~/workspace/workspace_mlflow:/home/jovyan/work/ \
    -v EYE:/mnt/ \
    -e MLFLOW_TRACKING_URI=http://${HOST_IP}:8000/ \
    -e EYE_DATA_DIR=/mnt/eye_dataset \
    --name jupyter \
    jupyter-mlflow
```
HOST_IP=$(curl --silent http://169.254.169.254/latest/meta-data/public-ipv4 )
docker run  -d --rm  -p 8888:8888 \
    --gpus all \
    --shm-size 16G \
    -v ~/mltrain-chi/workspace_mlflow:/home/jovyan/work/ \
    -v food11:/mnt/ \
    -e MLFLOW_TRACKING_URI=http://${HOST_IP}:8000/ \
    -e FOOD11_DATA_DIR=/mnt/Food-11 \
    --name jupyter \
    jupyter-mlflow
    

Note that we intially get `HOST_IP`, the floating IP assigned to your instance, as a variable; then we use it to specify the `MLFLOW_TRACKING_URI` inside the container. Training jobs inside the container will access the MLFlow tracking server using its public IP address.

-   `-d` says to start the container and detach, leaving it running in the background
-   `-rm` says that after we stop the container, it should be removed immediately, instead of leaving it around for potential debugging
-   `-p 8888:8888` says to publish the container’s port `8888` (the second `8888` in the argument) to the host port `8888` (the first `8888` in the argument)
-   `--gus all` pass the NVIDIA GPUs to the container
-   `--shm-size 16G` increases the memory available for interprocess communication
-   the host directory `~/mltrain-chi/workspace_mlflow` is mounted inside the workspace as `/home/jovyan/work/`
-   the volume `food11` is mounted inside the workspace as `/mnt/`
-   and we pass `MLFLOW_TRACKING_URI` and `FOOD11_DATA_DIR` as environment variables.

Then, run

    docker logs jupyter

and look for a line like

    http://127.0.0.1:8888/lab?token=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Paste this into a browser tab, but in place of `127.0.0.1`, substitute the floating IP assigned to your instance, to open the Jupyter notebook interface.

In the file browser on the left side, open the `work` directory.

Open a terminal (“File \> New \> Terminal”) inside the Jupyter server environment, and in this terminal, run

``` bash
# runs on jupyter container inside node-mltrain
env
```

to see environment variables. Confirm that the `MLFLOW_TRACKING_URI` is set, with the correct floating IP address.