## Launch and set up NVIDIA A100 40GB server - with python-chi

At the beginning of the lease time, we will bring up our GPU server. We will use the `python-chi` Python API to Chameleon to provision our server.

> **Note**: if you don’t have access to the Chameleon Jupyter environment, or if you prefer to set up your AMD MI100 server by hand, the next section provides alternative instructions! If you want to set up your server “by hand”, skip to the next section.

We will execute the cells in this notebook inside the Chameleon Jupyter environment.

Run the following cell, and make sure the correct project is selected:

In [12]:
from chi import server, context, lease
import os

context.version = "1.0" 
context.choose_project()
context.choose_site(default="CHI@TACC")

VBox(children=(Dropdown(description='Select Project', options=('CHI-251409',), value='CHI-251409'), Output()))

VBox(children=(Dropdown(description='Select Site', options=('CHI@TACC', 'CHI@UC', 'CHI@EVL', 'CHI@NCAR', 'CHI@…

Change the string in the following cell to reflect the name of *your* lease (**with your own net ID**), then run it to get your lease:

In [13]:
l = lease.get_lease(f"gpu-liquid-1-project33-07") 
l.show()

HTML(value='\n        <h2>Lease Details</h2>\n        <table>\n            <tr><th>Name</th><td>gpu-liquid-1-p…

Lease Details:
Name: gpu-liquid-1-project33-07
ID: 87cbff6b-2463-4674-a31e-56e16185e1dc
Status: ACTIVE
Start Date: 2025-05-11 18:00:00
End Date: 2025-05-11 23:55:00
User ID: 468b6156e3f5c46b440eb718ffcf4e04f88c39166f44727c350d35921f8923bd
Project ID: d3c6e101843a4ba79e665ebf59b521a2

Node Reservations:
ID: 391c2a69-537f-4d3e-a73c-c808ad24df62, Status: active, Min: 1, Max: 1

Floating IP Reservations:

Network Reservations:

Events:


The status should show as “ACTIVE” now that we are past the lease start time.

The rest of this notebook can be executed without any interactions from you, so at this point, you can save time by clicking on this cell, then selecting “Run” \> “Run Selected Cell and All Below” from the Jupyter menu.

As the notebook executes, monitor its progress to make sure it does not get stuck on any execution error, and also to see what it is doing!

We will use the lease to bring up a server with the `CC-Ubuntu24.04-CUDA` disk image.

> **Note**: the following cell brings up a server only if you don’t already have one with the same name! (Regardless of its error state.) If you have a server in ERROR state already, delete it first in the Horizon GUI before you run this cell.

In [14]:
username = os.getenv('USER') # all exp resources will have this prefix
s = server.Server(
    f"node-mltrain-{username}", 
    reservation_id=l.node_reservations[0]["id"],
    image_name="CC-Ubuntu24.04-CUDA"
)
s.submit(idempotent=True)

Waiting for server node-mltrain-nk3696_nyu_edu's status to become ACTIVE. This typically takes 10 minutes, but can take up to 20 minutes.


HBox(children=(Label(value=''), IntProgress(value=0, bar_style='success')))

Server has moved to status ACTIVE


Attribute,node-mltrain-nk3696_nyu_edu
Id,e1784bd8-4df3-4a2e-91a7-8307c5a4d1a4
Status,ACTIVE
Image Name,CC-Ubuntu24.04-CUDA
Flavor Name,baremetal
Addresses,sharednet1:  IP: 10.52.2.115 (v4)  Type: fixed  MAC: 34:80:0d:ed:4d:dc
Network Name,sharednet1
Created At,2025-05-11T18:00:58Z
Keypair,trovi-ae48ff0
Reservation Id,
Host Id,9acf860df16fe3cd915f9522cd52cf171577a815ef5c486f67a143e3


<chi.server.Server at 0x7fe8cb8c9270>

Note: security groups are not used at Chameleon bare metal sites, so we do not have to configure any security groups on this instance.

Then, we’ll associate a floating IP with the instance, so that we can access it over SSH.

In [27]:
s.associate_floating_ip()

In [28]:
s.refresh()
s.check_connectivity()

Checking connectivity to 129.114.108.116 port 22.


HBox(children=(Label(value=''), IntProgress(value=0, bar_style='success')))

Connection successful


In the output below, make a note of the floating IP that has been assigned to your instance (in the “Addresses” row).

In [17]:
s.refresh()
s.show(type="widget")

Attribute,node-mltrain-nk3696_nyu_edu
Id,e1784bd8-4df3-4a2e-91a7-8307c5a4d1a4
Status,ACTIVE
Image Name,CC-Ubuntu24.04-CUDA
Flavor Name,baremetal
Addresses,sharednet1:  IP: 10.52.2.115 (v4)  Type: fixed  MAC: 34:80:0d:ed:4d:dc  IP: 129.114.108.92 (v4)  Type: floating  MAC: 34:80:0d:ed:4d:dc
Network Name,sharednet1
Created At,2025-05-11T18:00:58Z
Keypair,trovi-ae48ff0
Reservation Id,391c2a69-537f-4d3e-a73c-c808ad24df62
Host Id,9acf860df16fe3cd915f9522cd52cf171577a815ef5c486f67a143e3


## Retrieve code and notebooks on the instance

Now, we can use `python-chi` to execute commands on the instance, to set it up. We’ll start by retrieving the code and other materials on the instance.

In [19]:
s.execute("git clone https://github.com/shettynitis/LLM_LegalDocSummarization.git")

fatal: destination path 'LLM_LegalDocSummarization' already exists and is not an empty directory.


UnexpectedExit: Encountered a bad command exit code!

Command: 'git clone https://github.com/shettynitis/LLM_LegalDocSummarization.git'

Exit code: 128

Stdout: already printed

Stderr: already printed



## Set up Docker

To use common deep learning frameworks like Tensorflow or PyTorch, and ML training platforms like MLFlow and Ray, we can run containers that have all the prerequisite libraries necessary for these frameworks. Here, we will set up the container framework.

In [20]:
s.execute("curl -sSL https://get.docker.com/ | sudo sh")
s.execute("sudo groupadd -f docker; sudo usermod -aG docker $USER")

# Executing docker install script, commit: 53a22f61c0628e58e1d6680b49e82993d304b449



If you already have Docker installed, this script can cause trouble, which is
installation.

If you installed the current Docker package using this script and are using it
again to update Docker, you can ignore this message, but be aware that the
script resets any custom changes in the deb and rpm repo configuration
files to match the parameters passed to the script.

You may press Ctrl+C now to abort this script.
+ sleep 20
+ sh -c apt-get -qq update >/dev/null
+ sh -c DEBIAN_FRONTEND=noninteractive apt-get -y -qq install ca-certificates curl >/dev/null
+ sh -c install -m 0755 -d /etc/apt/keyrings
+ sh -c curl -fsSL "https://download.docker.com/linux/ubuntu/gpg" -o /etc/apt/keyrings/docker.asc
+ sh -c chmod a+r /etc/apt/keyrings/docker.asc
+ sh -c echo "deb [arch=amd64 signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu noble stable" > /etc/apt/sources.list.d/docker.list
+ sh -c apt-get -qq update >/dev/null
+ sh -c DEBIAN_FRONTEND=noninteractive apt-get 

Client: Docker Engine - Community
 Version:           28.1.1
 API version:       1.49
 Go version:        go1.23.8
 Git commit:        4eba377
 Built:             Fri Apr 18 09:52:14 2025
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          28.1.1
  API version:      1.49 (minimum version 1.24)
  Go version:       go1.23.8
  Git commit:       01f442b
  Built:            Fri Apr 18 09:52:14 2025
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.7.27
  GitCommit:        05044ec0a9a75232cad458027ca83437aae3f4da
 runc:
  Version:          1.2.5
  GitCommit:        v1.2.5-0-g59923ef
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0


To run Docker as a non-privileged user, consider setting up the
Docker daemon in rootless mode for your user:

    dockerd-rootless-setuptool.sh install

Visit https://docs.docker.com/go/rootless/ to learn about rootless mode.


T

<Result cmd='sudo groupadd -f docker; sudo usermod -aG docker $USER' exited=0>

## Set up the NVIDIA container toolkit

We will also install the NVIDIA container toolkit, with which we can access GPUs from inside our containers.

In [21]:
s.execute("curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list")
s.execute("sudo apt update")
s.execute("sudo apt-get install -y nvidia-container-toolkit")
s.execute("sudo nvidia-ctk runtime configure --runtime=docker")
# for https://github.com/NVIDIA/nvidia-container-toolkit/issues/48
s.execute("sudo jq 'if has(\"exec-opts\") then . else . + {\"exec-opts\": [\"native.cgroupdriver=cgroupfs\"]} end' /etc/docker/daemon.json | sudo tee /etc/docker/daemon.json.tmp > /dev/null && sudo mv /etc/docker/daemon.json.tmp /etc/docker/daemon.json")
s.execute("sudo systemctl restart docker")

gpg: cannot open '/dev/tty': No such device or address


UnexpectedExit: Encountered a bad command exit code!

Command: "curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg   && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list |     sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' |     sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list"

Exit code: 2

Stdout: already printed

Stderr: already printed



and we can install `nvtop` to monitor GPU usage:

In [22]:
s.execute("sudo apt update")
s.execute("sudo apt -y install nvtop")





Hit:1 https://download.docker.com/linux/ubuntu noble InRelease
Get:2 https://nvidia.github.io/libnvidia-container/stable/deb/amd64  InRelease [1477 B]
Hit:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64  InRelease
Get:4 http://nova.clouds.archive.ubuntu.com/ubuntu noble InRelease [256 kB]
Hit:5 http://security.ubuntu.com/ubuntu noble-security InRelease
Get:6 http://nova.clouds.archive.ubuntu.com/ubuntu noble-updates InRelease [126 kB]
Get:7 http://nova.clouds.archive.ubuntu.com/ubuntu noble-backports InRelease [126 kB]
Fetched 510 kB in 1s (903 kB/s)
Reading package lists...
Building dependency tree...
Reading state information...
4 packages can be upgraded. Run 'apt list --upgradable' to see them.






Reading package lists...
Building dependency tree...
Reading state information...
nvtop is already the newest version (3.0.2-1).
0 upgraded, 0 newly installed, 0 to remove and 4 not upgraded.


<Result cmd='sudo apt -y install nvtop' exited=0>

### Build a container image - for MLFlow section

Finally, we will build a container image in which to work in the MLFlow section, that has:

-   a Jupyter notebook server
-   Pytorch and Pytorch Lightning
-   CUDA, which allows deep learning frameworks like Pytorch to use the NVIDIA GPU accelerator
-   and MLFlow

You can see our Dockerfile for this image at: [Dockerfile.jupyter-torch-mlflow-cuda](https://github.com/teaching-on-testbeds/mltrain-chi/tree/main/docker/Dockerfile.jupyter-torch-mlflow-cuda)

Building this container may take a bit of time, but that’s OK: we can get it started and then continue to the next section while it builds in the background, since we don’t need this container immediately.

In [23]:
s.execute("docker build -t jupyter-mlflow -f LLM_LegalDocSummarization/docker/Dockerfile.jupyter-torch-mlflow-cuda .")


#0 building with "default" instance using docker driver

#1 [internal] load build definition from Dockerfile.jupyter-torch-mlflow-cuda
#1 transferring dockerfile: 854B done
#1 DONE 0.0s

#2 [internal] load metadata for quay.io/jupyter/pytorch-notebook:cuda12-latest
#2 DONE 0.2s

#3 [internal] load .dockerignore
#3 transferring context: 2B done
#3 DONE 0.0s

#4 [1/4] FROM quay.io/jupyter/pytorch-notebook:cuda12-latest@sha256:5cc07bab3f9391418e9be3bb822d18a83b840fd3433aae0675e431981d32529e
#4 DONE 0.0s

#5 [2/4] RUN pip install --pre --no-cache-dir lightning &&     fix-permissions "/opt/conda" &&     fix-permissions "/home/jovyan"
#5 CACHED

#6 [3/4] RUN pip install --pre --no-cache-dir pynvml &&     pip install --pre --no-cache-dir mlflow &&     fix-permissions "/opt/conda" &&     fix-permissions "/home/jovyan"
#6 CACHED

#7 [4/4] RUN pip install --no-cache-dir         transformers==4.40.0         datasets==2.19.0         trl==0.8.6         peft==0.10.0         bitsandbytes==0.45.5     

<Result cmd='docker build -t jupyter-mlflow -f LLM_LegalDocSummarization/docker/Dockerfile.jupyter-torch-mlflow-cuda .' exited=0>

Leave that cell running, and in the meantime, open an SSH sesson on your server. From your local terminal, run

    ssh -i ~/.ssh/id_project_group33 cc@A.B.C.D

where

-   in place of `~/.ssh/id_rsa_chameleon`, substitute the path to your own key that you had uploaded to CHI@TACC
-   in place of `A.B.C.D`, use the floating IP address you just associated to your instance.