## Launch and set up AMD MI100 server - with python-chi

At the beginning of the lease time, we will bring up our GPU server. We will use the `python-chi` Python API to Chameleon to provision our server.

> **Note**: if you don’t have access to the Chameleon Jupyter environment, or if you prefer to set up your AMD MI100 server by hand, the next section provides alternative instructions! If you want to set up your server “by hand”, skip to the next section.

We will execute the cells in this notebook inside the Chameleon Jupyter environment.

Run the following cell, and make sure the correct project is selected:

In [21]:
from chi import server, context, lease
import os, time

context.version = "1.0" 
context.choose_project()
context.choose_site(default="CHI@TACC")

VBox(children=(Dropdown(description='Select Project', options=('CHI-251409',), value='CHI-251409'), Output()))

VBox(children=(Dropdown(description='Select Site', options=('CHI@TACC', 'CHI@UC', 'CHI@EVL', 'CHI@NCAR', 'CHI@…

Change the string in the following cell to reflect the name of *your* lease (**with your own net ID**), then run it to get your lease:

In [22]:
l = lease.get_lease(f"mltrain_jz7074_3") 
l.show()

HTML(value='\n        <h2>Lease Details</h2>\n        <table>\n            <tr><th>Name</th><td>mltrain_jz7074…

Lease Details:
Name: mltrain_jz7074_3
ID: 9cb086fd-7aee-4600-b0b7-1bcb201eee71
Status: ACTIVE
Start Date: 2025-05-10 14:05:00
End Date: 2025-05-10 20:00:00
User ID: f9e2164ccc536a5a989dd58b774e54a7d1dbee6d889761429ced412938961faf
Project ID: d3c6e101843a4ba79e665ebf59b521a2

Node Reservations:
ID: 280f351c-d7f7-4006-b3e5-67cc897b50b0, Status: active, Min: 1, Max: 1

Floating IP Reservations:

Network Reservations:

Events:


The status should show as “ACTIVE” now that we are past the lease start time.

The rest of this notebook can be executed without any interactions from you, so at this point, you can save time by clicking on this cell, then selecting “Run” \> “Run Selected Cell and All Below” from the Jupyter menu.

As the notebook executes, monitor its progress to make sure it does not get stuck on any execution error, and also to see what it is doing!

We will use the lease to bring up a server with the `CC-Ubuntu24.04-hwe` disk image. (The default Ubuntu 24.04 kernel is not compatible with the AMD GPU on these nodes.)

> **Note**: the following cell brings up a server only if you don’t already have one with the same name! (Regardless of its error state.) If you have a server in ERROR state already, delete it first in the Horizon GUI before you run this cell.

In [24]:
username = os.getenv('USER') # all exp resources will have this prefix
s = server.Server(
    f"node-mltrain-{username}", 
    reservation_id=l.node_reservations[0]["id"],
    image_name="CC-Ubuntu24.04-hwe"
)
s.submit(idempotent=True)

Waiting for server node-mltrain-jz7074_nyu_edu's status to become ACTIVE. This typically takes 10 minutes, but can take up to 20 minutes.


HBox(children=(Label(value=''), IntProgress(value=0, bar_style='success')))

Server has moved to status ACTIVE


Attribute,node-mltrain-jz7074_nyu_edu
Id,eb66cace-a6c1-486a-8cd4-d1e666d3a1ec
Status,ACTIVE
Image Name,CC-Ubuntu24.04-hwe
Flavor Name,baremetal
Addresses,sharednet1:  IP: 10.52.3.78 (v4)  Type: fixed  MAC: 34:80:0d:de:52:98
Network Name,sharednet1
Created At,2025-05-10T18:14:08Z
Keypair,trovi-84b35f9
Reservation Id,280f351c-d7f7-4006-b3e5-67cc897b50b0
Host Id,9acf860df16fe3cd915f9522cd52cf171577a815ef5c486f67a143e3


Note: security groups are not used at Chameleon bare metal sites, so we do not have to configure any security groups on this instance.

Then, we’ll associate a floating IP with the instance, so that we can access it over SSH.

In [25]:
s.associate_floating_ip()

In [26]:
s.refresh()
s.check_connectivity()

Checking connectivity to 129.114.108.40 port 22.


HBox(children=(Label(value=''), IntProgress(value=0, bar_style='success')))

Connection successful


In the output below, make a note of the floating IP that has been assigned to your instance (in the “Addresses” row).

In [27]:
s.refresh()
s.show(type="widget")

Attribute,node-mltrain-jz7074_nyu_edu
Id,eb66cace-a6c1-486a-8cd4-d1e666d3a1ec
Status,ACTIVE
Image Name,CC-Ubuntu24.04-hwe
Flavor Name,baremetal
Addresses,sharednet1:  IP: 10.52.3.78 (v4)  Type: fixed  MAC: 34:80:0d:de:52:98  IP: 129.114.108.40 (v4)  Type: floating  MAC: 34:80:0d:de:52:98
Network Name,sharednet1
Created At,2025-05-10T18:14:08Z
Keypair,trovi-84b35f9
Reservation Id,280f351c-d7f7-4006-b3e5-67cc897b50b0
Host Id,9acf860df16fe3cd915f9522cd52cf171577a815ef5c486f67a143e3


## Retrieve code and notebooks on the instance

Now, we can use `python-chi` to execute commands on the instance, to set it up. We’ll start by retrieving the code and other materials on the instance.

In [28]:
s.execute("git clone --recurse-submodules https://github.com/teaching-on-testbeds/mltrain-chi")

Cloning into 'mltrain-chi'...


<Result cmd='git clone --recurse-submodules https://github.com/teaching-on-testbeds/mltrain-chi' exited=0>

## Set up Docker

To use common deep learning frameworks like Tensorflow or PyTorch, and ML training platforms like MLFlow and Ray, we can run containers that have all the prerequisite libraries necessary for these frameworks. Here, we will set up the container framework.

In [29]:
s.execute("curl -sSL https://get.docker.com/ | sudo sh")
s.execute("sudo groupadd -f docker; sudo usermod -aG docker $USER")

# Executing docker install script, commit: 53a22f61c0628e58e1d6680b49e82993d304b449


+ sh -c apt-get -qq update >/dev/null
+ sh -c DEBIAN_FRONTEND=noninteractive apt-get -y -qq install ca-certificates curl >/dev/null
+ sh -c install -m 0755 -d /etc/apt/keyrings
+ sh -c curl -fsSL "https://download.docker.com/linux/ubuntu/gpg" -o /etc/apt/keyrings/docker.asc
+ sh -c chmod a+r /etc/apt/keyrings/docker.asc
+ sh -c echo "deb [arch=amd64 signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu noble stable" > /etc/apt/sources.list.d/docker.list
+ sh -c apt-get -qq update >/dev/null
+ sh -c DEBIAN_FRONTEND=noninteractive apt-get -y -qq install docker-ce docker-ce-cli containerd.io docker-compose-plugin docker-ce-rootless-extras docker-buildx-plugin >/dev/null

Running kernel seems to be up-to-date.

The processor microcode seems to be up-to-date.

No services need to be restarted.

No containers need to be restarted.

No user sessions are running outdated binaries.

No VM guests are running outdated hypervisor (qemu) binaries on this host.
+ sh -c doc

Client: Docker Engine - Community
 Version:           28.1.1
 API version:       1.49
 Go version:        go1.23.8
 Git commit:        4eba377
 Built:             Fri Apr 18 09:52:14 2025
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          28.1.1
  API version:      1.49 (minimum version 1.24)
  Go version:       go1.23.8
  Git commit:       01f442b
  Built:            Fri Apr 18 09:52:14 2025
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.7.27
  GitCommit:        05044ec0a9a75232cad458027ca83437aae3f4da
 runc:
  Version:          1.2.5
  GitCommit:        v1.2.5-0-g59923ef
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0


To run Docker as a non-privileged user, consider setting up the
Docker daemon in rootless mode for your user:

    dockerd-rootless-setuptool.sh install

Visit https://docs.docker.com/go/rootless/ to learn about rootless mode.


T

<Result cmd='sudo groupadd -f docker; sudo usermod -aG docker $USER' exited=0>

## Set up the AMD GPU

Before we can use the AMD GPUs, we need to set up the driver using the `amdgpu-install` utility.

Let’s follow [AMD’s instructions for setting up `amdgpu-install`](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/install-methods/amdgpu-installer/amdgpu-installer-ubuntu.html#installation):

In [30]:
s.execute("sudo apt update; wget https://repo.radeon.com/amdgpu-install/6.3.3/ubuntu/noble/amdgpu-install_6.3.60303-1_all.deb")
s.execute("sudo apt -y install ./amdgpu-install_6.3.60303-1_all.deb; sudo apt update")





Hit:1 https://download.docker.com/linux/ubuntu noble InRelease
Get:2 http://nova.clouds.archive.ubuntu.com/ubuntu noble InRelease [256 kB]
Hit:3 http://security.ubuntu.com/ubuntu noble-security InRelease
Get:4 http://nova.clouds.archive.ubuntu.com/ubuntu noble-updates InRelease [126 kB]
Get:5 http://nova.clouds.archive.ubuntu.com/ubuntu noble-backports InRelease [126 kB]
Get:6 http://nova.clouds.archive.ubuntu.com/ubuntu noble-updates/main amd64 Packages [1066 kB]
Get:7 http://nova.clouds.archive.ubuntu.com/ubuntu noble-updates/main amd64 Components [161 kB]
Get:8 http://nova.clouds.archive.ubuntu.com/ubuntu noble-updates/universe amd64 Packages [1062 kB]
Get:9 http://nova.clouds.archive.ubuntu.com/ubuntu noble-updates/universe amd64 Components [376 kB]
Get:10 http://nova.clouds.archive.ubuntu.com/ubuntu noble-updates/restricted amd64 Components [212 B]
Get:11 http://nova.clouds.archive.ubuntu.com/ubuntu noble-updates/multiverse amd64 Components [940 B]
Get:12 http://nova.clouds.archiv

--2025-05-10 18:30:01--  https://repo.radeon.com/amdgpu-install/6.3.3/ubuntu/noble/amdgpu-install_6.3.60303-1_all.deb
Resolving repo.radeon.com (repo.radeon.com)... 23.221.22.214, 23.221.22.215, 2600:1404:6400:25::17de:f154, ...
Connecting to repo.radeon.com (repo.radeon.com)|23.221.22.214|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16984 (17K) [application/octet-stream]
Saving to: ‘amdgpu-install_6.3.60303-1_all.deb’

     0K .......... ......                                     100% 40.3M=0s

2025-05-10 18:30:01 (40.3 MB/s) - ‘amdgpu-install_6.3.60303-1_all.deb’ saved [16984/16984]





Reading package lists...
Building dependency tree...
Reading state information...
Recommended packages:
  dialog
The following NEW packages will be installed:
  amdgpu-install
0 upgraded, 1 newly installed, 0 to remove and 145 not upgraded.
Need to get 0 B/17.0 kB of archives.
After this operation, 74.8 kB of additional disk space will be used.
Get:1 /home/cc/amdgpu-install_6.3.60303-1_all.deb amdgpu-install all 6.3.60303-2119913.24.04 [17.0 kB]


debconf: unable to initialize frontend: Dialog
debconf: (Dialog frontend will not work on a dumb terminal, an emacs shell buffer, or without a controlling terminal.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unable to re-open stdin: 


Selecting previously unselected package amdgpu-install.
(Reading database ... 93276 files and directories currently installed.)
Preparing to unpack .../amdgpu-install_6.3.60303-1_all.deb ...
Unpacking amdgpu-install (6.3.60303-2119913.24.04) ...
Setting up amdgpu-install (6.3.60303-2119913.24.04) ...


debconf: unable to initialize frontend: Dialog
debconf: (Dialog frontend will not work on a dumb terminal, an emacs shell buffer, or without a controlling terminal.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype

Running kernel seems to be up-to-date.

The processor microcode seems to be up-to-date.

No services need to be restarted.

No containers need to be restarted.

No user sessions are running outdated binaries.

No VM guests are running outdated hypervisor (qemu) binaries on this host.




Hit:1 https://download.docker.com/linux/ubuntu noble InRelease
Get:2 http://nova.clouds.archive.ubuntu.com/ubuntu noble InRelease [256 kB]
Get:3 https://repo.radeon.com/amdgpu/6.3.3/ubuntu noble InRelease [5435 B]
Get:4 https://repo.radeon.com/rocm/apt/6.3.3 noble InRelease [2605 B]
Get:5 https://repo.radeon.com/amdgpu/6.3.3/ubuntu noble/main amd64 Packages [14.1 kB]
Hit:6 http://security.ubuntu.com/ubuntu noble-security InRelease
Get:7 https://repo.radeon.com/amdgpu/6.3.3/ubuntu noble/main i386 Packages [12.2 kB]
Get:8 https://repo.radeon.com/rocm/apt/6.3.3 noble/main amd64 Packages [60.0 kB]
Hit:9 http://nova.clouds.archive.ubuntu.com/ubuntu noble-updates InRelease
Hit:10 http://nova.clouds.archive.ubuntu.com/ubuntu noble-backports InRelease
Fetched 350 kB in 1s (634 kB/s)
Reading package lists...
Building dependency tree...
Reading state information...
145 packages can be upgraded. Run 'apt list --upgradable' to see them.


<Result cmd='sudo apt -y install ./amdgpu-install_6.3.60303-1_all.deb; sudo apt update' exited=0>

To [run containers using ROCm](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/how-to/docker.html) (Radeon Open Compute Platform), an open-source software stack from AMD that allows users to program AMD GPUs (similar to NVIDIA’s CUDA), we need to install the `amdgpu-dkms` driver:

In [31]:
s.execute("amdgpu-install -y --usecase=dkms")

Hit:1 https://download.docker.com/linux/ubuntu noble InRelease
Hit:2 https://repo.radeon.com/amdgpu/6.3.3/ubuntu noble InRelease
Hit:3 https://repo.radeon.com/rocm/apt/6.3.3 noble InRelease
Hit:4 http://security.ubuntu.com/ubuntu noble-security InRelease
Get:5 http://nova.clouds.archive.ubuntu.com/ubuntu noble InRelease [256 kB]
Get:6 http://nova.clouds.archive.ubuntu.com/ubuntu noble-updates InRelease [126 kB]
Get:7 http://nova.clouds.archive.ubuntu.com/ubuntu noble-backports InRelease [126 kB]
Fetched 508 kB in 1s (453 kB/s)
Reading package lists...
Reading package lists...
Building dependency tree...
Reading state information...
linux-headers-6.11.0-17-generic is already the newest version (6.11.0-17.17~24.04.2).
linux-headers-6.11.0-17-generic set to manually installed.
The following additional packages will be installed:
  amdgpu-dkms-firmware autoconf automake autotools-dev m4
Suggested packages:
  autoconf-archive gnu-standards autoconf-doc libtool gettext m4-doc
The following N

debconf: unable to initialize frontend: Dialog
debconf: (Dialog frontend will not work on a dumb terminal, an emacs shell buffer, or without a controlling terminal.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unable to re-open stdin: 


Fetched 27.9 MB in 1s (24.6 MB/s)
Selecting previously unselected package m4.
(Reading database ... 93294 files and directories currently installed.)
Preparing to unpack .../0-m4_1.4.19-4build1_amd64.deb ...
Unpacking m4 (1.4.19-4build1) ...
Selecting previously unselected package autoconf.
Preparing to unpack .../1-autoconf_2.71-3_all.deb ...
Unpacking autoconf (2.71-3) ...
Selecting previously unselected package autotools-dev.
Preparing to unpack .../2-autotools-dev_20220109.1_all.deb ...
Unpacking autotools-dev (20220109.1) ...
Selecting previously unselected package automake.
Preparing to unpack .../3-automake_1%3a1.16.5-1.3ubuntu1_all.deb ...
Unpacking automake (1:1.16.5-1.3ubuntu1) ...
Selecting previously unselected package amdgpu-dkms-firmware.
Preparing to unpack .../4-amdgpu-dkms-firmware_1%3a6.10.5.60303-2119913.24.04_all.deb ...
Unpacking amdgpu-dkms-firmware (1:6.10.5.60303-2119913.24.04) ...
Selecting previously unselected package amdgpu-dkms.
Preparing to unpack .../5-am

debconf: unable to initialize frontend: Dialog
debconf: (Dialog frontend will not work on a dumb terminal, an emacs shell buffer, or without a controlling terminal.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype

Running kernel seems to be up-to-date.

The processor microcode seems to be up-to-date.

No services need to be restarted.

No containers need to be restarted.

No user sessions are running outdated binaries.

No VM guests are running outdated hypervisor (qemu) binaries on this host.


<Result cmd='amdgpu-install -y --usecase=dkms' exited=0>

And, we’ll also install the `rocm-smi` utility, so that we can monitor the GPU from the host:

In [32]:
s.execute("sudo apt -y install rocm-smi")





Reading package lists...
Building dependency tree...
Reading state information...
The following additional packages will be installed:
  librocm-smi64-1
The following NEW packages will be installed:
  librocm-smi64-1 rocm-smi
0 upgraded, 2 newly installed, 0 to remove and 145 not upgraded.
Need to get 362 kB of archives.
After this operation, 1744 kB of additional disk space will be used.
Get:1 http://nova.clouds.archive.ubuntu.com/ubuntu noble/universe amd64 librocm-smi64-1 amd64 5.7.0-1 [309 kB]
Get:2 http://nova.clouds.archive.ubuntu.com/ubuntu noble/universe amd64 rocm-smi amd64 5.7.0-1 [52.9 kB]


debconf: unable to initialize frontend: Dialog
debconf: (Dialog frontend will not work on a dumb terminal, an emacs shell buffer, or without a controlling terminal.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unable to re-open stdin: 


Fetched 362 kB in 0s (770 kB/s)
Selecting previously unselected package librocm-smi64-1.
(Reading database ... 97702 files and directories currently installed.)
Preparing to unpack .../librocm-smi64-1_5.7.0-1_amd64.deb ...
Unpacking librocm-smi64-1 (5.7.0-1) ...
Selecting previously unselected package rocm-smi.
Preparing to unpack .../rocm-smi_5.7.0-1_amd64.deb ...
Unpacking rocm-smi (5.7.0-1) ...
Setting up librocm-smi64-1 (5.7.0-1) ...
Setting up rocm-smi (5.7.0-1) ...
Processing triggers for man-db (2.12.0-4build2) ...
Processing triggers for libc-bin (2.39-0ubuntu8.4) ...


debconf: unable to initialize frontend: Dialog
debconf: (Dialog frontend will not work on a dumb terminal, an emacs shell buffer, or without a controlling terminal.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype

Running kernel seems to be up-to-date.

The processor microcode seems to be up-to-date.

No services need to be restarted.

No containers need to be restarted.

No user sessions are running outdated binaries.

No VM guests are running outdated hypervisor (qemu) binaries on this host.


<Result cmd='sudo apt -y install rocm-smi' exited=0>

Finally, we will add the `cc` user to the `video` and `render` groups, which are needed for access to the GPU:

In [33]:
s.execute("sudo usermod -aG video,render $USER")

<Result cmd='sudo usermod -aG video,render $USER' exited=0>

That’s all we will need on the host - the rest of ROCm will be installed inside the containers.

To apply the changes to the kernel, we need to reboot, and wait for the server to come back up.

In [34]:
s.execute("sudo reboot")
time.sleep(30)

In [35]:
s.refresh()
s.check_connectivity()

Checking connectivity to 129.114.108.40 port 22.


HBox(children=(Label(value=''), IntProgress(value=0, bar_style='success')))

Connection successful


Run

In [36]:
s.execute("rocm-smi")



GPU  Temp (DieEdge)  AvgPwr  SCLK    MCLK     Fan  Perf  PwrCap  VRAM%  GPU%  
0    29.0c           34.0W   300Mhz  1200Mhz  0%   auto  290.0W    0%   0%    
1    27.0c           35.0W   300Mhz  1200Mhz  0%   auto  290.0W    0%   0%    


<Result cmd='rocm-smi' exited=0>

and verify that you can see the GPU(s).

We can also install `nvtop` to monitor GPU usage - we’ll install from source, because the older version in the Ubuntu package repositories does not support AMD GPUs:

In [37]:
s.execute("sudo apt -y install cmake libncurses-dev libsystemd-dev libudev-dev libdrm-dev libgtest-dev")
s.execute("git clone https://github.com/Syllo/nvtop")
s.execute("mkdir -p nvtop/build && cd nvtop/build && cmake .. -DAMDGPU_SUPPORT=ON && sudo make install")





Reading package lists...
Building dependency tree...
Reading state information...
The following additional packages will be installed:
  cmake-data googletest libdrm-amdgpu1 libdrm-intel1 libdrm-nouveau2
  libdrm-radeon1 libjsoncpp25 libnss-systemd libpam-systemd libpciaccess-dev
  libpciaccess0 librhash0 libsystemd-shared libsystemd0 libudev1 systemd
  systemd-dev systemd-resolved systemd-sysv udev
Suggested packages:
  cmake-doc cmake-format elpa-cmake-mode ninja-build ncurses-doc
  systemd-container systemd-homed systemd-userdbd systemd-boot libqrencode4
  libtss2-rc0
The following NEW packages will be installed:
  cmake cmake-data googletest libdrm-amdgpu1 libdrm-dev libdrm-intel1
  libdrm-nouveau2 libdrm-radeon1 libgtest-dev libjsoncpp25 libncurses-dev
  libpciaccess-dev libpciaccess0 librhash0 libsystemd-dev libudev-dev
The following packages will be upgraded:
  libnss-systemd libpam-systemd libsystemd-shared libsystemd0 libudev1 systemd
  systemd-dev systemd-resolved systemd-sys

debconf: unable to initialize frontend: Dialog
debconf: (Dialog frontend will not work on a dumb terminal, an emacs shell buffer, or without a controlling terminal.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unable to re-open stdin: 


Fetched 25.3 MB in 1s (22.5 MB/s)
(Reading database ... 97717 files and directories currently installed.)
Preparing to unpack .../systemd-dev_255.4-1ubuntu8.6_all.deb ...
Unpacking systemd-dev (255.4-1ubuntu8.6) over (255.4-1ubuntu8.5) ...
Preparing to unpack .../systemd-resolved_255.4-1ubuntu8.6_amd64.deb ...
Unpacking systemd-resolved (255.4-1ubuntu8.6) over (255.4-1ubuntu8.5) ...
Preparing to unpack .../libsystemd-shared_255.4-1ubuntu8.6_amd64.deb ...
Unpacking libsystemd-shared:amd64 (255.4-1ubuntu8.6) over (255.4-1ubuntu8.5) ...
Preparing to unpack .../libsystemd0_255.4-1ubuntu8.6_amd64.deb ...
Unpacking libsystemd0:amd64 (255.4-1ubuntu8.6) over (255.4-1ubuntu8.5) ...
Setting up libsystemd0:amd64 (255.4-1ubuntu8.6) ...
(Reading database ... 97717 files and directories currently installed.)
Preparing to unpack .../0-systemd-sysv_255.4-1ubuntu8.6_amd64.deb ...
Unpacking systemd-sysv (255.4-1ubuntu8.6) over (255.4-1ubuntu8.5) ...
Preparing to unpack .../1-libnss-systemd_255.4-1ubuntu

debconf: unable to initialize frontend: Dialog
debconf: (Dialog frontend will not work on a dumb terminal, an emacs shell buffer, or without a controlling terminal.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype

Running kernel seems to be up-to-date.

The processor microcode seems to be up-to-date.

Restarting services...
 systemctl restart firewalld.service multipathd.service polkit.service rpcbind.service rsyslog.service ssh.service systemd-hostnamed.service systemd-timedated.service udisks2.service

Service restarts being deferred:
 systemctl restart ModemManager.service
 /etc/needrestart/restart.d/dbus.service
 systemctl restart docker.service
 systemctl restart networkd-dispatcher.service
 systemctl restart systemd-logind.service
 systemctl restart unattended-upgrades.service

No containers need to be restarted.

User sessions running out

-- The C compiler identification is GNU 13.3.0
-- The CXX compiler identification is GNU 13.3.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Setting build type to 'Release' as none was specified.
-- Looking for cbreak in /usr/lib/x86_64-linux-gnu/libncursesw.so
-- Looking for cbreak in /usr/lib/x86_64-linux-gnu/libncursesw.so - found
-- Found Curses: /usr/lib/x86_64-linux-gnu/libncursesw.so  
-- Performing Test HAS_REALLOCARRAY
-- Performing Test HAS_REALLOCARRAY - Success
-- Found UDev: /usr/lib/x86_64-linux-gnu/libudev.so (found version "") 
-- Libudev stable: FALSE
-- Found Systemd: /usr/lib/x86_64-linux-gnu/libsystemd.so 

<Result cmd='mkdir -p nvtop/build && cd nvtop/build && cmake .. -DAMDGPU_SUPPORT=ON && sudo make install' exited=0>

### Build a container image - for MLFlow section

Finally, we will build a container image in which to work in the MLFlow section, that has:

-   a Jupyter notebook server
-   Pytorch and Pytorch Lightning
-   ROCm, which allows deep learning frameworks like Pytorch to use the AMD GPU accelerator
-   and MLFlow

You can see our Dockerfile for this image at: [Dockerfile.jupyter-torch-mlflow-rocm](https://github.com/teaching-on-testbeds/mltrain-chi/tree/main/docker/Dockerfile.jupyter-torch-mlflow-rocm)

Building this container will take a **very long** time (ROCm is huge). But that’s OK: we can get it started and then continue to the next section while it builds in the background, since we don’t need this container immediately. We just need it to finish by the “Start a Jupyter server” subsection of the “Start the tracking server” section.

In [38]:
s.execute("docker build -t jupyter-mlflow -f mltrain-chi/docker/Dockerfile.jupyter-torch-mlflow-rocm .")

#0 building with "default" instance using docker driver

#1 [internal] load build definition from Dockerfile.jupyter-torch-mlflow-rocm
#1 transferring dockerfile: 1.16kB done
#1 DONE 0.0s

#2 [internal] load metadata for quay.io/jupyter/scipy-notebook:latest
#2 DONE 0.9s

#3 [internal] load .dockerignore
#3 transferring context: 2B done
#3 DONE 0.0s

#4 [1/5] FROM quay.io/jupyter/scipy-notebook:latest@sha256:38b74c0b58d1e004bb979f5a221f5730578f9aca7f6878c1689f4a193c4793cc
#4 resolve quay.io/jupyter/scipy-notebook:latest@sha256:38b74c0b58d1e004bb979f5a221f5730578f9aca7f6878c1689f4a193c4793cc done
#4 sha256:96c932f29ab238a89357a1ed3185a558d6195bed23a42e3f7c8eec419dfec130 0B / 11.43MB 0.1s
#4 sha256:38b74c0b58d1e004bb979f5a221f5730578f9aca7f6878c1689f4a193c4793cc 743B / 743B done
#4 sha256:33fe43ff4ffc27673f783d85a90aa649ec18870589bbce3ded03d5ca65e62351 18.52kB / 18.52kB done
#4 sha256:495e9b1f57cf2c6d6d8e97a2b579177065820753648fd7c66c9800bebd1d617f 0B / 687B 0.1s
#4 sha256:ac0c285abb482d

<Result cmd='docker build -t jupyter-mlflow -f mltrain-chi/docker/Dockerfile.jupyter-torch-mlflow-rocm .' exited=0>

Leave that cell running, and in the meantime, open an SSH sesson on your server. From your local terminal, run

    ssh -i ~/.ssh/id_rsa_chameleon cc@A.B.C.D

where

-   in place of `~/.ssh/id_rsa_chameleon`, substitute the path to your own key that you had uploaded to CHI@TACC
-   in place of `A.B.C.D`, use the floating IP address you just associated to your instance.