# Connect Colab to a Chameleon server - with AMD GPU

This notebook describes how to connect Colab to a server running on Chameleon. This allows you to run experiments requiring bare metal access, storage, memory, GPU and compute that exceeds the abilities of Colab's hosted runtime, but with Colab's familiar interface (and notebooks stored in your Google Drive). It also allows you to easily go back and forth between the convenience of Colab's hosted runtime and Chameleon's greater capabilities, depending on the needs of your experiment.

## Provision the resource


### Check resource availability

This notebook will try to reserve a bare metal Ubuntu server with MI100 GPU on CHI@TACC - pending availability. Before you begin, you should check the host calendar at [https://chi.tacc.chameleoncloud.org/project/leases/calendar/host/](https://chi.tacc.chameleoncloud.org/project/leases/calendar/host/). In the "Node Type" dropdown, filter on `gpu_mi100` and make sure some hosts are available.

### Chameleon configuration

You can change your Chameleon project name (if not using the one that is automatically configured in the JupyterHub environment) and the site on which to reserve resources (depending on availability) in the following cell.

In [None]:
import chi, os, time, datetime
from chi import lease
from chi import server
from chi import context
from chi import hardware

context.version = "1.0" # required during transition
context.choose_site(default="CHI@TACC")
context.choose_project()
username = os.getenv('USER') # all exp resources will have this prefix

If you need to change the details of the Chameleon server, e.g. use a different GPU type depending on availability, you can do that in the following cell.

In [None]:
node_type = "gpu_mi100"

### Reservation

The following cells will create a reservation that begins now, and ends in 8 hours, *if* your requested node type is available.

If the node type you have requested is *not* available right now, it will start your reservation as soon as one is available.

You can refer to [CHI@TACC host calendar](https://chi.tacc.chameleoncloud.org/project/leases/calendar/host/) to see availability.

In [None]:
gpu_start_times = [n.next_free_timeslot()[0] for n in hardware.get_nodes(node_type=node_type)]
current_time = datetime.datetime.now(datetime.timezone.utc) + datetime.timedelta(minutes=1)
if min(gpu_start_times) > current_time:
    lease_start = min(gpu_start_times)
    print(f"There is no {node_type} available now, you can request one starting at {str(lease_start)} (UTC).")
else:
    lease_start = current_time
    print(f"A {node_type} IS available now, your lease will start right away.")


In [None]:
l = lease.Lease(f"colab-{username}-{node_type}", duration=datetime.timedelta(hours=8),
                start_date = lease_start  )
l.add_node_reservation(node_type = node_type, amount = 1) 
l.add_fip_reservation(1) 
l.submit(idempotent=True)

### Provisioning resources

This section provisions resources. It will take approximately 10 minutes. You can check on its status in the Chameleon web-based UI: [https://chi.tacc.chameleoncloud.org/project/instances/](https://chi.tacc.chameleoncloud.org/project/instances/), then come back here when it is in the READY state.

In [None]:
# continue here, whether using a lease created just now or one created earlier
l = lease.get_lease(f"colab-{username}-{node_type}")

In [None]:
s = server.Server(
    f"colab-{username}-{node_type}", 
    reservation_id=l.node_reservations[0]["id"],
    image_name="CC-Ubuntu24.04-hwe"  # warning! default Ubuntu 24.04 kernel is not compatible with MI100
)
s.submit(idempotent=True)

Associate an IP address with this server:

In [None]:
s.associate_floating_ip()

and wait for it to come up:

In [None]:
s.refresh()
reserved_fip = s.get_floating_ip()
s.check_connectivity()

## Log in to resource

To log in to the resource, use File > New > Terminal in the Chameleon JupyterHub environment, or your local terminal, and paste in the *output* of the following cell:


In [None]:
print("ssh cc@" + reserved_fip)

## Set up AMD driver

Before we can use the AMD GPUs, we'll set up the driver using the `amdgpu-install` utility.

In [None]:
s.execute("sudo apt update; wget https://repo.radeon.com/amdgpu-install/6.3.3/ubuntu/noble/amdgpu-install_6.3.60303-1_all.deb")
s.execute("sudo apt -y install ./amdgpu-install_6.3.60303-1_all.deb; sudo apt update")

In [None]:
s.execute("amdgpu-install -y --usecase=dkms")
s.execute("sudo apt -y install rocm-smi ")

In [None]:
s.execute("sudo usermod -a -G video,render cc")

At this point you need to reboot:


In [None]:
s.execute("sudo reboot")

and wait for the host to come back online:

In [None]:
s.check_connectivity()

When it does, you should be able to see the GPU(s):

In [None]:
s.execute("rocm-smi")

## Set up Docker

To use common deep learning frameworks like Tensorflow or PyTorch, we can run *containers* that have all the prerequisite libraries necessary for these frameworks. Here, we will set up the container framework.

In [None]:
s.execute("curl -sSL https://get.docker.com/ | sudo sh")
s.execute("sudo groupadd -f docker; sudo usermod -aG docker $USER")

In [None]:
s.execute("docker run hello-world")

## Build a Docker container with ROCm and PyTorch

ROCm (Radeon Open Compute Platform) is an open-source software stack from AMD that allows users to program AMD GPUs (similar to NVIDIA's CUDA). To use our AMD GPU for machine learning, we'll install ROCm.

Now, we will build a container image with ROCm and PyTorch, so that we can run deep learning jobs on this server with PyTorch. You can follow a similar approach to build a container image with Tensorflow or similar frameworks.

In [None]:
s.execute("wget https://raw.githubusercontent.com/teaching-on-testbeds/colab/refs/heads/main/docker/Dockerfile.pytorch-notebook-rocm")
s.execute("docker build -t pytorch-notebook-rocm -f Dockerfile.pytorch-notebook-rocm .")

## Start a Jupyter server

Now, you can start a Jupyter server with Pytorch!

In [None]:
# note: the extra group is needed because https://github.com/ROCm/ROCm-docker/issues/90
s.execute("docker run  -d --rm  -p 8888:8888 --device=/dev/kfd --device=/dev/dri --group-add video --group-add $(getent group | grep render | cut -d':' -f 3) pytorch-notebook-rocm")

Then, run

In [None]:
s.execute("docker ps -q | xargs -L 1 docker logs")

In the output of this command, look for a line like

```
        http://127.0.0.1:8888/?token=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
```

You will copy and paste this URL into your own browser, but replace the **127.0.0.1** with the floating IP assigned to your server. This will open a Jupyter instance running *on your GPU server*.

If you start a notebook inside this Jupyter server, you should be able to run

```
import torch
print(torch.cuda.get_device_name(0))
```

and see the name of your GPU. 

You should also be able to run 

```
!rocminfo
```

and see details of the GPU(s).

To stop the running container(s), use:

In [None]:
s.execute("docker ps -q | xargs -L 1 docker container stop")

## Connect a Colab frontend to your Jupyter server

In [None]:
print('ssh -L 127.0.0.1:8888:127.0.0.1:8888 cc@' + reserved_fip) 

In a **local terminal on your own laptop**, run the SSH command that is printed by the previous cell. This will set up a tunnel to the Jupyter server that you are running on a Chameleon instance.

If your Chameleon key is not in the default location, you should also specify the path to your key as an argument, using `-i`. For example:

In [None]:
print('ssh -i ~/.ssh/id_rsa_chameleon -L 127.0.0.1:8888:127.0.0.1:8888 cc@' + reserved_fip) 

Leave this SSH session open.

Then, start your preferred container (un-comment one option)

In [None]:
s.execute("docker run  -d --rm  -p 8888:8888 --device=/dev/kfd --device=/dev/dri --group-add video --group-add $(getent group | grep render | cut -d':' -f 3) pytorch-notebook-rocm")

and check the logs -

In [None]:
s.execute("docker ps -q | xargs -L 1 docker logs")

Look for a URL in this format:
    
```
http://127.0.0.1:8888/?token=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
```

Copy this URL - you will need it in the next step.

Now, you can open Colab in a browser. Click on the drop-down menu for "Connect" in the top right and select "Connect to a local runtime". Paste the URL you copied earlier into the space and click "Connect". Your notebook should now be running on your Colab host (you can put `!hostname` in a cell and run it to verify!)


## Release resources

If you finish with your experimentation before your lease expires,release your resources and tear down your environment by running the following (commented out to prevent accidental deletions).

This section is designed to work as a "standalone" portion - you can come back to this notebook, ignore the top part, and just run this section to delete your reasources.

In [None]:
# setup environment - if you made any changes in the top part, make the same changes here
import chi, os
from chi import lease, magic, context

context.version = "1.0" # required during transition

context.choose_site(default="CHI@UC")
context.choose_project()
username = os.getenv('USER') # all exp resources will have this prefix

node_type = "gpu_mi100"
l = lease.get_lease(f"colab-{username}-{node_type}")

In [None]:
# un-comment to free resources
# chi.magic.cleanup_resources(l.id)
