# Experimenting Securely and Efficiently Using an SSH Bastion Host

This is the most basic setup for deploying a secure cluster of compute nodes. The notebook creates a list of remote connections to the worker nodes so that jobs can be securely batched to them while using only a single public IP address.

This notebook is broken up into 4 parts:

### Spawning the required nodes
1. Create a reservation
2. Spawn servers
3. Assign floating IP to bastion host
### Testing the connection
1. Ensure that it is possible to reach all experiment nodes via SSH
### Clean up
1. Free all of our resources

## Spawning nodes

Since our bastion host is only responsible for facilitating connections to our worker nodes, we should use the least desirable hardware possible. Alternatively, if it doesn't affect our experiments at all, we could repurpose a worker node to function as a bastion host to save even more resources.

Let's declare some variables which define what types of resources we're going to use.

In [8]:
SITE_NAME = "CHI@TACC"
PROJ_NAME = "CHI-241398"

# Set True if we want to repurpose a worker node to also function as a bastion host
use_worker_as_bastion_host = True

MAKE_RESERVATION = False
RESERVATION_NAME = "ddb-scale-test"
LEASE_DAY = 4
NODE_COUNT = 35


NETWORKS = [ 
    "cluster-net" 
]

In [9]:
import chi

chi.use_site(SITE_NAME)
chi.set("project_name", PROJ_NAME)

Now using CHI@TACC:
URL: https://chi.tacc.chameleoncloud.org
Location: Austin, Texas, USA
Support contact: help@chameleoncloud.org


In [10]:
worker_node_type = "compute_cascadelake_r"
# This is for experiments with multiple worker nodes.
# If you only require one node for your experiments, 
# you can access the node directly without need for a bastion host.
worker_node_count = NODE_COUNT
worker_image = "CC-Ubuntu24.04"
bastion_host_node_type = "compute_skylake"
bastion_host_image = "CC-Ubuntu24.04"
assert worker_node_count >= 2

if use_worker_as_bastion_host:
    bastion_host_node_type = worker_node_type
     
msg = f"Using {worker_node_count} {worker_node_type} nodes as worker nodes"
if use_worker_as_bastion_host:
    msg += ", one of which will function as a bastion host."
else:
    msg += f".\nUsing one {bastion_host_node_type} node as a bastion host."
    
print(msg)

Using 35 compute_cascadelake_r nodes as worker nodes, one of which will function as a bastion host.


In [None]:
import os
import chi.lease

lease = None
lease_name = RESERVATION_NAME

if MAKE_RESERVATION:
    # Prepare the required reservations
    user = os.getenv("USER")
    reservation = []
    lease_name = f"{user}-{RESERVATION_NAME}"
    # Leases can be between 1 and 7 days
    lease_length = LEASE_DAY

    # Reserve workers
    chi.lease.add_node_reservation(
        reservation, 
        node_type=worker_node_type, 
        count=worker_node_count
    )

    # Reserve bastion host
    if not use_worker_as_bastion_host:
        chi.lease.add_node_reservation(
            reservation,
            node_type=bastion_host_node_type,
            count=1,
        )
    
    # Reserve a floating IP address for the bastion host
    chi.lease.add_fip_reservation(reservation, count=1)

    start_date, end_date = chi.lease.lease_duration(days=lease_length)

    # Create the lease on Chameleon
    print("Submitting lease...")
    lease = chi.lease.create_lease(
        lease_name, 
        reservation, 
        start_date=start_date, 
        end_date=end_date
    )
    print("Waiting for lease to become active...")
    lease = chi.lease.wait_for_active(lease["id"])
else:
    lease = chi.lease(name=lease_name)
    lease.submit(show=True, idempotent=True)

print("Lease is active!")
lease

### Spawn servers

With our nodes reserved, we can spawn the servers required for our experiment. One thing that is critically important is that all your nodes are on the same network! Otherwise, you will not be able to route your SSH connections through the bastion host.

In [None]:
import chi.network

net_ids = []
for net in NETWORKS:
    network_name = net
    network_id = chi.network.get_network_id(network_name)
    print(f"Using network {network_id}")
    net_ids.append(network_id)

If we are using a separate bastion host, we can spawn that first.

In [None]:
import chi.server

if not use_worker_as_bastion_host:
    bastion_server_name = f"{user}-bastion-server"
    bastion_reservation = chi.lease.get_node_reservation(
        lease["id"],
        node_type=bastion_host_node_type,
        count=1,
    )
    print("Spawning bastion server...")
    bastion_server = chi.server.create_server(
        bastion_server_name, 
        reservation_id=bastion_reservation,
        image_name=bastion_host_image,
        network_id=network_id,
        count=1,
    )
    print("Waiting for bastion server to become active...")
    chi.server.wait_for_active(bastion_server.id)
    print(f"Bastion server {bastion_server.id} is active!")

We can now spawn our workers as well.

In [None]:
worker_name = f"{user}-worker"

worker_reservation = chi.lease.get_node_reservation(
    lease["id"],
    node_type=worker_node_type,
    count=worker_node_count,
)

print(f"Spawning {worker_node_count} workers...")
workers = chi.server.create_server(
    worker_name, 
    reservation_id=worker_reservation,
    image_name=worker_image,
    network_id=network_id,
    count=worker_node_count,
)

In [None]:
print("Waiting for workers to become active...")
for worker in workers:
    chi.server.wait_for_active(worker.id)
    print(f"{worker.name} is active!")

print("All workers active!")
print("Workers:")
[w.id for w in workers]

We'll keep refreshing the workers until we are able to know their private IP addresses.

In [None]:
import time

while not all(w.addresses.get(network_name) for w in workers):
    workers = [chi.server.get_server(w.id) for w in workers]
    time.sleep(5)

If we're using a worker as a bastion host, we'll just grab the first one.

In [None]:
if use_worker_as_bastion_host:
    bastion_server = workers[0]

### Associate a floating IP to the bastion host

Next, we'll assign the floating IP address we reserved to our bastion host. We will use this IP address as the entrypoint to all of our workers.

In [None]:
floating_ip = chi.lease.get_reserved_floating_ips(lease["id"])[0]
chi.server.associate_floating_ip(bastion_server.id, floating_ip)

It may take some time after the server becomes active for it to accept network connections.

In [None]:
print("Waiting for bastion server to come online...")
chi.server.wait_for_tcp(floating_ip, port=22)
print("Able to connect to bastion server!")

## Testing the connection

Now that our servers are up, we'll make sure that we can reach the workers via SSH routed through the bastion host.

Since our workers are behind a public firewall, we can't connect to them directly from Jupyter. Instead, we'll connect to their private IPs by routing our SSH connections via the bastion host.

**NOTE**: The servers may have "Active" status, but won't actually be reachable over the network for sometime after. You can check on the status of your nodes by viewing their console in the Chameleon dashboard.

In [None]:
import chi.ssh

worker_remotes = []

for worker in workers:
    print(f"Attempting connection to {worker.name}...")
    worker_private_ip = worker.addresses[network_name][0]["addr"]
    # The gateway here represents a jump host via the bastion host.
    # For more on SSH jump hosts, read here: https://www.redhat.com/sysadmin/ssh-proxy-bastion-proxyjump
    worker_remote = chi.ssh.Remote(worker_private_ip, gateway=chi.ssh.Remote(floating_ip))
    connected = False
    # Attempt a connection every 10 seconds until it succeeds
    while not connected:
        try:
            test_result = worker_remote.run("echo Hello from $(hostname)!")
            connected = True
        except Exception:
            time.sleep(10)
    print(test_result)
    worker_remotes.append(worker_remote)

Amazing! If the above cell executed, that means you have a _secure_ connection to all your workers via only a single IP address! From here, you can utilize the `worker_remotes` list to batch commands out to your worker nodes, or you can simply clean up resources in the next step.

## Cleanup
When we're done with our experiment, it's important to clean up all of our resources so that they're available for other researchers who may need them.

### Cleaning up resources

Below are some granular flags which, when set to `True`, will delete applicable resources when the next cell is executed. 

**NOTE**: If the lease is deleted, then _all_ resources (servers, IPs, etc.) reserved by it will also be deleted.

In [None]:
clear_floating_ip = False
clear_workers = False
clear_bastion_server = False

delete_lease_and_free_all_resources = False

In [None]:
if clear_floating_ip:
    chi.server.detach_floating_ip(bastion_server.id, floating_ip)

In [None]:
if clear_workers:
    for worker in workers:
        if worker != bastion_server:
            worker.delete()

In [None]:
if clear_bastion_server:
    bastion_server.delete()

In [None]:
if delete_lease_and_free_all_resources:
    chi.lease.delete_lease(lease["id"])