# Using the Savio HPC for Training Neural Nets

Saaht Mogan, AY128 UC Berkeley (SP 2025)

## 1. Why Use a computing cluster?
**Handling Large Datasets:**

Astrophysics involves massive datasets (e.g., telescope observations, 
cosmological simulations). HPCs accelerate processing tasks that would take 
days on a personal computer. We'll see an example of this later.

**Parallel Computing:**

HPC clusters like Savio allow parallel execution of tasks across multiple 
CPUs/GPUs, critical for simulations or parameter sweeps.

**Examples:**
* Analyzing galaxy survey data (e.g., DESI, LSST).
* Running N-body simulations for dark matter studies.
* **Training machine learning models on astronomical images.**

## 2. Savio Cluster Basics
**Cluster Architecture**
* Login Nodes: Lightweight tasks (editing code, job submission).
* Compute Nodes: Heavy computations (request via SLURM).
* Partitions: Groups of nodes with specific resources (e.g., `savio2_1080ti` 
for GPU jobs).

**Storage Options**
* Home Directory: Small, for critical files.
* Scratch: High-speed storage for temporary job data.
* Condos: Purchased storage for large projects.

**Job Scheduling (SLURM)**
* SLURM manages resource allocation.
* Submit jobs via `sbatch job.sh`.
* Monitor jobs with `squeue -u $USER`.

## 3. Using Savio: Step-by-Step
**Important Links**
* https://mybrc.brc.berkeley.edu/
    * Portal for managing your access to different computing projects
* https://docs-research-it.berkeley.edu/services/high-performance-computing/
    * Documentation for how to use the cluster
* https://ood.brc.berkeley.edu/
    * The On Demand portal: the simplest way to interface with the cluster

**Accessing Savio and File Transfer**
* Access, transfer files, edit files, and submit jobs from the On Demand portal

**Making a Job Script**
The required items for all job scripts is bolded. We will go over the main 
`SBATCH` options we use for our jobs.

* **Account** (`--account=fc_dweisz`)
    * We are using Prof. Weisz's faculty allocation (Don't use too many SUs)
* **Node Partition** (`--partition=savio2_1080ti`)
    * This is the "cheapest" node with GPUs, more than enough compute for us
* GPU (`--gres=gpu:1`)
* CPUs (`--cpus-per-task=2`)
    * `savio2_1080ti` nodes need 2 CPU cores for each GPU requested
* **Time limit**: 10-minute runtime (`--time=00:10:00`)
    * The job will stop itself after this time limit is reached whether or not 
    the job is complete

## 4: Example Time!
For our example, we are going to be training a ResNet model 
(see lab document for details) to classify images from the CIFAR-10 dataset. 
This dataset aims to classify 32x32 images into 1 of 10 different classes: 
plane, car, bird, cat, deer, dog, frog, horse, ship, and truck. Similar to our
problem but we are guessing continuous labels, not discrete classes. This 
changes our code slightly but the gist of the example remains the same. As 
outlined in the lab, our broad steps are the same:

1. Load in our data in a memory-friendly way
2. Initialize our model with an appropriate loss function, optimizer, and 
    (learning rate scheduler).
3. Finally run a training-validation loop over how many ever epochs we want to 
    train for.


![Examples from the CIFAR-10 dataset](cifar10_images.png)

Let's train this model for 5 epochs 3 different ways:
1. Personal Computer on the CPU
    * Wayyyyy too slow for repeated iterating
2. Personal Computer on the GPU
    * Possible, but inconvienient
3. On Savio
    * Set it, and (hopefully) forget it