# Using GPUs at Northwestern

In this workshop, we'll leverage the power of Quest GPU nodes to run our open-source LLMs. To do so, please use the temporary Quest allocation: <font color='purple'>__e32337__</font>.

Afterwards, you can request your own Quest allocation [here](https://www.it.northwestern.edu/departments/it-services-support/research/computing/quest/general-access-allocation-types.html)


:::{note}
There are other options for GPUs:

- [Google Colab](https://colab.research.google.com/?utm_source=scs-index) allows you to use GPUs for free with browser-based notebooks
- Cloud platforms like Amazon Web Services, Google Cloud Platform, and Microsoft Azure all offer cloud-based GPUs for a price
- Many other cloud providers have sprung up, such as [Paperspace](https://www.paperspace.com/)
- You can buy your own if you have the budget and expertise

## Parallel Computing for LLMs

`````{admonition}
:class: important
The purpose of running our LLMs on GPU nodes is largely to speed up processing.  In order to understand this, you'll often hear us talk about <font color='purple'>__CPUs__</font>, <font color='purple'>__GPUs__</font>, and <font color='purple'>__CUDA__</font>.  This section breaks down these terms.
`````

:::{admonition} <font color='purple'>__CPUs__</font>
 Much like your own computer, some of our KLC and Quest nodes are equipped with both processors and graphics cards. A processor or <font color='purple'>__central processing unit (CPU)__</font> is responsible for all the mathematical and logical calculations on a node. In a nutshell, it runs code. While CPUs are extremely powerful and complete most tasks in an infinitesimally short amount of time, a CPU core can only handle one task at a time and runs things __sequentially__.

 ```{figure} ./images/cpu_sequential.png
---
width: 500px
name: cpu_sequential
---
```
 :::

:::{admonition} <font color='purple'>__Multiple CPU Cores__</font>
One way to speed up processing is through <font color='purple'>_parallel computing_</font> across multiple CPU cores. Parallel computing is a method of solving a single problem by breaking it down into smaller chunks that run __simultaneously__.  A CPU can break up a task and distributes it over multiple CPU cores.  

```{figure} ./images/cpu_parallel.png
---
width: 350px
name: cpu_parallel
---
```
:::

:::{note}
The latest generation of [KLC nodes](https://www.kellogg.northwestern.edu/academics-research/research-support/computing/kellogg-linux-cluster.aspx) have 64 CPU cores and 2TB of shared RAM 🚀. This means you could in theory run 64 parallel (simultaneous) processes on a single KLC node.
:::

:::{admonition} <font color='purple'>__GPUs__</font>
A graphics card or <font color='purple'>__graphics processing unit (GPU)__</font> is a specialized hardware component that can efficiently handle parallel mathematical operations. In comparison to the 24 cores you can use on KLC, a A100 GPU contains 6,912 CUDA cores (the H100 GPU has an astounding 18,432 CUDA cores).  While a GPU core is less powerful than an individual CPU core, their sheer volume make them ideal for handling certain kinds of large amounts of computations in parallel, especially the vector and matrix operations for which GPUs were designed. We will see an example later of the speedup that GPUs provide for this kind of task.

```{figure} ./images/gpu.png
---
width: 350px
name: gpu
---
```
:::

:::{note}
If GPUs are so much better at parallelization than CPUs, why aren't all tasks given to GPUs?  

- Some tasks simply can't be parallelized, if the input to one depends on the output from another. In this case, they must be run in serial for logical reasons.

- Even when parallelization is possible, some tasks actually take longer if parallelized. Sometimes the overhead of coordinating processes across cores might actually take longer than having a single CPU core complete the task alone.
:::

:::{admonition} <font color='purple'>__CUDA__</font>
The potential inefficiency of parallelization raises the question of how your system knows when to send a task to CPUs or to GPUs? For Nvidia-based GPU's, this is where <font color='purple'>__CUDA__</font> comes in.  <font color='purple'>__CUDA (Compute Unified Device Architecture)__</font> is a powerful software platform that helps computer programs run faster. On the GPU nodes, we use it to solve performance intensive problems by optimizing when to allocate certains tasks to CPU processing or GPU processing. 

In the image below, CUDA determines which tasks to delegate to GPUs or to CPUs.

```{figure} ./images/giffy_gif.gif
---
width: auto
name: giffy
---
```
:::

:::{important} 
For vector and matrix operations, GPUs can be orders of magnitude faster than CPUs!

```{figure} ./images/gpu-v-cpu.png
---
width: auto
name: gpu-v-cpu
---
:::

## Sample GPU Python Code

To get started with the GPU nodes, here is a sample Python script. The code below allows you to test whether GPUs are available on a node and runs tensors. This file is located in the course [github repository](https://github.com/rs-kellogg/krs-openllm-cookbook/blob/main/scripts/slurm_basics)

:::{admonition} pytorch_gpu_test.py
```python
import torch

# Check if CUDA is available
if torch.cuda.is_available():
    print("CUDA is available!")
    print("Number of GPUs available:", torch.cuda.device_count())
    print("GPU:", torch.cuda.get_device_name(0))
else:
    print("CUDA is not available.")

# Check if CUDA is available and set the device accordingly
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Print whether a GPU or CPU is being used
if device.type == 'cuda':
    print("Using GPU")
else:
    print("Using CPU")

# Create two random tensors
tensor1 = torch.randn(1000, 1000, device=device)
tensor2 = torch.randn(1000, 1000, device=device)

# Add the two tensors, the operation will be performed on the GPU if available
result = tensor1 + tensor2

print(result)
```
:::

:::{note}
Code execution in a Jupyter notebook is demonstrated in [this video](https://kellogg-shared.s3.us-east-2.amazonaws.com/videos/quest-on-demand-gpu-notebook.mp4)
:::

:::{admonition} <font color='purple'>_Northwestern GPU Resources_</font>
<!-- :class: tip -->
[Quest](https://services.northwestern.edu/TDClient/30/Portal/KB/ArticleDet?ID=1112) has dozens of Nvidia-based GPU nodes available for use. We will show you how to access them via a Jupyter notebook using [Quest on Demand](https://services.northwestern.edu/TDClient/30/Portal/KB/ArticleDet?ID=2234) and using the [Slurm scheduler](https://services.northwestern.edu/TDClient/30/Portal/KB/ArticleDet?ID=1964). Both of these methods require that you are part of a Quest allocation.
:::

:::{note}
__Let's play this up!!!__

We are in the process of setting up GPU nodes for exclusive use by the Kellogg research community as part of the [Kellogg Linux Cluster](https://www.kellogg.northwestern.edu/academics-research/research-support/computing/kellogg-linux-cluster.aspx). Accessing these nodes will be identical to accessing GPU nodes on Quest, but will require only a KLC account and not a separate Quest allocation.
:::

## SLURM Script to Access GPU Nodes

For this workshop, we'll submit jobs to the Quest GPU nodes through a <font color='purple'>SLURM</font> (scheduler) script. You can launch the sample python code using this script. 

:::{admonition} <font color='purple'>_pytorch_gpu_test.sh_</font>
```
#!/bin/bash

#SBATCH --account=e32337
#SBATCH --partition gengpu
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:a100:1
#SBATCH --constraint=pcie
#SBATCH --time 0:30:00
#SBATCH --mem=40G
#SBATCH --output=/projects/e32337/slurm-output/slurm-%j.out


module purge all
module use --append /kellogg/software/Modules/modulefiles
module load micromamba/latest
source /kellogg/software/Modules/modulefiles/micromamba/load_hook.sh
micromamba activate /kellogg/software/envs/llm-test-env
python pytorch_gpu_test.py
```
:::

:::{admonition} <font color='purple'>_Breaking down this script_</font> 

- `--account` is the [Quest allocation](https://www.it.northwestern.edu/departments/it-services-support/research/computing/quest/general-access-allocation-types.html) you are given.
- `--partition=gengpu` directs you to [GPU nodes](https://services.northwestern.edu/TDClient/30/Portal/KB/ArticleDet?ID=1112) on the Quest Genomics Cluster
- `--ntasks-per-node=1` this line specifies how many cores of the node you will use. Setting `--ntasks-per-node=2` will run your script on two cores of the node. Only adjust this parameter if your code is parallelizable, otherwise it will slow your job down, not speed it up.
- `--gres=gpu:a100:1` This line specifies that the job requires 1 GPU of type "a100". You can select more.
- `--constraint` Specifies the type of A100 preferred, [choices](https://services.northwestern.edu/TDClient/30/Portal/KB/ArticleDet?ID=1112) are "sxm" (80GB of GPU memory) or "pcie" (40GB of GPU memory)- `--nodes=1` specifies that the job will be run on 1 node of the cluster. 
- `--time==00:30:00` indicates that this job will be allowed to run for up to 30 minutes.
- `--mem` specifies how much memory you are requesting. 
- `--output` specifies the path and file where the stdout and stderr output streams will get saved.

After accessing the GPU node, the script loads python and activates the <font color='purple'>__gpu-pytorch__</font> environment.  Finally it launches the python code.
:::

:::{note}
Demonstration of executing a slurm script using Quest On Demand graphical interface is shown [here](https://kellogg-shared.s3.us-east-2.amazonaws.com/videos/quest-on-demand-gpu-slurm.mp4), and using a command line terminal [here](https://kellogg-shared.s3.us-east-2.amazonaws.com/videos/console-gpu-slurm.mp4).
:::

### <font color='purple'>_Reference Sources_</font>

- [Cuda Simply Explained](https://youtube.com/watch?v=r9IqwpMR9TE)
- [Understanding Parallel Computing](https://blog.paperspace.com/demystifying-parallel-computing-gpu-vs-cpu-explained-simply-with-cuda/)
