## <font color='purple'>SLURM Jobs and GPU Usage</font>

The open-source LLMs can consist of model and model weights with billions of parameters.  In order to speed up the processing, we will be running these models on Quest GPU nodes. To access these nodes, you will need a Quest allocation. You can request a Quest allocation here: https://www.it.northwestern.edu/departments/it-services-support/research/computing/quest/general-access-allocation-types.html.  

Today, we'll submit jobs to the Quest GPU nodes through a <font color='purple'>SLURM</font> (scheduler) script.  

### <font color='purple'>1.) Sample Python Code</font>

To get started with the GPU nodes, here is some sample Python script. The code below allows you to test whether GPUs are available and runs tensors. This file is located in <font color='purple'>__/kellogg/admin/gpu_test_files/pytorch_gpu_test.py__</font>.

In [None]:
# pytorch_gpu_test.py

import torch

# Check if CUDA is available
if torch.cuda.is_available():
    print("CUDA is available!")
    print("Number of GPUs available:", torch.cuda.device_count())
    print("GPU:", torch.cuda.get_device_name(0))
else:
    print("CUDA is not available.")

# Check if CUDA is available and set the device accordingly
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Print whether a GPU or CPU is being used
if device.type == 'cuda':
    print("Using GPU")
else:
    print("Using CPU")

# Create two random tensors
tensor1 = torch.randn(1000, 1000, device=device)
tensor2 = torch.randn(1000, 1000, device=device)

# Add the two tensors, the operation will be performed on the GPU if available
result = tensor1 + tensor2

print(result)

### <font color='purple'>2.) SLURM Script</font>

So long as you have a Quest allocation, you can launch the sample python code using this SLURM script.

In [None]:
#!/bin/bash

#SBATCH -A your_quest_allocation_account
#SBATCH -p gengpu
#SBATCH --gres=gpu:a100:1
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -t 0:30:00
#SBATCH --mem=40G

module purge
module load python
source activate /kellogg/software/envs/gpu-pytorch
python pytorch_gpu_test.py


Breaking down this script: 

- `srun' is used to submit an interactive job to the Slurm scheduler.
- `--partition=gengpu` directs you to GPU nodes on the Quest Genomics Cluster
- `--account=XXXXX` must reference the Quest allocation you are given.
- `--nodes=1` specifies that the job will be run on 1 node of the cluster.  Since there is only 1 GPU node available this cannot be adjusted.
- `--ntasks-per-node=1` this line specifies how many cores of the node you will use. Setting `--ntasks-per-node=2` will run your script on two cores of the node. Only adjust this parameter if your code is parallelizable, otherwise it will slow down the processing speed. 
- `--gres=gpu:a100:1` GRES stand for `Generic Resources'. This line specifies that the job requires 1 GPU of type "a100".  By excluding this line, your job will be restricted to CPU resources.
- `--time==00:30:00` indicates that this job will be allowed to run for up to 30 minutes.
- `--mem` specifies how much memory you are requesting. 

After accessing the GPU node, the script loads python and activates the <font color='purple'>__gpu-pytorch__</font> environment.  Finally it launches the python code.

After submitting your job, you will receive a job ID, like: 

<font color='purple'>__Submitted batch job 9428806__</font> 

Once the script runs to completion, the results will be saved to a __slurm-[job_iD].out__ file, (like __slurm-9428806.out__) in your current directory. 