# LEARN Workshop - session 1
_17 March 2023_

## Objectives of today's session
- Get on board of the DGX A100 machine and explore its features.
- Learn how to build a Docker image with necessary environment to perform DL.
- Brainstorm requirements for the LEARN DL/ML platform (with the help of Jack O'Halloran).

## References

**Hardware**
- [DGX A100 white paper](https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/dgx-a100/dgxa100-system-architecture-white-paper.pdf)
- [Nvidia A100 Tensor Core GPU paper](https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf)

**Multiple Instance GPU (MIG)**
- [MIG user guide from Nvidia](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/)

**Docker**
- [Docker basics - how to use Dockerfiles](https://thenewstack.io/docker-basics-how-to-use-dockerfiles/)

**Deep learning training**
- https://sebastianraschka.com/blog/2023/pytorch-faster.html#3-automatic-mixed-precision-training
- https://sebastianraschka.com/blog/2023/pytorch-faster.html#5-training-on-4-gpus-with-distributed-data-parallel
- https://sebastianraschka.com/blog/2023/pytorch-faster.html#6-deepspeed

___

## To get started

To start off, we will log into the DGX A100, launch a container and open this notebook.

1. Log into the DGX A100 through SSH (server name: `cistib-dgx01.leeds.ac.uk`). Use your university credentials.
2. Check that Docker works on the DGX for your user: run `docker ps` and `docker run hello-world`.
3. Clone this repository: 
  ```bash
  git clone https://github.com/rbonazzola/LEARN_workshop.git
  ```
4. Pull the test Docker image we will use, with the following command: 
  ```bash
  docker pull rbonazzola/learn_workshop:session_2
  ```  
  Alternatively, you can build the image yourself (this will take longer). The source `Dockerfile` is called `docker/Dockerfile_pt113_cu117_ptl19`. The command would be (if run from the `LEARN_workshop/docker` directory): `docker build -f Dockerfile_pt113_cu117_ptl19 -t rbonazzola/learn_workshop:session_2 .` (don't forget the dot in the end).
  
  
5. Launch a container by running these commands:
```bash
DEVICE="1:0" # change by another device. This stands for GPU_ID:MIG_ID.
SOURCEPATH="${HOME}/LEARN_workshop" # if you didn't clone the repo in the home dir, change accordingly
TARGETPATH="/root/LEARN_workshop"
PORT=13467 # change by another random port >10000
docker run -it -p ${PORT}:8888 --shm-size=32gb --gpus '"device='$DEVICE'"' --user root --mount type=bind,source=${SOURCEPATH},target=${TARGETPATH} rbonazzola/learn_workshop:session_2
```
6. If the above worked, you should be inside the container. Launch Jupyter Lab or Jupyter Notebook from inside:
```bash
jupyter lab --ip "0.0.0.0" --allow-root --no-browser
```

7. Create an SSH tunnel (using MobaXterm on Windows, or the `ssh` option `"-L ${PORT}:localhost:${PORT}"` on Linux).
8. Open your local web browser and insert `localhost:PORT` in the address bar, and copy the access token.
9. Open this notebook, `LEARN_workshop/LEARN_workshop_session1.ipynb`. **Tip**: create a copy of this notebook ("save as...") if you plan to make changes to it. That will prevent future merge issues if you need to do pulls in the future.

For the SSH tunnel, this is a reference of the MobaXterm configuration that is needed:

![](figures/local-port-forwarding.png)

___

## DGX hardware

- **GPU**: 40GB per GPU/320 GB per DGX A100 Node
- **CPU**: 2-socket, 128 core AMD Rome 7742, 2.25 GHz (base), 3.4GHz (Max boost)
- **System Memory**: 1 TB 3200 MHz DDR4.
- **Storage:** 
    - **Default**: 15TB (4x3.84TB gen4 NVME).
    - **Purchased with this machine**: 105 TB drive [PNY 3S-1050](https://www.scan.co.uk/3xs/configurator/3s-1050) AI-optimised storage.

![](figures/DGX_schema.png)

## Multiple Instance GPU (MIG)

(_from [Nvidia MIG user guide](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/)_) 

The new Multi-Instance GPU (MIG) feature allows GPUs (starting with NVIDIA Ampere architecture) to be securely partitioned into up to seven separate GPU Instances for CUDA applications, providing multiple users with separate GPU resources for optimal GPU utilization. This feature is particularly beneficial for workloads that do not fully saturate the GPU's compute capacity and therefore users may want to run different workloads in parallel to maximize utilization. 

![](figures/DGX_MIG_partitioning_schemes.png)

Currently, all the GPUs are partitioned into seven 5GB chunks (MIG devices), except GPU 6 which is partitioned into two 20GB MIG devices. Partitioning requires `sudo` so the admins (Ale and Kattia) have to be contacted in order to change this configuration.

Examine the output of `nvidia-smi`

In [None]:
!nvidia-smi

## Using PyTorch

Let's examine the environment. Run this from the command line:

```bash
python -m torch.utils.collect_env
```

#### Set up GPU device in PyTorch

In [None]:
import torch

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA version: {torch.version.cuda}")
print("Torch CUDA available?", torch.cuda.is_available())
device = "cuda:0" if torch.cuda.is_available() else "cpu"

___

### Half-precision floating point

- Uses 16-bit representations for floating point number. Default is usually 32 bits.
- Saves memory.
- Speeds up computation.
- Feature also available in Nvidia V100 GPUs (ARC4, Bede, JADE2)
- _Not_ available on P100 or K80 GPUs (ARC3).

![](figures/DGX_16_bit_precision.png)

![](.figures/DGX_16_bit_precision)

#### Benchmarking

In [None]:
N = 1000
A32 = torch.Tensor(N, N).cuda()
B32 = torch.Tensor(N, N).cuda()
A16 = torch.Tensor(N, N).cuda().type(torch.float16)
B16 = torch.Tensor(N, N).cuda().type(torch.float16)

Let's perform element-wise matrix multiplication ($\mathcal{O}(n^2)$) to compare the execution times:

In [None]:
%timeit A16 * B16
%timeit A32 * B32

In [None]:
a = A16 * B32

In [None]:
%timeit A16 * B32

In [None]:
A16 = torch.Tensor(N, N).cuda().type(torch.bfloat16)
B16 = torch.Tensor(N, N).cuda().type(torch.bfloat16)

%timeit A16 * B16

Now, some standard matrix products ($\mathcal{O}(n^3)$):

In [None]:
%timeit torch.mm(A16, B16)
# %timeit torch.mm(A32, B32)

### MNIST digit recognition (plain PyTorch)

Let's train a simple CNN MNIST classifier using plain PyTorch.

In [None]:
device = torch.device(device)

In [None]:
from torch import nn
import torch.nn.functional as F
from torchvision import datasets
from torchvision import transforms

mnist_trainset = datasets.MNIST(root='./data', train=True, download=True, transform=transforms.Compose([transforms.ToTensor()]))
mnist_testset = datasets.MNIST(root='./data', train=False, download=True, transform=transforms.Compose([transforms.ToTensor()]))

mnist_trainset, mnist_valset = torch.utils.data.random_split(mnist_trainset, [int(0.8 * len(mnist_trainset)), int(0.2 * len(mnist_trainset))])

train_dataloader = torch.utils.data.DataLoader(mnist_trainset, batch_size=256, shuffle=True)
val_dataloader = torch.utils.data.DataLoader(mnist_valset, batch_size=16, shuffle=False)
test_dataloader = torch.utils.data.DataLoader(mnist_testset, batch_size=16, shuffle=False)

In [None]:
class MNISTClassifier(nn.Module):

    def __init__(self):
        super(MNISTClassifier, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, 1)
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
        self.dropout1 = nn.Dropout(0.25)
        self.dropout2 = nn.Dropout(0.5)
        self.fc1 = nn.Linear(9216, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = self.conv1(x)
        x = F.relu(x)
        x = self.conv2(x)
        x = F.relu(x)
        x = F.max_pool2d(x, 2)
        x = self.dropout1(x)
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x = F.relu(x)
        x = self.dropout2(x)
        x = self.fc2(x)
        output = F.softmax(x, dim=1)
        return output

In [None]:
model = MNISTClassifier().cuda()
ce_loss = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

In [None]:
import time

In [None]:
from IPython import embed

In [None]:
# N_EPOCHS = 10
# train_loss = list()
# val_loss = list()
# best_val_loss = 1
# 
# for epoch in range(N_EPOCHS):
#     print(epoch)
#     total_train_loss = 0
#     total_val_loss = 0
# 
#     model.train()
#     # training
#     for itr, (image, label) in enumerate(train_dataloader):
#         image = image.to(device)
#         label = label.to(device)
#         optimizer.zero_grad()
#         # embed()
# 
#         pred = model(image)
# 
#         loss = ce_loss(pred, label)
#         total_train_loss += loss.item()
# 
#         loss.backward()
#         optimizer.step()
# 
#     total_train_loss = total_train_loss / (itr + 1)
#     train_loss.append(total_train_loss)
#     
#     # validation
#     model.eval()
#     total = 0
#     for itr, (image, label) in enumerate(val_dataloader):
#         
#         image = image.to(device)
#         label = label.to(device)
#         pred = model(image)
#  
#         loss = ce_loss(pred, label)
#         total_val_loss += loss.item()
#  
#         pred = torch.nn.functional.softmax(pred, dim=1)
#         for i, p in enumerate(pred):
#             if label[i] == torch.max(p.data, 0)[1]:
#                 total = total + 1
#  
#     accuracy = total / len(mnist_valset)
#  
#     total_val_loss = total_val_loss / (itr + 1)
#     val_loss.append(total_val_loss)
#  
#     timestamp = time.strftime("%H:%M:%S") 
#     print('\n{} - Epoch: {}/{}, Train Loss: {:.8f}, Val Loss: {:.8f}, Val Accuracy: {:.8f}'.format(timestamp, epoch + 1, N_EPOCHS, total_train_loss, total_val_loss, accuracy))
#  
#     if total_val_loss < best_val_loss:
#         best_val_loss = total_val_loss
#         print("Saving the model state dictionary for Epoch: {} with Validation loss: {:.8f}".format(epoch + 1, total_val_loss))
#         torch.save(model.state_dict(), "checkpoints/model.dth")

___

### PyTorch Lightning (PTL)

- Library built on top of PyTorch.
- Gets rid of boilerplate code.
- **Allows to access hardware capabilities more easily.**

In [None]:
import pytorch_lightning as ptl

It's built around three key abstractions:
- `ptl.Module`: model itself plus specifications on what to do at each stage (training/validation/testing/inference)
- `ptl.DataModule`: data + how to partition the data
- `ptl.Trainer`: object that is fed with the two previous and performs the training. Hardware details must be specified through this object.

In [None]:
from my_ptl_callbacks import *

callbacks = [
    early_stopping,
    model_checkpoint,
    rich_model_summary,
    progress_bar
]

Let's import the ptl.Module and ptl.DataModule from the file `MNIST_lightning.py`:

In [None]:
from MNIST_lightning import CNN_Module, MNIST_DataModule 

In [None]:
!nvidia-smi

In [None]:
torch.set_float32_matmul_precision('medium')

In [None]:
BATCH_SIZE = 256
PRECISION = "32" # try 32, 64, "bf16"

datamodule = MNIST_DataModule(batch_size=BATCH_SIZE, split_lengths=[48000, 12000])

ptl_module = CNN_Module(
    model=MNISTClassifier()
)

trainer = ptl.Trainer(
  devices='auto',
  precision=PRECISION,
  callbacks=callbacks
)

In [None]:
datamodule.setup(stage="fit")

In [None]:
trainer.fit(ptl_module, datamodule)