# LEARN Workshop - session 1
_17 March 2023_

## Objectives
- Get on board of the DGX A100 machine and explore its features.
- Learn how to build a Docker image with necessary environment to perform DL.
- Brainstorm requirements for the LEARN DL/ML platform (with the help of Jack O'Halloran).

## Contents

- How to access the DGX machine and launch a Docker container.
- Overview of the hardware features.
- How to build a Docker image.

## References

**Hardware**
- [DGX A100 white paper](https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/dgx-a100/dgxa100-system-architecture-white-paper.pdf)
- [Nvidia A100 Tensor Core GPU paper](https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf)

**Multiple Instance GPU (MIG)**
- [MIG user guide from Nvidia](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/)

**Docker**
- [Docker basics - how to use Dockerfiles](https://thenewstack.io/docker-basics-how-to-use-dockerfiles/)

**Deep learning training**
- https://sebastianraschka.com/blog/2023/pytorch-faster.html#3-automatic-mixed-precision-training
- https://sebastianraschka.com/blog/2023/pytorch-faster.html#5-training-on-4-gpus-with-distributed-data-parallel
- https://sebastianraschka.com/blog/2023/pytorch-faster.html#6-deepspeed

___

## DGX hardware

- **GPU**: 40GB per GPU/320 GB per DGX A100 Node
- **CPU**: 2-socket, 128 core AMD Rome 7742, 2.25 GHz (base), 3.4GHz (Max boost)
- **System Memory**: 1 TB 3200 MHz DDR4.
- **Storage:** Data cache drives: 15TB (4x3.84TB gen4 NVME).

## Multiple Instance GPU (MIG)

Examine the output of `nvidia-smi`

In [None]:
!nvidia-smi

```bash
python -m torch.utils.collect_env
```

#### Set up GPU device in PyTorch

In [None]:
import torch

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA version: {torch.version.cuda}")
print("Torch CUDA available?", torch.cuda.is_available())
device = "cuda:0" if torch.cuda.is_available() else "cpu"

**How to launch a container that "sees" specific MIG devices?**

```bash
SOURCEPATH="${HOME}/code"
TARGETPATH="/app/code"
DOCKER_IMAGE="rbonazzola/coma:latest"
DEVICES="0:3,0:4,0:5,0:6,1:0,1:1"

docker run -it --rm \ 
--shm-size=32gb \
--gpus '"device='${DEVICES}''"' \
--mount type=bind,source=${SOURCEPATH},target=${TARGETPATH} \
$DOCKER_IMAGE
```

___

### Half-precision floating point

- Uses 16-bit representations for floating point number. Default is usually 32 bits.
- Saves memory.
- Speeds up computation.
- Feature also available in Nvidia V100 GPUs (ARC4, Bede, JADE2)
- _Not_ available on P100 or K80 GPUs (ARC3).

#### Benchmarking

In [None]:
N = 1000
A32 = torch.Tensor(N, N).cuda()
B32 = torch.Tensor(N, N).cuda()
A16 = torch.Tensor(N, N).cuda().type(torch.float16)
B16 = torch.Tensor(N, N).cuda().type(torch.float16)

Let's perform element-wise matrix multiplication to compare the execution times:

In [None]:
%timeit A16 * B16
%timeit A32 * B32

In [None]:
%timeit A16 * B32

In [None]:
A16 = torch.Tensor(N, N).cuda().type(torch.bfloat16)
B16 = torch.Tensor(N, N).cuda().type(torch.bfloat16)

%timeit A16 * B16

### Mixed-precision training

### MNIST digit recognition (plain PyTorch)

In [None]:
torch.__version__

In [None]:
device = torch.device('cuda:0')

In [None]:
from torch import nn
import torch.nn.functional as F
from torchvision import datasets
from torchvision import transforms

mnist_trainset = datasets.MNIST(root='./data', train=True, download=True, transform=transforms.Compose([transforms.ToTensor()]))
mnist_testset = datasets.MNIST(root='./data', train=False, download=True, transform=transforms.Compose([transforms.ToTensor()]))
from torch.utils.data import Subset

mnist_valset, mnist_testset = torch.utils.data.random_split(mnist_testset, [int(0.9 * len(mnist_testset)), int(0.1 * len(mnist_testset))])

# train_dataloader = torch.utils.data.DataLoader(Subset(mnist_trainset, range(5000)), batch_size=128, shuffle=True)
# val_dataloader = torch.utils.data.DataLoader(Subset(mnist_valset, range(1000)), batch_size=32, shuffle=False)
# test_dataloader = torch.utils.data.DataLoader(Subset(mnist_testset, range(500)), batch_size=32, shuffle=False)

train_dataloader = torch.utils.data.DataLoader(mnist_trainset, batch_size=128, shuffle=True)
val_dataloader = torch.utils.data.DataLoader(mnist_valset, batch_size=32, shuffle=False)
test_dataloader = torch.utils.data.DataLoader(mnist_testset, batch_size=32, shuffle=False)

In [None]:
# train_dataloader.dataset.dataset.data = train_dataloader.dataset.dataset.data.to(device)
# val_dataloader.dataset.dataset.dataset.data = val_dataloader.dataset.dataset.dataset.data.to(device)
# test_dataloader.dataset.dataset.dataset.data = test_dataloader.dataset.dataset.dataset.data.to(device)

In [None]:
class MNISTClassifier(nn.Module):

    def __init__(self):
        super(MNISTClassifier, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, 1)
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
        self.dropout1 = nn.Dropout(0.25)
        self.dropout2 = nn.Dropout(0.5)
        self.fc1 = nn.Linear(9216, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = self.conv1(x)
        x = F.relu(x)
        x = self.conv2(x)
        x = F.relu(x)
        x = F.max_pool2d(x, 2)
        x = self.dropout1(x)
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x = F.relu(x)
        x = self.dropout2(x)
        x = self.fc2(x)
        output = F.softmax(x, dim=1)
        return output

In [None]:
model = MNISTClassifier().cuda()
ce_loss = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

In [None]:
import time

In [None]:
no_epochs = 1000
train_loss = list()
val_loss = list()
best_val_loss = 1

for epoch in range(no_epochs):
    total_train_loss = 0
    total_val_loss = 0

    model.train()

    # training
    for itr, (image, label) in enumerate(train_dataloader):
        image = image.to(device)
        label = label.to(device)
        optimizer.zero_grad()

        pred = model(image)

        loss = ce_loss(pred, label)
        total_train_loss += loss.item()

        loss.backward()
        optimizer.step()

    total_train_loss = total_train_loss / (itr + 1)
    train_loss.append(total_train_loss)
    
    # validation
    model.eval()
    total = 0
    for itr, (image, label) in enumerate(val_dataloader):
        
        image = image.to(device)
        label = label.to(device)
        pred = model(image)

        loss = ce_loss(pred, label)
        total_val_loss += loss.item()

        pred = torch.nn.functional.softmax(pred, dim=1)
        for i, p in enumerate(pred):
            if label[i] == torch.max(p.data, 0)[1]:
                total = total + 1

    accuracy = total / len(mnist_valset)

    total_val_loss = total_val_loss / (itr + 1)
    val_loss.append(total_val_loss)

    hora = time.strftime("%H:%M:%S") 
    print('\n{} - Epoch: {}/{}, Train Loss: {:.8f}, Val Loss: {:.8f}, Val Accuracy: {:.8f}'.format(hora, epoch + 1, no_epochs, total_train_loss, total_val_loss, accuracy))

    if total_val_loss < best_val_loss:
        best_val_loss = total_val_loss
        print("Saving the model state dictionary for Epoch: {} with Validation loss: {:.8f}".format(epoch + 1, total_val_loss))
        torch.save(modelo.state_dict(), "checkpoints/model.dth")

___

### PyTorch Lightning (PTL)

- Library built on top of PyTorch.
- Gets rid of boilerplate code.
- **Allows to access hardware capabilities more easily.**

In [None]:
import pytorch_lightning as ptl

It's built around three key abstractions:
- `ptl.Module`: model itself plus specifications on what to do at each stage (training/validation/testing/inference)
- `ptl.DataModule`: data + how to partition the data
- `ptl.Trainer`: object that is fed with the two previous and performs the training. Hardware details must be specified through this object.

Let's use 

In [None]:
# import MNIST_module
# import MNIST_datamodule

In [None]:
from pytorch_lightning.callbacks.early_stopping import EarlyStopping
from pytorch_lightning.callbacks import RichProgressBar
from pytorch_lightning.callbacks import ModelCheckpoint
from pytorch_lightning.callbacks.progress.rich_progress import RichProgressBarTheme
from pytorch_lightning.callbacks import RichModelSummary

early_stopping = EarlyStopping(monitor="val_loss", mode="min", patience=20)

model_checkpoint = ModelCheckpoint(monitor='val_loss', save_top_k=1)

# rich_model_summary = RichModelSummary(max_depth=-1)

progress_bar = RichProgressBar(
  theme=RichProgressBarTheme(
    description="green_yellow",
    progress_bar="green1",
    progress_bar_finished="green1",
    progress_bar_pulse="#6206E0",
    batch_progress="green_yellow",
    time="grey82",
    processing_speed="grey82",
    metrics="grey82",
  )
)

callbacks = [
    early_stopping,
    model_checkpoint,
    # rich_model_summary
]

In [None]:
trainer = ptl.Trainer(
  gpus=1,
  precision="bf16",
  callbacks=callbacks
)

In [None]:
trainer.fit(ptl_module, datamodule)

___

# Docker

### Building your own Docker image

**_Note:_** ARC3, ARC4, Bede and JADE have Singularity installed, however a Docker image can be run from Singularity. Therefore, if you choose carefully the libraries's versions on your Docker images (such that they are compatible with the Nvidia drivers installed), in principle you could readily use this image on those platforms.