## Homework report (Roma Nigmatullin)
##### Changes
- Training run with custom flag from script, how to run: copy commands from jupyter.
- Added memory allocation / reservation metrics, training time metrics (total and per epoch).
- Added validation accuracy metrics, aggregated over all processes via all_reduce of both numerator and denominator of ratio metric.

##### Source code
SyncBatchNorm is in `syncbn.py`, tests are in `test_syncbn.py`.

Pipelines are in `ddp_cifar100.py` and `ddp_cifar100_benchmark.py` with all the changes
- `ddp_cifar100.py` contains dataset creation and dataloader creation, model class, metric reducing functions.
- `ddp_cifar100_benchmark.py` contains two pipelines with training loop and validation, and the runner script with --custom flag.

Benchmarking of implementations of SyncBatchNorm is in `syncbn_benchmark.py`.


In [6]:
# run from week04_data_parallel/homework directory, uv project initialized
# requirements.txt contains all the dependencies
!cat requirements.txt
!echo ''
!python -V

pytest==8.3.4
torch==2.4.0
torchvision==0.19.0


Python 3.12.9


In [7]:
# used GPUs - NVIDIA Quadro RTX 4000 (x2) and 1 process per GPU
!nvidia-smi

Sun Feb 23 13:42:15 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.28.03              Driver Version: 560.28.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Quadro RTX 4000                On  |   00000000:02:00.0 Off |                  N/A |
| 30%   29C    P8             16W /  125W |       4MiB /   8192MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA RTX A4000               On  |   00

### Testing

In [8]:
!uv run pytest

platform linux -- Python 3.12.9, pytest-8.3.4, pluggy-1.5.0
rootdir: /home/nigmatullinro/efficient-dl-systems/week04_data_parallel/homework
configfile: pyproject.toml
plugins: anyio-4.8.0
collected 16 items                                                             [0m[1m

test_syncbn.py [32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m                                          [100%][0m



### Running and benchmarking training pipeline

In [15]:
!CUDA_VISIBLE_DEVICES=4,5 uv run torchrun --nproc_per_node 2 ddp_cifar100_benchmark.py --custom

W0223 13:20:48.095000 140558715837312 torch/distributed/run.py:779] 
W0223 13:20:48.095000 140558715837312 torch/distributed/run.py:779] *****************************************
W0223 13:20:48.095000 140558715837312 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0223 13:20:48.095000 140558715837312 torch/distributed/run.py:779] *****************************************
Epoch 0 | train_loss: 4.25703, val_loss: 3.89881, accuracy: 0.15670
Epoch 1 | train_loss: 3.84836, val_loss: 3.63068, accuracy: 0.20520
Epoch 2 | train_loss: 3.62297, val_loss: 3.45377, accuracy: 0.23890
Epoch 3 | train_loss: 3.44466, val_loss: 3.29486, accuracy: 0.26210
Epoch 4 | train_loss: 3.28940, val_loss: 3.15547, accuracy: 0.28300
Epoch 5 | train_loss: 3.15712, val_loss: 3.03495, accuracy: 0.29630
Epoch 6 | train_loss: 3.0

##### Prettified
| Metric    | Process 0 | Process 1 |
| -------- | ------- | ------- |
| Max memory allocated (CUDA) | 184 MB | 152 MB |
| Max memory reserved (CUDA) | 195 MB | 220 MB |
| Training time (total) | 332 (s) | 332 (s) |
| Training time (per epoch) | 11 (s) | 11 (s) |
| Validation accuracy | 0.3942 | 0.3942 |


In [16]:
!CUDA_VISIBLE_DEVICES=4,5 uv run torchrun --nproc_per_node 2 ddp_cifar100_benchmark.py

W0223 13:26:37.667000 140640212700032 torch/distributed/run.py:779] 
W0223 13:26:37.667000 140640212700032 torch/distributed/run.py:779] *****************************************
W0223 13:26:37.667000 140640212700032 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0223 13:26:37.667000 140640212700032 torch/distributed/run.py:779] *****************************************
Epoch 0 | train_loss: 0.00000, val_loss: 3.90105, accuracy: 0.15850
Epoch 1 | train_loss: 0.00000, val_loss: 3.63479, accuracy: 0.20360
Epoch 2 | train_loss: 0.00000, val_loss: 3.46039, accuracy: 0.23830
Epoch 3 | train_loss: 0.00000, val_loss: 3.29393, accuracy: 0.26520
Epoch 4 | train_loss: 0.00000, val_loss: 3.16485, accuracy: 0.28240
Epoch 5 | train_loss: 0.00000, val_loss: 3.05015, accuracy: 0.29530
Epoch 6 | train_loss: 0.0

##### Prettified
| Metric    | Process 0 | Process 1 |
| -------- | ------- | ------- |
| Max memory allocated (CUDA) | 184 MB | 156 MB |
| Max memory reserved (CUDA) | 195 MB | 197 MB |
| Training time (total) | 272 (s) | 272 (s) |
| Training time (per epoch) | 9 (s) | 9 (s) |
| Validation accuracy | 0.3953 | 0.3953 |


Loosing on time, but almost same on memory and accuracy, actually not bad.

### Benchmarking implementations of SyncBatchNorm

In [24]:
!CUDA_VISIBLE_DEVICES=4,5 uv run torchrun --nproc_per_node 2 syncbn_benchmark.py

W0223 14:28:36.115000 139821546109824 torch/distributed/run.py:779] 
W0223 14:28:36.115000 139821546109824 torch/distributed/run.py:779] *****************************************
W0223 14:28:36.115000 139821546109824 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0223 14:28:36.115000 139821546109824 torch/distributed/run.py:779] *****************************************
Rank 0
hidden_dim: 128, batch_size: 32
Custom SyncBN: 1.2753 (ms / ep), 2.0972 MB reserved, 0.0568 MB allocated
Torch SyncBN: 1.0504 (ms / ep), 2.0972 MB reserved, 0.0435 MB allocated

hidden_dim: 128, batch_size: 64
Custom SyncBN: 1.2660 (ms / ep), 2.0972 MB reserved, 0.1060 MB allocated
Torch SyncBN: 1.0681 (ms / ep), 2.0972 MB reserved, 0.0763 MB allocated

hidden_dim: 256, batch_size: 32
Custom SyncBN: 1.2758 (ms / ep), 2.097