[Examples] Torch DDP bench #2987

asaiacai · 2024-01-15T23:55:14Z

This adds a torch ddp benchmarking example adapted from here.

I've been using this to test inter-node performance. Example output from this task on GCP is pasted below

(worker1, rank=1, pid=16369, ip=10.164.0.58) [2024-01-15 23:37:46,764] torch.distributed.run: [WARNING] master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
(head, rank=0, pid=17266) [2024-01-15 23:37:46,769] torch.distributed.run: [WARNING] master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
(head, rank=0, pid=17266) -----------------------------------
(head, rank=0, pid=17266) PyTorch distributed benchmark suite
(head, rank=0, pid=17266) -----------------------------------
(head, rank=0, pid=17266) 
(head, rank=0, pid=17266) * PyTorch version: 2.1.2+cu121
(head, rank=0, pid=17266) * CUDA version: 12.1
(head, rank=0, pid=17266) * Distributed backend: nccl
(head, rank=0, pid=17266) * Maximum bucket size: 25MB
(head, rank=0, pid=17266) 
(head, rank=0, pid=17266) --- nvidia-smi topo -m ---
(head, rank=0, pid=17266) 
(head, rank=0, pid=17266)       GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity GPU NUMA ID
(head, rank=0, pid=17266) GPU0   X      NV12    NV12    NV12    NV12    NV12    NV12    NV12    0-23,48-71      0             N/A
(head, rank=0, pid=17266) GPU1  NV12     X      NV12    NV12    NV12    NV12    NV12    NV12    0-23,48-71      0             N/A
(head, rank=0, pid=17266) GPU2  NV12    NV12     X      NV12    NV12    NV12    NV12    NV12    0-23,48-71      0             N/A
(head, rank=0, pid=17266) GPU3  NV12    NV12    NV12     X      NV12    NV12    NV12    NV12    0-23,48-71      0             N/A
(head, rank=0, pid=17266) GPU4  NV12    NV12    NV12    NV12     X      NV12    NV12    NV12    24-47,72-95     1             N/A
(head, rank=0, pid=17266) GPU5  NV12    NV12    NV12    NV12    NV12     X      NV12    NV12    24-47,72-95     1             N/A
(head, rank=0, pid=17266) GPU6  NV12    NV12    NV12    NV12    NV12    NV12     X      NV12    24-47,72-95     1             N/A
(head, rank=0, pid=17266) GPU7  NV12    NV12    NV12    NV12    NV12    NV12    NV12     X      24-47,72-95     1             N/A
(head, rank=0, pid=17266) 
(head, rank=0, pid=17266) Legend:
(head, rank=0, pid=17266) 
(head, rank=0, pid=17266)   X    = Self
(head, rank=0, pid=17266)   SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
(head, rank=0, pid=17266)   NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
(head, rank=0, pid=17266)   PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
(head, rank=0, pid=17266)   PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
(head, rank=0, pid=17266)   PIX  = Connection traversing at most a single PCIe bridge
(head, rank=0, pid=17266)   NV#  = Connection traversing a bonded set of # NVLinks
(head, rank=0, pid=17266) 
(head, rank=0, pid=17266) --------------------------
(head, rank=0, pid=17266) 
(head, rank=0, pid=17266) 
(head, rank=0, pid=17266) Benchmark: resnet50 with batch size 32
(head, rank=0, pid=17266) 
(head, rank=0, pid=17266)                             sec/iter    ex/sec      sec/iter    ex/sec      sec/iter    ex/sec      sec/iter    ex/sec
(head, rank=0, pid=17266)    1 GPUs --   no ddp:  p50:  0.040s     790/s  p75:  0.041s     789/s  p90:  0.041s     789/s  p95:  0.041s     789/s
(head, rank=0, pid=17266)    1 GPUs --    1M/1G:  p50:  0.040s     790/s  p75:  0.040s     790/s  p90:  0.041s     786/s  p95:  0.041s     772/s
(head, rank=0, pid=17266)    2 GPUs --    1M/2G:  p50:  0.041s     781/s  p75:  0.041s     777/s  p90:  0.042s     760/s  p95:  0.046s     690/s
(head, rank=0, pid=17266)    4 GPUs --    1M/4G:  p50:  0.043s     748/s  p75:  0.043s     747/s  p90:  0.043s     746/s  p95:  0.043s     744/s
(head, rank=0, pid=17266)    8 GPUs --    1M/8G:  p50:  0.043s     747/s  p75:  0.043s     745/s  p90:  0.043s     736/s  p95:  0.048s     673/s
(head, rank=0, pid=17266)   16 GPUs --    2M/8G:  p50:  0.049s     650/s  p75:  0.049s     647/s  p90:  0.050s     640/s  p95:  0.051s     631/s
(head, rank=0, pid=17266) 
(head, rank=0, pid=17266) Benchmark: resnet101 with batch size 32
(head, rank=0, pid=17266) 
(head, rank=0, pid=17266)                             sec/iter    ex/sec      sec/iter    ex/sec      sec/iter    ex/sec      sec/iter    ex/sec
(head, rank=0, pid=17266)    1 GPUs --   no ddp:  p50:  0.064s     501/s  p75:  0.064s     499/s  p90:  0.064s     497/s  p95:  0.065s     491/s
(head, rank=0, pid=17266)    1 GPUs --    1M/1G:  p50:  0.064s     502/s  p75:  0.064s     502/s  p90:  0.064s     502/s  p95:  0.064s     501/s
(head, rank=0, pid=17266)    2 GPUs --    1M/2G:  p50:  0.066s     486/s  p75:  0.066s     486/s  p90:  0.066s     484/s  p95:  0.066s     482/s
(head, rank=0, pid=17266)    4 GPUs --    1M/4G:  p50:  0.068s     468/s  p75:  0.069s     464/s  p90:  0.070s     457/s  p95:  0.077s     417/s
(head, rank=0, pid=17266)    8 GPUs --    1M/8G:  p50:  0.069s     465/s  p75:  0.069s     464/s  p90:  0.069s     463/s  p95:  0.069s     463/s
(head, rank=0, pid=17266)   16 GPUs --    2M/8G:  p50:  0.089s     359/s  p75:  0.090s     356/s  p90:  0.091s     350/s  p95:  0.094s     340/s
(head, rank=0, pid=17266) 
(head, rank=0, pid=17266) Benchmark: resnext50_32x4d with batch size 32
(head, rank=0, pid=17266) 
(head, rank=0, pid=17266)                             sec/iter    ex/sec      sec/iter    ex/sec      sec/iter    ex/sec      sec/iter    ex/sec
(head, rank=0, pid=17266)    1 GPUs --   no ddp:  p50:  0.051s     625/s  p75:  0.051s     625/s  p90:  0.051s     624/s  p95:  0.051s     624/s
(head, rank=0, pid=17266)    1 GPUs --    1M/1G:  p50:  0.051s     625/s  p75:  0.051s     625/s  p90:  0.051s     625/s  p95:  0.051s     624/s
(head, rank=0, pid=17266)    2 GPUs --    1M/2G:  p50:  0.052s     618/s  p75:  0.052s     618/s  p90:  0.052s     617/s  p95:  0.052s     617/s
(head, rank=0, pid=17266)    4 GPUs --    1M/4G:  p50:  0.053s     598/s  p75:  0.054s     597/s  p90:  0.054s     596/s  p95:  0.054s     594/s
(head, rank=0, pid=17266)    8 GPUs --    1M/8G:  p50:  0.054s     596/s  p75:  0.054s     595/s  p90:  0.054s     592/s  p95:  0.054s     592/s
(head, rank=0, pid=17266)   16 GPUs --    2M/8G:  p50:  0.060s     537/s  p75:  0.060s     536/s  p90:  0.060s     535/s  p95:  0.060s     535/s
(head, rank=0, pid=17266) 
(head, rank=0, pid=17266) Benchmark: resnext101_32x8d with batch size 32
(head, rank=0, pid=17266) 
(head, rank=0, pid=17266)                             sec/iter    ex/sec      sec/iter    ex/sec      sec/iter    ex/sec      sec/iter    ex/sec
(head, rank=0, pid=17266)    1 GPUs --   no ddp:  p50:  0.129s     248/s  p75:  0.129s     248/s  p90:  0.129s     248/s  p95:  0.129s     248/s
(head, rank=0, pid=17266)    1 GPUs --    1M/1G:  p50:  0.129s     248/s  p75:  0.129s     248/s  p90:  0.129s     248/s  p95:  0.129s     248/s
(head, rank=0, pid=17266)    2 GPUs --    1M/2G:  p50:  0.132s     243/s  p75:  0.132s     242/s  p90:  0.132s     242/s  p95:  0.132s     242/s
(head, rank=0, pid=17266)    4 GPUs --    1M/4G:  p50:  0.132s     242/s  p75:  0.132s     241/s  p90:  0.132s     241/s  p95:  0.132s     241/s
(head, rank=0, pid=17266)    8 GPUs --    1M/8G:  p50:  0.133s     240/s  p75:  0.133s     240/s  p90:  0.133s     240/s  p95:  0.133s     239/s
(head, rank=0, pid=17266)   16 GPUs --    2M/8G:  p50:  0.194s     165/s  p75:  0.194s     164/s  p90:  0.196s     163/s  p95:  0.199s     160/s

romilbhardwaj

Thanks @asaiacai - I was able to get this to work on L4:8 GPUs. Left some comments.

examples/torch_ddp_benchmark.yaml

examples/torch_ddp_benchmark.py

Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>

romilbhardwaj

Thanks @asaiacai! LGTM!

examples/torch_ddp_benchmark/torch_ddp_benchmark.yaml

Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>

asaiacai added 3 commits January 15, 2024 15:47

add torch ddp example

a2bd593

fix docstring

b325818

fix formatting

4190f6f

concretevitamin requested a review from romilbhardwaj January 16, 2024 23:01

romilbhardwaj reviewed Jan 17, 2024

View reviewed changes

romilbhardwaj changed the title ~~Ddp bench~~ [Examples] Torch DDP bench Jan 17, 2024

asaiacai and others added 6 commits January 16, 2024 17:16

Update examples/torch_ddp_benchmark.yaml

7e91e83

Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>

add docstring explaining diff between reference and this script

c6755ad

add newline to end of yaml

a31c470

add preamble to yaml

2b04ec2

move examples to folder

3b19023

move original files to folder

b28dc71

asaiacai requested a review from romilbhardwaj January 17, 2024 01:55

update path

ac572c3

romilbhardwaj approved these changes Jan 17, 2024

View reviewed changes

examples/torch_ddp_benchmark/torch_ddp_benchmark.yaml Outdated Show resolved Hide resolved

Update examples/torch_ddp_benchmark/torch_ddp_benchmark.yaml

c02d0a6

Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>

romilbhardwaj merged commit b59ab22 into skypilot-org:master Jan 17, 2024
19 checks passed

asaiacai deleted the ddp-bench branch January 17, 2024 19:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Examples] Torch DDP bench #2987

[Examples] Torch DDP bench #2987

asaiacai commented Jan 15, 2024

romilbhardwaj left a comment

romilbhardwaj left a comment

[Examples] Torch DDP bench #2987

[Examples] Torch DDP bench #2987

Conversation

asaiacai commented Jan 15, 2024

romilbhardwaj left a comment

Choose a reason for hiding this comment

romilbhardwaj left a comment

Choose a reason for hiding this comment