Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Examples] Torch DDP bench #2987

Merged
merged 11 commits into from
Jan 17, 2024
Merged

Conversation

asaiacai
Copy link
Contributor

This adds a torch ddp benchmarking example adapted from here.

I've been using this to test inter-node performance. Example output from this task on GCP is pasted below

(worker1, rank=1, pid=16369, ip=10.164.0.58) [2024-01-15 23:37:46,764] torch.distributed.run: [WARNING] master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
(head, rank=0, pid=17266) [2024-01-15 23:37:46,769] torch.distributed.run: [WARNING] master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
(head, rank=0, pid=17266) -----------------------------------
(head, rank=0, pid=17266) PyTorch distributed benchmark suite
(head, rank=0, pid=17266) -----------------------------------
(head, rank=0, pid=17266) 
(head, rank=0, pid=17266) * PyTorch version: 2.1.2+cu121
(head, rank=0, pid=17266) * CUDA version: 12.1
(head, rank=0, pid=17266) * Distributed backend: nccl
(head, rank=0, pid=17266) * Maximum bucket size: 25MB
(head, rank=0, pid=17266) 
(head, rank=0, pid=17266) --- nvidia-smi topo -m ---
(head, rank=0, pid=17266) 
(head, rank=0, pid=17266)       GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity GPU NUMA ID
(head, rank=0, pid=17266) GPU0   X      NV12    NV12    NV12    NV12    NV12    NV12    NV12    0-23,48-71      0             N/A
(head, rank=0, pid=17266) GPU1  NV12     X      NV12    NV12    NV12    NV12    NV12    NV12    0-23,48-71      0             N/A
(head, rank=0, pid=17266) GPU2  NV12    NV12     X      NV12    NV12    NV12    NV12    NV12    0-23,48-71      0             N/A
(head, rank=0, pid=17266) GPU3  NV12    NV12    NV12     X      NV12    NV12    NV12    NV12    0-23,48-71      0             N/A
(head, rank=0, pid=17266) GPU4  NV12    NV12    NV12    NV12     X      NV12    NV12    NV12    24-47,72-95     1             N/A
(head, rank=0, pid=17266) GPU5  NV12    NV12    NV12    NV12    NV12     X      NV12    NV12    24-47,72-95     1             N/A
(head, rank=0, pid=17266) GPU6  NV12    NV12    NV12    NV12    NV12    NV12     X      NV12    24-47,72-95     1             N/A
(head, rank=0, pid=17266) GPU7  NV12    NV12    NV12    NV12    NV12    NV12    NV12     X      24-47,72-95     1             N/A
(head, rank=0, pid=17266) 
(head, rank=0, pid=17266) Legend:
(head, rank=0, pid=17266) 
(head, rank=0, pid=17266)   X    = Self
(head, rank=0, pid=17266)   SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
(head, rank=0, pid=17266)   NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
(head, rank=0, pid=17266)   PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
(head, rank=0, pid=17266)   PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
(head, rank=0, pid=17266)   PIX  = Connection traversing at most a single PCIe bridge
(head, rank=0, pid=17266)   NV#  = Connection traversing a bonded set of # NVLinks
(head, rank=0, pid=17266) 
(head, rank=0, pid=17266) --------------------------
(head, rank=0, pid=17266) 
(head, rank=0, pid=17266) 
(head, rank=0, pid=17266) Benchmark: resnet50 with batch size 32
(head, rank=0, pid=17266) 
(head, rank=0, pid=17266)                             sec/iter    ex/sec      sec/iter    ex/sec      sec/iter    ex/sec      sec/iter    ex/sec
(head, rank=0, pid=17266)    1 GPUs --   no ddp:  p50:  0.040s     790/s  p75:  0.041s     789/s  p90:  0.041s     789/s  p95:  0.041s     789/s
(head, rank=0, pid=17266)    1 GPUs --    1M/1G:  p50:  0.040s     790/s  p75:  0.040s     790/s  p90:  0.041s     786/s  p95:  0.041s     772/s
(head, rank=0, pid=17266)    2 GPUs --    1M/2G:  p50:  0.041s     781/s  p75:  0.041s     777/s  p90:  0.042s     760/s  p95:  0.046s     690/s
(head, rank=0, pid=17266)    4 GPUs --    1M/4G:  p50:  0.043s     748/s  p75:  0.043s     747/s  p90:  0.043s     746/s  p95:  0.043s     744/s
(head, rank=0, pid=17266)    8 GPUs --    1M/8G:  p50:  0.043s     747/s  p75:  0.043s     745/s  p90:  0.043s     736/s  p95:  0.048s     673/s
(head, rank=0, pid=17266)   16 GPUs --    2M/8G:  p50:  0.049s     650/s  p75:  0.049s     647/s  p90:  0.050s     640/s  p95:  0.051s     631/s
(head, rank=0, pid=17266) 
(head, rank=0, pid=17266) Benchmark: resnet101 with batch size 32
(head, rank=0, pid=17266) 
(head, rank=0, pid=17266)                             sec/iter    ex/sec      sec/iter    ex/sec      sec/iter    ex/sec      sec/iter    ex/sec
(head, rank=0, pid=17266)    1 GPUs --   no ddp:  p50:  0.064s     501/s  p75:  0.064s     499/s  p90:  0.064s     497/s  p95:  0.065s     491/s
(head, rank=0, pid=17266)    1 GPUs --    1M/1G:  p50:  0.064s     502/s  p75:  0.064s     502/s  p90:  0.064s     502/s  p95:  0.064s     501/s
(head, rank=0, pid=17266)    2 GPUs --    1M/2G:  p50:  0.066s     486/s  p75:  0.066s     486/s  p90:  0.066s     484/s  p95:  0.066s     482/s
(head, rank=0, pid=17266)    4 GPUs --    1M/4G:  p50:  0.068s     468/s  p75:  0.069s     464/s  p90:  0.070s     457/s  p95:  0.077s     417/s
(head, rank=0, pid=17266)    8 GPUs --    1M/8G:  p50:  0.069s     465/s  p75:  0.069s     464/s  p90:  0.069s     463/s  p95:  0.069s     463/s
(head, rank=0, pid=17266)   16 GPUs --    2M/8G:  p50:  0.089s     359/s  p75:  0.090s     356/s  p90:  0.091s     350/s  p95:  0.094s     340/s
(head, rank=0, pid=17266) 
(head, rank=0, pid=17266) Benchmark: resnext50_32x4d with batch size 32
(head, rank=0, pid=17266) 
(head, rank=0, pid=17266)                             sec/iter    ex/sec      sec/iter    ex/sec      sec/iter    ex/sec      sec/iter    ex/sec
(head, rank=0, pid=17266)    1 GPUs --   no ddp:  p50:  0.051s     625/s  p75:  0.051s     625/s  p90:  0.051s     624/s  p95:  0.051s     624/s
(head, rank=0, pid=17266)    1 GPUs --    1M/1G:  p50:  0.051s     625/s  p75:  0.051s     625/s  p90:  0.051s     625/s  p95:  0.051s     624/s
(head, rank=0, pid=17266)    2 GPUs --    1M/2G:  p50:  0.052s     618/s  p75:  0.052s     618/s  p90:  0.052s     617/s  p95:  0.052s     617/s
(head, rank=0, pid=17266)    4 GPUs --    1M/4G:  p50:  0.053s     598/s  p75:  0.054s     597/s  p90:  0.054s     596/s  p95:  0.054s     594/s
(head, rank=0, pid=17266)    8 GPUs --    1M/8G:  p50:  0.054s     596/s  p75:  0.054s     595/s  p90:  0.054s     592/s  p95:  0.054s     592/s
(head, rank=0, pid=17266)   16 GPUs --    2M/8G:  p50:  0.060s     537/s  p75:  0.060s     536/s  p90:  0.060s     535/s  p95:  0.060s     535/s
(head, rank=0, pid=17266) 
(head, rank=0, pid=17266) Benchmark: resnext101_32x8d with batch size 32
(head, rank=0, pid=17266) 
(head, rank=0, pid=17266)                             sec/iter    ex/sec      sec/iter    ex/sec      sec/iter    ex/sec      sec/iter    ex/sec
(head, rank=0, pid=17266)    1 GPUs --   no ddp:  p50:  0.129s     248/s  p75:  0.129s     248/s  p90:  0.129s     248/s  p95:  0.129s     248/s
(head, rank=0, pid=17266)    1 GPUs --    1M/1G:  p50:  0.129s     248/s  p75:  0.129s     248/s  p90:  0.129s     248/s  p95:  0.129s     248/s
(head, rank=0, pid=17266)    2 GPUs --    1M/2G:  p50:  0.132s     243/s  p75:  0.132s     242/s  p90:  0.132s     242/s  p95:  0.132s     242/s
(head, rank=0, pid=17266)    4 GPUs --    1M/4G:  p50:  0.132s     242/s  p75:  0.132s     241/s  p90:  0.132s     241/s  p95:  0.132s     241/s
(head, rank=0, pid=17266)    8 GPUs --    1M/8G:  p50:  0.133s     240/s  p75:  0.133s     240/s  p90:  0.133s     240/s  p95:  0.133s     239/s
(head, rank=0, pid=17266)   16 GPUs --    2M/8G:  p50:  0.194s     165/s  p75:  0.194s     164/s  p90:  0.196s     163/s  p95:  0.199s     160/s

Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @asaiacai - I was able to get this to work on L4:8 GPUs. Left some comments.

examples/torch_ddp_benchmark.yaml Outdated Show resolved Hide resolved
examples/torch_ddp_benchmark.yaml Outdated Show resolved Hide resolved
examples/torch_ddp_benchmark.yaml Outdated Show resolved Hide resolved
examples/torch_ddp_benchmark.py Outdated Show resolved Hide resolved
examples/torch_ddp_benchmark.py Outdated Show resolved Hide resolved
@romilbhardwaj romilbhardwaj changed the title Ddp bench [Examples] Torch DDP bench Jan 17, 2024
Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @asaiacai! LGTM!

examples/torch_ddp_benchmark/torch_ddp_benchmark.yaml Outdated Show resolved Hide resolved
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>
@romilbhardwaj romilbhardwaj merged commit b59ab22 into skypilot-org:master Jan 17, 2024
19 checks passed
@asaiacai asaiacai deleted the ddp-bench branch January 17, 2024 19:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants