[Core][Distributed] improve p2p access check #4992

youkaichao · 2024-05-22T21:52:28Z

Done

youkaichao · 2024-05-22T22:58:39Z

Previously, we use the following check for actual p2p access in case cuda driver is broken:

# code partly borrowed from
# https://github.com/turboderp/exllamav2/blob/1c67f97f3d2a968605a9c31ab791a05c85bb7879/exllamav2/compat.py#L10
# License: MIT
def _can_actually_p2p(idx_a, idx_b):
    dev_i = f"cuda:{idx_a}"
    dev_j = f"cuda:{idx_b}"
    a = torch.randn(5, device=dev_i) + 123.0
    b = a.to(dev_j)
    c = b.to(dev_i)
    return torch.all(a == c).cpu().item()

However, pytorch somehow fixes the bug, and it will always return True, no matter whether p2p is available:

import torch
torch.cuda.can_device_access_peer(0, 1) # False
_can_actually_p2p(0, 1) # True

This is reported in #4770 (comment) .

youkaichao · 2024-05-22T23:00:04Z

cc @hanzhi713

youkaichao · 2024-05-23T06:57:29Z

@WoosukKwon ready for review

WoosukKwon

LGTM! Thanks for the PR! Left some minor comments.

vllm/distributed/device_communicators/custom_all_reduce.py

vllm/distributed/device_communicators/custom_all_reduce_utils.py

youkaichao · 2024-05-29T06:23:03Z

@WoosukKwon thanks for the very detailed review!

youkaichao · 2024-05-29T07:21:23Z

Since we still don't have ci machines with p2p capability, I tested this PR locally.

cc @simon-mo for nvlink machines.

youkaichao added 6 commits May 22, 2024 14:51

move files

2a54ffb

add files

31d215c

fix format

dc06672

fix import

b7ed666

add verbose comments

51e1f59

enforce CUDA_VISIBLE_DEVICES

fad32de

youkaichao changed the title ~~[WIP][Core][Distributed] improve p2p access check~~ [Core][Distributed] improve p2p access check May 22, 2024

update comments

20324bb

youkaichao requested a review from WoosukKwon May 23, 2024 06:57

youkaichao added 2 commits May 23, 2024 09:51

add nv forum link

c56b38e

Merge branch 'main' into p2p_check

9c7b8b8

WoosukKwon self-assigned this May 28, 2024

WoosukKwon approved these changes May 29, 2024

View reviewed changes

youkaichao added 6 commits May 28, 2024 23:01

Merge remote-tracking branch 'origin' into p2p_check

f31939d

add type annotation

1c21915

only allow export one function

770e508

fix format

b730c80

use strict assert

89f6bd3

change init

34900d3

fix fork machines

4047484

youkaichao enabled auto-merge (squash) May 29, 2024 07:21

Merge remote-tracking branch 'origin' into p2p_check

dab94e8

youkaichao merged commit 594392d into vllm-project:main May 29, 2024

chengzhi-lu pushed a commit to chengzhi-lu/vllm that referenced this pull request May 29, 2024

[Core][Distributed] improve p2p access check (vllm-project#4992)

c859d82

youkaichao deleted the p2p_check branch May 29, 2024 15:16

dtrifiro pushed a commit to opendatahub-io/vllm that referenced this pull request May 31, 2024

[Core][Distributed] improve p2p access check (vllm-project#4992)

6bdfb4f

robertgshaw2-redhat pushed a commit to neuralmagic/nm-vllm that referenced this pull request Jun 8, 2024

[Core][Distributed] improve p2p access check (vllm-project#4992)

5bde5ba

joerunde pushed a commit to joerunde/vllm that referenced this pull request Jun 17, 2024

[Core][Distributed] improve p2p access check (vllm-project#4992)

b484450

robertgshaw2-redhat pushed a commit to neuralmagic/nm-vllm that referenced this pull request Jul 14, 2024

[Core][Distributed] improve p2p access check (vllm-project#4992)

3986c3e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Core][Distributed] improve p2p access check #4992

[Core][Distributed] improve p2p access check #4992

Uh oh!

youkaichao commented May 22, 2024 •

edited

Loading

Uh oh!

youkaichao commented May 22, 2024

Uh oh!

youkaichao commented May 22, 2024

Uh oh!

youkaichao commented May 23, 2024

Uh oh!

WoosukKwon left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

youkaichao commented May 29, 2024

Uh oh!

youkaichao commented May 29, 2024

Uh oh!

Uh oh!

Uh oh!

[Core][Distributed] improve p2p access check #4992

[Core][Distributed] improve p2p access check #4992

Uh oh!

Conversation

youkaichao commented May 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

youkaichao commented May 22, 2024

Uh oh!

youkaichao commented May 22, 2024

Uh oh!

youkaichao commented May 23, 2024

Uh oh!

WoosukKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

youkaichao commented May 29, 2024

Uh oh!

youkaichao commented May 29, 2024

Uh oh!

Uh oh!

youkaichao commented May 22, 2024 •

edited

Loading