[Bugfix] Fix CustomAllreduce pcie nvlink topology detection (#3974) #4159

agt · 2024-04-18T05:50:45Z

CustomAllreduce requires nvidia GPUs to be fully connected via NVlink, and attempts to disable itself when run on incompatible hardware (e.g. >2 PCIE GPU where only specific pairs of cards are linked).

Current code calls nvmlDeviceGetNvLinkState(), but that method does not actually assess peer connectivity, instead it queries status of individual NVlink lanes on the current-rank GPU (equivalent to "nvidia-smi nvlink -s -i "). As a result, CustomAllReduce is intermittently incorrectly enabled for >2 PCIE configurations, leading to hangs at model loading as discussed in #3974.

New code calls nvmlDeviceGetP2PStatus() to determine topology.

FIX #3974

youkaichao · 2024-04-18T06:01:34Z

Can you give some pointers to explain the behavior of nvmlDeviceGetNvLinkState and nvmlDeviceGetP2PStatus ? Since nvmlDeviceGetNvLinkState has two arguments, it looks surprising to me if it only checks nvlink status on one device.

agt · 2024-04-18T20:55:21Z

Can you give some pointers to explain the behavior of nvmlDeviceGetNvLinkState and nvmlDeviceGetP2PStatus ? Since nvmlDeviceGetNvLinkState has two arguments, it looks surprising to me if it only checks nvlink status on one device.

Sure - Nvidia's documentation for nvmlDeviceGetNvLinkState reads:

nvmlReturn_t nvmlDeviceGetNvLinkState ( nvmlDevice_t device, unsigned int  link, nvmlEnableState_t* isActive )
device = The identifier of the target device
link = Specifies the NvLink link to be queried
isActive = return buffer

Description: Retrieves the state of the device's NvLink for the link specified

This function queries individual NVlink links/channels associated with the specified device , these are the 12 (PCIe) or 18 (SXM) physical lanes that connect that GPU with either its peer (PCIe) or NVswitch (SXM) , equivalent to "nvidia-smi nvlink -s -i [unit]". (output of that command attached as nvidia-smi-nvlink-s.txt as it's quite long; note that each card reports 18 possible connections, of which only 12 are active, with some differences in link activation among cards which I suspect reflect die yield variation.)

A "true" nvmlDeviceGetNvLinkState() response would be equivalent to "link present" for Ethernet - indicating that something is connected, but saying nothing about the other device's identity.

Docs for nvmlDeviceGetP2PStatus:

nvmlReturn_t nvmlDeviceGetP2PStatus ( nvmlDevice_t device1, nvmlDevice_t device2, nvmlGpuP2PCapsIndex_t p2pIndex, nvmlGpuP2PStatus_t* p2pStatus )
device1 = The first device
device2 = The second device
p2pIndex = p2p Capability Index being looked for between device1 and device2
p2pStatus = return buffer
Description:  Retrieve the status for a given p2p capability index between a given pair of GPU

This method returns concrete information regarding connectivity between the two specified GPUs, which I believe matches the intent of _is_full_nvlink().

youkaichao · 2024-04-18T21:34:12Z

This should be a release blocker. cc @simon-mo

youkaichao · 2024-04-18T21:34:31Z

Also cc @hanzhi713 FYI

simon-mo · 2024-04-18T21:37:14Z

ok what's the merge plan?

youkaichao

Nice catch!

youkaichao · 2024-04-18T21:52:21Z

I can confirm this pr fix is correct. I tested it in a V100 DGX-1 machine, with 6 nvlinks for each device. The topo is

test code:

import pynvml
from contextlib import contextmanager

@contextmanager
def _nvml():
    try:
        pynvml.nvmlInit()
        yield
    finally:
        pynvml.nvmlShutdown()

@_nvml()
def check_link_state(device: int, link_lane: int):
    device = pynvml.nvmlDeviceGetHandleByIndex(device)
    return pynvml.nvmlDeviceGetNvLinkState(device, link_lane)

@_nvml()
def check_p2p_nvlink(device: int, target: int):
    device = pynvml.nvmlDeviceGetHandleByIndex(device)
    target = pynvml.nvmlDeviceGetHandleByIndex(target)
    return pynvml.nvmlDeviceGetP2PStatus(device, target, pynvml.NVML_P2P_CAPS_INDEX_NVLINK) == pynvml.NVML_P2P_STATUS_OK

check_p2p_nvlink(6, 7) # returns True
check_link_state(6, 6) # errors , because device 6 only has 6 nvlink lanes
check_link_state(6, 5) # returns True

youkaichao · 2024-04-18T21:53:44Z

ok what's the merge plan?

I will merge it after the test is good.

…3974) [Bugfix] Fix CustomAllreduce pcie nvlink topology detection (vllm-project#3974) (vllm-project#4159)

agt added 2 commits April 17, 2024 21:56

[Bugfix] Fix pcie nvlink topology detection (vllm-project#3974)

d919e45

formatting

78f10fd

simon-mo assigned youkaichao Apr 18, 2024

simon-mo mentioned this pull request Apr 18, 2024

v0.4.1 Release Tracker #4181

Closed

9 tasks

youkaichao approved these changes Apr 18, 2024

View reviewed changes

youkaichao merged commit 8f9c28f into vllm-project:main Apr 18, 2024
46 checks passed

xjpang pushed a commit to xjpang/vllm that referenced this pull request Apr 19, 2024

[Bugfix] Fix CustomAllreduce nvlink topology detection (vllm-project#…

57329c9

…3974) [Bugfix] Fix CustomAllreduce pcie nvlink topology detection (vllm-project#3974) (vllm-project#4159)

alexeykondrat pushed a commit to alexeykondrat/ci-vllm that referenced this pull request May 1, 2024

[Bugfix] Fix CustomAllreduce nvlink topology detection (vllm-project#…

15ee106

…3974) [Bugfix] Fix CustomAllreduce pcie nvlink topology detection (vllm-project#3974) (vllm-project#4159)

z103cb pushed a commit to z103cb/opendatahub_vllm that referenced this pull request May 7, 2024

[Bugfix] Fix CustomAllreduce nvlink topology detection (vllm-project#…

2f0297f

…3974) [Bugfix] Fix CustomAllreduce pcie nvlink topology detection (vllm-project#3974) (vllm-project#4159)

youkaichao mentioned this pull request Jun 1, 2024

[Bug]: vllm 0.4.1 crashing after checking P2P status on single GPU #4587

Open

mawong-amd pushed a commit to ROCm/vllm that referenced this pull request Jun 3, 2024

[Bugfix] Fix CustomAllreduce nvlink topology detection (vllm-project#…

0fad2d9

…3974) [Bugfix] Fix CustomAllreduce pcie nvlink topology detection (vllm-project#3974) (vllm-project#4159)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bugfix] Fix CustomAllreduce pcie nvlink topology detection (#3974) #4159

[Bugfix] Fix CustomAllreduce pcie nvlink topology detection (#3974) #4159

agt commented Apr 18, 2024

youkaichao commented Apr 18, 2024

agt commented Apr 18, 2024 •

edited

youkaichao commented Apr 18, 2024

youkaichao commented Apr 18, 2024

simon-mo commented Apr 18, 2024

youkaichao left a comment

youkaichao commented Apr 18, 2024

youkaichao commented Apr 18, 2024

[Bugfix] Fix CustomAllreduce pcie nvlink topology detection (#3974) #4159

[Bugfix] Fix CustomAllreduce pcie nvlink topology detection (#3974) #4159

Conversation

agt commented Apr 18, 2024

youkaichao commented Apr 18, 2024

agt commented Apr 18, 2024 • edited

youkaichao commented Apr 18, 2024

youkaichao commented Apr 18, 2024

simon-mo commented Apr 18, 2024

youkaichao left a comment

Choose a reason for hiding this comment

youkaichao commented Apr 18, 2024

youkaichao commented Apr 18, 2024

agt commented Apr 18, 2024 •

edited