Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bugfix] Fix CustomAllreduce pcie nvlink topology detection (#3974) #4159

Merged
merged 2 commits into from
Apr 18, 2024

Conversation

agt
Copy link
Contributor

@agt agt commented Apr 18, 2024

CustomAllreduce requires nvidia GPUs to be fully connected via NVlink, and attempts to disable itself when run on incompatible hardware (e.g. >2 PCIE GPU where only specific pairs of cards are linked).

Current code calls nvmlDeviceGetNvLinkState(), but that method does not actually assess peer connectivity, instead it queries status of individual NVlink lanes on the current-rank GPU (equivalent to "nvidia-smi nvlink -s -i "). As a result, CustomAllReduce is intermittently incorrectly enabled for >2 PCIE configurations, leading to hangs at model loading as discussed in #3974.

New code calls nvmlDeviceGetP2PStatus() to determine topology.

FIX #3974

@youkaichao
Copy link
Member

Can you give some pointers to explain the behavior of nvmlDeviceGetNvLinkState and nvmlDeviceGetP2PStatus ? Since nvmlDeviceGetNvLinkState has two arguments, it looks surprising to me if it only checks nvlink status on one device.

@agt
Copy link
Contributor Author

agt commented Apr 18, 2024

Can you give some pointers to explain the behavior of nvmlDeviceGetNvLinkState and nvmlDeviceGetP2PStatus ? Since nvmlDeviceGetNvLinkState has two arguments, it looks surprising to me if it only checks nvlink status on one device.

Sure - Nvidia's documentation for nvmlDeviceGetNvLinkState reads:

nvmlReturn_t nvmlDeviceGetNvLinkState ( nvmlDevice_t device, unsigned int  link, nvmlEnableState_t* isActive )
device = The identifier of the target device
link = Specifies the NvLink link to be queried
isActive = return buffer

Description: Retrieves the state of the device's NvLink for the link specified

This function queries individual NVlink links/channels associated with the specified device , these are the 12 (PCIe) or 18 (SXM) physical lanes that connect that GPU with either its peer (PCIe) or NVswitch (SXM) , equivalent to "nvidia-smi nvlink -s -i [unit]". (output of that command attached as nvidia-smi-nvlink-s.txt as it's quite long; note that each card reports 18 possible connections, of which only 12 are active, with some differences in link activation among cards which I suspect reflect die yield variation.)

A "true" nvmlDeviceGetNvLinkState() response would be equivalent to "link present" for Ethernet - indicating that something is connected, but saying nothing about the other device's identity.

Docs for nvmlDeviceGetP2PStatus:

nvmlReturn_t nvmlDeviceGetP2PStatus ( nvmlDevice_t device1, nvmlDevice_t device2, nvmlGpuP2PCapsIndex_t p2pIndex, nvmlGpuP2PStatus_t* p2pStatus )
device1 = The first device
device2 = The second device
p2pIndex = p2p Capability Index being looked for between device1 and device2
p2pStatus = return buffer
Description:  Retrieve the status for a given p2p capability index between a given pair of GPU

This method returns concrete information regarding connectivity between the two specified GPUs, which I believe matches the intent of _is_full_nvlink().

@youkaichao
Copy link
Member

This should be a release blocker. cc @simon-mo

@youkaichao
Copy link
Member

Also cc @hanzhi713 FYI

@simon-mo
Copy link
Collaborator

ok what's the merge plan?

@simon-mo simon-mo mentioned this pull request Apr 18, 2024
9 tasks
Copy link
Member

@youkaichao youkaichao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch!

@youkaichao
Copy link
Member

I can confirm this pr fix is correct. I tested it in a V100 DGX-1 machine, with 6 nvlinks for each device. The topo is

image

test code:

import pynvml
from contextlib import contextmanager

@contextmanager
def _nvml():
    try:
        pynvml.nvmlInit()
        yield
    finally:
        pynvml.nvmlShutdown()

@_nvml()
def check_link_state(device: int, link_lane: int):
    device = pynvml.nvmlDeviceGetHandleByIndex(device)
    return pynvml.nvmlDeviceGetNvLinkState(device, link_lane)

@_nvml()
def check_p2p_nvlink(device: int, target: int):
    device = pynvml.nvmlDeviceGetHandleByIndex(device)
    target = pynvml.nvmlDeviceGetHandleByIndex(target)
    return pynvml.nvmlDeviceGetP2PStatus(device, target, pynvml.NVML_P2P_CAPS_INDEX_NVLINK) == pynvml.NVML_P2P_STATUS_OK

check_p2p_nvlink(6, 7) # returns True
check_link_state(6, 6) # errors , because device 6 only has 6 nvlink lanes
check_link_state(6, 5) # returns True

@youkaichao
Copy link
Member

ok what's the merge plan?

I will merge it after the test is good.

@youkaichao youkaichao merged commit 8f9c28f into vllm-project:main Apr 18, 2024
46 checks passed
xjpang pushed a commit to xjpang/vllm that referenced this pull request Apr 19, 2024
robertgshaw2-neuralmagic pushed a commit to neuralmagic/nm-vllm that referenced this pull request Apr 21, 2024
robertgshaw2-neuralmagic pushed a commit to neuralmagic/nm-vllm that referenced this pull request Apr 26, 2024
alexeykondrat pushed a commit to alexeykondrat/ci-vllm that referenced this pull request May 1, 2024
z103cb pushed a commit to z103cb/opendatahub_vllm that referenced this pull request May 7, 2024
mawong-amd pushed a commit to ROCm/vllm that referenced this pull request Jun 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: LLM is not getting loaded on multiple GPUs but works fine on a single GPU
3 participants