Improve IPC for Expandable Segments to use fabric handle when possible #156074

youkaichao · 2025-06-16T08:50:42Z

Improve upon #130890 , inspired by #130890 (comment) , we can automatically use the fabric handle for IPC when possible.

cc @ptrblck @msaroufim @eqy @jerryzh168

pytorch-bot · 2025-06-16T08:50:46Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/156074

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 2e6ad5c with merge base 655b3b1 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

youkaichao · 2025-06-16T08:51:31Z

I tested the functionality locally, but didn't know how to test it in the ci. We need to enable the imex channel to test the functionality.

youkaichao · 2025-06-16T08:53:33Z

c10/cuda/CUDACachingAllocator.cpp

+          C10_CUDA_DRIVER_CHECK(DriverAPI::get()->cuMemExportToShareableHandle_(
+              &fd, handle.handle, CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR, 0));
+          handle.fd = fd;
+          TORCH_WARN_ONCE("use posix fd to share expandable segments.");


if possible, i'd like to add some debug level logging, but i don't know if it is possible in pytorch c++.

There's C10_LOG_API_USAGE_ONCE

Regular LOG(INFO) also works, with TORCH_CPP_LOG_LEVEL=INFO

there's no debug level logging (turned off by default), is it acceptable to put it as ERROR level? it's not an error though.

or do you think making it INFO level acceptable?

right now users see:

/data/youkaichao/pytorch/torch/storage.py:1445: UserWarning: use posix fd to share expandable segments. (Triggered internally at /data/youkaichao/pytorch/c10/cuda/CUDACachingAllocator.cpp:511.) return self._untyped_storage._share_cuda_(*args, **kwargs)

when i use C10_LOG_API_USAGE_ONCE, i don't see any output.

okay, looks like LOG(INFO) does not print by default, it only shows when setting TORCH_CPP_LOG_LEVEL=INFO, so this is fine.

fixed in 61cadf7

youkaichao · 2025-06-16T08:54:20Z

cc @nataliakliushkina

youkaichao

technically it is possible that one process imports fabric handle but allocates with posix fd handle, but i guess that's not useful. to simplify the code, I assume all processes use the same handle type, either fabric handle or posix fd handle.

ngimel

Looks good, small comments

ngimel · 2025-06-16T22:20:54Z

c10/cuda/CUDACachingAllocator.cpp

+          // is not supported, return a null range to indicate it. and clear the
+          // error by calling cuGetErrorString_.
+          const char* error_string = nullptr;
+          DriverAPI::get()->cuGetErrorString_(status, &error_string);


Does cuGetErrorString clear the error? Docs are silent on this

No, it should not, it just a table lookup (unless error code you are passing to it is undefined :) )

good catch, fixed in 2e6ad5c

ngimel · 2025-06-16T22:25:19Z

c10/cuda/CUDACachingAllocator.cpp

  size_t max_handles_;
  struct Handle {
    CUmemGenericAllocationHandle handle;
    std::optional<int> fd;


consider using std::variant here, as only one of fd/fabric_handle can have value

fixed in c2d111e

c10/cuda/CUDAAllocatorConfig.h

youkaichao · 2025-06-17T01:22:22Z

@malfet @ngimel all comments should be addressed now, please take a look again, thanks!

ngimel

@malfet can you help with ci config so this can be tested?

ngimel · 2025-06-17T04:12:48Z

c10/cuda/CUDACachingAllocator.cpp

+          // CUDA_ERROR_NOT_PERMITTED to be safe, any non out-of-memory error is
+          // considered as the handle type is not supported. if the handle type
+          // is not supported, return a null range to indicate it.
+          return SegmentRange(nullptr, 0);


this won't clear the error though, so next error check would return false?
Also what happens if allocation succeeds (it requires just the current driver version iiuc), but subsequent handle exhange fails due to missing permissions?
@malfet is there a way to check if imex permissions are set correctly?

what happens if allocation succeeds (it requires just the current driver version iiuc)

it requires the imex channel to be set up. so allocation succeeds <==> handle exchange succeeds.

i checked the error handling doc https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__ERROR.html , it does not say anything about clearing the error. I think driver API error is just returned, without affecting the following call?

i think this is demonstrated by the passing tests. they don't have the imex channels set up, so this driver api call will fail. but later calls are still successful.

youkaichao · 2025-06-17T04:52:28Z

@malfet can you help with ci config so this can be tested?

@ngimel is it test-able in ci? we need to create imex channels and remove it to cover multiple cases, which requires sudo access and machines with nvswitch.

ngimel · 2025-06-17T16:08:09Z

cc @malfet for CI testing, we have H100 runners that have nvswitch.

ngimel

Approving the PR, I'd wait for @malfet for help with CI testing

youkaichao · 2025-06-17T16:10:14Z

cc @malfet for CI testing, we have H100 runners that have nvswitch.

also need to make sure imex channels are set up correctly

malfet · 2025-06-17T19:59:42Z

c10/cuda/CUDACachingAllocator.cpp

+      if (CUDAAllocatorConfig::expandable_segments_handle_type() !=
+          Expandable_Segments_Handle_Type::FABRIC_HANDLE) {
+        if (!handle.shareable_handle) {
+          int fd = 0;


Please initialize it to -1 otherwise if API call for whatever reason is a no-op, one will be trying to use stdout

Suggested change

int fd = 0;

int fd = -1;

malfet · 2025-06-17T20:00:37Z

c10/cuda/CUDACachingAllocator.cpp

+      for (auto i : c10::irange(header.num_handles)) {
+        (void)i;


Suggested change

for (auto i : c10::irange(header.num_handles)) {

(void)i;

for ([[maybe_unused]] auto i : c10::irange(header.num_handles)) {

malfet · 2025-06-17T20:00:51Z

c10/cuda/CUDACachingAllocator.cpp

+      TORCH_CHECK(pidfd != -1, "pidfd_open:", c10::utils::str_error(errno));
+      for (auto i : c10::irange(header.num_handles)) {
+        (void)i;
+        int fd = 0;


Same as above

Suggested change

int fd = 0;

int fd = -1;

malfet · 2025-06-17T20:01:04Z

c10/cuda/CUDACachingAllocator.cpp

+      for (auto i : c10::irange(header.num_handles)) {
+        (void)i;
+        int fd = 0;
+        buf.read((char*)&fd, sizeof(int));


Don't you want to check that read call is successfull?

malfet · 2025-06-17T20:29:46Z

c10/cuda/CUDACachingAllocator.cpp

+      for (auto i : c10::irange(header.num_handles)) {
+        (void)i;


Suggested change

for (auto i : c10::irange(header.num_handles)) {

(void)i;

for ([[maybe_unused]] auto i : c10::irange(header.num_handles)) {

malfet

Sorry, ignore my comments, those are pre-existing glitches (will fix after you land this change)

ngimel · 2025-06-17T22:54:23Z

@youkaichao let's land this and figure out testing later

youkaichao · 2025-06-18T00:20:17Z

@malfet @ngimel thanks for the review!

youkaichao · 2025-06-18T00:20:33Z

@pytorchmergebot merge

pytorchmergebot · 2025-06-18T00:22:22Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

#156074 adds the support of ipc with fabric handle, but the code cannot compile for cuda < 12.3 (in particular, e.g. cuda 11.8). this pr improves the support by adding some compilation-time check against cuda versions. Pull Request resolved: #156394 Approved by: https://github.com/ngimel

youkaichao added 3 commits June 16, 2025 01:35

stage

4a7030e

linter

50c474f

update version

0a0fc19

youkaichao requested review from eqy and syed-ahmed as code owners June 16, 2025 08:50

youkaichao added module: cuda Related to torch.cuda, and CUDA support in general topic: not user facing topic category labels Jun 16, 2025

youkaichao commented Jun 16, 2025

View reviewed changes

youkaichao changed the title ~~Improve IPC for Expandable Segments to use fabric handle~~ Improve IPC for Expandable Segments to use fabric handle when possible Jun 16, 2025

pytorchbot added the open source label Jun 16, 2025

youkaichao commented Jun 16, 2025

View reviewed changes

ngimel reviewed Jun 16, 2025

View reviewed changes

malfet reviewed Jun 16, 2025

View reviewed changes

c10/cuda/CUDAAllocatorConfig.h Outdated Show resolved Hide resolved

youkaichao added 5 commits June 16, 2025 16:29

init and fix linter

21d09dc

use enum class

5558213

try to use std::variant

c2d111e

use LOG(INFO)

61cadf7

remove cuGetErrorString_

2e6ad5c

youkaichao requested review from malfet and ngimel June 17, 2025 01:22

ngimel reviewed Jun 17, 2025

View reviewed changes

colesbury added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jun 17, 2025

ngimel approved these changes Jun 17, 2025

View reviewed changes

malfet reviewed Jun 17, 2025

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 18, 2025

pytorchmergebot added the merging label Jun 18, 2025

malfet approved these changes Jun 18, 2025

View reviewed changes

pytorchmergebot added the Merged label Jun 18, 2025

pytorchmergebot closed this in a5df6ff Jun 18, 2025

pytorchmergebot removed the merging label Jun 18, 2025

youkaichao mentioned this pull request Jun 18, 2025

Support cuMem API in cross process shared memory management deepseek-ai/DeepEP#217

Open

youkaichao deleted the fabric branch June 18, 2025 16:00

youkaichao mentioned this pull request Jun 19, 2025

[bugfix] [build] guard cuda version for ipc with fabric handle #156394

Closed

ngimel mentioned this pull request Jul 30, 2025

Add CI runners that have nvidia's fabric export enabled #159436

Open

	for (auto i : c10::irange(header.num_handles)) {
	(void)i;
	for ([[maybe_unused]] auto i : c10::irange(header.num_handles)) {

Improve IPC for Expandable Segments to use fabric handle when possible #156074

Improve IPC for Expandable Segments to use fabric handle when possible #156074

Uh oh!

Conversation

youkaichao commented Jun 16, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/156074

✅ No Failures

Uh oh!

youkaichao commented Jun 16, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

youkaichao commented Jun 16, 2025

Uh oh!

youkaichao left a comment

Choose a reason for hiding this comment

Uh oh!

ngimel left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

malfet Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

youkaichao commented Jun 17, 2025

Uh oh!

ngimel left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

youkaichao commented Jun 17, 2025

Uh oh!

ngimel commented Jun 17, 2025

Uh oh!

ngimel left a comment

Choose a reason for hiding this comment

Uh oh!

youkaichao commented Jun 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

youkaichao commented Jun 16, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Jun 16, 2025 •

edited

Loading

malfet Jun 16, 2025 •

edited

Loading