Skip to content

Conversation

@youkaichao
Copy link
Collaborator

@youkaichao youkaichao commented Jun 16, 2025

Improve upon #130890 , inspired by #130890 (comment) , we can automatically use the fabric handle for IPC when possible.

cc @ptrblck @msaroufim @eqy @jerryzh168

@pytorch-bot
Copy link

pytorch-bot bot commented Jun 16, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/156074

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 2e6ad5c with merge base 655b3b1 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@youkaichao
Copy link
Collaborator Author

I tested the functionality locally, but didn't know how to test it in the ci. We need to enable the imex channel to test the functionality.

@youkaichao youkaichao added module: cuda Related to torch.cuda, and CUDA support in general topic: not user facing topic category labels Jun 16, 2025
C10_CUDA_DRIVER_CHECK(DriverAPI::get()->cuMemExportToShareableHandle_(
&fd, handle.handle, CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR, 0));
handle.fd = fd;
TORCH_WARN_ONCE("use posix fd to share expandable segments.");
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if possible, i'd like to add some debug level logging, but i don't know if it is possible in pytorch c++.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's C10_LOG_API_USAGE_ONCE

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regular LOG(INFO) also works, with TORCH_CPP_LOG_LEVEL=INFO

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's no debug level logging (turned off by default), is it acceptable to put it as ERROR level? it's not an error though.

or do you think making it INFO level acceptable?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right now users see:

/data/youkaichao/pytorch/torch/storage.py:1445: UserWarning: use posix fd to share expandable segments. (Triggered internally at /data/youkaichao/pytorch/c10/cuda/CUDACachingAllocator.cpp:511.)
  return self._untyped_storage._share_cuda_(*args, **kwargs)

when i use C10_LOG_API_USAGE_ONCE, i don't see any output.

okay, looks like LOG(INFO) does not print by default, it only shows when setting TORCH_CPP_LOG_LEVEL=INFO, so this is fine.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in 61cadf7

@youkaichao youkaichao changed the title Improve IPC for Expandable Segments to use fabric handle Improve IPC for Expandable Segments to use fabric handle when possible Jun 16, 2025
@youkaichao
Copy link
Collaborator Author

cc @nataliakliushkina

Copy link
Collaborator Author

@youkaichao youkaichao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically it is possible that one process imports fabric handle but allocates with posix fd handle, but i guess that's not useful. to simplify the code, I assume all processes use the same handle type, either fabric handle or posix fd handle.

Copy link
Collaborator

@ngimel ngimel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, small comments

// is not supported, return a null range to indicate it. and clear the
// error by calling cuGetErrorString_.
const char* error_string = nullptr;
DriverAPI::get()->cuGetErrorString_(status, &error_string);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does cuGetErrorString clear the error? Docs are silent on this

Copy link
Contributor

@malfet malfet Jun 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it should not, it just a table lookup (unless error code you are passing to it is undefined :) )

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, fixed in 2e6ad5c

size_t max_handles_;
struct Handle {
CUmemGenericAllocationHandle handle;
std::optional<int> fd;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider using std::variant here, as only one of fd/fabric_handle can have value

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in c2d111e

@youkaichao
Copy link
Collaborator Author

@malfet @ngimel all comments should be addressed now, please take a look again, thanks!

@youkaichao youkaichao requested review from malfet and ngimel June 17, 2025 01:22
Copy link
Collaborator

@ngimel ngimel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@malfet can you help with ci config so this can be tested?

// CUDA_ERROR_NOT_PERMITTED to be safe, any non out-of-memory error is
// considered as the handle type is not supported. if the handle type
// is not supported, return a null range to indicate it.
return SegmentRange(nullptr, 0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this won't clear the error though, so next error check would return false?
Also what happens if allocation succeeds (it requires just the current driver version iiuc), but subsequent handle exhange fails due to missing permissions?
@malfet is there a way to check if imex permissions are set correctly?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if allocation succeeds (it requires just the current driver version iiuc)

it requires the imex channel to be set up. so allocation succeeds <==> handle exchange succeeds.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i checked the error handling doc https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__ERROR.html , it does not say anything about clearing the error. I think driver API error is just returned, without affecting the following call?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this is demonstrated by the passing tests. they don't have the imex channels set up, so this driver api call will fail. but later calls are still successful.

@youkaichao
Copy link
Collaborator Author

@malfet can you help with ci config so this can be tested?

@ngimel is it test-able in ci? we need to create imex channels and remove it to cover multiple cases, which requires sudo access and machines with nvswitch.

@colesbury colesbury added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jun 17, 2025
@ngimel
Copy link
Collaborator

ngimel commented Jun 17, 2025

cc @malfet for CI testing, we have H100 runners that have nvswitch.

Copy link
Collaborator

@ngimel ngimel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving the PR, I'd wait for @malfet for help with CI testing

@youkaichao
Copy link
Collaborator Author

cc @malfet for CI testing, we have H100 runners that have nvswitch.

also need to make sure imex channels are set up correctly

if (CUDAAllocatorConfig::expandable_segments_handle_type() !=
Expandable_Segments_Handle_Type::FABRIC_HANDLE) {
if (!handle.shareable_handle) {
int fd = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please initialize it to -1 otherwise if API call for whatever reason is a no-op, one will be trying to use stdout

Suggested change
int fd = 0;
int fd = -1;

Comment on lines +561 to +562
for (auto i : c10::irange(header.num_handles)) {
(void)i;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for (auto i : c10::irange(header.num_handles)) {
(void)i;
for ([[maybe_unused]] auto i : c10::irange(header.num_handles)) {

TORCH_CHECK(pidfd != -1, "pidfd_open:", c10::utils::str_error(errno));
for (auto i : c10::irange(header.num_handles)) {
(void)i;
int fd = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

Suggested change
int fd = 0;
int fd = -1;

for (auto i : c10::irange(header.num_handles)) {
(void)i;
int fd = 0;
buf.read((char*)&fd, sizeof(int));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you want to check that read call is successfull?

Comment on lines +593 to +594
for (auto i : c10::irange(header.num_handles)) {
(void)i;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for (auto i : c10::irange(header.num_handles)) {
(void)i;
for ([[maybe_unused]] auto i : c10::irange(header.num_handles)) {

Copy link
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, ignore my comments, those are pre-existing glitches (will fix after you land this change)

@ngimel
Copy link
Collaborator

ngimel commented Jun 17, 2025

@youkaichao let's land this and figure out testing later

@youkaichao
Copy link
Collaborator Author

@malfet @ngimel thanks for the review!

@youkaichao
Copy link
Collaborator Author

@pytorchmergebot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 18, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@youkaichao youkaichao deleted the fabric branch June 18, 2025 16:00
pytorchmergebot pushed a commit that referenced this pull request Jun 19, 2025
#156074 adds the support of ipc with fabric handle, but the code cannot compile for cuda < 12.3 (in particular, e.g. cuda 11.8).

this pr improves the support by adding some compilation-time check against cuda versions.
Pull Request resolved: #156394
Approved by: https://github.com/ngimel
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged module: cuda Related to torch.cuda, and CUDA support in general open source topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants