-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Improve IPC for Expandable Segments to use fabric handle when possible #156074
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/156074
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 2e6ad5c with merge base 655b3b1 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
I tested the functionality locally, but didn't know how to test it in the ci. We need to enable the imex channel to test the functionality. |
c10/cuda/CUDACachingAllocator.cpp
Outdated
| C10_CUDA_DRIVER_CHECK(DriverAPI::get()->cuMemExportToShareableHandle_( | ||
| &fd, handle.handle, CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR, 0)); | ||
| handle.fd = fd; | ||
| TORCH_WARN_ONCE("use posix fd to share expandable segments."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if possible, i'd like to add some debug level logging, but i don't know if it is possible in pytorch c++.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's C10_LOG_API_USAGE_ONCE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regular LOG(INFO) also works, with TORCH_CPP_LOG_LEVEL=INFO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's no debug level logging (turned off by default), is it acceptable to put it as ERROR level? it's not an error though.
or do you think making it INFO level acceptable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right now users see:
/data/youkaichao/pytorch/torch/storage.py:1445: UserWarning: use posix fd to share expandable segments. (Triggered internally at /data/youkaichao/pytorch/c10/cuda/CUDACachingAllocator.cpp:511.)
return self._untyped_storage._share_cuda_(*args, **kwargs)
when i use C10_LOG_API_USAGE_ONCE, i don't see any output.
okay, looks like LOG(INFO) does not print by default, it only shows when setting TORCH_CPP_LOG_LEVEL=INFO, so this is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in 61cadf7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
technically it is possible that one process imports fabric handle but allocates with posix fd handle, but i guess that's not useful. to simplify the code, I assume all processes use the same handle type, either fabric handle or posix fd handle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, small comments
c10/cuda/CUDACachingAllocator.cpp
Outdated
| // is not supported, return a null range to indicate it. and clear the | ||
| // error by calling cuGetErrorString_. | ||
| const char* error_string = nullptr; | ||
| DriverAPI::get()->cuGetErrorString_(status, &error_string); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does cuGetErrorString clear the error? Docs are silent on this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it should not, it just a table lookup (unless error code you are passing to it is undefined :) )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, fixed in 2e6ad5c
c10/cuda/CUDACachingAllocator.cpp
Outdated
| size_t max_handles_; | ||
| struct Handle { | ||
| CUmemGenericAllocationHandle handle; | ||
| std::optional<int> fd; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider using std::variant here, as only one of fd/fabric_handle can have value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in c2d111e
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@malfet can you help with ci config so this can be tested?
| // CUDA_ERROR_NOT_PERMITTED to be safe, any non out-of-memory error is | ||
| // considered as the handle type is not supported. if the handle type | ||
| // is not supported, return a null range to indicate it. | ||
| return SegmentRange(nullptr, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this won't clear the error though, so next error check would return false?
Also what happens if allocation succeeds (it requires just the current driver version iiuc), but subsequent handle exhange fails due to missing permissions?
@malfet is there a way to check if imex permissions are set correctly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens if allocation succeeds (it requires just the current driver version iiuc)
it requires the imex channel to be set up. so allocation succeeds <==> handle exchange succeeds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i checked the error handling doc https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__ERROR.html , it does not say anything about clearing the error. I think driver API error is just returned, without affecting the following call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this is demonstrated by the passing tests. they don't have the imex channels set up, so this driver api call will fail. but later calls are still successful.
|
cc @malfet for CI testing, we have H100 runners that have nvswitch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving the PR, I'd wait for @malfet for help with CI testing
also need to make sure imex channels are set up correctly |
| if (CUDAAllocatorConfig::expandable_segments_handle_type() != | ||
| Expandable_Segments_Handle_Type::FABRIC_HANDLE) { | ||
| if (!handle.shareable_handle) { | ||
| int fd = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please initialize it to -1 otherwise if API call for whatever reason is a no-op, one will be trying to use stdout
| int fd = 0; | |
| int fd = -1; |
| for (auto i : c10::irange(header.num_handles)) { | ||
| (void)i; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| for (auto i : c10::irange(header.num_handles)) { | |
| (void)i; | |
| for ([[maybe_unused]] auto i : c10::irange(header.num_handles)) { |
| TORCH_CHECK(pidfd != -1, "pidfd_open:", c10::utils::str_error(errno)); | ||
| for (auto i : c10::irange(header.num_handles)) { | ||
| (void)i; | ||
| int fd = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
| int fd = 0; | |
| int fd = -1; |
| for (auto i : c10::irange(header.num_handles)) { | ||
| (void)i; | ||
| int fd = 0; | ||
| buf.read((char*)&fd, sizeof(int)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't you want to check that read call is successfull?
| for (auto i : c10::irange(header.num_handles)) { | ||
| (void)i; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| for (auto i : c10::irange(header.num_handles)) { | |
| (void)i; | |
| for ([[maybe_unused]] auto i : c10::irange(header.num_handles)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, ignore my comments, those are pre-existing glitches (will fix after you land this change)
|
@youkaichao let's land this and figure out testing later |
|
@pytorchmergebot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
#156074 adds the support of ipc with fabric handle, but the code cannot compile for cuda < 12.3 (in particular, e.g. cuda 11.8). this pr improves the support by adding some compilation-time check against cuda versions. Pull Request resolved: #156394 Approved by: https://github.com/ngimel
Improve upon #130890 , inspired by #130890 (comment) , we can automatically use the fabric handle for IPC when possible.
cc @ptrblck @msaroufim @eqy @jerryzh168