-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Fix SIGSEGV in CudaIPCTypes.cpp. #53080
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix SIGSEGV in CudaIPCTypes.cpp. #53080
Conversation
As described in #51619, ProcessGroupShareTensorTest was failing due to segfaults in CudaIPCTypes.cpp. There were two issues that had to be fixed for this: 1. The ref_counter_files_ map was looked up and the result was used without checking whether or not the appropriate key existed in the map. This would result in default construction in the map if the key didn't exist resulting in a nullptr being stored in the map. 2. ~CudaIPCSentData uses the global cuda_ipc_global_entities variable. But as part of destroying cuda_ipc_global_entities, ~CudaIPCSentData is called which accesses an already destroyed cuda_ipc_global_entities. This is now avoided by clearing all shared blocks in ~CudaIPCGlobalEntities to ensure they are all cleaned up before the destructor exits. #Closes: #51619 Differential Revision: [D26742332](https://our.internmc.facebook.com/intern/diff/D26742332/) [ghstack-poisoned]
💊 CI failures summary and remediationsAs of commit 95421ac (more details on the Dr. CI page):
This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group. |
As described in #51619, ProcessGroupShareTensorTest was failing due to segfaults in CudaIPCTypes.cpp. There were two issues that had to be fixed for this: 1. The ref_counter_files_ map was looked up and the result was used without checking whether or not the appropriate key existed in the map. This would result in default construction in the map if the key didn't exist resulting in a nullptr being stored in the map. 2. ~CudaIPCSentData uses the global cuda_ipc_global_entities variable. But as part of destroying cuda_ipc_global_entities, ~CudaIPCSentData is called which accesses an already destroyed cuda_ipc_global_entities. This is now avoided by clearing all shared blocks in ~CudaIPCGlobalEntities to ensure they are all cleaned up before the destructor exits. #Closes: #51619 Differential Revision: [D26742332](https://our.internmc.facebook.com/intern/diff/D26742332/) ghstack-source-id: 122812319 Pull Request resolved: #53080
Codecov Report
@@ Coverage Diff @@
## gh/pritamdamania87/205/base #53080 +/- ##
===============================================================
- Coverage 78.01% 78.01% -0.01%
===============================================================
Files 1848 1848
Lines 179696 179696
===============================================================
- Hits 140192 140182 -10
- Misses 39504 39514 +10 |
@malfet Could I get a review for this PR? Thanks! |
By any chance you can include test for this case? |
The test is actually ProcessGroupShareTensorTest which was failing in #51619 but passes with this PR. |
Feel free to land this PR. |
This pull request has been merged in 8533a48. |
…es instance (#56141) Summary: There is an instance of the static destruction order fiasco where cuda_ipc_global_entities may be accessed after it is destroyed. See #51961 This change uses a flag and avoids accesses to the destroyed class when it is set to false. Fixes #51961 This removes the function to clear shared_blocks introduced by #53080 which had multiple issues: Unprotected access to a shared structure and modification of the vector which is being cleared by the destructors of the objects contained. I.e. what happened was: - `CudaIPCSentDataLimbo_.clear_shared_blocks();` is called from the destructor of CudaIPCGlobalEntities as of your PR - This deletes instances of `CudaIPCSentData` which hold `at::DataPtr` created by `GetNewRefCountedSentData` - This means `CudaIPCSentDataDelete` is called with still active pointers - Hence `CudaIPCSentDataLimbo_.add` is called adding a new value to `shared_blocks_` Pull Request resolved: #56141 Reviewed By: ejguan Differential Revision: D30397279 Pulled By: VitalyFedyunin fbshipit-source-id: ce4b8b90fa1c90d275e5eca93ba84321cbc6140a
Stack from ghstack:
As described in #51619,
ProcessGroupShareTensorTest was failing due to segfaults in CudaIPCTypes.cpp.
There were two issues that had to be fixed for this:
checking whether or not the appropriate key existed in the map. This would
result in default construction in the map if the key didn't exist resulting in
a nullptr being stored in the map.
part of destroying cuda_ipc_global_entities, ~CudaIPCSentData is called which
accesses an already destroyed cuda_ipc_global_entities. This is now avoided by
clearing all shared blocks in ~CudaIPCGlobalEntities to ensure they are all
cleaned up before the destructor exits.
#Closes: #51619
Differential Revision: D26742332