Skip to content

Conversation

pritamdamania87
Copy link
Contributor

@pritamdamania87 pritamdamania87 commented Mar 2, 2021

Stack from ghstack:

As described in #51619,
ProcessGroupShareTensorTest was failing due to segfaults in CudaIPCTypes.cpp.
There were two issues that had to be fixed for this:

  1. The ref_counter_files_ map was looked up and the result was used without
    checking whether or not the appropriate key existed in the map. This would
    result in default construction in the map if the key didn't exist resulting in
    a nullptr being stored in the map.
  2. ~CudaIPCSentData uses the global cuda_ipc_global_entities variable. But as
    part of destroying cuda_ipc_global_entities, ~CudaIPCSentData is called which
    accesses an already destroyed cuda_ipc_global_entities. This is now avoided by
    clearing all shared blocks in ~CudaIPCGlobalEntities to ensure they are all
    cleaned up before the destructor exits.

#Closes: #51619

Differential Revision: D26742332

As described in #51619,
ProcessGroupShareTensorTest was failing due to segfaults in CudaIPCTypes.cpp.
There were two issues that had to be fixed for this:

1. The ref_counter_files_ map was looked up and the result was used without
checking whether or not the appropriate key existed in the map. This would
result in default construction in the map if the key didn't exist resulting in
a nullptr being stored in the map.
2. ~CudaIPCSentData uses the global cuda_ipc_global_entities variable. But as
part of destroying cuda_ipc_global_entities, ~CudaIPCSentData is called which
accesses an already destroyed cuda_ipc_global_entities. This is now avoided by
clearing all shared blocks in ~CudaIPCGlobalEntities to ensure they are all
cleaned up before the destructor exits.

#Closes: #51619

Differential Revision: [D26742332](https://our.internmc.facebook.com/intern/diff/D26742332/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Mar 2, 2021

💊 CI failures summary and remediations

As of commit 95421ac (more details on the Dr. CI page):


  • 1/1 failures possibly* introduced in this PR
    • 1/1 non-scanned failure(s)

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

pritamdamania87 pushed a commit that referenced this pull request Mar 2, 2021
As described in #51619,
ProcessGroupShareTensorTest was failing due to segfaults in CudaIPCTypes.cpp.
There were two issues that had to be fixed for this:

1. The ref_counter_files_ map was looked up and the result was used without
checking whether or not the appropriate key existed in the map. This would
result in default construction in the map if the key didn't exist resulting in
a nullptr being stored in the map.
2. ~CudaIPCSentData uses the global cuda_ipc_global_entities variable. But as
part of destroying cuda_ipc_global_entities, ~CudaIPCSentData is called which
accesses an already destroyed cuda_ipc_global_entities. This is now avoided by
clearing all shared blocks in ~CudaIPCGlobalEntities to ensure they are all
cleaned up before the destructor exits.

#Closes: #51619

Differential Revision: [D26742332](https://our.internmc.facebook.com/intern/diff/D26742332/)

ghstack-source-id: 122812319
Pull Request resolved: #53080
@codecov
Copy link

codecov bot commented Mar 2, 2021

Codecov Report

Merging #53080 (95421ac) into gh/pritamdamania87/205/base (0569f63) will decrease coverage by 0.00%.
The diff coverage is n/a.

@@                       Coverage Diff                       @@
##           gh/pritamdamania87/205/base   #53080      +/-   ##
===============================================================
- Coverage                        78.01%   78.01%   -0.01%     
===============================================================
  Files                             1848     1848              
  Lines                           179696   179696              
===============================================================
- Hits                            140192   140182      -10     
- Misses                           39504    39514      +10     

@pritamdamania87
Copy link
Contributor Author

@malfet Could I get a review for this PR? Thanks!

@VitalyFedyunin
Copy link
Contributor

By any chance you can include test for this case?

@pritamdamania87
Copy link
Contributor Author

By any chance you can include test for this case?

The test is actually ProcessGroupShareTensorTest which was failing in #51619 but passes with this PR.

@VitalyFedyunin
Copy link
Contributor

Feel free to land this PR.

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 8533a48.

@facebook-github-bot facebook-github-bot deleted the gh/pritamdamania87/205/head branch March 20, 2021 14:16
facebook-github-bot pushed a commit that referenced this pull request Aug 18, 2021
…es instance (#56141)

Summary:
There is an instance of the static destruction order fiasco where cuda_ipc_global_entities may be accessed after it is destroyed. See #51961

This change uses a flag and avoids accesses to the destroyed class when it is set to false.

Fixes #51961

This removes the function to clear shared_blocks introduced by #53080 which had multiple issues: Unprotected access to a shared structure and modification of the vector which is being cleared by the destructors of the objects contained.
I.e. what happened was:

- `CudaIPCSentDataLimbo_.clear_shared_blocks();` is called from the destructor of CudaIPCGlobalEntities as of your PR
- This deletes instances of `CudaIPCSentData` which hold `at::DataPtr` created by `GetNewRefCountedSentData`
- This means `CudaIPCSentDataDelete` is called with still active pointers
- Hence `CudaIPCSentDataLimbo_.add` is called adding a new value to `shared_blocks_`

Pull Request resolved: #56141

Reviewed By: ejguan

Differential Revision: D30397279

Pulled By: VitalyFedyunin

fbshipit-source-id: ce4b8b90fa1c90d275e5eca93ba84321cbc6140a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants