Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Segfault when a device_buffer is released as its memory resource has already been released #943

Closed
viclafargue opened this issue Jan 17, 2022 · 6 comments
Assignees
Labels
? - Needs Triage Need team to review and classify bug Something isn't working Python Related to RMM Python API

Comments

@viclafargue
Copy link
Contributor

Describe the bug
A segfault is triggered when a device_buffer is released as its memory resource has already been released.

Steps/Code to reproduce bug

  1. Enable logging (or any UpstreamResourceAdaptor)
  2. Allocate some memory through native code.
  3. Disable logging (or remove the UpstreamResourceAdaptor)

Explanation :
A pointer to the logging_resource_adaptor will be saved in the device_buffer instance. However, whereas RMM's Python classes retain a reference to the memory resource preventing its deletion, the native code doesn't. Because of this, the logging_resource_adaptor can be released before the device_buffer is, making its proper release impossible.

Expected behavior
No segfault should be observed.

Note:
Collecting the garbage collector before disabling the RMM logging seems to fix the issue. However, it doesn't look like a very good long term solution.

@viclafargue viclafargue added ? - Needs Triage Need team to review and classify bug Something isn't working labels Jan 17, 2022
@jrhemstad jrhemstad added the Python Related to RMM Python API label Jan 17, 2022
@shwina shwina self-assigned this Jan 18, 2022
@shwina
Copy link
Contributor

shwina commented Jan 18, 2022

Sync'd offline, and it looks like this issue arises for a custom extension type that is composed of a C++ type, which in turn holds a reference to a rmm::device_uvector. That device_uvector depends on the memory resource being alive at the time of deallocation. The simplest solution here is for the custom extension type to hold a reference to the current memory resource, in the same way that rmm.DeviceBuffer does.

@harrism
Copy link
Member

harrism commented Feb 1, 2022

So is this a bug that needs to be fixed in RMM, or in @viclafargue 's code?

@jakirkham
Copy link
Member

I thought this was fixed with PR ( #931 )?

@harrism
Copy link
Member

harrism commented Feb 1, 2022

@shwina can you confirm?

@shwina
Copy link
Contributor

shwina commented Feb 1, 2022

@viclafargue does the solution I posted above work for your use case? #931 fixed a somewhat different issue than this one.

@viclafargue
Copy link
Contributor Author

Yes, #931 is a different issue. As discussed offline, the lifetime of the memory resource should be the responsibility of the user. Holding a reference to the current memory resource in the same way the rmm.DeviceBuffer does should indeed solve the issue. Thank you for the answers. Closing the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify bug Something isn't working Python Related to RMM Python API
Projects
None yet
Development

No branches or pull requests

5 participants