Prevent `DeviceBuffer` DeviceMemoryResource premature release #931

viclafargue · 2021-12-06T16:24:52Z

DeviceBuffer deallocation sometimes fails with a segfault. I wasn't able to create a minimal reproducer for now. However, a possible source for the problem is the way Cython releases the memory. Indeed the DeviceBuffer's DeviceMemoryResource appears to be sometimes released before the device_buffer. This causes a segfault in the the device_buffer deconstructor as the device memory resource pointer is set to null.

The code of this PR seems to fix the issue. It makes use of a holder class that seem to allow proper ordering of the class members destructions.

shwina · 2021-12-06T19:46:04Z

Thanks for the PR!

I think we need a better understanding of the problem, exactly when it arises, and why the changes in this PR fix it. If we haven't been able to find a minimal reproducer yet, do you perhaps have a larger program that reproduces the segfault?

Do you think maybe simply reversing the order of declaration of c_obj and mr here would resolve the issue you are seeing? cdef classes in Cython translate into C++ structs, and the order of deletion of members is in the reverse order of declaration.

viclafargue · 2021-12-07T12:15:57Z

Thanks for taking a look!

Do you think maybe simply reversing the order of declaration of c_obj and mr here would resolve the issue you are seeing?

I remember trying that, unfortunately, it didn't seemed to work.

What I noticed is that the DeviceMemoryResource's shared_ptr refcount seems to always be set to 1, and could, in my understanding, as well be defined as a unique_ptr. DeviceMemoryResource copies in the code seem to be done by reference at the Python level. Because of this, what actually matters is the Python refcount.

In the C++ code, we can see the following:

struct __pyx_obj_3rmm_4_lib_13device_buffer_DeviceBuffer {
  PyObject_HEAD
  struct __pyx_vtabstruct_3rmm_4_lib_13device_buffer_DeviceBuffer *__pyx_vtab;
  std::unique_ptr<rmm::device_buffer>  c_obj;
  struct __pyx_obj_3rmm_4_lib_15memory_resource_DeviceMemoryResource *mr;
  struct __pyx_obj_3rmm_5_cuda_6stream_Stream *stream;
};

While c_obj holding the native device_buffer is a C++ object, mr the DeviceMemoryResource extension type is a pointer to a struct that seems to be regulated by to the Python layer. And, as discussed in the Cython documentation:

By the time your dealloc() method is called, the object may already have been partially destroyed and may not be in a valid state as far as Python is concerned

In my understanding, mr, the DeviceMemoryResource class member might have been destroyed by the time the c_obj is deconstructed.

If we haven't been able to find a minimal reproducer yet, do you perhaps have a larger program that reproduces the segfault?

Sure, I can help you reproduce it through NVIDIA's internal communication means.

shwina · 2021-12-10T20:18:42Z

As an update: using no_gc_clear with DeviceBuffer also fixes the issue. Although, a simple test program where a DeviceBuffer lives in a reference cycle does not reproduce the issue. Continuing to investigate.

shwina · 2021-12-14T21:37:47Z

I believe this should be a more minimal reproducer of the problem:

import rmm
import gc

class Foo():
    def __init__(self, x):
        self.x = x

rmm.mr.set_current_device_resource(rmm.mr.CudaMemoryResource())

dbuf1 = rmm.DeviceBuffer(size=10)

# Make dbuf1 part of a reference cycle:
l1 = Foo(dbuf1)
l2 = Foo(dbuf1)
l1.l2 = l2
l2.l1 = l1

# due to the reference cycle, the device buffer doesn't actually get
# cleaned up until later, after we invoke `gc.collect()`:
del dbuf1, l1, l2

rmm.mr.set_current_device_resource(rmm.mr.CudaMemoryResource())

# by now, the only remaining reference to the *original* memory
# resource should be in `dbuf1`. However, the cyclic garbage collector
# will eliminate that reference when it clears the object via its
# `tp_clear` method.  Later, when `tp_dealloc` attemps to actually
# deallocate `dbuf1` (which needs the MR alive), a segfault occurs.

gc.collect()

As suspected, we are seeing exactly the situation described in https://cython.readthedocs.io/en/latest/src/userguide/extension_types.html#disabling-cycle-breaking-tp-clear. Setting no_gc_clear seems to be the recommended fix and eliminates the segfault, both in the original issue and the smaller repro.

The one caveat with no_gc_clear is:

If you use no_gc_clear, it is important that any given reference cycle contains at least one object without no_gc_clear. Otherwise, the cycle cannot be broken, which is a memory leak.

I cannot think of a way to construct such a reference cycle.

shwina · 2021-12-15T15:19:10Z

@viclafargue I think we should avoid the indirection via NativeDeviceBufferHolder if possible. Adding a @no_gc_clear to DeviceBuffer tackles the problem more directly, although we should add a descriptive comment describing the need for it.

viclafargue · 2021-12-15T17:18:44Z

Wrote a fix as discussed with @shwina through DM. It involves storing a shared_ptr to a device_memory_resource as a reference to the memory resource in the DeviceBuffer. Previously the reference was stored as a DeviceMemoryResource object that was unfortunately subject to being released before the device buffer in itself could be deallocated.

python/rmm/_lib/device_buffer.pyx

python/rmm/_lib/device_buffer.pxd

viclafargue · 2021-12-17T15:03:30Z

The solution actually resulted in segfaults when using a UpstreamResourceAdaptor. Indeed, the upstream device memory resource was released before the device buffer. Will use the no_gc_clear decorator instead.

python/rmm/tests/test_rmm.py

viclafargue · 2021-12-21T14:23:26Z

rerun tests

shwina

This looks good to me. Going to merge!

shwina · 2022-01-05T18:59:27Z

@gpucibot merge

harrism · 2022-01-10T23:05:32Z

Please remove WIP from PRS before merging.

harrism · 2022-01-10T23:07:29Z

Gave this PR a new title, posthumously. Would hate to end up with just "Fix for DeviceBuffer" in our release notes! @viclafargue Please be thoughtful and descriptive in PR titles.

viclafargue requested a review from a team as a code owner December 6, 2021 16:24

github-actions bot added the Python Related to RMM Python API label Dec 6, 2021

Fix for DeviceBuffer

75c0e2f

viclafargue force-pushed the devicebuffer-fix branch from 37a5171 to 75c0e2f Compare December 15, 2021 17:12

Update doc

59836ec

shwina reviewed Dec 15, 2021

View reviewed changes

python/rmm/_lib/device_buffer.pyx Outdated Show resolved Hide resolved

shwina reviewed Dec 15, 2021

View reviewed changes

python/rmm/_lib/device_buffer.pxd Outdated Show resolved Hide resolved

viclafargue added 2 commits December 16, 2021 15:35

Review requests

69c2a7b

Adding test

41fcdab

viclafargue requested a review from shwina December 16, 2021 14:55

shwina reviewed Dec 16, 2021

View reviewed changes

python/rmm/_lib/device_buffer.pxd Outdated Show resolved Hide resolved

Update doc

0515d60

shwina reviewed Dec 16, 2021

View reviewed changes

python/rmm/_lib/device_buffer.pxd Outdated Show resolved Hide resolved

python/rmm/_lib/device_buffer.pxd Outdated Show resolved Hide resolved

viclafargue added 2 commits December 16, 2021 16:41

Modify imports

9050447

Use no_gc_clear decorator

507fa06

viclafargue requested a review from shwina December 17, 2021 15:00

shwina reviewed Dec 17, 2021

View reviewed changes

python/rmm/tests/test_rmm.py Show resolved Hide resolved

Addition to RMM test doc

2cf2dbb

viclafargue requested a review from shwina January 5, 2022 16:54

caryr35 added this to PR-WIP in v22.02 Release via automation Jan 5, 2022

caryr35 moved this from PR-WIP to PR-Needs review in v22.02 Release Jan 5, 2022

shwina added the non-breaking Non-breaking change label Jan 5, 2022

shwina added the bug Something isn't working label Jan 5, 2022

shwina approved these changes Jan 5, 2022

View reviewed changes

v22.02 Release automation moved this from PR-Needs review to PR-Reviewer approved Jan 5, 2022

rapids-bot bot merged commit 5a239d2 into rapidsai:branch-22.02 Jan 5, 2022

v22.02 Release automation moved this from PR-Reviewer approved to Done Jan 5, 2022

pentschev mentioned this pull request Jan 10, 2022

Failing test_proxy.py::test_sizeof_cudf rapidsai/dask-cuda#824

Closed

harrism changed the title ~~[WIP] Fix for DeviceBuffer~~ Fix for DeviceBuffer Jan 10, 2022

harrism changed the title ~~Fix for DeviceBuffer~~ Prevent DeviceBuffer DeviceMemoryResource premature release Jan 10, 2022

jakirkham mentioned this pull request Feb 1, 2022

[BUG] Segfault when a device_buffer is released as its memory resource has already been released #943

Closed

viclafargue mentioned this pull request Apr 20, 2022

knn predict wrong and varying predictions, cudaErrorIllegalAddress, or core dump rapidsai/cuml#4629

Open

tfeher mentioned this pull request Jul 27, 2022

[BUG] Memory leak when mdarray is used in a file that is used by cython/python interface rapidsai/raft#740

Closed

nickjcroucher mentioned this pull request Oct 16, 2022

Lineage model fitting - PopPUNK changes bacpop/PopPUNK#232

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent `DeviceBuffer` DeviceMemoryResource premature release #931

Prevent `DeviceBuffer` DeviceMemoryResource premature release #931

viclafargue commented Dec 6, 2021

shwina commented Dec 6, 2021

viclafargue commented Dec 7, 2021

shwina commented Dec 10, 2021

shwina commented Dec 14, 2021 •

edited

Loading

shwina commented Dec 15, 2021

viclafargue commented Dec 15, 2021

viclafargue commented Dec 17, 2021

viclafargue commented Dec 21, 2021

shwina left a comment

shwina commented Jan 5, 2022

harrism commented Jan 10, 2022

harrism commented Jan 10, 2022

Prevent DeviceBuffer DeviceMemoryResource premature release #931

Prevent DeviceBuffer DeviceMemoryResource premature release #931

Conversation

viclafargue commented Dec 6, 2021

shwina commented Dec 6, 2021

viclafargue commented Dec 7, 2021

shwina commented Dec 10, 2021

shwina commented Dec 14, 2021 • edited Loading

shwina commented Dec 15, 2021

viclafargue commented Dec 15, 2021

viclafargue commented Dec 17, 2021

viclafargue commented Dec 21, 2021

shwina left a comment

Choose a reason for hiding this comment

shwina commented Jan 5, 2022

harrism commented Jan 10, 2022

harrism commented Jan 10, 2022

Prevent `DeviceBuffer` DeviceMemoryResource premature release #931

Prevent `DeviceBuffer` DeviceMemoryResource premature release #931

shwina commented Dec 14, 2021 •

edited

Loading