Remove DeviceBuffer synchronization on default stream #650

pentschev · 2020-12-08T22:57:20Z

As discussed offline a couple weeks ago, RMM's Python binding synchronizes on the default stream and that was introduced mostly because of UCX-Py's usage with Dask. However, Distributed synchronizes the default stream when using UCX-Py in https://github.com/dask/distributed/blob/9f5426e030b5c4f7cd9db08d5c73f9aca3db4830/distributed/comm/ucx.py#L224-L230 and https://github.com/dask/distributed/blob/9f5426e030b5c4f7cd9db08d5c73f9aca3db4830/distributed/comm/ucx.py#L282-L285 .

I spent a good part of yesterday and today testing this PR as much as I could, including with TPCx-BB workflows, and I've experienced no regressions on simply getting rid of those two cases. Therefore, I'm reasonably confident we're safe in removing those synchronizations.

pentschev · 2020-12-08T22:57:49Z

cc @harrism @kkraus14 @jakirkham @quasiben

harrism

Making async asynchronous again. Thank you!

python/rmm/_lib/device_buffer.pyx

kkraus14

We now have some race conditions with host memory we need to handle

harrism · 2020-12-09T01:52:39Z

We now have some race conditions with host memory we need to handle

Hmmm, if DeviceBuffer has a to_host method, shouldn't the synchronization happen there? I think rmm::device_scalar::value() and rmm::device_uvector::element() are good examples of the correct implementation of this synchronization when copying to host.

(rmm::device_buffer is not a good example, since it only provides access to a raw device pointer.)

harrism · 2020-12-09T01:54:23Z

I think copy_ptr_to_host should synchronize.

kkraus14 · 2020-12-09T02:28:01Z

I think copy_ptr_to_host should synchronize.

I think the layer above these functions are the ones that should synchronize. Generally I think the rule we should follow should be that if working with pointers, we follow CUDA semantics, if working with Python host objects we need to prevent the footgun.

harrism · 2020-12-09T02:46:51Z

I think copy_ptr_to_host should synchronize.

I think the layer above these functions are the ones that should synchronize. Generally I think the rule we should follow should be that if working with pointers, we follow CUDA semantics, if working with Python host objects we need to prevent the footgun.

That's my point and the reason I compared it to C++ classes. I think CUDA semantics would be to synchronize here before returning the value to host, since host code doesn't execute in stream order. Otherwise the value could be returned and accessed from the host before the copy is complete.

python/rmm/_lib/device_buffer.pyx

pentschev · 2020-12-09T08:20:28Z

So if I'm following correctly, instead of removing the sync entirely, we want to synchronize the default stream only when we're copying to host, is that right? I can easily check for that condition in _copy_async by checking kind, but in DeviceBuffer we don't have that information, and I see internally it will just copy with cudaMemcpyDefault, so it seems intended RMM doesn't need to know where from/to the copy is happening. Do we have some sort of attribute somewhere that I'm missing which could be used to identify the copy direction?

harrism · 2020-12-09T10:57:36Z

That's why I thought copy_ptr_to_host would be the place to sync. And not just the default stream. I think we would always sync the specified stream before returning when copying to host.

pentschev · 2020-12-09T11:43:50Z

That's why I thought copy_ptr_to_host would be the place to sync. And not just the default stream. I think we would always sync the specified stream before returning when copying to host.

That makes sense I thought we also would need to sync on DeviceBuffer constructor, but I think I misunderstood it originally. I believe all the requested changes are in now.

pentschev · 2020-12-09T11:57:59Z

I think we would always sync the specified stream before returning when copying to host.

Actually I didn't do this. While for the Python safety I generally agree with this, shouldn't we at least add a new argument that allows the user to prevent synchronization if that's desired? E.g., force_synchronization=True, which the user could overrule by passing force_synchronization=False?

kkraus14 · 2020-12-09T15:34:51Z

python/rmm/_lib/device_buffer.pyx

@@ -78,9 +71,6 @@ cdef class DeviceBuffer:
            else:
                self.c_obj.reset(new device_buffer(c_ptr, size, stream.view()))


Would we need to possibly synchronize here in the case that c_ptr is a host ptr? Or are we saying that when working with ptrs that we do everything asynchronously?

That's a good question, I don't know what's the correct answer from the RMM perspective. I think from the Python perspective we should indeed synchronize, but for that I think we would need cudaPointerGetAttributes which isn't exposed to Python as far as I see. Should I expose and use cudaPointerGetAttributes or is there a simpler way I'm overlooking?

I think that's unfortunately the best bet for now

Or we can pessimistically always synchronize if constructing from a pointer in Python for now. I'm not sure if this API gets actually used in practice.

It gets used in Distributed: https://github.com/dask/distributed/blob/3d53801f1ba7b2d9b59804548162efc3bc57f857/distributed/protocol/rmm.py#L46

Anyway, it's probably best to try and play safe and efficient, so I'm gonna attempt the cudaPointerGetAttributes path, if we have a more appropriate way we can always update that later.

Why do you need to synchronize when constructing from a host pointer? Host memory is not stream ordered, so you don't have to wait on it. And all methods on the DeviceBuffer are stream ordered, so any accesses from the device or all copies will be stream ordered. Synchronizing here means you lose the asynchronous benefits of stream ordering. The only place you really need to synchronize a stream-ordered entity is when returning a scalar value to the host thread after copying it DeviceToHost.

If we don't synchronize here the host pointer could be mutated or freed before the copy finishes, no?

The user was smart enough to pass a stream, meaning they want this to be stream ordered. They should be smart enough not to modify or delete the thing they are copying without synchronizing first.

The whole point of stream ordering is so that users who know what they are doing can achieve stall-free pipelines on the GPU. If we synchronize their streams everywhere, we kill that capability.

This is the path of construction if a pointer is passed regardless of whether the user passed a stream or not, hence why I think we need to synchronize in the case of the default stream.

As per the conversation @kkraus14 and I had offline, I've added the synchronization back but only when a pointer is passed, I adjusted the note accordingly. For now, we assume any pointer needs synchronization, regardless of being host or device pointer. As Keith noted too, this is most often used by host copies, as generally we would rely on __cuda_array_interface__ for D2D.

python/rmm/_lib/device_buffer.pyx

jakirkham · 2020-12-09T19:00:44Z

Thanks Peter! FWIW share the same concerns Keith already articulated above. So I think it is just a matter of figuring out how we solve those race conditions.

GPUtester · 2020-12-09T19:01:05Z

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

python/rmm/_lib/device_buffer.pyx

kkraus14 · 2020-12-11T20:39:32Z

LGTM now, thanks @pentschev!

pentschev · 2020-12-11T22:02:44Z

Thanks everyone for reviews!

jakirkham · 2020-12-12T03:07:04Z

Thanks Peter! 😄

Remove DeviceBuffer synchronization on default stream

fe6dbc5

pentschev requested a review from a team as a code owner December 8, 2020 22:57

harrism approved these changes Dec 9, 2020

View reviewed changes

kkraus14 reviewed Dec 9, 2020

View reviewed changes

python/rmm/_lib/device_buffer.pyx Show resolved Hide resolved

kkraus14 suggested changes Dec 9, 2020

View reviewed changes

kkraus14 added breaking Breaking change improvement Improvement / enhancement to an existing function Python Related to RMM Python API labels Dec 9, 2020

harrism reviewed Dec 9, 2020

View reviewed changes

python/rmm/_lib/device_buffer.pyx Show resolved Hide resolved

harrism reviewed Dec 9, 2020

View reviewed changes

python/rmm/_lib/device_buffer.pyx Show resolved Hide resolved

pentschev added 2 commits December 9, 2020 03:37

Synchronize default stream on copy_ptr_to_host

90a3a5b

Remove outdated notes on default stream synchronization

32e53fd

kkraus14 suggested changes Dec 9, 2020

View reviewed changes

Synchronize default stream on copy_host_to_ptr

64d279b

kkraus14 reviewed Dec 11, 2020

View reviewed changes

python/rmm/_lib/device_buffer.pyx Show resolved Hide resolved

pentschev added 2 commits December 11, 2020 12:29

Sync DeviceBuffer constructor on pointer and default stream

2c8f831

Fix copy_device_to_ptr docstring

7922cd4

kkraus14 approved these changes Dec 11, 2020

View reviewed changes

kkraus14 added 5 - Ready to Merge Testing and reviews complete, ready to merge 6 - Okay to Auto-Merge labels Dec 11, 2020

rapids-bot bot merged commit c306e60 into rapidsai:branch-0.18 Dec 11, 2020

pentschev deleted the remove-python-devicebuffer-sync branch December 11, 2020 22:57

harrism mentioned this pull request Dec 15, 2020

[REVIEW] Improve Cython Lifetime Management by Adding References in DeviceBuffer #661

Merged

jakirkham mentioned this pull request Feb 17, 2021

[BUG] Shouldn't need to synchronize on creation of DeviceBuffer even in the default stream #313

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove DeviceBuffer synchronization on default stream #650

Remove DeviceBuffer synchronization on default stream #650

pentschev commented Dec 8, 2020

pentschev commented Dec 8, 2020

harrism left a comment

kkraus14 left a comment

harrism commented Dec 9, 2020

harrism commented Dec 9, 2020

kkraus14 commented Dec 9, 2020

harrism commented Dec 9, 2020

pentschev commented Dec 9, 2020

harrism commented Dec 9, 2020

pentschev commented Dec 9, 2020

pentschev commented Dec 9, 2020

kkraus14 Dec 9, 2020

pentschev Dec 9, 2020

kkraus14 Dec 9, 2020

kkraus14 Dec 9, 2020

pentschev Dec 9, 2020 •

edited

Loading

harrism Dec 11, 2020

kkraus14 Dec 11, 2020

harrism Dec 11, 2020

kkraus14 Dec 11, 2020

pentschev Dec 11, 2020

jakirkham commented Dec 9, 2020

GPUtester commented Dec 9, 2020

kkraus14 commented Dec 11, 2020

pentschev commented Dec 11, 2020

jakirkham commented Dec 12, 2020

		@@ -78,9 +71,6 @@ cdef class DeviceBuffer:
		else:
		self.c_obj.reset(new device_buffer(c_ptr, size, stream.view()))

Remove DeviceBuffer synchronization on default stream #650

Remove DeviceBuffer synchronization on default stream #650

Conversation

pentschev commented Dec 8, 2020

pentschev commented Dec 8, 2020

harrism left a comment

Choose a reason for hiding this comment

kkraus14 left a comment

Choose a reason for hiding this comment

harrism commented Dec 9, 2020

harrism commented Dec 9, 2020

kkraus14 commented Dec 9, 2020

harrism commented Dec 9, 2020

pentschev commented Dec 9, 2020

harrism commented Dec 9, 2020

pentschev commented Dec 9, 2020

pentschev commented Dec 9, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pentschev Dec 9, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jakirkham commented Dec 9, 2020

GPUtester commented Dec 9, 2020

kkraus14 commented Dec 11, 2020

pentschev commented Dec 11, 2020

jakirkham commented Dec 12, 2020

pentschev Dec 9, 2020 •

edited

Loading