-
Notifications
You must be signed in to change notification settings - Fork 198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove DeviceBuffer synchronization on default stream #650
Remove DeviceBuffer synchronization on default stream #650
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making async asynchronous again. Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We now have some race conditions with host memory we need to handle
Hmmm, if ( |
I think |
I think the layer above these functions are the ones that should synchronize. Generally I think the rule we should follow should be that if working with pointers, we follow CUDA semantics, if working with Python host objects we need to prevent the footgun. |
That's my point and the reason I compared it to C++ classes. I think CUDA semantics would be to synchronize here before returning the value to host, since host code doesn't execute in stream order. Otherwise the value could be returned and accessed from the host before the copy is complete. |
So if I'm following correctly, instead of removing the sync entirely, we want to synchronize the default stream only when we're copying to host, is that right? I can easily check for that condition in |
That's why I thought copy_ptr_to_host would be the place to sync. And not just the default stream. I think we would always sync the specified stream before returning when copying to host. |
That makes sense I thought we also would need to sync on |
Actually I didn't do this. While for the Python safety I generally agree with this, shouldn't we at least add a new argument that allows the user to prevent synchronization if that's desired? E.g., |
@@ -78,9 +71,6 @@ cdef class DeviceBuffer: | |||
else: | |||
self.c_obj.reset(new device_buffer(c_ptr, size, stream.view())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would we need to possibly synchronize here in the case that c_ptr
is a host ptr? Or are we saying that when working with ptrs that we do everything asynchronously?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good question, I don't know what's the correct answer from the RMM perspective. I think from the Python perspective we should indeed synchronize, but for that I think we would need cudaPointerGetAttributes
which isn't exposed to Python as far as I see. Should I expose and use cudaPointerGetAttributes
or is there a simpler way I'm overlooking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's unfortunately the best bet for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or we can pessimistically always synchronize if constructing from a pointer in Python for now. I'm not sure if this API gets actually used in practice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It gets used in Distributed: https://github.com/dask/distributed/blob/3d53801f1ba7b2d9b59804548162efc3bc57f857/distributed/protocol/rmm.py#L46
Anyway, it's probably best to try and play safe and efficient, so I'm gonna attempt the cudaPointerGetAttributes
path, if we have a more appropriate way we can always update that later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need to synchronize when constructing from a host pointer? Host memory is not stream ordered, so you don't have to wait on it. And all methods on the DeviceBuffer
are stream ordered, so any accesses from the device or all copies will be stream ordered. Synchronizing here means you lose the asynchronous benefits of stream ordering. The only place you really need to synchronize a stream-ordered entity is when returning a scalar value to the host thread after copying it DeviceToHost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't synchronize here the host pointer could be mutated or freed before the copy finishes, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The user was smart enough to pass a stream, meaning they want this to be stream ordered. They should be smart enough not to modify or delete the thing they are copying without synchronizing first.
The whole point of stream ordering is so that users who know what they are doing can achieve stall-free pipelines on the GPU. If we synchronize their streams everywhere, we kill that capability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the path of construction if a pointer is passed regardless of whether the user passed a stream or not, hence why I think we need to synchronize in the case of the default stream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per the conversation @kkraus14 and I had offline, I've added the synchronization back but only when a pointer is passed, I adjusted the note accordingly. For now, we assume any pointer needs synchronization, regardless of being host or device pointer. As Keith noted too, this is most often used by host copies, as generally we would rely on __cuda_array_interface__
for D2D.
Thanks Peter! FWIW share the same concerns Keith already articulated above. So I think it is just a matter of figuring out how we solve those race conditions. |
Please update the changelog in order to start CI tests. View the gpuCI docs here. |
LGTM now, thanks @pentschev! |
Thanks everyone for reviews! |
Thanks Peter! 😄 |
As discussed offline a couple weeks ago, RMM's Python binding synchronizes on the default stream and that was introduced mostly because of UCX-Py's usage with Dask. However, Distributed synchronizes the default stream when using UCX-Py in https://github.com/dask/distributed/blob/9f5426e030b5c4f7cd9db08d5c73f9aca3db4830/distributed/comm/ucx.py#L224-L230 and https://github.com/dask/distributed/blob/9f5426e030b5c4f7cd9db08d5c73f9aca3db4830/distributed/comm/ucx.py#L282-L285 .
I spent a good part of yesterday and today testing this PR as much as I could, including with TPCx-BB workflows, and I've experienced no regressions on simply getting rid of those two cases. Therefore, I'm reasonably confident we're safe in removing those synchronizations.