Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove DeviceBuffer synchronization on default stream #650

Merged

Conversation

pentschev
Copy link
Member

As discussed offline a couple weeks ago, RMM's Python binding synchronizes on the default stream and that was introduced mostly because of UCX-Py's usage with Dask. However, Distributed synchronizes the default stream when using UCX-Py in https://github.com/dask/distributed/blob/9f5426e030b5c4f7cd9db08d5c73f9aca3db4830/distributed/comm/ucx.py#L224-L230 and https://github.com/dask/distributed/blob/9f5426e030b5c4f7cd9db08d5c73f9aca3db4830/distributed/comm/ucx.py#L282-L285 .

I spent a good part of yesterday and today testing this PR as much as I could, including with TPCx-BB workflows, and I've experienced no regressions on simply getting rid of those two cases. Therefore, I'm reasonably confident we're safe in removing those synchronizations.

@pentschev pentschev requested a review from a team as a code owner December 8, 2020 22:57
@pentschev
Copy link
Member Author

Copy link
Member

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making async asynchronous again. Thank you!

Copy link
Contributor

@kkraus14 kkraus14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now have some race conditions with host memory we need to handle

@kkraus14 kkraus14 added breaking Breaking change improvement Improvement / enhancement to an existing function Python Related to RMM Python API labels Dec 9, 2020
@harrism
Copy link
Member

harrism commented Dec 9, 2020

We now have some race conditions with host memory we need to handle

Hmmm, if DeviceBuffer has a to_host method, shouldn't the synchronization happen there? I think rmm::device_scalar::value() and rmm::device_uvector::element() are good examples of the correct implementation of this synchronization when copying to host.

(rmm::device_buffer is not a good example, since it only provides access to a raw device pointer.)

@harrism
Copy link
Member

harrism commented Dec 9, 2020

I think copy_ptr_to_host should synchronize.

@kkraus14
Copy link
Contributor

kkraus14 commented Dec 9, 2020

I think copy_ptr_to_host should synchronize.

I think the layer above these functions are the ones that should synchronize. Generally I think the rule we should follow should be that if working with pointers, we follow CUDA semantics, if working with Python host objects we need to prevent the footgun.

@harrism
Copy link
Member

harrism commented Dec 9, 2020

I think copy_ptr_to_host should synchronize.

I think the layer above these functions are the ones that should synchronize. Generally I think the rule we should follow should be that if working with pointers, we follow CUDA semantics, if working with Python host objects we need to prevent the footgun.

That's my point and the reason I compared it to C++ classes. I think CUDA semantics would be to synchronize here before returning the value to host, since host code doesn't execute in stream order. Otherwise the value could be returned and accessed from the host before the copy is complete.

@pentschev
Copy link
Member Author

So if I'm following correctly, instead of removing the sync entirely, we want to synchronize the default stream only when we're copying to host, is that right? I can easily check for that condition in _copy_async by checking kind, but in DeviceBuffer we don't have that information, and I see internally it will just copy with cudaMemcpyDefault, so it seems intended RMM doesn't need to know where from/to the copy is happening. Do we have some sort of attribute somewhere that I'm missing which could be used to identify the copy direction?

@harrism
Copy link
Member

harrism commented Dec 9, 2020

That's why I thought copy_ptr_to_host would be the place to sync. And not just the default stream. I think we would always sync the specified stream before returning when copying to host.

@pentschev
Copy link
Member Author

That's why I thought copy_ptr_to_host would be the place to sync. And not just the default stream. I think we would always sync the specified stream before returning when copying to host.

That makes sense I thought we also would need to sync on DeviceBuffer constructor, but I think I misunderstood it originally. I believe all the requested changes are in now.

@pentschev
Copy link
Member Author

I think we would always sync the specified stream before returning when copying to host.

Actually I didn't do this. While for the Python safety I generally agree with this, shouldn't we at least add a new argument that allows the user to prevent synchronization if that's desired? E.g., force_synchronization=True, which the user could overrule by passing force_synchronization=False?

@@ -78,9 +71,6 @@ cdef class DeviceBuffer:
else:
self.c_obj.reset(new device_buffer(c_ptr, size, stream.view()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we need to possibly synchronize here in the case that c_ptr is a host ptr? Or are we saying that when working with ptrs that we do everything asynchronously?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question, I don't know what's the correct answer from the RMM perspective. I think from the Python perspective we should indeed synchronize, but for that I think we would need cudaPointerGetAttributes which isn't exposed to Python as far as I see. Should I expose and use cudaPointerGetAttributes or is there a simpler way I'm overlooking?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's unfortunately the best bet for now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we can pessimistically always synchronize if constructing from a pointer in Python for now. I'm not sure if this API gets actually used in practice.

Copy link
Member Author

@pentschev pentschev Dec 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It gets used in Distributed: https://github.com/dask/distributed/blob/3d53801f1ba7b2d9b59804548162efc3bc57f857/distributed/protocol/rmm.py#L46

Anyway, it's probably best to try and play safe and efficient, so I'm gonna attempt the cudaPointerGetAttributes path, if we have a more appropriate way we can always update that later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need to synchronize when constructing from a host pointer? Host memory is not stream ordered, so you don't have to wait on it. And all methods on the DeviceBuffer are stream ordered, so any accesses from the device or all copies will be stream ordered. Synchronizing here means you lose the asynchronous benefits of stream ordering. The only place you really need to synchronize a stream-ordered entity is when returning a scalar value to the host thread after copying it DeviceToHost.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't synchronize here the host pointer could be mutated or freed before the copy finishes, no?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user was smart enough to pass a stream, meaning they want this to be stream ordered. They should be smart enough not to modify or delete the thing they are copying without synchronizing first.

The whole point of stream ordering is so that users who know what they are doing can achieve stall-free pipelines on the GPU. If we synchronize their streams everywhere, we kill that capability.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the path of construction if a pointer is passed regardless of whether the user passed a stream or not, hence why I think we need to synchronize in the case of the default stream.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per the conversation @kkraus14 and I had offline, I've added the synchronization back but only when a pointer is passed, I adjusted the note accordingly. For now, we assume any pointer needs synchronization, regardless of being host or device pointer. As Keith noted too, this is most often used by host copies, as generally we would rely on __cuda_array_interface__ for D2D.

python/rmm/_lib/device_buffer.pyx Show resolved Hide resolved
@jakirkham
Copy link
Member

Thanks Peter! FWIW share the same concerns Keith already articulated above. So I think it is just a matter of figuring out how we solve those race conditions.

@GPUtester
Copy link
Contributor

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

@kkraus14
Copy link
Contributor

LGTM now, thanks @pentschev!

@kkraus14 kkraus14 added 5 - Ready to Merge Testing and reviews complete, ready to merge 6 - Okay to Auto-Merge labels Dec 11, 2020
@rapids-bot rapids-bot bot merged commit c306e60 into rapidsai:branch-0.18 Dec 11, 2020
@pentschev
Copy link
Member Author

Thanks everyone for reviews!

@pentschev pentschev deleted the remove-python-devicebuffer-sync branch December 11, 2020 22:57
@jakirkham
Copy link
Member

Thanks Peter! 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge breaking Breaking change improvement Improvement / enhancement to an existing function Python Related to RMM Python API
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants