-
Notifications
You must be signed in to change notification settings - Fork 901
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DOC] Discuss and document expected stream synchronization behavior of libcudf functions #4511
Comments
This conversation relates to #925 and if/when we ever want to make streams a part of the public interface. |
In Legate, we have two problems with using the current synchronization model in cuDF. First, all of our streams are non-blocking, non-default streams. This allows us to convert task parallelism discovered by the Legion runtime into kernel parallelism visible to the CUDA driver. This does make using cuDF currently unsafe unless we can understand the synchronization model for "effects" (kernel launches and asynchronous memcpy operations) performed during a call into cuDF. There's at least two possible solutions to this problem (maybe more):
I think we would have a slight preference for the second option, as it keeps the library implementation functional and stateless, which means that it can be shared across multiple GPUs and can be called into simultaneously with different CUDA contexts from different threads. The second problem that we have is that it is not obvious which cuDF API calls contain synchronization calls which are in violation of The first problem has a significantly higher priority to us than the second, but for completeness I decided to include it. |
Option 3. would be for each libcudf API to take a stream parameter, and to order the API's actions in that stream. This avoids the statefulness of option 1 (inconsistent with current libcudf design) and returning outputs via pointer parameter (inconsistent with current libcudf design) of option 2. |
That would work for us as long as all "effects" performed by the cudf API call would be encapsulated on that stream. Some cursory investigation seems to show that there are a few API calls in cudf that make or use their own streams for internal work. I haven't checked to see whether all the kernels/copy operations placed on those internal streams make it back to the default stream or a potential external stream passed in by the user. |
I agree, and I don't think it would be useable if libcudf forked onto internal streams without waiting on events in those streams in the externally specified stream.
Correction, that's legacy too. There are no calls to |
What makes me uncomfortable about Option 3 and just adding a stream parameter to every API is it's potential for being a foot gun. E.g., Option A:
The user has to be very aware that they cannot touch I think a futures based API would be much safer and explicit about the synchronization semantics. Option B
Note that in this example an explicit stream wasn't used. The idea would be that libcudf would abstract away the use of a non-default stream by using something like a stream pool. Granted, that's a lot more work. Alternatively, we could mix futures and explicit streams: Option C:
C is a bit of a middle ground between A and B. It alleviates some of the potential foot gun of A by making the caller explicitly aware that the result is asynchronously returned, without the added libcudf complexity of managing a stream pool in B. |
How would options B and C support avoiding synchronization when it is unnecessary? I can see how these approaches make it harder to avoid races, but it also seems like it makes it harder to avoid unnecessary synchronization. Chaining a series of libcudf calls would be painful if the caller needs to Would libcudf methods start taking |
Yeah, this has always been the roadblock for me in terms of figuring out how we could have a future-based API. It seems like we'd need to re-architect all the APIs to not only return Alternatively, one idea is to have
|
I'm not sure about your implementation of the |
It's a good idea, but unfortunately impossible. All libcudf APIs are setup like this:
The output is an owning We could hide the |
Thrust solves this problem by having execution policies that can depend on futures:
I may not have the syntax exactly right because I can't find an actual example of this, only a mention in the release notes: https://github.com/thrust/thrust/releases/tag/1.9.4 And probably not a great example since my inclusive_scan above wouldn't actually depend on the reduces before it. But you get the idea. |
I went looking for documentation on the I'm wondering though if it might make sense for streams to be added as members of both the // Do some async operation on data_a in stream_a
some_kernel<<<grid,block,0,stream_a>>>(data_a);
// Do another async operation on data_b in stream_b
another_kernel<<<grid,block,0,stream_b>>>(data_b);
// Create column_views with the streams they're ordered on
auto a = cudf::column_view{data_type{INT32}, size_a, data_a, stream_a};
auto b = cudf::column_view{data_type{INT32}, size_b, data_b, stream_b};
// Concatenate should sync stream_a and stream_b as appropriate (cudaStreamWaitEvent)
// An additional output stream parameter to use for the result
unique_ptr<column> c = cudf::concatenate({a, b}, stream_c);
assert(c->view().stream() == stream_c); With an interface like this, I don't think // Columns a and b are each allocated in their own streams from the rmm pool
unique_ptr<column> a = cudf::make_numeric_column(data_type{INT32}, 100);
unique_ptr<column> b = cudf::make_numeric_column(data_type{INT32}, 100);
// Column c is computed in its own stream, which is appropriately synchronized with the
// streams owned by a and b
unique_ptr<column> c = cudf::concatenate({a->view(), b->view()});
// The streams return to the pool when columns a, b, and c go out of scope |
I think unnecessarily creating new streams is dangerous, even with a stream pool that makes it cheap. The reason is that the larger the number of streams used to allocate memory asynchronously from a pool allocator, the greater the amount of temporal fragmentation of the pool and the greater the frequency of stream synchronization.
These are all just current heuristics in the allocator, could change / optimize them as needed. But I want us to be aware of the impacts of automatically creating more streams. |
That's a good point, I wasn't considering how pooled allocations would be affected. I'm not necessarily advocating for using a large number of streams, just trying to consider how the API could be adjusted so that it's possible to do so safely. But to your point, maybe it would be better to keep stream parameters explicit instead of having a default to a pool or default stream. class managed_stream {
public:
managed_stream(cudaStream_t stream): _stream{stream} {} // for wrapping non-managed streams
managed_stream(unique_ptr<stream_resource> owner):
_stream{owner->stream()}, _owner{std::move(owner)} {}
private:
cudaStream_t _stream;
unique_ptr<stream_resource> _owner; // some raii owning type
}
unique_ptr<column> a = cudf::make_numeric_column(data_type{INT32}, 100, managed_stream{stream_a});
unique_ptr<column> b = cudf::make_numeric_column(data_type{INT32}, 100, pool->allocate());
// column b owns the raii owning type, column a owns nothing
How does rmm currently handle the latter case? I'm thinking it could record an event at the time the block is freed so that subsequent work in the old stream isn't counted against it, and then use wait event in the new stream rather than stream synchronize. Coalescing is a tricky point though... I suppose after coalescing, it could hold onto a vector of events to wait on when allocating from that block. Maybe it could keep them sorted and remember the address range they belonged to so it could wait only on the range of events overlapping with a fresh allocation from the coalesced block. But... I'm not sure if having a potentially high volume of events (one per block returned to rmm) hanging around is a sane thing to do. Having a fat list of events after coalescing a bunch of small allocations could be worse than just dealing with stream synchronization. I'm just thinking out loud with only a vague idea of how rmm works internally 😅 |
@jrhemstad @harrism Have you had any more ideas about streams, either in public APIs or the stream types we discussed for internal APIs? |
Last time we talked about it we decided we would get per-thread default stream working and see how far that gets us. We still need a strongly typed, non-owning |
I plan to start working on the CUDA event handling needed in |
Update, making progress: rapidsai/rmm#425 |
This issue has been labeled |
This issue has been labeled |
This is still (more than ever) relevant. |
This PR adds a section to the developer documentation about various libcudf design decisions that affect users. These policies are important for us to document and communicate consistently. I am not sure what the best place for this information is, but I think the developer docs are a good place to start since until we address #11481 we don't have a great way to publish any non-API user-facing libcudf documentation. I've created this draft PR to solicit feedback from other libcudf devs about other policies that we should be documenting in a similar manner. Once everyone is happy with the contents, I would suggest that we merge this into the dev docs for now and then revisit a better place once we've tackled #11481. Partly addresses #5505, #1781. Resolves #4511. Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Jake Hemstad (https://github.com/jrhemstad) - Bradley Dice (https://github.com/bdice) - David Wendt (https://github.com/davidwendt) URL: #11853
Report needed documentation
Today, the (a)synchronous behavior of cuDF/libcudf functions is not formally documented (nor has it really been discussed to my knowledge).
The current state of affairs is that no libcudf function explicitly synchronizes before returning. This means that the result buffers from libcudf functions may not immediately be safe to use.
Thus far, this has been okay because of a few reasons:
This means that for any function:
when
cudf::function
returns, the buffer inresult
may not be safe to immediately use. However, any operations like copyingresult
to host or using it in another kernel are all safe because they are stream-ordered. Meaning, any stream ordered operations aftercudf::function
will be en-queued in the stream such that by the time those operations occur, previous operations in the stream will have completed (i.e.,result
is safe to use).Where this breaks down:
None of this information is really documented anywhere, so we should probably remedy that.
Describe the documentation you'd like
The synchronous behavior of the library is very important to get right. We should have a conversation about what we have now and if it is what we think is correct for the long term future of the library.
Compare to Thrust which takes a more pessimistic approach to synchronization and synchronizes the default stream before returning from functions like reductions or constructing a
device_vector
.The text was updated successfully, but these errors were encountered: