add multi stream allocations benchmark. #841

cwharris · 2021-08-10T06:24:42Z

This PR introduces a new benchmark to measure the effect of the pool allocator in situations which require concurrent kernel execution. The benchmark reveals that the pool allocator prevents concurrent kernel execution in multiple non-default streams unless the pool allocator already has enough memory reserved for each stream prior to any allocation attempts.

legend: BM_MultiStreamAllocations/<threads in pool>/<kernels to launch>/<prewarm>

-------------------------------------------------------------------------------------------
Benchmark                                 Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------
BM_MultiStreamAllocations/cuda/4/4/0             2568 us         2564 us          270 items_per_second=1.55987k/s
BM_MultiStreamAllocations/cuda/4/4/1             2568 us         2562 us          272 items_per_second=1.5612k/s
BM_MultiStreamAllocations/cuda_async/4/4/0        536 us          536 us         1301 items_per_second=7.45966k/s
BM_MultiStreamAllocations/cuda_async/4/4/1        537 us          537 us         1303 items_per_second=7.45012k/s
BM_MultiStreamAllocations/pool_mr/4/4/0          2094 us         2094 us          334 items_per_second=1.90985k/s
BM_MultiStreamAllocations/pool_mr/4/4/1           533 us          533 us         1310 items_per_second=7.49885k/s
BM_MultiStreamAllocations/arena/4/4/0            2105 us         2105 us          332 items_per_second=1.90017k/s
BM_MultiStreamAllocations/arena/4/4/1            2113 us         2113 us          332 items_per_second=1.89299k/s
BM_MultiStreamAllocations/binning/4/4/0          2104 us         2104 us          333 items_per_second=1.9009k/s
BM_MultiStreamAllocations/binning/4/4/1           534 us          534 us         1307 items_per_second=7.49373k/s

The benchmark works by running a compute-bound kernel in a very small launch configuration (1 block 1 thread) multiple times against a stream pool. As the size of the stream pool increases, one should expect the overall performance to improve. This is only the case when the memory pool has been prewarmed for each stream - otherwise, sync-and-steal behavior comes in to play, forcing any already-launched work to complete prior to allocating the next buffer, therefore preventing any kernel overlap.

undesirable synchronization:

desirable kernel overlap:

harrism

Thanks for creating this. Looks good. Just some suggestions on how to make it easier to use for profiling and easier to extend to other MRs in the future.

benchmarks/multi_stream_allocations/multi_stream_allocations_bench.cu

harrism

Love it.

benchmarks/CMakeLists.txt

…eam_allocations_bench

jrhemstad · 2021-08-16T14:32:07Z

benchmarks/multi_stream_allocations/multi_stream_allocations_bench.cu

+  std::cout << "Error: invalid memory_resource name: " << name << std::endl;
+}
+
+void run_profile(std::string resource_name, int kernel_count, int stream_count, bool prewarm)


Is this needed? Can't you do the same thing via GBench command line args?

Not sure what you mean by "this". gbench runs multiple times no matter what. We want a way to run only once for profiling.

We want a way to run only once for profiling.

Right, that's what I meant. I thought there was a "num_iterations` gbench option.

Related NVIDIA/nvbench#10

I haven't found an option in gbench to limit the number of iterations. I looked briefly prior to implementing it this way, but gbench documentation is... not great?

gbenchmark does not have a way to control number of iterations. Here's one of the authors explanations of why. https://stackoverflow.com/a/61888885/749748

harrism · 2021-08-16T21:21:36Z

@cwharris failing cmake style

harrism · 2021-08-18T20:46:08Z

@gpucibot merge

add multi stream allocations benchmark.

e7c1991

github-actions bot added the CMake label Aug 10, 2021

cwharris requested a review from harrism August 10, 2021 06:25

Update CMakeLists.txt

7e9d465

cwharris added 3 - Ready for review Ready for review by team benchmarks non-breaking Non-breaking change feature request New feature or request improvement Improvement / enhancement to an existing function and removed feature request New feature or request labels Aug 10, 2021

cwharris marked this pull request as ready for review August 10, 2021 06:38

cwharris requested review from a team as code owners August 10, 2021 06:38

cwharris requested a review from rongou August 10, 2021 06:38

harrism requested changes Aug 10, 2021

View reviewed changes

benchmarks/multi_stream_allocations/multi_stream_allocations_bench.cu Outdated Show resolved Hide resolved

benchmarks/multi_stream_allocations/multi_stream_allocations_bench.cu Outdated Show resolved Hide resolved

cwharris added 2 commits August 10, 2021 10:59

benchmark all MRs

ec14e0e

add profiling options

fe51c6e

cwharris requested a review from harrism August 11, 2021 03:11

fix cmake style

ee77d95

harrism approved these changes Aug 11, 2021

View reviewed changes

robertmaynard reviewed Aug 11, 2021

View reviewed changes

benchmarks/CMakeLists.txt Show resolved Hide resolved

cwharris added 2 commits August 12, 2021 13:29

add prewarm option to profile

a9b2c13

update benchmark cmake file style

1eb92bd

cwharris requested a review from robertmaynard August 12, 2021 18:39

robertmaynard approved these changes Aug 12, 2021

View reviewed changes

Merge branch 'branch-21.10' of github.com:rapidsai/rmm into multi_str…

7fcefc1

…eam_allocations_bench

jrhemstad reviewed Aug 16, 2021

View reviewed changes

harrism mentioned this pull request Aug 17, 2021

[QST]The difference between cuda_async_memory_resource and pool_memory_resource. #844

Closed

fix cmake styles

7645bbf

rapids-bot bot merged commit b458233 into rapidsai:branch-21.10 Aug 18, 2021

harrism mentioned this pull request Aug 24, 2021

[BUG] pool_memory_resource excessive synchronization #850

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add multi stream allocations benchmark. #841

add multi stream allocations benchmark. #841

cwharris commented Aug 10, 2021 •

edited

Loading

harrism left a comment

harrism left a comment

jrhemstad Aug 16, 2021

harrism Aug 16, 2021

jrhemstad Aug 18, 2021

cwharris Aug 18, 2021

harrism Aug 18, 2021

harrism commented Aug 16, 2021

harrism commented Aug 18, 2021

add multi stream allocations benchmark. #841

add multi stream allocations benchmark. #841

Conversation

cwharris commented Aug 10, 2021 • edited Loading

undesirable synchronization:

desirable kernel overlap:

harrism left a comment

Choose a reason for hiding this comment

harrism left a comment

Choose a reason for hiding this comment

jrhemstad Aug 16, 2021

Choose a reason for hiding this comment

harrism Aug 16, 2021

Choose a reason for hiding this comment

jrhemstad Aug 18, 2021

Choose a reason for hiding this comment

cwharris Aug 18, 2021

Choose a reason for hiding this comment

harrism Aug 18, 2021

Choose a reason for hiding this comment

harrism commented Aug 16, 2021

harrism commented Aug 18, 2021

cwharris commented Aug 10, 2021 •

edited

Loading