Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add multi stream allocations benchmark. #841

Merged

Conversation

cwharris
Copy link
Contributor

@cwharris cwharris commented Aug 10, 2021

This PR introduces a new benchmark to measure the effect of the pool allocator in situations which require concurrent kernel execution. The benchmark reveals that the pool allocator prevents concurrent kernel execution in multiple non-default streams unless the pool allocator already has enough memory reserved for each stream prior to any allocation attempts.

legend: BM_MultiStreamAllocations/<threads in pool>/<kernels to launch>/<prewarm>

-------------------------------------------------------------------------------------------
Benchmark                                 Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------
BM_MultiStreamAllocations/cuda/4/4/0             2568 us         2564 us          270 items_per_second=1.55987k/s
BM_MultiStreamAllocations/cuda/4/4/1             2568 us         2562 us          272 items_per_second=1.5612k/s
BM_MultiStreamAllocations/cuda_async/4/4/0        536 us          536 us         1301 items_per_second=7.45966k/s
BM_MultiStreamAllocations/cuda_async/4/4/1        537 us          537 us         1303 items_per_second=7.45012k/s
BM_MultiStreamAllocations/pool_mr/4/4/0          2094 us         2094 us          334 items_per_second=1.90985k/s
BM_MultiStreamAllocations/pool_mr/4/4/1           533 us          533 us         1310 items_per_second=7.49885k/s
BM_MultiStreamAllocations/arena/4/4/0            2105 us         2105 us          332 items_per_second=1.90017k/s
BM_MultiStreamAllocations/arena/4/4/1            2113 us         2113 us          332 items_per_second=1.89299k/s
BM_MultiStreamAllocations/binning/4/4/0          2104 us         2104 us          333 items_per_second=1.9009k/s
BM_MultiStreamAllocations/binning/4/4/1           534 us          534 us         1307 items_per_second=7.49373k/s

The benchmark works by running a compute-bound kernel in a very small launch configuration (1 block 1 thread) multiple times against a stream pool. As the size of the stream pool increases, one should expect the overall performance to improve. This is only the case when the memory pool has been prewarmed for each stream - otherwise, sync-and-steal behavior comes in to play, forcing any already-launched work to complete prior to allocating the next buffer, therefore preventing any kernel overlap.

undesirable synchronization:

cold start

desirable kernel overlap:

warm start

@cwharris cwharris added 3 - Ready for review Ready for review by team benchmarks non-breaking Non-breaking change feature request New feature or request improvement Improvement / enhancement to an existing function and removed feature request New feature or request labels Aug 10, 2021
@cwharris cwharris marked this pull request as ready for review August 10, 2021 06:38
@cwharris cwharris requested review from a team as code owners August 10, 2021 06:38
@cwharris cwharris requested a review from rongou August 10, 2021 06:38
Copy link
Member

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for creating this. Looks good. Just some suggestions on how to make it easier to use for profiling and easier to extend to other MRs in the future.

@cwharris cwharris requested a review from harrism August 11, 2021 03:11
Copy link
Member

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love it.

std::cout << "Error: invalid memory_resource name: " << name << std::endl;
}

void run_profile(std::string resource_name, int kernel_count, int stream_count, bool prewarm)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed? Can't you do the same thing via GBench command line args?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean by "this". gbench runs multiple times no matter what. We want a way to run only once for profiling.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want a way to run only once for profiling.

Right, that's what I meant. I thought there was a "num_iterations` gbench option.

Related NVIDIA/nvbench#10

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't found an option in gbench to limit the number of iterations. I looked briefly prior to implementing it this way, but gbench documentation is... not great?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gbenchmark does not have a way to control number of iterations. Here's one of the authors explanations of why. https://stackoverflow.com/a/61888885/749748

@harrism
Copy link
Member

harrism commented Aug 16, 2021

@cwharris failing cmake style

@harrism
Copy link
Member

harrism commented Aug 18, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit b458233 into rapidsai:branch-21.10 Aug 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for review Ready for review by team benchmarks CMake improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants