New benchmark compares concurrent throughput of device_vector and device_uvector #981

harrism · 2022-02-15T02:15:30Z

Adds a new benchmark in device_uvector_benchmark.cpp that compares using multiple streams and concurrent kernels interleaved with vector creation. This is then parameterized on the type of the vector:

thrust::device_vector -- uses cudaMalloc allocation
rmm::device_vector -- uses RMM allocation
rmm::device_uvector -- uses RMM allocation and uninitialized vector

The benchmark uses the cuda_async_memory_resource so that cudaMallocAsync is used for allocation of the rmm:: vector types.

The performance on V100 demonstrates that option 1. is slowest due to allocation bottlenecks. 2. alleviates these by using cudaMallocFromPoolAsync, but there is no concurrency among the kernels because thrust::device_vector synchronizes the default stream. 3. Is fastest and achieves full concurrency (verified in nsight-sys).

Benchmark                                                                        Time             CPU   Iterations UserCounters...
----------------------------------------------------------------------------------------------------------------------------------
BM_VectorWorkflow<thrust::device_vector<int32_t>>/100000/manual_time           242 us          267 us         2962 bytes_per_second=13.8375G/s
BM_VectorWorkflow<thrust::device_vector<int32_t>>/1000000/manual_time         1441 us         1465 us          472 bytes_per_second=23.273G/s
BM_VectorWorkflow<thrust::device_vector<int32_t>>/10000000/manual_time       10483 us        10498 us           68 bytes_per_second=31.9829G/s
BM_VectorWorkflow<thrust::device_vector<int32_t>>/100000000/manual_time      63583 us        63567 us           12 bytes_per_second=52.7303G/s
BM_VectorWorkflow<rmm::device_vector<int32_t>>/100000/manual_time             82.0 us          105 us         8181 bytes_per_second=40.8661G/s
BM_VectorWorkflow<rmm::device_vector<int32_t>>/1000000/manual_time             502 us          527 us         1357 bytes_per_second=66.8029G/s
BM_VectorWorkflow<rmm::device_vector<int32_t>>/10000000/manual_time           4714 us         4746 us          148 bytes_per_second=71.1222G/s
BM_VectorWorkflow<rmm::device_vector<int32_t>>/100000000/manual_time         46451 us        46478 us           13 bytes_per_second=72.1784G/s
BM_VectorWorkflow<rmm::device_uvector<int32_t>>/100000/manual_time            39.0 us         59.9 us        17970 bytes_per_second=85.8733G/s
BM_VectorWorkflow<rmm::device_uvector<int32_t>>/1000000/manual_time            135 us          159 us         5253 bytes_per_second=248.987G/s
BM_VectorWorkflow<rmm::device_uvector<int32_t>>/10000000/manual_time          1319 us         1351 us          516 bytes_per_second=254.169G/s
BM_VectorWorkflow<rmm::device_uvector<int32_t>>/100000000/manual_time        12841 us        12865 us           54 bytes_per_second=261.099G/s

…nd device_uvector

benchmarks/device_uvector/device_uvector_bench.cu

codereport

LGTM 👍

Only couple small trivial changes. Also, we seem to swap between using int and int32_t. Not sure if it is worth switching the ints to int32_ts.

codereport · 2022-02-16T21:12:57Z

benchmarks/device_uvector/device_uvector_bench.cu

+  auto num_elements = state.range(0);
+  int block_size    = 256;
+  int num_blocks    = 16;


Suggested change

auto num_elements = state.range(0);

int block_size = 256;

int num_blocks = 16;

auto const num_elements = state.range(0);

int constexpr block_size = 256;

int constexpr num_blocks = 16;

Thanks. Fixed. Most ints can be auto.

harrism · 2022-02-17T01:10:53Z

Hmmm gpuCI doesn't seem to be rerunning automatically. rerun tests.

vyasr · 2022-02-17T01:42:26Z

Hmmm gpuCI doesn't seem to be rerunning automatically. rerun tests.

Perhaps just a browser caching issue? I occasionally require a hard refresh to see that tests have started running if I already had the page open in a tab for a while before a push.

harrism · 2022-02-17T02:31:37Z

I could see the comments from today in the browser, but the linked CI results were from yesterday.

vyasr · 2022-02-17T02:33:08Z

That happens to me sometimes. It will update the comment list, but the checks list won't update until I hard refresh the web page. It's a possible explanation at least, but no way to know unless it happens again.

harrism · 2022-02-17T08:51:48Z

@gpucibot merge

Add new benchmarks comparing concurrent throughput of device_vector a…

1550844

…nd device_uvector

harrism added non-breaking Non-breaking change improvement Improvement / enhancement to an existing function labels Feb 15, 2022

harrism requested a review from jrhemstad February 15, 2022 02:15

harrism self-assigned this Feb 15, 2022

harrism requested a review from a team as a code owner February 15, 2022 02:15

harrism requested a review from codereport February 15, 2022 02:15

harrism added this to PR-WIP in v22.04 Release via automation Feb 15, 2022