Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load balance optimization for contiguous_split #9755

Merged
merged 11 commits into from Jan 13, 2022

Conversation

nvdbaranec
Copy link
Contributor

@nvdbaranec nvdbaranec commented Nov 23, 2021

The existing contiguous_split implementation was vulnerable to situations where number of columns N * number of splits M was < the number of SMs on the gpu. This PR implements a postprocessing step which attempts to distribute the amount of bytes to be copied as evenly as possible across all available SMs.

PR has been updated to repartition using a constant chunk size of 1 MB. This yields better results than the initial approach.

Before/after benchmarks for some particularly degenerate cases (T4)

Before (4 partitions)
4GB, 4 columns, no splits                      43.3 ms         43.3 ms            8 bytes_per_second=46.1738G/s
After
4GB, 4 columns, no splits                      10.1 ms         10.1 ms            8 bytes_per_second=198.642G/s
Before (2 partitions)
1GB, 1 column + validity, no splits         114 ms          114 ms            8 bytes_per_second=17.5212G/s
After
1GB, 1 column + validity, no splits         10.5 ms         10.6 ms            8 bytes_per_second=189.784G/s

…opied over all the SMs as best it can. Previously, it was vulnerable to

low numbers of splits/columns or large discrepancies between column sizes (eg, large string columns combined with columns of shorts for example).
@nvdbaranec nvdbaranec requested a review from a team as a code owner November 23, 2021 00:02
@nvdbaranec nvdbaranec added libcudf Affects libcudf (C++/CUDA) code. improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Nov 23, 2021
@nvdbaranec nvdbaranec added this to PR-WIP in v22.02 Release via automation Nov 23, 2021
@nvdbaranec nvdbaranec added the 5 - DO NOT MERGE Hold off on merging; see PR for details label Nov 23, 2021
Copy link
Contributor

@hyperbolic2346 hyperbolic2346 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you compared performance for a non-degenerate case?

cpp/src/copying/contiguous_split.cu Outdated Show resolved Hide resolved
Comment on lines 820 to 823
int device;
cudaGetDevice(&device);
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, device);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not recommend basing this on the exact SM count. A more canonical way to do this kind of load balancing is to just pick a partition size and divide up your input data according to that partition size and then distribute those partitions among CTAs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest looking at the load balancing algorithm @elstehle developed for NVIDIA/cub#297.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting changing contiguous split to use cooperative groups? That's kind of a massive change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. My main thing is to avoid the cudaGetProperties call and basing the load balancing on the exact SM count. That isn't necessary. As is, this code currently calculates a chunk size based on the number of SMs and chunks the input data accordingly. I'm saying to pick a chunk size independent of #SMs (based on empirical results from benchmarking).

That could still lead to load imbalance if the incoming chunks are already very small, so you'd really want to do something where you can aggregate incoming chunks up to your chunk size as well. At that point you'd be in the territory of the load balancing algorithm @elstehle developed in NVIDIA/cub#297, which is why I suggested taking a look.

Copy link
Contributor

@elstehle elstehle Dec 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit: Jake beat me to it. I don't think @jrhemstad necessarily referred to using cg. I think Jake rather referred to the scenario where copying chunk 0 (on SM 0) takes (considerably) longer than copying chunk 1 (on SM 1). Assume we'd just have 2 SMs and we're copying 2 chunks of 1 GB each. For some reason copying chunk 0 is 25% slower. So, on a vertical time scale we'd look at something like this:

t SM0 SM1
0.0ms Buff#0.bytes:[0,200) Buff#1.bytes:[0,250)
0.1ms Buff#0.bytes:[200,400) Buff#1.bytes:[250,500)
0.2ms Buff#0.bytes:[400,600) Buff#1.bytes:[500,750)
0.3ms Buff#0.bytes:[600,800) Buff#1.bytes:[750,1000)
0.4ms Buff#0.bytes:[800,1000)

So, in the time 0.4ms until 0.5ms, SM1 is sitting idle, while, in the best case it could help out copying Buff#0 when it's finished. So, if instead of splitting the total work into roughly num_sms, we could subdivide the work into smaller "work items" of, say, 100 MB, and have threadBlockIdx.x work on the <threadBlockIdx.x>-th such work item. That allows the GPU's scheduler to schedule a new thread block, at the time when SM1 would otherwise be sitting idle. The smaller the work items, the smaller the load imbalance. The 100 MB would be what Jake is referring to as "partition size". I'm not yet sure what's a good choice here, probably something in the order of 1 MB? If we assume copy throughput to be around 250 GB/s that would make the worst-case load imbalance be around 4 us (which seems acceptable). The trade-off being that: the smaller the partition size, the more involved the computation of assignment from partitions to the byte range within a buffer (involved=run time & memory capacity for materializing the assignments).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At that point you'd be in the territory of the load balancing algorithm @elstehle developed in NVIDIA/cub#297, which is why I suggested taking a look.

@jrhemstad, do you think we should consider the scenario where a buffer is too small to be copied by a whole thread block (i.e., we need sub-thread block granularity)? If we do, I will try to better understand how we can refactor BatchMemcpy to meet this use case, because I think otherwise this will become too complex to have the logic here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. The way this works is that it does a pretty good job of making sure the # of bytes being copied by each block is the same. It's not always exactly the same, but it's quite close. So the "outlier" case is typically "many equally-sized buffers, and a number of much much smaller ones", but not "many equally-size buffers, and a number of much much larger ones". The idea being: the "much smaller" ones will end up churningthrough 1 SM while the remainder of them take up and approximately equal amount of time on the rest.

If I'm reading this right, you're thinking maybe just say: "divide this up into 1 MB pieces across the board". As you mention, this will increase the time spent running the setup code, but maybe not too bad.

I'll try this as well as the idea from a few days ago: using some multiple of the # of sm's to squeeze the difference between buffers than get repartitioned and the ones that don't (typically just validity) and end up being small outliers.

Copy link
Contributor Author

@nvdbaranec nvdbaranec Jan 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I tried out a couple of things:

  • Uniform subdivision of buffers at various levels of granularity ranging from 64 KB to 2 MB.

  • Uniform subdivision of buffers based on a larger multiple of SM count (8), with a minimum size of 256 KB.

These all worked out roughly the same on my various interesting benchmarks, however they did inch out the implementation in the existing PR a tiny bit:

4GB, 4 columns, no splits              198.64 2G/s  ->  208-212 G/s
1GB, 1 column + validity, no splits    189.784G/s  ->   211-213 G/s

The new code (both subdivisions):

  // A: straight constant
  auto const desired_chunk_size = size_t{1 * 1024 * 1024};

  // B: larger multiple of SM count with uniform subdivision across all columns
  auto const num_sms = prop.multiProcessorCount * 8;
  auto const desired_chunk_size = max(size_t{256 * 1024}, util::round_up_safe(total_bytes / (num_sms), split_align));  

  thrust::transform(
    rmm::exec_policy(stream),
    _d_dst_buf_info,
    _d_dst_buf_info + num_bufs,
    chunks.begin(),
    [desired_chunk_size] __device__(dst_buf_info const& buf) {
      // how many chunks do we want to subdivide this buffer into
      size_t const bytes = buf.num_elements * buf.element_size;
      
      // can happen for things like lists and strings (the root columns store no data)
      if (bytes == 0) { return thrust::pair<size_t, size_t>{1, 0}; }
      size_t const num_chunks = max(size_t{1}, _round_up_safe(bytes, desired_chunk_size) / desired_chunk_size); 
      
      // NOTE: leaving chunk size as a separate parameter for future tuning possibilities,
      // even though in the current implemenetation it will be a constant.
      return thrust::pair<size_t, size_t>{num_chunks, desired_chunk_size};
    });

We'll want to run this against the Spark benchmarks again but it seems like a reasonable change. Does this seem more inline with what you guys are thinking?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good 👍 Thanks for making the changes and evaluation. Performance numbers look robust (at least amongst the two evaluated cases). Hope we can confirm this with additional Spark benchmarks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rob did a run on Friday with the new code. It does carve us out some additional gains. Code updated.

@elstehle elstehle self-requested a review November 29, 2021 16:46
@codecov
Copy link

codecov bot commented Nov 29, 2021

Codecov Report

Merging #9755 (717ab92) into branch-22.02 (967a333) will decrease coverage by 0.07%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff                @@
##           branch-22.02    #9755      +/-   ##
================================================
- Coverage         10.49%   10.41%   -0.08%     
================================================
  Files               119      119              
  Lines             20305    20480     +175     
================================================
+ Hits               2130     2134       +4     
- Misses            18175    18346     +171     
Impacted Files Coverage Δ
python/custreamz/custreamz/kafka.py 29.16% <0.00%> (-0.63%) ⬇️
python/dask_cudf/dask_cudf/sorting.py 92.30% <0.00%> (-0.61%) ⬇️
python/cudf/cudf/__init__.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/frame.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/index.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/parquet.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/series.py 0.00% <0.00%> (ø)
python/cudf/cudf/utils/utils.py 0.00% <0.00%> (ø)
python/cudf/cudf/utils/dtypes.py 0.00% <0.00%> (ø)
python/cudf/cudf/utils/ioutils.py 0.00% <0.00%> (ø)
... and 19 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0d5ec7f...717ab92. Read the comment docs.

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some basic comments pending changes to the load balancing algorithm.

cpp/benchmarks/copying/contiguous_split_benchmark.cu Outdated Show resolved Hide resolved
cpp/src/copying/contiguous_split.cu Show resolved Hide resolved
cpp/src/copying/contiguous_split.cu Show resolved Hide resolved
cpp/src/copying/contiguous_split.cu Show resolved Hide resolved
cpp/src/copying/contiguous_split.cu Show resolved Hide resolved
cpp/src/copying/contiguous_split.cu Outdated Show resolved Hide resolved
cpp/src/copying/contiguous_split.cu Outdated Show resolved Hide resolved
cpp/src/copying/contiguous_split.cu Outdated Show resolved Hide resolved
cpp/src/copying/contiguous_split.cu Outdated Show resolved Hide resolved
v22.02 Release automation moved this from PR-WIP to PR-Needs review Nov 30, 2021
Copy link
Contributor

@hyperbolic2346 hyperbolic2346 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking like a great speedup here. Note that copyrights need to be updated as well.

cpp/benchmarks/copying/contiguous_split_benchmark.cu Outdated Show resolved Hide resolved
cpp/benchmarks/copying/contiguous_split_benchmark.cu Outdated Show resolved Hide resolved
cpp/src/copying/contiguous_split.cu Show resolved Hide resolved
cpp/src/copying/contiguous_split.cu Outdated Show resolved Hide resolved
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

cpp/src/copying/contiguous_split.cu Show resolved Hide resolved
cpp/src/copying/contiguous_split.cu Outdated Show resolved Hide resolved
cpp/src/copying/contiguous_split.cu Outdated Show resolved Hide resolved
cpp/src/copying/contiguous_split.cu Outdated Show resolved Hide resolved
cpp/src/copying/contiguous_split.cu Outdated Show resolved Hide resolved
@nvdbaranec nvdbaranec removed the 5 - DO NOT MERGE Hold off on merging; see PR for details label Jan 13, 2022
v22.02 Release automation moved this from PR-Needs review to PR-Reviewer approved Jan 13, 2022
@nvdbaranec
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit c07fdab into rapidsai:branch-22.02 Jan 13, 2022
v22.02 Release automation moved this from PR-Reviewer approved to Done Jan 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

5 participants