[REVIEW] Removing local memory operations from computeSplitKernel and other optimizations #4083

vinaydes · 2021-07-22T16:40:38Z

Currently computeSplitKernel uses local memory for split related operations. This is evident from LDL/STL instructions in the SASS for the kernel. This PR updates the split operations such that local memory usage is removed. After this change I observed improvement in kernel performance.
Along with this, the PR also replaces Thrust binary search with direct binary search code. I observed this gives small performance improvement. An unnecessary call to __syncthreads() is also removed.
GBM-bench performance results to be posted soon.
Update 1: @venkywonka reduced the shared memory requirement by removing the duplicate copies of bins.

…plit for this change

GPUtester · 2021-07-22T16:40:40Z

Can one of the admins verify this patch?

dantegd · 2021-07-22T17:51:59Z

add to allowlist

vinaydes · 2021-07-27T15:55:02Z

Here are GBM bench results for this PR

Training time improves by 11.2%, 9.5%, 6.8% for max_depth 32, 24, 18 respectively. The performance gain is more at higher depths. Also regression datasets seems to be gaining more than classification. Accuracy is unchanged with change.
The GBM-bench parameter file used for benchmarking can be located here.

dantegd · 2021-07-27T19:00:08Z

rerun tests

codecov-commenter · 2021-07-27T22:37:36Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.08@cb32219). Click here to learn what that means.
The diff coverage is n/a.

@@               Coverage Diff               @@
##             branch-21.08    #4083   +/-   ##
===============================================
  Coverage                ?   85.81%           
===============================================
  Files                   ?      231           
  Lines                   ?    18269           
  Branches                ?        0           
===============================================
  Hits                    ?    15677           
  Misses                  ?     2592           
  Partials                ?        0

Flag	Coverage Δ
dask	`48.17% <0.00%> (?)`
non-dask	`78.28% <0.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cb32219...b7a7265. Read the comment docs.

RAMitchell

LGTM, tests for binary search would be good. Are we putting this on 21.08 or should it be on 21.10?

RAMitchell · 2021-07-27T22:54:58Z

cpp/src/decisiontree/batched-levelalgo/kernels.cuh

@@ -373,17 +368,27 @@ __global__ void computeSplitKernel(BinT* hist,
    auto row   = input.rowids[i];
    auto d     = input.data[row + coloffset];
    auto label = input.labels[row];
-    IdxT bin   = thrust::lower_bound(thrust::seq, sbins, sbins + nbins, d) - sbins;
-    BinT::IncrementHistogram(pdf_shist, nbins, bin, label);
+    IdxT start = 0;


I'm not too bothered about whether we use thrust or a custom function here. If using a custom version I think it should be a function and it needs to be tested. The advantage of thrust is that it's one line of code and we can assume it's correct.

@venkywonka will help here for writing test case.

RAMitchell · 2021-07-27T22:59:09Z

cpp/src/decisiontree/batched-levelalgo/split.cuh

@@ -77,17 +77,20 @@ struct Split {
  /**
   * @brief updates the current split if the input gain is better
   */
-  DI void update(const SplitT& other) volatile
+  DI bool update(const SplitT& other)


I'm fine with this for now, but in general the split update function is relatively hard to understand and contains a bunch of custom code, so it's high maintenance.

I guess the alternative is to write out all of the split proposals to global memory, then cub segmented reduce, or a scan to get the best for each node. Or maybe do the reduction in the node split kernel?

Writing the splits to global memory would incur overhead. We can do first level update reduction at block level using cub calls and for second level we can use global memory. The only strictly required comparison is the one comparing best_metric_val. The other two comparisons are for tie-breaks when best_metric_val for two splits are equal. Without the other two comparisons the non-determinism in the code increases. Even if we do use cub based reduction, we would still need these comparisons in the form of functor right? Or are you referring to the whole evalBestSplit() here?

vinaydes · 2021-07-28T07:44:14Z

Thanks @venkywonka for the changes.

dantegd · 2021-07-28T21:36:00Z

@gpucibot merge

@venkywonka

…timizations (rapidsai#4083) Currently `computeSplitKernel` uses local memory for split related operations. This is evident from LDL/STL instructions in the SASS for the kernel. This PR updates the split operations such that local memory usage is removed. After this change I observed improvement in kernel performance. Along with this, the PR also replaces Thrust binary search with direct binary search code. I observed this gives small performance improvement. An unnecessary call to `__syncthreads()` is also removed. GBM-bench performance results to be posted soon. Update 1: @venkywonka reduced the shared memory requirement by removing the duplicate copies of bins. Authors: - Vinay Deshpande (https://github.com/vinaydes) - Venkat (https://github.com/venkywonka) Approvers: - Rory Mitchell (https://github.com/RAMitchell) - Dante Gama Dessavre (https://github.com/dantegd) URL: rapidsai#4083

vinaydes added 4 commits July 21, 2021 21:03

Removing volatile from update, assignment function, updated evalBestS…

43b1f0f

…plit for this change

Removing unnecessary __syncthreads()

47193ff

Replacing thrust::lower_bound with binary search code

9ed20df

Merge branch 'branch-21.08' into enh-rf-local-mem-optimization

4c9490d

vinaydes requested a review from a team as a code owner July 22, 2021 16:40

github-actions bot added the CUDA/C++ label Jul 22, 2021

venkywonka and others added 2 commits July 23, 2021 15:02

reduce shared memory footprint, cosmetic changes, use atomicExch

88f689e

Clang-format fixes

ca2a6af

caryr35 added this to PR-WIP in v21.08 Release via automation Jul 26, 2021

caryr35 moved this from PR-WIP to PR-Needs review in v21.08 Release Jul 26, 2021

vinaydes changed the title ~~[WIP] Removing local memory operations from computeSplitKernel and other optimizations~~ [REVIEW] Removing local memory operations from computeSplitKernel and other optimizations Jul 27, 2021

Merge branch 'branch-21.08' into enh-rf-local-mem-optimization

b7a7265

dantegd added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jul 27, 2021

RAMitchell approved these changes Jul 27, 2021

View reviewed changes

venkywonka added 2 commits July 28, 2021 13:00

custom lower_bound for kernel and its tests

44d3649

FIX clang format

1a8dd2b

dantegd approved these changes Jul 28, 2021

View reviewed changes

v21.08 Release automation moved this from PR-Needs review to PR-Reviewer approved Jul 28, 2021

rapids-bot bot merged commit 9406d53 into rapidsai:branch-21.08 Jul 28, 2021

v21.08 Release automation moved this from PR-Reviewer approved to Done Jul 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REVIEW] Removing local memory operations from computeSplitKernel and other optimizations #4083

[REVIEW] Removing local memory operations from computeSplitKernel and other optimizations #4083

vinaydes commented Jul 22, 2021 •

edited

GPUtester commented Jul 22, 2021

dantegd commented Jul 22, 2021

vinaydes commented Jul 27, 2021

dantegd commented Jul 27, 2021

codecov-commenter commented Jul 27, 2021

RAMitchell left a comment

RAMitchell Jul 27, 2021

vinaydes Jul 28, 2021

RAMitchell Jul 27, 2021

vinaydes Jul 28, 2021

vinaydes commented Jul 28, 2021

dantegd commented Jul 28, 2021

[REVIEW] Removing local memory operations from computeSplitKernel and other optimizations #4083

[REVIEW] Removing local memory operations from computeSplitKernel and other optimizations #4083

Conversation

vinaydes commented Jul 22, 2021 • edited

GPUtester commented Jul 22, 2021

dantegd commented Jul 22, 2021

vinaydes commented Jul 27, 2021

dantegd commented Jul 27, 2021

codecov-commenter commented Jul 27, 2021

Codecov Report

RAMitchell left a comment

Choose a reason for hiding this comment

RAMitchell Jul 27, 2021

Choose a reason for hiding this comment

vinaydes Jul 28, 2021

Choose a reason for hiding this comment

RAMitchell Jul 27, 2021

Choose a reason for hiding this comment

vinaydes Jul 28, 2021

Choose a reason for hiding this comment

vinaydes commented Jul 28, 2021

dantegd commented Jul 28, 2021

vinaydes commented Jul 22, 2021 •

edited