Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Removing local memory operations from computeSplitKernel and other optimizations #4083

Merged

Conversation

vinaydes
Copy link
Contributor

@vinaydes vinaydes commented Jul 22, 2021

Currently computeSplitKernel uses local memory for split related operations. This is evident from LDL/STL instructions in the SASS for the kernel. This PR updates the split operations such that local memory usage is removed. After this change I observed improvement in kernel performance.
Along with this, the PR also replaces Thrust binary search with direct binary search code. I observed this gives small performance improvement. An unnecessary call to __syncthreads() is also removed.
GBM-bench performance results to be posted soon.
Update 1: @venkywonka reduced the shared memory requirement by removing the duplicate copies of bins.

@vinaydes vinaydes requested a review from a team as a code owner July 22, 2021 16:40
@GPUtester
Copy link
Contributor

Can one of the admins verify this patch?

@dantegd
Copy link
Member

dantegd commented Jul 22, 2021

add to allowlist

@caryr35 caryr35 added this to PR-WIP in v21.08 Release via automation Jul 26, 2021
@caryr35 caryr35 moved this from PR-WIP to PR-Needs review in v21.08 Release Jul 26, 2021
@vinaydes
Copy link
Contributor Author

Here are GBM bench results for this PR
volatile-result-32
volatile-result-24
volatile-result-18

Training time improves by 11.2%, 9.5%, 6.8% for max_depth 32, 24, 18 respectively. The performance gain is more at higher depths. Also regression datasets seems to be gaining more than classification. Accuracy is unchanged with change.
The GBM-bench parameter file used for benchmarking can be located here.

@vinaydes vinaydes changed the title [WIP] Removing local memory operations from computeSplitKernel and other optimizations [REVIEW] Removing local memory operations from computeSplitKernel and other optimizations Jul 27, 2021
@dantegd dantegd added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jul 27, 2021
@dantegd
Copy link
Member

dantegd commented Jul 27, 2021

rerun tests

@codecov-commenter
Copy link

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.08@cb32219). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##             branch-21.08    #4083   +/-   ##
===============================================
  Coverage                ?   85.81%           
===============================================
  Files                   ?      231           
  Lines                   ?    18269           
  Branches                ?        0           
===============================================
  Hits                    ?    15677           
  Misses                  ?     2592           
  Partials                ?        0           
Flag Coverage Δ
dask 48.17% <0.00%> (?)
non-dask 78.28% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.


Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cb32219...b7a7265. Read the comment docs.

Copy link
Contributor

@RAMitchell RAMitchell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, tests for binary search would be good. Are we putting this on 21.08 or should it be on 21.10?

@@ -373,17 +368,27 @@ __global__ void computeSplitKernel(BinT* hist,
auto row = input.rowids[i];
auto d = input.data[row + coloffset];
auto label = input.labels[row];
IdxT bin = thrust::lower_bound(thrust::seq, sbins, sbins + nbins, d) - sbins;
BinT::IncrementHistogram(pdf_shist, nbins, bin, label);
IdxT start = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not too bothered about whether we use thrust or a custom function here. If using a custom version I think it should be a function and it needs to be tested. The advantage of thrust is that it's one line of code and we can assume it's correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@venkywonka will help here for writing test case.

@@ -77,17 +77,20 @@ struct Split {
/**
* @brief updates the current split if the input gain is better
*/
DI void update(const SplitT& other) volatile
DI bool update(const SplitT& other)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with this for now, but in general the split update function is relatively hard to understand and contains a bunch of custom code, so it's high maintenance.

I guess the alternative is to write out all of the split proposals to global memory, then cub segmented reduce, or a scan to get the best for each node. Or maybe do the reduction in the node split kernel?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Writing the splits to global memory would incur overhead. We can do first level update reduction at block level using cub calls and for second level we can use global memory. The only strictly required comparison is the one comparing best_metric_val. The other two comparisons are for tie-breaks when best_metric_val for two splits are equal. Without the other two comparisons the non-determinism in the code increases. Even if we do use cub based reduction, we would still need these comparisons in the form of functor right? Or are you referring to the whole evalBestSplit() here?

@vinaydes
Copy link
Contributor Author

Thanks @venkywonka for the changes.

v21.08 Release automation moved this from PR-Needs review to PR-Reviewer approved Jul 28, 2021
@dantegd
Copy link
Member

dantegd commented Jul 28, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 9406d53 into rapidsai:branch-21.08 Jul 28, 2021
v21.08 Release automation moved this from PR-Reviewer approved to Done Jul 28, 2021
vimarsh6739 pushed a commit to vimarsh6739/cuml that referenced this pull request Oct 9, 2023
…timizations (rapidsai#4083)

Currently `computeSplitKernel` uses local memory for split related operations. This is evident from LDL/STL instructions in the SASS for the kernel. This PR updates the split operations such that local memory usage is removed. After this change I observed improvement in kernel performance.
Along with this, the PR also replaces Thrust binary search with direct binary search code. I observed this gives small performance improvement. An unnecessary call to `__syncthreads()` is also removed.
GBM-bench performance results to be posted soon.
Update 1: @venkywonka reduced the shared memory requirement by removing the duplicate copies of bins.

Authors:
  - Vinay Deshpande (https://github.com/vinaydes)
  - Venkat (https://github.com/venkywonka)

Approvers:
  - Rory Mitchell (https://github.com/RAMitchell)
  - Dante Gama Dessavre (https://github.com/dantegd)

URL: rapidsai#4083
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CUDA/C++ improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

6 participants