-
Notifications
You must be signed in to change notification settings - Fork 510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] Removing local memory operations from computeSplitKernel and other optimizations #4083
[REVIEW] Removing local memory operations from computeSplitKernel and other optimizations #4083
Conversation
Can one of the admins verify this patch? |
add to allowlist |
Here are GBM bench results for this PR Training time improves by 11.2%, 9.5%, 6.8% for |
rerun tests |
Codecov Report
@@ Coverage Diff @@
## branch-21.08 #4083 +/- ##
===============================================
Coverage ? 85.81%
===============================================
Files ? 231
Lines ? 18269
Branches ? 0
===============================================
Hits ? 15677
Misses ? 2592
Partials ? 0
Flags with carried forward coverage won't be shown. Click here to find out more. Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, tests for binary search would be good. Are we putting this on 21.08 or should it be on 21.10?
@@ -373,17 +368,27 @@ __global__ void computeSplitKernel(BinT* hist, | |||
auto row = input.rowids[i]; | |||
auto d = input.data[row + coloffset]; | |||
auto label = input.labels[row]; | |||
IdxT bin = thrust::lower_bound(thrust::seq, sbins, sbins + nbins, d) - sbins; | |||
BinT::IncrementHistogram(pdf_shist, nbins, bin, label); | |||
IdxT start = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not too bothered about whether we use thrust or a custom function here. If using a custom version I think it should be a function and it needs to be tested. The advantage of thrust is that it's one line of code and we can assume it's correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@venkywonka will help here for writing test case.
@@ -77,17 +77,20 @@ struct Split { | |||
/** | |||
* @brief updates the current split if the input gain is better | |||
*/ | |||
DI void update(const SplitT& other) volatile | |||
DI bool update(const SplitT& other) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with this for now, but in general the split update function is relatively hard to understand and contains a bunch of custom code, so it's high maintenance.
I guess the alternative is to write out all of the split proposals to global memory, then cub segmented reduce, or a scan to get the best for each node. Or maybe do the reduction in the node split kernel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Writing the splits to global memory would incur overhead. We can do first level update reduction at block level using cub calls and for second level we can use global memory. The only strictly required comparison is the one comparing best_metric_val
. The other two comparisons are for tie-breaks when best_metric_val
for two splits are equal. Without the other two comparisons the non-determinism in the code increases. Even if we do use cub based reduction, we would still need these comparisons in the form of functor right? Or are you referring to the whole evalBestSplit()
here?
Thanks @venkywonka for the changes. |
@gpucibot merge |
…timizations (rapidsai#4083) Currently `computeSplitKernel` uses local memory for split related operations. This is evident from LDL/STL instructions in the SASS for the kernel. This PR updates the split operations such that local memory usage is removed. After this change I observed improvement in kernel performance. Along with this, the PR also replaces Thrust binary search with direct binary search code. I observed this gives small performance improvement. An unnecessary call to `__syncthreads()` is also removed. GBM-bench performance results to be posted soon. Update 1: @venkywonka reduced the shared memory requirement by removing the duplicate copies of bins. Authors: - Vinay Deshpande (https://github.com/vinaydes) - Venkat (https://github.com/venkywonka) Approvers: - Rory Mitchell (https://github.com/RAMitchell) - Dante Gama Dessavre (https://github.com/dantegd) URL: rapidsai#4083
Currently
computeSplitKernel
uses local memory for split related operations. This is evident from LDL/STL instructions in the SASS for the kernel. This PR updates the split operations such that local memory usage is removed. After this change I observed improvement in kernel performance.Along with this, the PR also replaces Thrust binary search with direct binary search code. I observed this gives small performance improvement. An unnecessary call to
__syncthreads()
is also removed.GBM-bench performance results to be posted soon.
Update 1: @venkywonka reduced the shared memory requirement by removing the duplicate copies of bins.