Performance optimization of RF split kernels by removing empty cycles #3818

vinaydes · 2021-05-03T14:38:13Z

The compute split kernels for classification and regression end up doing lot of work that is not required. This PR removes lot of these empty work cycles by doing following changes:

For computing split for a node, launch number of thread blocks proportional to number of samples in that node. Before this PR the number of thread blocks was fixed for all the nodes
Check if a node is leaf before launching the kernel and if it is leaf, do not launch any thread blocks for it
Don't call update on split, if not valid split is found for a feature
Skip round trip to global memory before evaluating best split, if only one thread block is operating on a node

Performance improvement observed

Classification problem on a synthetic dataset computeSplitClassificationKernel timings

branch-0.20: 22.91 seconds
This branch:  5.27 seconds
Gain: 4.35x

Regression problem on synthetic dataset computeSplitRegessionKernel timings

branch-0.20: 36.46 seconds
This branch: 34.03 seconds
Gain: 1.07x

Empty cycles is not the major performance issue in regression code, therefore we do not see large improvement currently.

…for classification

…riable-cta-per-node

…r a node

hcho3 · 2021-05-19T22:14:34Z

cpp/src/decisiontree/batched-levelalgo/kernels.cuh


  // variables
  auto end = range_start + range_len;
-  auto len = nbins * 2;
+  // auto len = nbins * 2;


Can we simply remove this line instead of commenting it?

Yes. I need to redo the regression part anyway after merging with #3845. I'll remove it then.

hcho3 · 2021-05-19T22:14:38Z

cpp/src/decisiontree/batched-levelalgo/kernels.cuh

-  auto cdf_spred_len = 2 * nbins;
-  IdxT stride = blockDim.x * gridDim.x;
-  IdxT tid = threadIdx.x + blockIdx.x * blockDim.x;
+  // auto cdf_spred_len = 2 * nbins;


Can we simply remove this line instead of commenting it?

Same as above.

hcho3 · 2021-05-19T22:16:04Z

cpp/src/decisiontree/batched-levelalgo/kernels.cuh

@@ -655,7 +708,7 @@ __global__ void computeSplitRegressionKernel(
  __syncthreads();

  /* Make a second pass over the data to compute gain */
-
+  auto coloffset = col * input.M;


Is coloffset used anywhere in the kernel?

seems to be used in L716 and L729

codecov-commenter · 2021-05-25T20:42:26Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.06@29a8390). Click here to learn what that means.
The diff coverage is n/a.

@@               Coverage Diff               @@
##             branch-21.06    #3818   +/-   ##
===============================================
  Coverage                ?   85.43%           
===============================================
  Files                   ?      226           
  Lines                   ?    17281           
  Branches                ?        0           
===============================================
  Hits                    ?    14764           
  Misses                  ?     2517           
  Partials                ?        0

Flag	Coverage Δ
dask	`48.96% <0.00%> (?)`
non-dask	`77.41% <0.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 29a8390...64c9cb5. Read the comment docs.

vinaydes · 2021-05-26T14:07:27Z

gbm-bench results for this PR
Datasets benchmarked
For classification: airline, Fraud, Higgs, Covtye, Epsilon
For regression: airline_regression, year
code used: https://github.com/NVIDIA/gbm-bench
Number of estimators: For Fraud, Higgs and Epsilon 100 estimator trained, 50 estimators for rest of the datasets.
max_samples is set to 0.5 for all the experiments.

Accuracy remains unchanged for both classification and regression

Fit time improves for both classification and regression

Removing sklearn for zooming in on impact of this PR

Improvement in percentage term

Covtype and Fraud are relatively tiny datasets. Their fit time performance change is not dominated by computesplit kernels. Instead nodeSplit kernel becomes the dominant one (>80% gpu time) for that size. Therefore this PR has little to no impact on them.

teju85 · 2021-05-26T15:21:18Z

@JohnZed or @dantegd can we get python-side approval so that this PR can be merged?

dantegd · 2021-05-26T23:23:17Z

@gpucibot merge

…rapidsai#3818) The compute split kernels for classification and regression end up doing lot of work that is not required. This PR removes lot of these empty work cycles by doing following changes: 1. For computing split for a node, launch number of thread blocks proportional to number of samples in that node. Before this PR the number of thread blocks was fixed for all the nodes 2. Check if a node is leaf before launching the kernel and if it is leaf, do not launch any thread blocks for it 3. Don't call update on split, if not valid split is found for a feature 4. Skip round trip to global memory before evaluating best split, if only one thread block is operating on a node **Performance improvement observed** Classification problem on a synthetic dataset `computeSplitClassificationKernel` timings ``` branch-0.20: 22.91 seconds This branch: 5.27 seconds Gain: 4.35x ``` Regression problem on synthetic dataset `computeSplitRegessionKernel` timings ``` branch-0.20: 36.46 seconds This branch: 34.03 seconds Gain: 1.07x ``` Empty cycles is not the major performance issue in regression code, therefore we do not see large improvement currently. Authors: - Vinay Deshpande (https://github.com/vinaydes) - Rory Mitchell (https://github.com/RAMitchell) Approvers: - Rory Mitchell (https://github.com/RAMitchell) - Thejaswi. N. S (https://github.com/teju85) - Philip Hyunsu Cho (https://github.com/hcho3) - Dante Gama Dessavre (https://github.com/dantegd) URL: rapidsai#3818

vinaydes added 30 commits March 30, 2021 11:02

Fixing bug in unique id computation

549b8a8

Turning line info on for development

051cfb2

Adding tie-break logic for better reproducibility

1f80b8a

Adding debug code for printing quantile values for a single feature

15ca59c

Added proportionate launch, keeping the old way launching for comparison

3f51d95

Merge branch 'branch-0.19' into enh-rf-variable-cta-per-node

14ef55e

Added a timer to get exact training time

39fbc00

Changing some parameters and undoing minor optimizations

d61e5c2

Added tie-break rule for cases where best_metric_val and colid are same

501f8f1

Argument splits should be volatile, this fixes reproducibility issue …

b48f6a7

…for classification

Fixing formatting

00622c3

Merge branch 'branch-0.20' into enh-rf-variable-cta-per-node

0b205ff

Merge branch 'fix-rf-classification-irreproducibility' into enh-rf-va…

051a205

…riable-cta-per-node

Skip updating global split, if no split found for a feature

0b6baf7

Using experimental backend in benchmarking classification

08a7d47

Skip global histogram update, if only one threadblock is operating fo…

a96d022

…r a node

Check if node is leaf before calling computeSplit

2930185

Removed leaf evaluation from kernel as it is done at host now

eab0b4d

proportionate blocks working for regression

dad806f

Combining workload information in a single array

8be00de

Removing dead code

c4abdca

Updated shared memory calculation for separated regression kernel

0e566fc

Refactoring parameters and function names

be38716

Undoing a typo

108707c

Refactoring and clean-up of regression kernels

a21aaa9

Removed unnecessary parameters from regression kernel

f0502b7

Removed unused grid sync workspace

2288211

Removed occupancy maximizing code as it is no longer needed

c12a1e9

Removed debug code and updated NVTX markers

ab12c88

Removed additional debug code

369f874

hcho3 self-requested a review May 18, 2021 23:50

Disable mae in gtests

e3d64ae

hcho3 requested changes May 19, 2021

View reviewed changes

vinaydes mentioned this pull request May 20, 2021

Fix RF regression performance #3845

Merged

hcho3 approved these changes May 20, 2021

View reviewed changes

RAMitchell and others added 9 commits May 20, 2021 17:18

Remove sync

1adedb9

Merge branch 'mae' into enh-rf-variable-cta-per-node

5ede57a

Fixing issues introduced by merge

fc4d00b

Restoring deleted comment

91bc591

Duplicate code removal

c2f9fca

Another round of fixing merge issues

a144234

Using correct variable and added a printf for debugging

57a2888

Comparison operator instead of assignment

1fbb384

Moving the signalDone part before pred and count are reloaded

59fbb99

vinaydes requested a review from a team as a code owner May 25, 2021 17:00

github-actions bot added the Cython / Python Cython or Python issue label May 25, 2021

vinaydes added 5 commits May 25, 2021 22:57

Single threadblock bypass for regression

e424090

Formatting changes

76a0c7d

More formatting changes

5ba9dca

Merge branch 'branch-21.06' into enh-rf-variable-cta-per-node

2fcc3dc

Removing stray printf

64c9cb5

github-actions bot removed the Cython / Python Cython or Python issue label May 25, 2021

vinaydes mentioned this pull request May 26, 2021

[BUG] Sporadic RFBatchedRegTests fail in CI #3906

Closed

dantegd approved these changes May 26, 2021

View reviewed changes

rapids-bot bot merged commit f5a3483 into rapidsai:branch-21.06 May 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance optimization of RF split kernels by removing empty cycles #3818

Performance optimization of RF split kernels by removing empty cycles #3818

vinaydes commented May 3, 2021

hcho3 May 19, 2021

vinaydes May 20, 2021

hcho3 May 19, 2021

vinaydes May 20, 2021

hcho3 May 19, 2021

venkywonka May 20, 2021

codecov-commenter commented May 25, 2021

vinaydes commented May 26, 2021

teju85 commented May 26, 2021

dantegd commented May 26, 2021

Performance optimization of RF split kernels by removing empty cycles #3818

Performance optimization of RF split kernels by removing empty cycles #3818

Conversation

vinaydes commented May 3, 2021

hcho3 May 19, 2021

Choose a reason for hiding this comment

vinaydes May 20, 2021

Choose a reason for hiding this comment

hcho3 May 19, 2021

Choose a reason for hiding this comment

vinaydes May 20, 2021

Choose a reason for hiding this comment

hcho3 May 19, 2021

Choose a reason for hiding this comment

venkywonka May 20, 2021

Choose a reason for hiding this comment

codecov-commenter commented May 25, 2021

Codecov Report

vinaydes commented May 26, 2021

teju85 commented May 26, 2021

dantegd commented May 26, 2021