Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] RF: code re-organization to enhance build parallelism #4299

Merged
merged 8 commits into from
Dec 6, 2021

Conversation

venkywonka
Copy link
Contributor

@venkywonka venkywonka commented Oct 21, 2021

This PR separates the Decision tree kernels into separate Translation Units (TU) and explicitly instantiates templates.
This is helpful in 2 ways:

  1. refactoring top-level RF/DT code now would not require recompilation of the kernels
  2. Since they are separated into different TUs and linked, they can leverage build parallelism (4x improvement in rebuild times after touching kernel definitions)

Rebuilding by running time ./build.sh libcuml -v -n PARALLEL_LEVEL=20 after touching RF kernels comparison:
(Note: using --ccache doesn't matter here, assuming after touching RF kernels the state of the code-base is completely new and not part of ccache's hashed index)

This PR
real    0m20.054s             
user    2m28.436s                                               
sys     0m14.241s 
branch-21.12
real    1m21.197s                                                                                                                    
user    2m5.751s                                                                                                                     
sys     0m6.050s

Some other changes include renaming and reorganizing files, pruning headers and cleaning up some code

Things to do:

  • split DT Kernels
  • benchmark for regressions

@venkywonka venkywonka requested review from a team as code owners October 21, 2021 09:37
@venkywonka venkywonka changed the title RF: file re-organization to enhance build parallelism [WIP] RF: file re-organization to enhance build parallelism Oct 21, 2021
@venkywonka venkywonka added the improvement Improvement / enhancement to an existing function label Oct 21, 2021
@teju85 teju85 added Build or Dep Issues related to building the code or dependencies CUDA / C++ CUDA issue non-breaking Non-breaking change labels Oct 21, 2021
@caryr35 caryr35 added this to PR-WIP in v21.12 Release via automation Oct 21, 2021
@caryr35 caryr35 moved this from PR-WIP to PR-Needs review in v21.12 Release Oct 21, 2021
@venkywonka venkywonka changed the title [WIP] RF: file re-organization to enhance build parallelism [WIP] RF: code re-organization to enhance build parallelism Oct 22, 2021
Copy link
Contributor

@RAMitchell RAMitchell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like adding a bunch of files but against long compile times it seems like the lesser of two evils.

I think this should go ahead and we can look at other options later.

@dantegd dantegd removed this from PR-Needs review in v21.12 Release Nov 18, 2021
@dantegd dantegd added this to PR-WIP in v22.02 Release via automation Nov 18, 2021
@venkywonka
Copy link
Contributor Author

venkywonka commented Nov 24, 2021

No regressions found on GBM-Bench. The algorithmic correctness, splits generated every iteration, and output tree are the same.
One thing to note is that, due to the code reorganization, the generated kernel code for the computeSplitKernel uses lesser registers than before boosting the occupancy in the GPU, but this doesn't translate to a significant boost/regression in performance.

The below is an average of 100 runs end-to-end times
comparison

@venkywonka venkywonka changed the base branch from branch-21.12 to branch-22.02 November 24, 2021 14:38
@venkywonka venkywonka changed the title [WIP] RF: code re-organization to enhance build parallelism [REVIEW] RF: code re-organization to enhance build parallelism Nov 26, 2021
@venkywonka
Copy link
Contributor Author

rerun tests

@codecov-commenter
Copy link

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.02@ed0e58c). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##             branch-22.02    #4299   +/-   ##
===============================================
  Coverage                ?   85.73%           
===============================================
  Files                   ?      236           
  Lines                   ?    19314           
  Branches                ?        0           
===============================================
  Hits                    ?    16558           
  Misses                  ?     2756           
  Partials                ?        0           
Flag Coverage Δ
dask 46.52% <0.00%> (?)
non-dask 78.62% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.


Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ed0e58c...f20989f. Read the comment docs.

v22.02 Release automation moved this from PR-WIP to PR-Reviewer approved Dec 6, 2021
@dantegd
Copy link
Member

dantegd commented Dec 6, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 4215577 into rapidsai:branch-22.02 Dec 6, 2021
v22.02 Release automation moved this from PR-Reviewer approved to Done Dec 6, 2021
vimarsh6739 pushed a commit to vimarsh6739/cuml that referenced this pull request Oct 9, 2023
This PR separates the Decision tree kernels into separate Translation Units (TU) and explicitly instantiates templates. 
This is helpful in 2 ways:
1. refactoring top-level RF/DT code now would not require recompilation of the kernels 
2. Since they are separated into different TUs and linked, they can leverage build parallelism (4x improvement in rebuild times after touching kernel definitions)

Rebuilding by running `time ./build.sh libcuml -v -n PARALLEL_LEVEL=20` after touching RF kernels comparison:
(Note: using `--ccache` doesn't matter here, assuming after touching RF kernels the state of the code-base is completely new and not part of ccache's hashed index)
<details><summary>This PR</summary>
  
  ```
  real    0m20.054s             
  user    2m28.436s                                               
  sys     0m14.241s 
  ```
</details>
<details><summary>branch-21.12</summary>
  
  ```
real    1m21.197s                                                                                                                    
user    2m5.751s                                                                                                                     
sys     0m6.050s
  ```
</details>

Some other changes include renaming and reorganizing files, pruning headers and cleaning up some code

Things to do: 
- [x] split DT Kernels
- [x] benchmark for regressions

Authors:
  - Venkat (https://github.com/venkywonka)

Approvers:
  - Rory Mitchell (https://github.com/RAMitchell)
  - Dante Gama Dessavre (https://github.com/dantegd)

URL: rapidsai#4299
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Build or Dep Issues related to building the code or dependencies CMake CUDA / C++ CUDA issue CUDA/C++ improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

5 participants