Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random forest refactoring #4166

Merged
merged 41 commits into from
Sep 3, 2021
Merged

Conversation

RAMitchell
Copy link
Contributor

@RAMitchell RAMitchell commented Aug 17, 2021

Summary of the changes:

  • Remove some unused print functions
  • Move validity checks into parameter construction, so parameters are checked by default
  • Remove Node_ID_info struct, we can just use a std::pair
  • Move builder_base.cuh into builder.cuh
  • Remove node.cuh. Use InstanceRange to store this information.
  • Builder.train() directly returns a DT::TreeMetaDataNode<DataT, LabelT> object
  • computeQuantiles is made into a pure function. Some weird usages of smart pointers removed.
  • Unused DataInfo struct removed
  • DecisionTree class member variables removed, member functions made into pure functions (static)
  • Some unnecessary RandomForest member variables removed, destructor removed
  • Some instances of new/delete change to use std containers
  • Tests for instance counts moved from python to gtest
  • Change indexing type from 32-bit integers to std::size_t
  • Test fil predictions against rf predictions, fixes a case where ties in multi-class prediction are broken inconsistently in RF's cpu predictor

@RAMitchell RAMitchell requested review from a team as code owners August 17, 2021 01:42

std::shared_ptr<DT::TreeMetaDataNode<DataT, LabelT>> train()
{
ML::PUSH_RANGE("Builder::train @builder_base.cuh [batched-levelalgo]");

This comment was marked as resolved.

Copy link
Contributor

@venkywonka venkywonka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGreatTM 👍🏾

Copy link
Contributor

@vinaydes vinaydes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think changing each IdxT to size_t is not required and is not good for performance. Especially in the device code. For example n_bins would never be more than few thousands (currently max 1024).
It is probably safe for now to assume that n_rows and n_cols are both less than 2^32 and product of the two (size of the dataset) is size_t (i.e. less than 2^64 on 64-bit platform). So any derived variables for n_rows and n_cols, such as n_sampled_cols, could be treated as 32-bit integers. Anything that is derived from the size of dataset, should be size_t.
If we keep IdxT abstraction throughout the code for integer types (n_rows, n_cols) then we can change it to bigger sizes in future if needed.

cpp/src/randomforest/randomforest.cu Outdated Show resolved Hide resolved
cpp/src/randomforest/randomforest.cu Outdated Show resolved Hide resolved
cpp/src/randomforest/randomforest.cuh Outdated Show resolved Hide resolved
cpp/src/randomforest/randomforest.cuh Outdated Show resolved Hide resolved
cpp/src/decisiontree/decisiontree.cu Show resolved Hide resolved
cpp/include/cuml/tree/flatnode.h Outdated Show resolved Hide resolved
cpp/include/cuml/tree/flatnode.h Show resolved Hide resolved
cpp/src/hdbscan/detail/utils.h Outdated Show resolved Hide resolved
cpp/src/fil/infer.cu Outdated Show resolved Hide resolved
cpp/src/hdbscan/hdbscan.cu Outdated Show resolved Hide resolved
@RAMitchell
Copy link
Contributor Author

I reverted the 32->64 bit changes for now as I was not able to resolve a minor performance difference, and it's not that relevant to this pr.

Copy link
Contributor

@vinaydes vinaydes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from minor hash related comment, everything looks good to go. Approving.

@RAMitchell
Copy link
Contributor Author

rerun tests

@codecov-commenter
Copy link

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.10@7706130). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##             branch-21.10    #4166   +/-   ##
===============================================
  Coverage                ?   85.97%           
===============================================
  Files                   ?      231           
  Lines                   ?    18502           
  Branches                ?        0           
===============================================
  Hits                    ?    15907           
  Misses                  ?     2595           
  Partials                ?        0           
Flag Coverage Δ
dask 47.33% <0.00%> (?)
non-dask 78.57% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.


Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7706130...06f41dc. Read the comment docs.

@dantegd dantegd added breaking Breaking change improvement Improvement / enhancement to an existing function labels Sep 3, 2021
v21.10 Release automation moved this from PR-Needs review to PR-Reviewer approved Sep 3, 2021
@dantegd
Copy link
Member

dantegd commented Sep 3, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 0e770fa into rapidsai:branch-21.10 Sep 3, 2021
v21.10 Release automation moved this from PR-Reviewer approved to Done Sep 3, 2021
vimarsh6739 pushed a commit to vimarsh6739/cuml that referenced this pull request Oct 9, 2023
Summary of the changes:
- Remove some unused print functions
- Move validity checks into parameter construction, so parameters are checked by default
- Remove Node_ID_info struct, we can just use a std::pair
- Move builder_base.cuh into builder.cuh
- Remove node.cuh. Use InstanceRange to store this information.
- Builder.train() directly returns a DT::TreeMetaDataNode<DataT, LabelT> object
- computeQuantiles is made into a pure function. Some weird usages of smart pointers removed.
- Unused DataInfo struct removed
- DecisionTree class member variables removed, member functions made into pure functions (static)
- Some unnecessary RandomForest member variables removed, destructor removed
- Some instances of new/delete change to use std containers
- Tests for instance counts moved from python to gtest
- Change indexing type from 32-bit integers to std::size_t
- Test fil predictions against rf predictions, fixes a case where ties in multi-class prediction are broken inconsistently in RF's cpu predictor

Authors:
  - Rory Mitchell (https://github.com/RAMitchell)

Approvers:
  - Venkat (https://github.com/venkywonka)
  - Vinay Deshpande (https://github.com/vinaydes)
  - Dante Gama Dessavre (https://github.com/dantegd)

URL: rapidsai#4166
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking Breaking change CUDA/C++ Cython / Python Cython or Python issue improvement Improvement / enhancement to an existing function
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

5 participants