-
Notifications
You must be signed in to change notification settings - Fork 514
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Random forest refactoring #4166
Conversation
|
||
std::shared_ptr<DT::TreeMetaDataNode<DataT, LabelT>> train() | ||
{ | ||
ML::PUSH_RANGE("Builder::train @builder_base.cuh [batched-levelalgo]"); |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGreatTM 👍🏾
1e6d2a6
to
1f59a24
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think changing each IdxT
to size_t
is not required and is not good for performance. Especially in the device code. For example n_bins
would never be more than few thousands (currently max 1024).
It is probably safe for now to assume that n_rows
and n_cols
are both less than 2^32 and product of the two (size of the dataset) is size_t
(i.e. less than 2^64 on 64-bit platform). So any derived variables for n_rows
and n_cols
, such as n_sampled_cols
, could be treated as 32-bit integers. Anything that is derived from the size of dataset, should be size_t
.
If we keep IdxT
abstraction throughout the code for integer types (n_rows
, n_cols
) then we can change it to bigger sizes in future if needed.
This reverts commit e9525c3.
I reverted the 32->64 bit changes for now as I was not able to resolve a minor performance difference, and it's not that relevant to this pr. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from minor hash related comment, everything looks good to go. Approving.
Co-authored-by: Vinay Deshpande <vinayd@nvidia.com>
543b74d
to
b21fae9
Compare
rerun tests |
Codecov Report
@@ Coverage Diff @@
## branch-21.10 #4166 +/- ##
===============================================
Coverage ? 85.97%
===============================================
Files ? 231
Lines ? 18502
Branches ? 0
===============================================
Hits ? 15907
Misses ? 2595
Partials ? 0
Flags with carried forward coverage won't be shown. Click here to find out more. Continue to review full report at Codecov.
|
@gpucibot merge |
Summary of the changes: - Remove some unused print functions - Move validity checks into parameter construction, so parameters are checked by default - Remove Node_ID_info struct, we can just use a std::pair - Move builder_base.cuh into builder.cuh - Remove node.cuh. Use InstanceRange to store this information. - Builder.train() directly returns a DT::TreeMetaDataNode<DataT, LabelT> object - computeQuantiles is made into a pure function. Some weird usages of smart pointers removed. - Unused DataInfo struct removed - DecisionTree class member variables removed, member functions made into pure functions (static) - Some unnecessary RandomForest member variables removed, destructor removed - Some instances of new/delete change to use std containers - Tests for instance counts moved from python to gtest - Change indexing type from 32-bit integers to std::size_t - Test fil predictions against rf predictions, fixes a case where ties in multi-class prediction are broken inconsistently in RF's cpu predictor Authors: - Rory Mitchell (https://github.com/RAMitchell) Approvers: - Venkat (https://github.com/venkywonka) - Vinay Deshpande (https://github.com/vinaydes) - Dante Gama Dessavre (https://github.com/dantegd) URL: rapidsai#4166
Summary of the changes: