Random forest refactoring #4166

RAMitchell · 2021-08-17T01:42:31Z

Summary of the changes:

Remove some unused print functions
Move validity checks into parameter construction, so parameters are checked by default
Remove Node_ID_info struct, we can just use a std::pair
Move builder_base.cuh into builder.cuh
Remove node.cuh. Use InstanceRange to store this information.
Builder.train() directly returns a DT::TreeMetaDataNode<DataT, LabelT> object
computeQuantiles is made into a pure function. Some weird usages of smart pointers removed.
Unused DataInfo struct removed
DecisionTree class member variables removed, member functions made into pure functions (static)
Some unnecessary RandomForest member variables removed, destructor removed
Some instances of new/delete change to use std containers
Tests for instance counts moved from python to gtest
Change indexing type from 32-bit integers to std::size_t
Test fil predictions against rf predictions, fixes a case where ties in multi-class prediction are broken inconsistently in RF's cpu predictor

…tests

…shmem-bins

…vector-leaf

cpp/src/decisiontree/batched-levelalgo/builder.cuh

+
+  std::shared_ptr<DT::TreeMetaDataNode<DataT, LabelT>> train()
+  {
+    ML::PUSH_RANGE("Builder::train @builder_base.cuh [batched-levelalgo]");


venkywonka

LGreatTM 👍🏾

vinaydes

I think changing each IdxT to size_t is not required and is not good for performance. Especially in the device code. For example n_bins would never be more than few thousands (currently max 1024).
It is probably safe for now to assume that n_rows and n_cols are both less than 2^32 and product of the two (size of the dataset) is size_t (i.e. less than 2^64 on 64-bit platform). So any derived variables for n_rows and n_cols, such as n_sampled_cols, could be treated as 32-bit integers. Anything that is derived from the size of dataset, should be size_t.
If we keep IdxT abstraction throughout the code for integer types (n_rows, n_cols) then we can change it to bigger sizes in future if needed.

cpp/src/randomforest/randomforest.cu

cpp/src/randomforest/randomforest.cuh

cpp/src/decisiontree/decisiontree.cu

cpp/include/cuml/tree/flatnode.h

cpp/src/hdbscan/detail/utils.h

cpp/src/fil/infer.cu

cpp/src/hdbscan/hdbscan.cu

…vector-leaf

This reverts commit e9525c3.

RAMitchell · 2021-08-27T05:52:03Z

I reverted the 32->64 bit changes for now as I was not able to resolve a minor performance difference, and it's not that relevant to this pr.

cpp/src/randomforest/randomforest.cuh

vinaydes

Apart from minor hash related comment, everything looks good to go. Approving.

Co-authored-by: Vinay Deshpande <vinayd@nvidia.com>

…vector-leaf

RAMitchell · 2021-09-02T23:32:56Z

rerun tests

codecov-commenter · 2021-09-03T02:01:29Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.10@7706130). Click here to learn what that means.
The diff coverage is n/a.

@@               Coverage Diff               @@
##             branch-21.10    #4166   +/-   ##
===============================================
  Coverage                ?   85.97%           
===============================================
  Files                   ?      231           
  Lines                   ?    18502           
  Branches                ?        0           
===============================================
  Hits                    ?    15907           
  Misses                  ?     2595           
  Partials                ?        0

Flag	Coverage Δ
dask	`47.33% <0.00%> (?)`
non-dask	`78.57% <0.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7706130...06f41dc. Read the comment docs.

dantegd · 2021-09-03T15:19:14Z

@gpucibot merge

Summary of the changes: - Remove some unused print functions - Move validity checks into parameter construction, so parameters are checked by default - Remove Node_ID_info struct, we can just use a std::pair - Move builder_base.cuh into builder.cuh - Remove node.cuh. Use InstanceRange to store this information. - Builder.train() directly returns a DT::TreeMetaDataNode<DataT, LabelT> object - computeQuantiles is made into a pure function. Some weird usages of smart pointers removed. - Unused DataInfo struct removed - DecisionTree class member variables removed, member functions made into pure functions (static) - Some unnecessary RandomForest member variables removed, destructor removed - Some instances of new/delete change to use std containers - Tests for instance counts moved from python to gtest - Change indexing type from 32-bit integers to std::size_t - Test fil predictions against rf predictions, fixes a case where ties in multi-class prediction are broken inconsistently in RF's cpu predictor Authors: - Rory Mitchell (https://github.com/RAMitchell) Approvers: - Venkat (https://github.com/venkywonka) - Vinay Deshpande (https://github.com/vinaydes) - Dante Gama Dessavre (https://github.com/dantegd) URL: rapidsai#4166

RAMitchell added 27 commits July 7, 2021 20:35

Rf gtest rewrite

812d9c7

Merge branch 'branch-21.08' of https://github.com/rapidsai/cuml into …

2ac77c3

…tests

Use c++17 constexpr if

735c983

Merge branch 'branch-21.08' of https://github.com/rapidsai/cuml into …

b077cf3

…tests

Constructor

e54fb6d

Lint

1f88053

Fix uninitialised

af0aede

Merge branch 'tests' of github.com:RAMitchell/cuml into shmem-bins

8e57f41

Merge branch 'branch-21.08' of https://github.com/rapidsai/cuml into …

3c9a27b

…shmem-bins

Refactor

14cb0f5

Predict after

26362b0

Node queue with work items

bc0934f

Fix tests

f288814

Merge branch 'branch-21.10' of https://github.com/rapidsai/cuml into …

c7102ea

…shmem-bins

Relax accuracy test

9a0edb9

Include checker

6328fe8

Node instance counts

ce3221c

Address review comments

5019199

Batch leaf calculations

919244b

Removing node

a592b12

Merge branch 'branch-21.10' of https://github.com/rapidsai/cuml into …

7f9c219

…vector-leaf

Pre-merge

6f76bee

Merge branch 'branch-21.10' of https://github.com/rapidsai/cuml into …

93e6d1d

…vector-leaf

Remove Node

a274aa2

Reduce boilerplate

bb61c47

Refactor

471bda4

More refactoring

7ed733f

RAMitchell requested review from a team as code owners August 17, 2021 01:42

github-actions bot added the CUDA/C++ label Aug 17, 2021

venkywonka reviewed Aug 19, 2021

View reviewed changes

cpp/src/decisiontree/batched-levelalgo/builder.cuh Outdated

std::shared_ptr<DT::TreeMetaDataNode<DataT, LabelT>> train()

{

ML::PUSH_RANGE("Builder::train @builder_base.cuh [batched-levelalgo]");

This comment was marked as resolved.

Sign in to view

venkywonka approved these changes Aug 19, 2021

View reviewed changes

RAMitchell added 2 commits August 19, 2021 15:09

Update nvtx marker

8ebef30

Test fil predictions against rf predictions

1f59a24

RAMitchell force-pushed the vector-leaf branch from 1e6d2a6 to 1f59a24 Compare August 22, 2021 07:01

RAMitchell added 2 commits August 23, 2021 16:43

Change back to 32 bit atomics

751cbae

Don't copy trees

7f571da

vinaydes reviewed Aug 25, 2021

View reviewed changes

RAMitchell added 4 commits August 25, 2021 18:21

Review comments

04c65bc

Merge branch 'branch-21.10' of https://github.com/rapidsai/cuml into …

2f79690

…vector-leaf

Revert "Change 32 bit index types to std::size_t"

726d909

This reverts commit e9525c3.

Lint

c15898e

vinaydes reviewed Aug 30, 2021

View reviewed changes

cpp/src/randomforest/randomforest.cuh Outdated Show resolved Hide resolved

vinaydes approved these changes Aug 30, 2021

View reviewed changes

RAMitchell and others added 3 commits August 31, 2021 12:04

Update cpp/src/randomforest/randomforest.cuh

9d8b543

Co-authored-by: Vinay Deshpande <vinayd@nvidia.com>

Build/lint

7755ba0

Merge branch 'branch-21.10' of https://github.com/rapidsai/cuml into …

b21fae9

…vector-leaf

RAMitchell force-pushed the vector-leaf branch from 543b74d to b21fae9 Compare August 31, 2021 02:13

Merge branch 'branch-21.10' of https://github.com/rapidsai/cuml into …

06f41dc

…vector-leaf

dantegd added breaking Breaking change improvement Improvement / enhancement to an existing function labels Sep 3, 2021

dantegd approved these changes Sep 3, 2021

View reviewed changes

v21.10 Release automation moved this from PR-Needs review to PR-Reviewer approved Sep 3, 2021

rapids-bot bot merged commit 0e770fa into rapidsai:branch-21.10 Sep 3, 2021

v21.10 Release automation moved this from PR-Reviewer approved to Done Sep 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Random forest refactoring #4166

Random forest refactoring #4166

RAMitchell commented Aug 17, 2021 •

edited

This comment was marked as resolved.

venkywonka left a comment

vinaydes left a comment

RAMitchell commented Aug 27, 2021

vinaydes left a comment

RAMitchell commented Sep 2, 2021

codecov-commenter commented Sep 3, 2021

dantegd commented Sep 3, 2021

Random forest refactoring #4166

Random forest refactoring #4166

Conversation

RAMitchell commented Aug 17, 2021 • edited

This comment was marked as resolved.

venkywonka left a comment

Choose a reason for hiding this comment

vinaydes left a comment

Choose a reason for hiding this comment

RAMitchell commented Aug 27, 2021

vinaydes left a comment

Choose a reason for hiding this comment

RAMitchell commented Sep 2, 2021

codecov-commenter commented Sep 3, 2021

Codecov Report

dantegd commented Sep 3, 2021

RAMitchell commented Aug 17, 2021 •

edited