[REVIEW] Improvements in feature sampling #4278

vinaydes · 2021-10-12T07:56:23Z

With this PR, the feature sampling overhead is greatly reduced, especially for wide (thousands of features) datasets. The PR requires some structural changes in RAFT therefore is marked as WIP.

teju85 · 2021-10-12T08:08:21Z

Any ideas we could also add support for feature subsampling with weights in the same PR? Or better to keep it separate in another PR?

vinaydes · 2021-10-12T08:22:40Z

Any ideas we could also add support for feature subsampling with weights in the same PR? Or better to keep it separate in another PR?

Weighted sampling should be possible. I'll see if I can manage to add it to same PR.

venkywonka · 2021-10-13T05:36:34Z

No regressions in gbm-bench, both in accuracy and perf 🙌🏻. Since most of the datasets in gbm-bench have small number of columns, speedups are not expected (except epsilon, where 2000 columns shows a slight improvement in perf).

accuracy comparison with branch-21.12

perf comparison with branch-21.12

github-actions · 2021-11-23T20:03:08Z

This PR has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be labeled inactive-90d if there is no activity in the next 60 days.

venkywonka · 2022-06-13T12:38:37Z

rerun tests

venkywonka · 2022-06-13T13:05:07Z

The previous pushed changes include two feature-samping strategies that are decided based on if the sampling problem size corresponds to allowable static shared memory availability and register pressure. The default strategy is a sorting-based sampling (this kernel). Another strategy using a batchwise adaptation of the algo-L of reservoir sampling (this kernel) is used as a fallback.

The former strategy is more performant than latter. (by 1.5x times in target datasets with wide columns (~100000))

benchmarks on GBM datasets

No regressions in accuracy

There is only a slight improvement in performance for gbm-datasets as all their columns are small so the improvement in the feature-sampling portion does not significantly affect the end-to-end times.

max_features: "sqrt", n_trees: 1000, n_bins: 128, n_streams: 4, max_samples: 0.5, bootstrap: true
max_features: 0.7, n_trees: 1000, n_bins: 128, n_streams: 4, max_samples: 0.5, bootstrap: true

benchmark on a representative synthetic dataset

This feature-sampling improvement is more pronounced when it becomes the bottleneck in problems where datasets are very wide (~100000 cols).

The below is a benchmark on such a representative synthetic regression dataset with 1000 rows and 100000 cols

max_features: "sqrt", n_trees: 1000, n_bins: 128, n_streams: 4, max_samples: 1.0, bootstrap: true

The above benchmarks have been rerun after the latest commit and modified in-place

venkywonka · 2022-06-24T10:16:27Z

rerun tests

venkywonka · 2022-06-29T11:42:57Z

@teju85 could you give a final review? ✌🏻

teju85 · 2022-06-29T12:52:19Z

Not sure if I'll get time to review this PR soon. Maybe @vinaydes or @tfeher ?

vinaydes · 2022-06-30T04:47:19Z

I already had a look at the code. I am okay with the changes. I am not approving as some part of code is written by me and I have been part of PR from the start. @tfeher Let us know if we need to explain the changes for you to review.

tfeher

Thanks @venkywonka and @vinaydes for the PR! In general it looks good, there are only a few smaller issues.

cpp/src/decisiontree/batched-levelalgo/kernels/builder_kernels.cuh

vinaydes · 2022-07-29T14:16:31Z

@dantegd I have addressed the changes @tfeher had asked for. This PR can now be merged.

vinaydes · 2022-08-01T04:21:48Z

@tfeher I think you need to re-review or accept the changes, cause the merging is blocked for that.

tfeher

Thanks @vinaydes for addressing the issues!

I think my request for improving the docstring was not clear, I have added a suggestion to illustrate what did I mean.

On one hand this is a nitpick, and it should not hold up this PR, therfore I am approving the PR.
On the other hand, if someone picks up rapidsai/raft#767, then such information could be very useful.

cpp/src/decisiontree/batched-levelalgo/kernels/builder_kernels.cuh

….cuh Co-authored-by: Tamas Bela Feher <tfeher@nvidia.com>

vinaydes · 2022-08-01T11:44:42Z

I'll take a look at CI failures.

codecov-commenter · 2022-08-02T19:50:54Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.08@33c0170). Click here to learn what that means.
The diff coverage is n/a.

@@               Coverage Diff               @@
##             branch-22.08    #4278   +/-   ##
===============================================
  Coverage                ?   78.02%           
===============================================
  Files                   ?      180           
  Lines                   ?    11385           
  Branches                ?        0           
===============================================
  Hits                    ?     8883           
  Misses                  ?     2502           
  Partials                ?        0

Flag	Coverage Δ
dask	`46.21% <0.00%> (?)`
non-dask	`67.27% <0.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 33c0170...b5751f1. Read the comment docs.

dantegd · 2022-08-03T15:16:15Z

@gpucibot merge

venkywonka · 2022-08-03T15:25:33Z

Yaaaay ❤️🥳

With this PR, the feature sampling overhead is greatly reduced, especially for wide (thousands of features) datasets. The PR requires some structural changes in RAFT therefore is marked as WIP. Authors: - Vinay Deshpande (https://github.com/vinaydes) - Ray Douglass (https://github.com/raydouglass) - Andy Adinets (https://github.com/canonizer) - Jordan Jacobelli (https://github.com/Ethyling) - Jiwei Liu (https://github.com/daxiongshu) - GALI PREM SAGAR (https://github.com/galipremsagar) - Christopher Akiki (https://github.com/cakiki) - Venkat (https://github.com/venkywonka) Approvers: - Tamas Bela Feher (https://github.com/tfeher) - Dante Gama Dessavre (https://github.com/dantegd) URL: rapidsai#4278

vinaydes added 16 commits September 13, 2021 18:06

Saving the changes made so far

25dbf45

Added Thrust shuffle based colid generation

a7a6707

Removing host copy of colids

b5658fb

Refacotring kernel arguments

b05689b

Removing unused select function

5dad925

Added print to distinguish code version

558dd46

Timing measurement calls added

9f3adbe

Added count sort based sampling again for better comparison

7cc1676

Minor changes

2a6f4a2

Working adaptive sampling kernel

669bb9c

Removing thrust and other unused code

9d2ffd0

Making kiss99 static for now

8b3704f

Removing some more unused code

1c50ab7

Formatting changes

b1c63bb

Fixing select kernel call format

684c668

Merge branch 'branch-21.12' into enh-rf-better-feature-sampling

3490114

vinaydes requested a review from a team as a code owner October 12, 2021 07:56

github-actions bot added the CUDA/C++ label Oct 12, 2021

Undo local build fix

3138301

caryr35 added this to PR-WIP in v21.12 Release via automation Oct 14, 2021

caryr35 moved this from PR-WIP to PR-Needs review in v21.12 Release Oct 14, 2021

dantegd removed this from PR-Needs review in v21.12 Release Nov 18, 2021

dantegd added this to PR-WIP in v22.02 Release via automation Nov 18, 2021

github-actions bot added the inactive-30d label Nov 23, 2021

vinaydes added 2 commits January 20, 2022 11:18

Merge branch 'branch-22.02' into enh-rf-better-feature-sampling

fbec82f

Changing the RAFT repo link

7f82825

vinaydes changed the title ~~[WIP] Improvements in feature sampling~~ [REVIEW] Improvements in feature sampling Jun 16, 2022

change seed for a corner case

ff92c6e

github-actions bot added the Cython / Python Cython or Python issue label Jun 29, 2022

change seed for a corner case

8d76a78

tfeher requested changes Jul 11, 2022

View reviewed changes

v22.08 Release automation moved this from PR-WIP to PR-Needs review Jul 11, 2022

Merge branch 'branch-22.08' into enh-rf-better-feature-sampling

8e5990d

vinaydes mentioned this pull request Jul 29, 2022

[FEA] Add batched sample wihtout replacement methods rapidsai/raft#767

Open

Addressing review comments about docstring

c094f65

tfeher approved these changes Aug 1, 2022

View reviewed changes

cpp/src/decisiontree/batched-levelalgo/kernels/builder_kernels.cuh Outdated Show resolved Hide resolved

vinaydes and others added 2 commits August 1, 2022 12:25

Update cpp/src/decisiontree/batched-levelalgo/kernels/builder_kernels…

aae286c

….cuh Co-authored-by: Tamas Bela Feher <tfeher@nvidia.com>

formatting changes

ea0428a

Merge branch 'branch-22.08' into enh-rf-better-feature-sampling

b5751f1

v22.08 Release automation moved this from PR-Needs review to PR-Reviewer approved Aug 3, 2022

dantegd approved these changes Aug 3, 2022

View reviewed changes

rapids-bot bot merged commit 3b3b891 into rapidsai:branch-22.08 Aug 3, 2022

v22.08 Release automation moved this from PR-Reviewer approved to Done Aug 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REVIEW] Improvements in feature sampling #4278

[REVIEW] Improvements in feature sampling #4278

vinaydes commented Oct 12, 2021

teju85 commented Oct 12, 2021

vinaydes commented Oct 12, 2021

venkywonka commented Oct 13, 2021

github-actions bot commented Nov 23, 2021

venkywonka commented Jun 13, 2022

venkywonka commented Jun 13, 2022

benchmarks on GBM datasets

benchmark on a representative synthetic dataset

venkywonka commented Jun 24, 2022

venkywonka commented Jun 29, 2022

teju85 commented Jun 29, 2022

vinaydes commented Jun 30, 2022

tfeher left a comment

vinaydes commented Jul 29, 2022

vinaydes commented Aug 1, 2022

tfeher left a comment

vinaydes commented Aug 1, 2022

codecov-commenter commented Aug 2, 2022

dantegd commented Aug 3, 2022

venkywonka commented Aug 3, 2022

[REVIEW] Improvements in feature sampling #4278

[REVIEW] Improvements in feature sampling #4278

Conversation

vinaydes commented Oct 12, 2021

teju85 commented Oct 12, 2021

vinaydes commented Oct 12, 2021

venkywonka commented Oct 13, 2021

github-actions bot commented Nov 23, 2021

venkywonka commented Jun 13, 2022

venkywonka commented Jun 13, 2022

benchmarks on GBM datasets

benchmark on a representative synthetic dataset

venkywonka commented Jun 24, 2022

venkywonka commented Jun 29, 2022

teju85 commented Jun 29, 2022

vinaydes commented Jun 30, 2022

tfeher left a comment

Choose a reason for hiding this comment

vinaydes commented Jul 29, 2022

vinaydes commented Aug 1, 2022

tfeher left a comment

Choose a reason for hiding this comment

vinaydes commented Aug 1, 2022

codecov-commenter commented Aug 2, 2022

Codecov Report

dantegd commented Aug 3, 2022

venkywonka commented Aug 3, 2022