Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Improvements in feature sampling #4278

Merged

Conversation

vinaydes
Copy link
Contributor

With this PR, the feature sampling overhead is greatly reduced, especially for wide (thousands of features) datasets. The PR requires some structural changes in RAFT therefore is marked as WIP.

@vinaydes vinaydes requested a review from a team as a code owner October 12, 2021 07:56
@teju85
Copy link
Member

teju85 commented Oct 12, 2021

Any ideas we could also add support for feature subsampling with weights in the same PR? Or better to keep it separate in another PR?

@vinaydes
Copy link
Contributor Author

Any ideas we could also add support for feature subsampling with weights in the same PR? Or better to keep it separate in another PR?

Weighted sampling should be possible. I'll see if I can manage to add it to same PR.

@venkywonka
Copy link
Contributor

No regressions in gbm-bench, both in accuracy and perf 🙌🏻. Since most of the datasets in gbm-bench have small number of columns, speedups are not expected (except epsilon, where 2000 columns shows a slight improvement in perf).

accuracy comparison with branch-21.12

sampling-comparison-with-main accuracy

perf comparison with branch-21.12

sampling-comparison-with-main time

@caryr35 caryr35 added this to PR-WIP in v21.12 Release via automation Oct 14, 2021
@caryr35 caryr35 moved this from PR-WIP to PR-Needs review in v21.12 Release Oct 14, 2021
@dantegd dantegd removed this from PR-Needs review in v21.12 Release Nov 18, 2021
@dantegd dantegd added this to PR-WIP in v22.02 Release via automation Nov 18, 2021
@github-actions
Copy link

This PR has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be labeled inactive-90d if there is no activity in the next 60 days.

@venkywonka
Copy link
Contributor

rerun tests

@venkywonka
Copy link
Contributor

The previous pushed changes include two feature-samping strategies that are decided based on if the sampling problem size corresponds to allowable static shared memory availability and register pressure. The default strategy is a sorting-based sampling (this kernel). Another strategy using a batchwise adaptation of the algo-L of reservoir sampling (this kernel) is used as a fallback.

The former strategy is more performant than latter. (by 1.5x times in target datasets with wide columns (~100000))

benchmarks on GBM datasets

  • No regressions in accuracy
  • There is only a slight improvement in performance for gbm-datasets as all their columns are small so the improvement in the feature-sampling portion does not significantly affect the end-to-end times.

max_features: "sqrt", n_trees: 1000, n_bins: 128, n_streams: 4, max_samples: 0.5, bootstrap: true
max_features: 0.7, n_trees: 1000, n_bins: 128, n_streams: 4, max_samples: 0.5, bootstrap: true

benchmark on a representative synthetic dataset

  • This feature-sampling improvement is more pronounced when it becomes the bottleneck in problems where datasets are very wide (~100000 cols).
  • The below is a benchmark on such a representative synthetic regression dataset with 1000 rows and 100000 cols

max_features: "sqrt", n_trees: 1000, n_bins: 128, n_streams: 4, max_samples: 1.0, bootstrap: true

The above benchmarks have been rerun after the latest commit and modified in-place

@vinaydes vinaydes changed the title [WIP] Improvements in feature sampling [REVIEW] Improvements in feature sampling Jun 16, 2022
@venkywonka
Copy link
Contributor

rerun tests

@github-actions github-actions bot added the Cython / Python Cython or Python issue label Jun 29, 2022
@venkywonka
Copy link
Contributor

@teju85 could you give a final review? ✌🏻

@teju85
Copy link
Member

teju85 commented Jun 29, 2022

Not sure if I'll get time to review this PR soon. Maybe @vinaydes or @tfeher ?

@vinaydes
Copy link
Contributor Author

I already had a look at the code. I am okay with the changes. I am not approving as some part of code is written by me and I have been part of PR from the start. @tfeher Let us know if we need to explain the changes for you to review.

Copy link
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @venkywonka and @vinaydes for the PR! In general it looks good, there are only a few smaller issues.

v22.08 Release automation moved this from PR-WIP to PR-Needs review Jul 11, 2022
@vinaydes
Copy link
Contributor Author

@dantegd I have addressed the changes @tfeher had asked for. This PR can now be merged.

@vinaydes
Copy link
Contributor Author

vinaydes commented Aug 1, 2022

@tfeher I think you need to re-review or accept the changes, cause the merging is blocked for that.

Copy link
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @vinaydes for addressing the issues!

I think my request for improving the docstring was not clear, I have added a suggestion to illustrate what did I mean.

On one hand this is a nitpick, and it should not hold up this PR, therfore I am approving the PR.
On the other hand, if someone picks up rapidsai/raft#767, then such information could be very useful.

vinaydes and others added 2 commits August 1, 2022 12:25
@vinaydes
Copy link
Contributor Author

vinaydes commented Aug 1, 2022

I'll take a look at CI failures.

@codecov-commenter
Copy link

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.08@33c0170). Click here to learn what that means.
The diff coverage is n/a.

@@               Coverage Diff               @@
##             branch-22.08    #4278   +/-   ##
===============================================
  Coverage                ?   78.02%           
===============================================
  Files                   ?      180           
  Lines                   ?    11385           
  Branches                ?        0           
===============================================
  Hits                    ?     8883           
  Misses                  ?     2502           
  Partials                ?        0           
Flag Coverage Δ
dask 46.21% <0.00%> (?)
non-dask 67.27% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.


Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 33c0170...b5751f1. Read the comment docs.

v22.08 Release automation moved this from PR-Needs review to PR-Reviewer approved Aug 3, 2022
@dantegd
Copy link
Member

dantegd commented Aug 3, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 3b3b891 into rapidsai:branch-22.08 Aug 3, 2022
v22.08 Release automation moved this from PR-Reviewer approved to Done Aug 3, 2022
@venkywonka
Copy link
Contributor

Yaaaay ❤️🥳

jakirkham pushed a commit to jakirkham/cuml that referenced this pull request Feb 27, 2023
With this PR, the feature sampling overhead is greatly reduced, especially for wide (thousands of features) datasets. The PR requires some structural changes in RAFT therefore is marked as WIP.

Authors:
  - Vinay Deshpande (https://github.com/vinaydes)
  - Ray Douglass (https://github.com/raydouglass)
  - Andy Adinets (https://github.com/canonizer)
  - Jordan Jacobelli (https://github.com/Ethyling)
  - Jiwei Liu (https://github.com/daxiongshu)
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Christopher Akiki (https://github.com/cakiki)
  - Venkat (https://github.com/venkywonka)

Approvers:
  - Tamas Bela Feher (https://github.com/tfeher)
  - Dante Gama Dessavre (https://github.com/dantegd)

URL: rapidsai#4278
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team CUDA/C++ Cython / Python Cython or Python issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet