Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transforms RandomForest estimators non-consecutive labels to consecutive labels where appropriate #4780

Merged
merged 13 commits into from Sep 29, 2022

Conversation

VamsiTallam95
Copy link
Contributor

@VamsiTallam95 VamsiTallam95 commented Jun 17, 2022

This PR closes #4478 by transforming non-consecutive labels outside of [0,n) to consecutive labels inside [0,n) similar to what Scikit-learn does under the hood.

Closes #691

@VamsiTallam95 VamsiTallam95 requested a review from a team as a code owner June 17, 2022 02:32
@github-actions github-actions bot added the Cython / Python Cython or Python issue label Jun 17, 2022
@VamsiTallam95 VamsiTallam95 marked this pull request as draft June 17, 2022 02:33
raise ValueError("The labels need "
"to be consecutive values from "
"0 to the number of unique label values")
self.classes_unorder = cp.unique(y_m).tolist()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be reusing existing primitives where at all possible and using the make_monotonic primitive to do this. That allows us to optimize this specific operation once and have it benefit all uses.

@beckernick beckernick added bug Something isn't working non-breaking Non-breaking change labels Jun 21, 2022
@cjnolet cjnolet added this to PR-WIP in v22.08 Release via automation Jun 22, 2022
@VamsiTallam95 VamsiTallam95 marked this pull request as ready for review June 27, 2022 15:07
Copy link
Member

@dantegd dantegd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just had a couple of comments

Comment on lines 313 to 316
@pytest.mark.parametrize("datatype", [np.float32, np.float64])
@pytest.mark.parametrize("max_features", [1.0, "auto", "log2", "sqrt"])
@pytest.mark.parametrize("b", [0, 5, -5, 10])
@pytest.mark.parametrize("a", [1, 2, 3])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this full matrix of tests for testing the monotonic case, one combination for each datatype would be enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I will fix one value for a, b and max_features.

Comment on lines 289 to 291
y_m, _ = make_monotonic(y_m)
break

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the logic for the loop might be better to have in make_monotonic, perhaps with a parameter like make_monotonic(array, check_already_monotonic=True) so that every use of the prim is cleaner? What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd vote we make that change in RAFT, plumb raft::label::make_monotonic to Python (rapidsai/raft#640), and then make a follow up PR to use that in cuML and remove this prim.

Context: #4478 (comment)

What do you think? With that said, we could just do both :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not modify the loop as I wanted to make least possible changes to the code. However, we can make use of check_lables primitives to see if the labels are already monotonic. I can make the change for cleaner implementation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@beckernick I am on board with that idea.

v22.08 Release automation moved this from PR-WIP to PR-Needs review Jun 29, 2022
@ayushdg
Copy link
Member

ayushdg commented Jun 29, 2022

rerun tests

3 similar comments
@VamsiTallam95
Copy link
Contributor Author

rerun tests

@VamsiTallam95
Copy link
Contributor Author

rerun tests

@VamsiTallam95
Copy link
Contributor Author

rerun tests

@codecov-commenter
Copy link

Codecov Report

Merging #4780 (8cd8d3a) into branch-22.08 (b26fe7e) will increase coverage by 0.00%.
The diff coverage is n/a.

@@              Coverage Diff              @@
##           branch-22.08    #4780   +/-   ##
=============================================
  Coverage         77.62%   77.62%           
=============================================
  Files               180      180           
  Lines             11382    11384    +2     
=============================================
+ Hits               8835     8837    +2     
  Misses             2547     2547           
Flag Coverage Δ
dask 45.52% <ø> (+<0.01%) ⬆️
non-dask 67.26% <ø> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
python/cuml/metrics/__init__.py 100.00% <0.00%> (ø)
python/cuml/metrics/cluster/__init__.py 100.00% <0.00%> (ø)
python/cuml/thirdparty_adapters/adapters.py 91.48% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b26fe7e...8cd8d3a. Read the comment docs.

@beckernick
Copy link
Member

Is this ready for another round of reviews?

@VamsiTallam95
Copy link
Contributor Author

Its ready!

@caryr35 caryr35 added this to PR-WIP in v22.10 Release via automation Aug 9, 2022
@caryr35 caryr35 moved this from PR-WIP to PR-Needs review in v22.10 Release Aug 9, 2022
@caryr35 caryr35 removed this from PR-Needs review in v22.08 Release Aug 9, 2022
Copy link
Contributor

@lowener lowener left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dantegd dantegd changed the base branch from branch-22.08 to branch-22.10 August 31, 2022 17:49
@lowener
Copy link
Contributor

lowener commented Aug 31, 2022

rerun tests

1 similar comment
@lowener
Copy link
Contributor

lowener commented Sep 1, 2022

rerun tests

@beckernick beckernick changed the title Transforms cuML estimators non-consecutive labels to consecutive labels where appropriate Transforms RandomForest estimators non-consecutive labels to consecutive labels where appropriate Sep 8, 2022
@beckernick
Copy link
Member

Changing the title before merging, as this PR only applies this change to random forest models.

@beckernick
Copy link
Member

This will also close #691

v22.10 Release automation moved this from PR-Needs review to PR-Reviewer approved Sep 29, 2022
@dantegd
Copy link
Member

dantegd commented Sep 29, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 96da84c into rapidsai:branch-22.10 Sep 29, 2022
v22.10 Release automation moved this from PR-Reviewer approved to Done Sep 29, 2022
jakirkham pushed a commit to jakirkham/cuml that referenced this pull request Feb 27, 2023
…ive labels where appropriate (rapidsai#4780)

This PR closes rapidsai#4478 by transforming non-consecutive labels outside of [0,n) to consecutive labels inside [0,n) similar to what Scikit-learn does under the hood.

Closes rapidsai#691

Authors:
  - https://github.com/VamsiTallam95

Approvers:
  - Micka (https://github.com/lowener)
  - Dante Gama Dessavre (https://github.com/dantegd)
  - Corey J. Nolet (https://github.com/cjnolet)

URL: rapidsai#4780
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Cython / Python Cython or Python issue non-breaking Non-breaking change
Projects
No open projects
7 participants