Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TST Fix typo, lint test_target_encoder.py #26958

Merged
merged 16 commits into from
Sep 7, 2023

Conversation

lucyleeow
Copy link
Member

@lucyleeow lucyleeow commented Aug 1, 2023

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Fixes some typos, cleans a redundant if/else and removes a magic number.

Any other comments?

@lucyleeow lucyleeow added the Quick Review For PRs that are quick to review label Aug 1, 2023
@lucyleeow lucyleeow changed the title DOC Fix typos in test_target_encoder.py Clean typo, lint test_target_encoder.py Aug 1, 2023
@github-actions
Copy link

github-actions bot commented Aug 1, 2023

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 9bda613. Link to the linter CI: here

@lucyleeow lucyleeow changed the title Clean typo, lint test_target_encoder.py Fix typo, lint test_target_encoder.py Aug 1, 2023
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR! LGTM

@thomasjpfan thomasjpfan added the Waiting for Second Reviewer First reviewer is done, need a second one! label Aug 1, 2023
@thomasjpfan thomasjpfan changed the title Fix typo, lint test_target_encoder.py TST Fix typo, lint test_target_encoder.py Aug 1, 2023
X_test = np.concatenate((X_test, [[unknown_value]]))

rng = np.random.RandomState(global_random_seed)

n_splits = 3
random_state = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be using the rng which takes from global_random_seed instead?

Copy link
Member Author

@lucyleeow lucyleeow Aug 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad you raised this because it uncovers interesting behaviour. The tests fail if we use rng in creating a *Kfold and also pass it to TargetEncoder. I think this is because we are calling shuffle on this RandomState object more than once, which gives us different results (I think this is intended behaviour), see:

In [1]: import numpy as np

In [2]: rng = np.random.RandomState(42)

In [3]: a = np.arange(10)

In [4]: a
Out[4]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [5]: rng.shuffle(a)

In [6]: a
Out[6]: array([8, 1, 5, 0, 7, 2, 9, 4, 3, 6])

In [7]: a = np.arange(10)

In [8]: a
Out[8]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [9]: rng.shuffle(a)

In [10]: a
Out[10]: array([0, 1, 8, 5, 3, 4, 7, 9, 6, 2])

Generating a new RandomState object (with the same seed) and then calling shuffle on a will give the same results as "Out[6]". Thus when we use an int a new RandomState object is created each time and the test pass.

Not sure if we need/can to do anything about this? Potentially just document?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if I understand. If we need to get a shuffle and re-use it, we should be doing that and passing it along, and if that's not the case, I don't see why it would make the tests fail.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I follow you either. We shuffle separately in for train_idx, test_idx in cv.split(X_train_array, y_train): and inside of TargetEncoder.fit_transform. We can't get a shuffle and re-use it (if re-use means reusing the shuffled indices).

But for clarity, the symptoms are at least: the tests fail when I use rng and pass when I use random_state. The shuffled indices (e.g., taking the continuous case, KFold) from here:

check_random_state(self.random_state).shuffle(indices)

are different when done in the test at the line for train_idx, test_idx in cv.split(X_train_array, y_train): and when done inside of TargetEncoder.fit_transform.

I checked (with id()) and the rng object used in and in the line for train_idx, test_idx in cv.split(X_train_array, y_train): and in TargetEncoder.fit_transform

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah im so silly. I will just create a new rng and pass it to TargetEncoder

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the current tests refactor is better than main and good enough to be merge as is. I think there are ways to get global_random_state to work, but that can be done as a follow up.

Copy link
Member Author

@lucyleeow lucyleeow Aug 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just realised that a reasonably neat way is to set_state each time before it is passed to an estimator, done in: f6ef1d2
However, happy to revert if you prefer @thomasjpfan @adrinjalali

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am okay with get_state + get_state as long as there is a comment above get_state explaining why it is necessary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coming late to this discussion. Apparently the get_state/set_state are not needed. Here is a PR to this PR to further reorganize this test and make it more explicit and even more seed independent:

lucyleeow#1

@lucyleeow
Copy link
Member Author

lucyleeow commented Aug 4, 2023

@thomasjpfan not sure if another review is needed after changes (sorry for noise if not!)

Edit: nevermind, I've gone back to using int

X_test = np.concatenate((X_test, [[unknown_value]]))

rng = np.random.RandomState(global_random_seed)

n_splits = 3
random_state = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the current tests refactor is better than main and good enough to be merge as is. I think there are ways to get global_random_state to work, but that can be done as a follow up.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol, this turned out to be an interesting one. I'm okay with the solution. But I think it'd be nice to expand on the comment before the rng.get_state to explain for a future poor maintainer what's happening 🙈

@lucyleeow
Copy link
Member Author

But I think it'd be nice to expand on the comment before the rng.get_state to explain for a future poor maintainer what's happening see_no_evil

Done but maybe it is too long now? Anyway happy to make any changes.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider suggested changes in #26958 (comment).

ogrisel and others added 3 commits September 6, 2023 15:57
Co-authored-by: Lucy Liu <jliu176@gmail.com>
Clarify the use of RNGs in test_target_encoder.test_encoding
@lucyleeow
Copy link
Member Author

@ogrisel thanks for the changes, merged, maybe this is ready to go now?

@ogrisel
Copy link
Member

ogrisel commented Sep 7, 2023

LGTM once the ruff linting problem reported in circle ci is fixed.

@ogrisel ogrisel enabled auto-merge (squash) September 7, 2023 05:40
@lucyleeow
Copy link
Member Author

Hmm odd, ruff gave an error in a file not touched by this PR:

sklearn/utils/tests/test_utils.py:528:12: E721 Do not compare types, use `isinstance()`
    |
527 |     assert a_s == ["c", "b", "a"]
528 |     assert type(a_s) == list
    |            ^^^^^^^^^^^^^^^^^ E721
529 | 
530 |     assert_array_equal(b_s, ["c", "b", "a"])
    |

sklearn/utils/tests/test_utils.py:534:12: E721 Do not compare types, use `isinstance()`
    |
533 |     assert c_s == [3, 2, 1]
534 |     assert type(c_s) == list
    |            ^^^^^^^^^^^^^^^^^ E721
535 | 
536 |     assert_array_equal(d_s, np.array([["c", 2], ["b", 1], ["a", 0]], dtype=object))
    |

@lucyleeow
Copy link
Member Author

I've merged main to see if that will fix it.

@ogrisel ogrisel merged commit b72252e into scikit-learn:main Sep 7, 2023
27 checks passed
@lucyleeow lucyleeow deleted the test_encoder branch September 7, 2023 10:27
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Sep 18, 2023
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
jeremiedbb pushed a commit that referenced this pull request Sep 20, 2023
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
REDVM pushed a commit to REDVM/scikit-learn that referenced this pull request Nov 16, 2023
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Documentation module:preprocessing No Changelog Needed Quick Review For PRs that are quick to review Waiting for Second Reviewer First reviewer is done, need a second one!
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants