-
-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TST Fix typo, lint test_target_encoder.py
#26958
Conversation
test_target_encoder.py
test_target_encoder.py
test_target_encoder.py
test_target_encoder.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR! LGTM
test_target_encoder.py
test_target_encoder.py
X_test = np.concatenate((X_test, [[unknown_value]])) | ||
|
||
rng = np.random.RandomState(global_random_seed) | ||
|
||
n_splits = 3 | ||
random_state = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should be using the rng
which takes from global_random_seed
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Glad you raised this because it uncovers interesting behaviour. The tests fail if we use rng
in creating a *Kfold and also pass it to TargetEncoder
. I think this is because we are calling shuffle on this RandomState
object more than once, which gives us different results (I think this is intended behaviour), see:
In [1]: import numpy as np
In [2]: rng = np.random.RandomState(42)
In [3]: a = np.arange(10)
In [4]: a
Out[4]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [5]: rng.shuffle(a)
In [6]: a
Out[6]: array([8, 1, 5, 0, 7, 2, 9, 4, 3, 6])
In [7]: a = np.arange(10)
In [8]: a
Out[8]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [9]: rng.shuffle(a)
In [10]: a
Out[10]: array([0, 1, 8, 5, 3, 4, 7, 9, 6, 2])
Generating a new RandomState
object (with the same seed) and then calling shuffle on a
will give the same results as "Out[6]". Thus when we use an int
a new RandomState
object is created each time and the test pass.
Not sure if we need/can to do anything about this? Potentially just document?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if I understand. If we need to get a shuffle and re-use it, we should be doing that and passing it along, and if that's not the case, I don't see why it would make the tests fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I follow you either. We shuffle separately in for train_idx, test_idx in cv.split(X_train_array, y_train):
and inside of TargetEncoder.fit_transform
. We can't get a shuffle and re-use it (if re-use means reusing the shuffled indices).
But for clarity, the symptoms are at least: the tests fail when I use rng
and pass when I use random_state
. The shuffled indices (e.g., taking the continuous case, KFold
) from here:
check_random_state(self.random_state).shuffle(indices) |
are different when done in the test at the line for train_idx, test_idx in cv.split(X_train_array, y_train):
and when done inside of TargetEncoder.fit_transform
.
I checked (with id()
) and the rng
object used in and in the line for train_idx, test_idx in cv.split(X_train_array, y_train):
and in TargetEncoder.fit_transform
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah im so silly. I will just create a new rng
and pass it to TargetEncoder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the current tests refactor is better than main
and good enough to be merge as is. I think there are ways to get global_random_state
to work, but that can be done as a follow up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just realised that a reasonably neat way is to set_state
each time before it is passed to an estimator, done in: f6ef1d2
However, happy to revert if you prefer @thomasjpfan @adrinjalali
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am okay with get_state
+ get_state
as long as there is a comment above get_state
explaining why it is necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Coming late to this discussion. Apparently the get_state/set_state are not needed. Here is a PR to this PR to further reorganize this test and make it more explicit and even more seed independent:
@thomasjpfan not sure if another review is needed after changes (sorry for noise if not!) Edit: nevermind, I've gone back to using int |
X_test = np.concatenate((X_test, [[unknown_value]])) | ||
|
||
rng = np.random.RandomState(global_random_seed) | ||
|
||
n_splits = 3 | ||
random_state = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the current tests refactor is better than main
and good enough to be merge as is. I think there are ways to get global_random_state
to work, but that can be done as a follow up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol, this turned out to be an interesting one. I'm okay with the solution. But I think it'd be nice to expand on the comment before the rng.get_state
to explain for a future poor maintainer what's happening 🙈
Done but maybe it is too long now? Anyway happy to make any changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please consider suggested changes in #26958 (comment).
Co-authored-by: Lucy Liu <jliu176@gmail.com>
Clarify the use of RNGs in test_target_encoder.test_encoding
@ogrisel thanks for the changes, merged, maybe this is ready to go now? |
LGTM once the ruff linting problem reported in circle ci is fixed. |
Hmm odd, ruff gave an error in a file not touched by this PR:
|
I've merged main to see if that will fix it. |
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Reference Issues/PRs
What does this implement/fix? Explain your changes.
Fixes some typos, cleans a redundant if/else and removes a magic number.
Any other comments?