TST Fix typo, lint `test_target_encoder.py` #26958

lucyleeow · 2023-08-01T06:53:46Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Fixes some typos, cleans a redundant if/else and removes a magic number.

Any other comments?

github-actions · 2023-08-01T06:55:45Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 9bda613. Link to the linter CI: here}

thomasjpfan

Thank you for the PR! LGTM

adrinjalali · 2023-08-02T09:43:21Z

sklearn/preprocessing/tests/test_target_encoder.py

    X_test = np.concatenate((X_test, [[unknown_value]]))

    rng = np.random.RandomState(global_random_seed)

+    n_splits = 3
+    random_state = 0


I think we should be using the rng which takes from global_random_seed instead?

Glad you raised this because it uncovers interesting behaviour. The tests fail if we use rng in creating a *Kfold and also pass it to TargetEncoder. I think this is because we are calling shuffle on this RandomState object more than once, which gives us different results (I think this is intended behaviour), see:

In [1]: import numpy as np In [2]: rng = np.random.RandomState(42) In [3]: a = np.arange(10) In [4]: a Out[4]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) In [5]: rng.shuffle(a) In [6]: a Out[6]: array([8, 1, 5, 0, 7, 2, 9, 4, 3, 6]) In [7]: a = np.arange(10) In [8]: a Out[8]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) In [9]: rng.shuffle(a) In [10]: a Out[10]: array([0, 1, 8, 5, 3, 4, 7, 9, 6, 2])

Generating a new RandomState object (with the same seed) and then calling shuffle on a will give the same results as "Out[6]". Thus when we use an int a new RandomState object is created each time and the test pass.

Not sure if we need/can to do anything about this? Potentially just document?

I'm not sure if I understand. If we need to get a shuffle and re-use it, we should be doing that and passing it along, and if that's not the case, I don't see why it would make the tests fail.

I am not sure I follow you either. We shuffle separately in for train_idx, test_idx in cv.split(X_train_array, y_train): and inside of TargetEncoder.fit_transform. We can't get a shuffle and re-use it (if re-use means reusing the shuffled indices).

But for clarity, the symptoms are at least: the tests fail when I use rng and pass when I use random_state. The shuffled indices (e.g., taking the continuous case, KFold) from here:

scikit-learn/sklearn/model_selection/_split.py

Line 482 in 594475a

check_random_state(self.random_state).shuffle(indices)

are different when done in the test at the line for train_idx, test_idx in cv.split(X_train_array, y_train): and when done inside of TargetEncoder.fit_transform.

I checked (with id()) and the rng object used in and in the line for train_idx, test_idx in cv.split(X_train_array, y_train): and in TargetEncoder.fit_transform

Ah im so silly. I will just create a new rng and pass it to TargetEncoder

I think the current tests refactor is better than main and good enough to be merge as is. I think there are ways to get global_random_state to work, but that can be done as a follow up.

Just realised that a reasonably neat way is to set_state each time before it is passed to an estimator, done in: f6ef1d2
However, happy to revert if you prefer @thomasjpfan @adrinjalali

I am okay with get_state + get_state as long as there is a comment above get_state explaining why it is necessary.

Coming late to this discussion. Apparently the get_state/set_state are not needed. Here is a PR to this PR to further reorganize this test and make it more explicit and even more seed independent:

lucyleeow#1

lucyleeow · 2023-08-04T02:51:06Z

@thomasjpfan not sure if another review is needed after changes (sorry for noise if not!)

Edit: nevermind, I've gone back to using int

thomasjpfan · 2023-08-04T17:26:35Z

sklearn/preprocessing/tests/test_target_encoder.py

    X_test = np.concatenate((X_test, [[unknown_value]]))

    rng = np.random.RandomState(global_random_seed)

+    n_splits = 3
+    random_state = 0


I think the current tests refactor is better than main and good enough to be merge as is. I think there are ways to get global_random_state to work, but that can be done as a follow up.

adrinjalali

lol, this turned out to be an interesting one. I'm okay with the solution. But I think it'd be nice to expand on the comment before the rng.get_state to explain for a future poor maintainer what's happening 🙈

lucyleeow · 2023-08-16T03:03:06Z

But I think it'd be nice to expand on the comment before the rng.get_state to explain for a future poor maintainer what's happening see_no_evil

Done but maybe it is too long now? Anyway happy to make any changes.

ogrisel

Please consider suggested changes in #26958 (comment).

Co-authored-by: Lucy Liu <jliu176@gmail.com>

Clarify the use of RNGs in test_target_encoder.test_encoding

lucyleeow · 2023-09-06T23:00:47Z

@ogrisel thanks for the changes, merged, maybe this is ready to go now?

ogrisel · 2023-09-07T05:40:03Z

LGTM once the ruff linting problem reported in circle ci is fixed.

lucyleeow · 2023-09-07T05:49:18Z

Hmm odd, ruff gave an error in a file not touched by this PR:

sklearn/utils/tests/test_utils.py:528:12: E721 Do not compare types, use `isinstance()`
    |
527 |     assert a_s == ["c", "b", "a"]
528 |     assert type(a_s) == list
    |            ^^^^^^^^^^^^^^^^^ E721
529 | 
530 |     assert_array_equal(b_s, ["c", "b", "a"])
    |

sklearn/utils/tests/test_utils.py:534:12: E721 Do not compare types, use `isinstance()`
    |
533 |     assert c_s == [3, 2, 1]
534 |     assert type(c_s) == list
    |            ^^^^^^^^^^^^^^^^^ E721
535 | 
536 |     assert_array_equal(d_s, np.array([["c", 2], ["b", 1], ["a", 0]], dtype=object))
    |

lucyleeow · 2023-09-07T05:50:39Z

I've merged main to see if that will fix it.

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

typos

ffb414d

lucyleeow added the Quick Review For PRs that are quick to review label Aug 1, 2023

github-actions bot added module:preprocessing Documentation labels Aug 1, 2023

lucyleeow changed the title ~~DOC Fix typos in test_target_encoder.py~~ Clean typo, lint test_target_encoder.py Aug 1, 2023

lint

5a92faa

lucyleeow changed the title ~~Clean typo, lint test_target_encoder.py~~ Fix typo, lint test_target_encoder.py Aug 1, 2023

lucyleeow added the No Changelog Needed label Aug 1, 2023

thomasjpfan approved these changes Aug 1, 2023

View reviewed changes

thomasjpfan added the Waiting for Second Reviewer First reviewer is done, need a second one! label Aug 1, 2023

thomasjpfan changed the title ~~Fix typo, lint test_target_encoder.py~~ TST Fix typo, lint test_target_encoder.py Aug 1, 2023

adrinjalali reviewed Aug 2, 2023

View reviewed changes

lucyleeow added 3 commits August 4, 2023 10:46

try rng

f112247

typo

3b815a1

use rng

3f741c3

use new var name

86a5f5b

lucyleeow mentioned this pull request Aug 4, 2023

Should we consider moving from legacy numpy RandomState to Random.Generator? #27008

Closed

use int

d9742a4

thomasjpfan approved these changes Aug 4, 2023

View reviewed changes

set state

f6ef1d2

thomasjpfan mentioned this pull request Aug 7, 2023

ENH Allows multiclass target in TargetEncoder #26674

Merged

4 tasks

add comment

c3a694a

adrinjalali approved these changes Aug 15, 2023

View reviewed changes

expand comment

fe9ca07

lucyleeow and others added 2 commits August 16, 2023 13:03

Merge branch 'main' into test_encoder

2e521ab

Clarify the use of RNGs in test_target_encoder.test_encoding

ead55d1

ogrisel mentioned this pull request Aug 28, 2023

Clarify the use of RNGs in test_target_encoder.test_encoding lucyleeow/scikit-learn#1

Merged

ogrisel requested changes Aug 28, 2023

View reviewed changes

ogrisel and others added 3 commits September 6, 2023 15:57

Apply suggestions from code review

585fd98

Co-authored-by: Lucy Liu <jliu176@gmail.com>

Take Lucy's suggestions into account

7c2241c

Merge pull request #1 from ogrisel/test_encoder_simplification

838adf4

Clarify the use of RNGs in test_target_encoder.test_encoding

ogrisel approved these changes Sep 7, 2023

View reviewed changes

ogrisel enabled auto-merge (squash) September 7, 2023 05:40

Merge branch 'main' into test_encoder

9bda613

ogrisel merged commit b72252e into scikit-learn:main Sep 7, 2023
27 checks passed

lucyleeow deleted the test_encoder branch September 7, 2023 10:27

glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Sep 18, 2023

TST Fix typo, lint test_target_encoder.py (scikit-learn#26958)

0ce9648

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

jeremiedbb pushed a commit that referenced this pull request Sep 20, 2023

TST Fix typo, lint test_target_encoder.py (#26958)

9412202

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

REDVM pushed a commit to REDVM/scikit-learn that referenced this pull request Nov 16, 2023

TST Fix typo, lint test_target_encoder.py (scikit-learn#26958)

12ad94e

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TST Fix typo, lint `test_target_encoder.py` #26958

TST Fix typo, lint `test_target_encoder.py` #26958

lucyleeow commented Aug 1, 2023 •

edited

github-actions bot commented Aug 1, 2023 •

edited

thomasjpfan left a comment

adrinjalali Aug 2, 2023

lucyleeow Aug 3, 2023 •

edited

adrinjalali Aug 3, 2023

lucyleeow Aug 3, 2023

lucyleeow Aug 3, 2023

thomasjpfan Aug 4, 2023

lucyleeow Aug 4, 2023 •

edited

thomasjpfan Aug 7, 2023

lucyleeow Aug 8, 2023

ogrisel Aug 28, 2023

lucyleeow commented Aug 4, 2023 •

edited

thomasjpfan Aug 4, 2023

adrinjalali left a comment

lucyleeow commented Aug 16, 2023

ogrisel left a comment

lucyleeow commented Sep 6, 2023

ogrisel commented Sep 7, 2023

lucyleeow commented Sep 7, 2023

lucyleeow commented Sep 7, 2023

TST Fix typo, lint test_target_encoder.py #26958

TST Fix typo, lint test_target_encoder.py #26958

Conversation

lucyleeow commented Aug 1, 2023 • edited

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

github-actions bot commented Aug 1, 2023 • edited

✔️ Linting Passed

thomasjpfan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lucyleeow Aug 3, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lucyleeow Aug 4, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lucyleeow commented Aug 4, 2023 • edited

Choose a reason for hiding this comment

adrinjalali left a comment

Choose a reason for hiding this comment

lucyleeow commented Aug 16, 2023

ogrisel left a comment

Choose a reason for hiding this comment

lucyleeow commented Sep 6, 2023

ogrisel commented Sep 7, 2023

lucyleeow commented Sep 7, 2023

lucyleeow commented Sep 7, 2023

TST Fix typo, lint `test_target_encoder.py` #26958

TST Fix typo, lint `test_target_encoder.py` #26958

lucyleeow commented Aug 1, 2023 •

edited

github-actions bot commented Aug 1, 2023 •

edited

lucyleeow Aug 3, 2023 •

edited

lucyleeow Aug 4, 2023 •

edited

lucyleeow commented Aug 4, 2023 •

edited