Use `check_array` to validate `y` #25089

betatim · 2022-12-01T15:06:34Z

Reference Issues/PRs

closes #25073 (more precisely this PR combined with #25080 closes it)

What does this implement/fix? Explain your changes.

Uses check_array in _check_y so that we get the same behaviour for converting pandas special data types (that can represent missing values) like Int64 as for X. This is done in the part of the code around pandas_requires_conversion.

Is this what you had in mind?

cc @glemaitre

adrinjalali · 2022-12-02T09:08:47Z

Would this hack still work after this change?

https://stackoverflow.com/questions/49581104/sklearn-gridsearchcv-not-using-sample-weight-in-score-function/49598597#49598597

betatim · 2022-12-02T09:27:27Z

I've not tried to run the code from the SO answer, so technically "I don't know". From a quick read of the code it seems that it relies on the y being passed around as a data frame so that it can extract indices from it and then use those to index the sample weights.

From a super high level view, this change replaces a y = np.asarray(y) with y = check_array(y). So I assume that the SO answer should still work. Because np.asarray() would also remove the "data frame-ness".

jjerphan · 2022-12-05T16:05:25Z

sklearn/utils/validation.py

+    y = check_array(
+        y, ensure_2d=False, dtype=dtype, input_name="y", force_all_finite=False
+    )


The CI jobs fail because some tests have y be one-element arrays.

One solution is to adapt check_array regarding

scikit-learn/sklearn/utils/validation.py

Lines 926 to 933 in cbfb6ab

if ensure_min_samples > 0:

n_samples = _num_samples(array)

if n_samples < ensure_min_samples:

raise ValueError(

"Found array with %d sample(s) (shape=%s) while a"

" minimum of %d is required%s."

% (n_samples, array.shape, ensure_min_samples, context)

)

probably by:

changing the default value of ensure_min_samples to 0

changing the check for ensure_min_samples > 1

Alternatively, we can change this call to:

Suggested change

y = check_array(

y, ensure_2d=False, dtype=dtype, input_name="y", force_all_finite=False

)

y = check_array(

y,

ensure_2d=False,

ensure_min_samples=0,

dtype=dtype,

input_name="y",

force_all_finite=False,

)

The first proposal is more sensible to me while probably necessitating more changes and adaptation in cascade whilst the second proposal is relatively simple but probably semantically not always correct.

For example one test test_classification_report_output_dict_empty_input checks that you can call classification_report with [] for both y_true and y_pred. So I think we need to allow ensure_min_samples=0.

jjerphan

LGTM given that tests pass.

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz> closes scikit-learn#25073

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz> closes #25073

Use check_array to validate y

ec5d42f

github-actions bot added the module:utils label Dec 1, 2022

adrinjalali approved these changes Dec 2, 2022

View reviewed changes

jjerphan reviewed Dec 5, 2022

View reviewed changes

betatim added 2 commits December 13, 2022 13:47

Merge branch 'main' into y-input-validation

210f54e

Allow zero samples

b19979b

betatim force-pushed the y-input-validation branch from bfa11e7 to b19979b Compare December 13, 2022 16:08

Merge branch 'main' into y-input-validation

17d3579

jjerphan approved these changes Dec 15, 2022

View reviewed changes

jjerphan enabled auto-merge (squash) December 15, 2022 13:54

jjerphan merged commit a0829ac into scikit-learn:main Dec 15, 2022

jjerphan pushed a commit to jjerphan/scikit-learn that referenced this pull request Jan 3, 2023

Use check_array to validate y (scikit-learn#25089)

3242578

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz> closes scikit-learn#25073

ogrisel added this to the 1.2.1 milestone Jan 19, 2023

jjerphan pushed a commit to jjerphan/scikit-learn that referenced this pull request Jan 20, 2023

Use check_array to validate y (scikit-learn#25089)

b559fd3

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz> closes scikit-learn#25073

jjerphan pushed a commit to jjerphan/scikit-learn that referenced this pull request Jan 20, 2023

Use check_array to validate y (scikit-learn#25089)

c84b0ca

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz> closes scikit-learn#25073

jjerphan added a commit to jjerphan/scikit-learn that referenced this pull request Jan 20, 2023

DOC Move changelog entry for scikit-learn#25089

996eefa

jjerphan pushed a commit to jjerphan/scikit-learn that referenced this pull request Jan 23, 2023

Use check_array to validate y (scikit-learn#25089)

2d3047b

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz> closes scikit-learn#25073

adrinjalali pushed a commit that referenced this pull request Jan 24, 2023

Use check_array to validate y (#25089)

3995b33

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz> closes #25073

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use `check_array` to validate `y` #25089

Use `check_array` to validate `y` #25089

betatim commented Dec 1, 2022 •

edited

adrinjalali commented Dec 2, 2022

betatim commented Dec 2, 2022

jjerphan Dec 5, 2022

betatim Dec 13, 2022

jjerphan left a comment

	if ensure_min_samples > 0:
	n_samples = _num_samples(array)
	if n_samples < ensure_min_samples:
	raise ValueError(
	"Found array with %d sample(s) (shape=%s) while a"
	" minimum of %d is required%s."
	% (n_samples, array.shape, ensure_min_samples, context)
	)

Use check_array to validate y #25089

Use check_array to validate y #25089

Conversation

betatim commented Dec 1, 2022 • edited

Reference Issues/PRs

What does this implement/fix? Explain your changes.

adrinjalali commented Dec 2, 2022

betatim commented Dec 2, 2022

jjerphan Dec 5, 2022

Choose a reason for hiding this comment

betatim Dec 13, 2022

Choose a reason for hiding this comment

jjerphan left a comment

Choose a reason for hiding this comment

Use `check_array` to validate `y` #25089

Use `check_array` to validate `y` #25089

betatim commented Dec 1, 2022 •

edited