Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use check_array to validate y #25089

Merged
merged 4 commits into from Dec 15, 2022
Merged

Conversation

betatim
Copy link
Member

@betatim betatim commented Dec 1, 2022

Reference Issues/PRs

closes #25073 (more precisely this PR combined with #25080 closes it)

What does this implement/fix? Explain your changes.

Uses check_array in _check_y so that we get the same behaviour for converting pandas special data types (that can represent missing values) like Int64 as for X. This is done in the part of the code around pandas_requires_conversion.

Is this what you had in mind?

cc @glemaitre

@adrinjalali
Copy link
Member

@betatim
Copy link
Member Author

betatim commented Dec 2, 2022

I've not tried to run the code from the SO answer, so technically "I don't know". From a quick read of the code it seems that it relies on the y being passed around as a data frame so that it can extract indices from it and then use those to index the sample weights.

From a super high level view, this change replaces a y = np.asarray(y) with y = check_array(y). So I assume that the SO answer should still work. Because np.asarray() would also remove the "data frame-ness".

Comment on lines 1170 to 1172
y = check_array(
y, ensure_2d=False, dtype=dtype, input_name="y", force_all_finite=False
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CI jobs fail because some tests have y be one-element arrays.

One solution is to adapt check_array regarding

if ensure_min_samples > 0:
n_samples = _num_samples(array)
if n_samples < ensure_min_samples:
raise ValueError(
"Found array with %d sample(s) (shape=%s) while a"
" minimum of %d is required%s."
% (n_samples, array.shape, ensure_min_samples, context)
)

probably by:

  • changing the default value of ensure_min_samples to 0
  • changing the check for ensure_min_samples > 1

Alternatively, we can change this call to:

Suggested change
y = check_array(
y, ensure_2d=False, dtype=dtype, input_name="y", force_all_finite=False
)
y = check_array(
y,
ensure_2d=False,
ensure_min_samples=0,
dtype=dtype,
input_name="y",
force_all_finite=False,
)

The first proposal is more sensible to me while probably necessitating more changes and adaptation in cascade whilst the second proposal is relatively simple but probably semantically not always correct.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example one test test_classification_report_output_dict_empty_input checks that you can call classification_report with [] for both y_true and y_pred. So I think we need to allow ensure_min_samples=0.

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM given that tests pass.

@jjerphan jjerphan enabled auto-merge (squash) December 15, 2022 13:54
@jjerphan jjerphan merged commit a0829ac into scikit-learn:main Dec 15, 2022
jjerphan pushed a commit to jjerphan/scikit-learn that referenced this pull request Jan 3, 2023
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
closes scikit-learn#25073
@ogrisel ogrisel added this to the 1.2.1 milestone Jan 19, 2023
jjerphan pushed a commit to jjerphan/scikit-learn that referenced this pull request Jan 20, 2023
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
closes scikit-learn#25073
jjerphan pushed a commit to jjerphan/scikit-learn that referenced this pull request Jan 20, 2023
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
closes scikit-learn#25073
jjerphan added a commit to jjerphan/scikit-learn that referenced this pull request Jan 20, 2023
jjerphan pushed a commit to jjerphan/scikit-learn that referenced this pull request Jan 23, 2023
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
closes scikit-learn#25073
adrinjalali pushed a commit that referenced this pull request Jan 24, 2023
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
closes #25073
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ValueError: "Unknown label type: 'unknown'" when class column has Pandas type like Int64
4 participants