ENH Add "ensure_non_negative" option to check_array#29540
Conversation
|
Thanks for the PR @TamaraAtanasoska. I'd like to keep the It's possible though that |
Agreed, changed it everywhere to |
There was a problem hiding this comment.
We need to add an entry in the changelog located in doc/whats_new/v1.6.rst since it touch the public API of check_array. We don't need to document anything about _check_sample_weight there.
It is a bit weird that we introduce the option but that we don't use it.
@jeremiedbb do you think that we can use this feature directly in NMF because if I recall well this is one requirement of the transformer. I assume that we are currently calling the check_non_negative function directly in the code.
I don't know if there is something similar with transformer that expect counts as input in the feature extraction for text (like TfidfTransformer).
| ensure_non_negative : bool, default=False, | ||
| Whether or not the weights are expected to be non-negative. | ||
|
|
||
| .. versionadded:: 1.0 |
There was a problem hiding this comment.
We should probably have a versionchanged to mention that we rename the parameter. @jeremiedbb I'm scared to break code but this should be quite limited and easy to handle for downstream package.
Co-authored-by: Guillaume Lemaitre <guillaume@probabl.ai>
Thanks for the review @glemaitre! I am looking into adding to the changelog now. Is this an |
I would say an enhancement. |
|
@glemaitre @jeremiedbb a few notes on a6cfa3a:
|
glemaitre
left a comment
There was a problem hiding this comment.
I added two example usages in the existing code. What comes to light when looking at the changes required for them is that there is a loss of information that previous was there, packed in whom which is now a generic check_array as source of input. In what cases is this information essential?
I think that we can alleviate this issue by using both input_name and estimator of the check_array: estimator will help at getting the name of the estimator while input_name will help at indicating which data container contains negative values. We could adapt a bit the error message of check_non_negative to have a meaningful message, or craft a good whom message from these two parameters.
| array = array.copy(**copy_params) | ||
|
|
||
| if ensure_non_negative: | ||
| check_non_negative(array, "check_array") |
There was a problem hiding this comment.
As stated in another comment, if input_name and estimator are not the default, I think that we can make a crafted message better than "check_array".
glemaitre
left a comment
There was a problem hiding this comment.
I think that we can also modify NMF. Right now the check_non_negative is called in _fit_transform method. However, we could remove it and use the ensure_non_negative of check_array call in fit_transform and transform
Co-authored-by: Guillaume Lemaitre <guillaume@probabl.ai>
|
@glemaitre I tired to address all the comments above in 890b480. Not sure about adding a generic |
I would probably not have a generic name and not mentioned it (so probably the empty yes). It would translate to something like: if ensure_non_negative:
whom = input_name
if estimator_name:
whom += f" in {estimator_name}"
check_non_negative(array, whom=whom) |
I love this suggestion, just pushed a commit with it. It preserves the |
| Make sure the array has only non-negative values. An array that contains | ||
| non-negative values will raise a ValueError. |
There was a problem hiding this comment.
| Make sure the array has only non-negative values. An array that contains | |
| non-negative values will raise a ValueError. | |
| Make sure the array has only non-negative values. If True, an array that contains | |
| negative values will raise a ValueError. |
| if ensure_non_negative: | ||
| whom = input_name | ||
| if estimator_name: | ||
| whom += f" in {estimator_name}" | ||
| check_non_negative(array, whom) | ||
|
|
There was a problem hiding this comment.
I would put that just before the previous if force_writeable block to be able to fail fast, before a pontential copy.
| graph = check_array( | ||
| graph, accept_sparse="csr", ensure_non_negative=True, input_name="graph" | ||
| ) |
There was a problem hiding this comment.
why didn't you keep "precomputed distance matrix" ?
There was a problem hiding this comment.
Agreed, graph is an internal variable and people might get confused by the error message then.
There was a problem hiding this comment.
thanks @jeremiedbb , makes sense! I didn't notice there is more relevant naming there.
|
@glemaitre, should we add it to |
Yep we can do that but in a subsequent PR. |
glemaitre
left a comment
There was a problem hiding this comment.
OK after failing at properly using git merge, I think this is all good ;)
|
Enabling auto-merge. Thanks @TamaraAtanasoska |
Co-authored-by: Guillaume Lemaitre <guillaume@probabl.ai> Co-authored-by: Jérémie du Boisberranger <jeremie@probabl.ai>
Reference Issues/PRs
Fixes #29508.
What does this implement/fix? Explain your changes.
Adding an option to check if an array has
only_non_negativevalues to thesklearn.utils.validation.check_arrayfunction, that contains thesklearn.utils.validation.check_non_negativefunctionality.While in the initial issue I proposed
ensure_positiveas a name for the option, I foundonly_non_negativewith exactly the same function insklearn.utils.validation._check_sample_weight(here) and I used the same for consistency.This addition will prevent the need to use
sklearn.utils.validation.check_non_negativeaftersklearn.utils.validation.check_arrayin the use case of needing only non-negative values in an array. I am keeping the proposed changes minimal for easier review, but I am happy to also change all occurrences in the scikit-learn code where this pattern exists as an added commit.Any other comments?