FIX TransformedTargetRegressor warns when set_output expects dataframe#29401
FIX TransformedTargetRegressor warns when set_output expects dataframe#29401thomasjpfan merged 5 commits intoscikit-learn:mainfrom
TransformedTargetRegressor warns when set_output expects dataframe#29401Conversation
TransformedTargetRegressor warns when set_output expects dataframe
lesteve
left a comment
There was a problem hiding this comment.
Thanks for the PR, a few comments.
| - `"pandas"`: DataFrame output | ||
| - `"polars"`: Polars output | ||
| - `None`: Transform configuration is unchanged | ||
| - `None`: Transform configuration is unchanged, global configuration is used |
There was a problem hiding this comment.
Actually it is not necessarily using the global configuration, you could imagine something like this:
estimator.set_output(transform='pandas')
estimator.set_output(transform=None)The global configuration is transform='default' but the estimator is using transform='pandas'
More a meta comment: I get the fact that it may not be very clear from the docstring what transform=None does but I don't think there is an easy way to improve the situation in the docstring. Full disclosure: I am not sure about the use case for transform=None actually 🤔.
There was a problem hiding this comment.
Oh, I was thinking that we cannot actively set transform=None, because this is what the docstring says: transform : {"default", "pandas", "polars"}, default=None.
Also, I have just discovered, there is
supported_outputs = ADAPTERS_MANAGER.supported_outputs
if dense_config not in supported_outputs:
raise ValueError(
f"output config must be in {sorted(supported_outputs)}, got {dense_config}"
)
to prevent us from setting transform=None.
It can only stay None if it is untouched, and then the global configuration is used within the _get_output_config() function.
I'm happy to revert this if it's wrong, but I will keep my suggestions for now to facilitate discussion.
There was a problem hiding this comment.
transform : {"default", "pandas", "polars"}, default=None
I guess the docstring is slightly misleading then. It seems like you can use transform=None:
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
X, y = load_iris(as_frame=True, return_X_y=True)
scaler = StandardScaler()
scaler.set_output(transform="pandas")
scaler.set_output(transform=None)
scaler.fit(X)
# this outputs a pandas DataFrame so transform=None did not change transform="pandas"
X_scaled = scaler.transform(X)There was a problem hiding this comment.
None is used as a sentinel. In the future, f set_output supported predict, then the default transform=None will leave transform configuration alone:
scaler.set_output(predict="pandas") There was a problem hiding this comment.
Your example got me really confused @lesteve , because I had tried a similar experiment with FunctionTransformer, that had led me to think that we cannot pass transform=None:
from sklearn.datasets import load_iris
from sklearn.preprocessing import FunctionTransformer
#set_config(transform_output="pandas") # same as using set_output before transform=None
X, y = load_iris(as_frame=True, return_X_y=True)
functrans = FunctionTransformer()
functrans.set_output(transform="pandas")
functrans.set_output(transform=None)
functrans.fit(X)
# this raises a ValueError:
functrans.transform(X)ValueError: output config must be in ['default', 'pandas', 'polars'], got None
It seems that FunctionTransformer().set_output() and _SetOutputMixin.set_output() (from StandardScaler ) are handeling this differently. The latter has an early if transform is None: return self, while the FunctionTransformer can pass self._sklearn_output_config = {transform: None} to _get_output_config(), which would then raise the above error message.
I am not sure what to make of this, though...
Is this wanted/expected or should we allign this?
I think I should revert the documentation change, since transform=None is sometimes possible to pass and then the previous setting is used, not necessarily the global configuration.
There was a problem hiding this comment.
OK well, I think we should fix the bug first and leave the docstring modification for another PR when we understand a bit more the .set_output thing.
There was a problem hiding this comment.
Yes, I agree. I will revert the documentation changes then.
sklearn/compose/tests/test_target.py
Outdated
| assert regr.regressor_.predict_called | ||
|
|
||
|
|
||
| @pytest.mark.filterwarnings("error") |
There was a problem hiding this comment.
Did you find this pattern used in other places in our test? Personally I would maybe find it a bit less clear than the recommendation from Pytest for checking that no warnings is emitted:
with warnings.catch_warnings():
warnings.simplefilter("error")
...If I had to give a small justification for it, I would say that the code checking that there is no warning is closer to code potentially creating the warning which maybe helps making the intent a bit clearer ...
There was a problem hiding this comment.
Yes, I see what you mean. @pytest.mark.filterwarnings("error") is actually not used anywhere else in scikit-learn. I was finding it by googling how to make the test not pass, when a warning is raised.
Let's better use warnings.catch_warnings instead.
StefanieSenger
left a comment
There was a problem hiding this comment.
Thank you @lesteve for reviewing. I have adopted the suggested changes and answered to the documentation change. Would you have another look?
sklearn/compose/tests/test_target.py
Outdated
| assert regr.regressor_.predict_called | ||
|
|
||
|
|
||
| @pytest.mark.filterwarnings("error") |
There was a problem hiding this comment.
Yes, I see what you mean. @pytest.mark.filterwarnings("error") is actually not used anywhere else in scikit-learn. I was finding it by googling how to make the test not pass, when a warning is raised.
Let's better use warnings.catch_warnings instead.
| - `"pandas"`: DataFrame output | ||
| - `"polars"`: Polars output | ||
| - `None`: Transform configuration is unchanged | ||
| - `None`: Transform configuration is unchanged, global configuration is used |
There was a problem hiding this comment.
Oh, I was thinking that we cannot actively set transform=None, because this is what the docstring says: transform : {"default", "pandas", "polars"}, default=None.
Also, I have just discovered, there is
supported_outputs = ADAPTERS_MANAGER.supported_outputs
if dense_config not in supported_outputs:
raise ValueError(
f"output config must be in {sorted(supported_outputs)}, got {dense_config}"
)
to prevent us from setting transform=None.
It can only stay None if it is untouched, and then the global configuration is used within the _get_output_config() function.
I'm happy to revert this if it's wrong, but I will keep my suggestions for now to facilitate discussion.
|
I have just reverted the documentation change on Here's a link to it, in case we want to get back to it later. |
thomasjpfan
left a comment
There was a problem hiding this comment.
I feel like this should be backported to 1.5.2. I see that this PR is marked "no changelog required", but can we add one for 1.5.2?
Thanks @thomasjpfan, |
|
@thomasjpfan do you have inside knowledge that there will be a 1.5.2 🤔? I was unsure about the changelog for this one to be honest it's nice way to recognise the contributor work, but at the same time nobody is going to be like "weird I don't get this spurious warning anymore, let's look at the changelog to see whether they changed something" ... |
|
@StefanieSenger Thanks for this fix! |
Reference Issues/PRs
closes #29361
What does this implement/fix? Explain your changes.
Prevents warnings on output type to be raised on
TransformedTargetRegressor().fit()when global output is set to "pandas" or "polars".Any other comments?
Did this together with @lesteve: we figured that
TransformedTargetRegressoris in fact not a transformer and therefore there should not be a warning, neither would we expect that X is returned as a DataFrame, nor do we need to raise an exception.Setting
set_config(transform_output="pandas")should have no effect withinTransformedTargetRegressor().fit().I have also checked other places where
FunctionTransformeris used, but these occurances all refer to transformers. So there should be no other places where we would need to apply the same fix.In the second commit, I also suggest a documentation change to explain better the difference between
set_output(transform=None)andset_output(transform="default")(see this comment).