Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+2] ENH Passthrough DataFrame in FunctionTransformer #11043

Merged
merged 17 commits into from Jul 10, 2018

Conversation

glemaitre
Copy link
Member

@glemaitre glemaitre commented Apr 29, 2018

Reference Issues/PRs

closes #10655

What does this implement/fix? Explain your changes.

Added the following option

  • Raise FutureWarning that validate=False will be the default in the future.
  • Convert list to array when validate=False.
  • Make a what's new entry for the change of behaviour.

Any other comments?

@glemaitre
Copy link
Member Author

@jnothman @amueller @rth I would like to have some feedback on this.
I added the force_all_finite parameter since that I thought that it might be missing actually.

One of my main concern now is how to modify the documentation. Shall we modify some example to show the use with pandas (dependency issue)? The parameter is yet not the default one as well.

WDYT?

if self._validate:
if hasattr(X, 'loc') and self._validate == 'array-or-frame':
if self.force_all_finite:
_assert_all_finite(X.values, allow_nan=False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wasn't there some case when X.values is not the same as np.array(X)? Also, this is possibly expensive, as we're converting to numpy array and then discard it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this is possibly expensive, as we're converting to numpy array and then discard it.

It should be expensive memory wise, but apparently this is faster:

>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 10 columns):
0    10000 non-null int64
1    10000 non-null float64
0    10000 non-null int64
1    10000 non-null float64
0    10000 non-null int64
1    10000 non-null float64
0    10000 non-null int64
1    10000 non-null float64
0    10000 non-null int64
1    10000 non-null float64
dtypes: float64(5), int64(5)
memory usage: 781.3 KB

>>> %timeit df.isna().any().any()
1.9 ms ± 217 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

>>> %timeit _assert_all_finite(df.values, allow_nan=False)
163 µs ± 4.11 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

@jorisvandenbossche do you have an advise regarding the internal of pandas to check inf and nan in a DataFrame the most efficiently.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is mainly the .any().any() that makes it slow. Once you have done .isna() you have a full frame of booleans, so getting the values then is cheap:

In [11]: %timeit df.isna().any().any()
1.87 ms ± 4.25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [12]: %timeit df.isna()
411 µs ± 18.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [13]: %timeit df.isna().values.any()
431 µs ± 7.95 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Still not as fast as the _assert_all_finite though, but much closer.

wasn't there some case when X.values is not the same as np.array(X)?

normally for a dataframe values will always be an array. For series this is indeed not always true, but, in the _assert_all_finite function np.asanyarray is still called, so this shouldn't matter.

@jnothman
Copy link
Member

jnothman commented Apr 30, 2018 via email

- If 'allow-nan', accept only np.nan values in X. Values cannot be
infinite.

Applied only when ``validate=True``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand, in the future the default for validate will be 'array-or-frame'. That means that this force_all_finite will not be used anymore once that change is made?
That seems a bit strange as a) that will be a change in behaviour and b) then this default in the documentation will be misleading.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That means that this force_all_finite will not be used anymore once that change is made

The aim is to keep DataFrame as DataFrame with converting it (calling check_array). So with 'array-or-frame' this behavior is enforce but I still keep checking for the finiteness in array or Frame.

But the documentation is wrong :)


- If True, then X will be converted to a 2-dimensional NumPy array or
sparse matrix. If the conversion is not possible an exception is
raised.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You left out the "or X contains NaN or infinity" part, I suppose because it is now controlled by force_all_finite ?
I think it might still be useful to at least reference the other keyword, since it is also about "validation"

- :class:`preprocessing.FunctionTransformer` is accepting pandas DataFrame in
``func`` without converting to a NumPy array when
``validate='array-or-frame``. :issue:`10655` by :user:`Guillaume Lemaitre
<glemaitre>`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also add note about the new force_all_finite keyword?

sparse matrix. If the conversion is not possible an exception is
raised.
- If False, then there is no input validation
- If 'array-or-frame', X will be pass-through if this is a pandas
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pass-through -> passed through ?

if self.validate is None:
self._validate = True
warnings.warn("The default validate=True will be replaced by "
"validate='array-or-frame' in 0.22.", FutureWarning)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we let this depend on the type of X ? Eg only do this is if the input is actually a dataframe? (related to my other question whether the behaviour for arrays will change with regard to NaNs)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can only do it for DataFrame. The behavior for array will not change: we will enforce finiteness by default (no NaN and infinite) which the equivalent of validate=True.

@glemaitre
Copy link
Member Author

@jorisvandenbossche Could you have another look.

- If 'allow-nan', accept only np.nan values in X. Values cannot be
infinite.

This parameter is discarded when ``validate=False``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"If validate is False, this has no effect" for more consistency with the accept_sparse docsting above.

This parameter is discarded when ``validate=False``.

.. versionadded:: 0.20
``force_all_finite`` was added to let pass NaN.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior is more complex than just letting pass NaN.

Maybe just keep .. versionadded:: 0.20 alone: the description of what it does is above.

.format(self.force_all_finite))

if self._validate:
if hasattr(X, 'loc') and self._validate == 'array-or-frame':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xarray.DataArray will pass here while it shouldn't.

Instead of indirectly detecting DataFrames by attribute names (here and elsewhere in scikit-learn) maybe it would make sense to make an actual isinstance check when possible? e.g.

def is_dataframe(obj):
    try:
        import pandas as pd
        return isinstance(obj, pd.DataFrame)
    except ImportError:
        return False

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually that might allow people to pass xarray data array and it should even work. The validation for the nan and inf should is currently working for those arrays. Then, we have the following solutions:

  • Let the ducktyping and do not document that xarray is supported;
  • Let the ducktyping and say that we can pass xarray.
  • Make a strict isinstance.

@rth WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting.

Let the ducktyping and do not document that xarray is supported;

Let's keep things as they are now, then. The second solution would require some more unit tests (and optionally a CI build with xarray) and since pandas support is only in progress, I don't think it would be reasonable to officially support yet another data format at this time..

``validate=True`` as default will be replaced by
``validate='array-or-frame'`` in 0.22.

.. versionadded:: 0.20
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

versionchanged

X = np.random.randn(100, 10)
transformer = FunctionTransformer()
transformer.set_params(**params)
with pytest.raises(ValueError, match=msg_err):
Copy link
Member

@rth rth May 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very nice and so much better than the approach proposed in the second example of Pytest doc, but this only works for pytest >=3.10 (released less than a year ago), before it ignores the match argument (tested with 3.07) producing false positives.

As long as CI used an up to date pytest this should not be an issue, but if we use it it would be worth specifying minimum pytest requirements in the dev documentation.

@jnothman
Copy link
Member

jnothman commented May 6, 2018

Can we step back a moment and ask what practical benefit validate=True provides users? It means the underlying function will work regardless of whether input was list or array, but that is true for most numpy or pandas functions regardless. It helps the user with some sanity checking (finiteness, 2d) but maybe that's better handled with a warning, by validating the output perhaps. Should we just be changing validate's default to False?

@glemaitre
Copy link
Member Author

I should admit that I have difficulty to follow your thoughts. If I understand well, validate=True should only raise warning and validate=False should be the default letting the user managing its input by himself. I would be OK with False as a default but I still would raise an error when validate=True.

Personally, I was thinking that it could be great to have:

  • validate=True calling the default check_array;
  • validate=False skipping check_array (and possibly be the default in FunctionTransformer);
  • validate={'accept_sparse'.: ....} which will be a dict of parameter passed to the check_array for more flexibility. In this case we could remove accept_sparse and force_all_finite which would simplify this function for most user.

In addition, when working on this PR, I was thinking that check_array should provide a keyword to convert or not DataFrame, DataArray, etc. to numpy array.

@rth @jnothman WDYT?

@rth
Copy link
Member

rth commented May 7, 2018

Should we just be changing validate's default to False?

Each time I used FunctionTransformer in the past I had to set validate=False but that was mainly because DataFrames was not supported then...

I still think validate='array-or-frame' by default is preferable, because if the user gets some error (i.e. data doesn't pass validation) he can then take a moment to think about his data and whether it would make sense to validate it or not. If validate=False by default, there is no error, so there is no incentive to change anything or to validate it even when it would make sense.
As pointed out bu @glemaitre another issue is that with `validate in {True, False, "array-or-frame"} one would need yet another option to not validate but raise a warning..

As to the rest of @glemaitre's #11043 (comment) I don't really have a strong opinion about it..


@pytest.mark.parametrize(
"is_dataframe",
[True, False]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: I know that's the code style in pytest examples, but do you think it's really worth spreading something that can be written in 1 line over 4 lines? I'm not convinced it helps readability..

@@ -260,6 +260,12 @@ Miscellaneous
:issue:`9101` by :user:`alex-33 <alex-33>`
and :user:`Maskani Filali Mohamed <maskani-moh>`.

- :class:`preprocessing.FunctionTransformer` is accepting pandas DataFrame in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"now accepts" ?

@@ -572,6 +578,10 @@ Misc
acts as an upper bound on iterations.
:issue:`#10982` by :user:`Juliet Lawton <julietcl>`

- In :class:`preprocessing.FunctionTransformer`, the default of ``validate``
will changed from ``True`` to ``'array-or-frame'`` in 0.22. :issue:`10655` by
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"will be"

- If True, then X will be converted to a 2-dimensional NumPy array or
sparse matrix. If the conversion is not possible an exception is
raised.
- If False, then there is no input validation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the "then" here and in the bullet point above

- If False, then there is no input validation.

When X is validated, the parameters ``accept_sparse`` and
``force_all_finite`` will control the validation for the sparsity and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"will control" -> "control"

self._validate = self.validate

if ((not isinstance(self._validate, bool)) and
self._validate != 'array-or-frame'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just _validate not in (True, False, 'array-or-frame') ?

"'array-or-frame'. Got {!r} instead."
.format(self._validate))
if ((not isinstance(self.force_all_finite, bool)) and
self.force_all_finite != 'allow-nan'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same self.force_all_finite not in (True, False, 'array-or-frame')

if self.force_all_finite:
_assert_all_finite(X.values, allow_nan=False
if self.force_all_finite is True
else True)
Copy link
Member

@rth rth May 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could replace,

allow_nan = (False if self.force_all_finite is True else True)

with

allow_nan = not (self.force_all_finite is True)

@glemaitre
Copy link
Member Author

I address @rth comments. @jnothman could you give some insights regarding #11043 (comment).

@jnothman
Copy link
Member

Yes, i like the proposed API, as long as the simple cases are straightforward for users to understand.

@amueller
Copy link
Member

I feel that's exposing a lot of internals of check_array that I don't really want to maintain / keep backward compatible.

In addition, when working on this PR, I was thinking that check_array should provide a keyword to convert or not DataFrame, DataArray, etc. to numpy array.

Where's DataArray from? Xray? Can we easily detect things implementing numpy protocols? DataFrame doesn't right?

@jnothman
Copy link
Member

jnothman commented May 21, 2018 via email

@amueller
Copy link
Member

Ok the question is what should be the criterion. exposing __array__? But we need to know how to slice it probably, right?

@jnothman
Copy link
Member

jnothman commented May 22, 2018 via email

@amueller
Copy link
Member

Probably not, which is why I was a bit confused by @glemaitre's suggestion.

@glemaitre
Copy link
Member Author

would it not be possible to make something like:

if hasattr(X, 'loc'):
    if convert_to_array:
        array = np.asarray(X)
    else:
        return X

@jnothman
Copy link
Member

jnothman commented May 24, 2018 via email

@glemaitre
Copy link
Member Author

glemaitre commented May 24, 2018 via email


accept_sparse : boolean, optional
Indicate that func accepts a sparse matrix as input. If validate is
False, this has no effect. Otherwise, if accept_sparse is false,
sparse matrix inputs will cause an exception to be raised.

force_all_finite : boolean or 'allow-nan', optional default=True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would go for 'allow-nan' as the default? As we are going in that direction for other transformers as well?

@jorisvandenbossche
Copy link
Member

I feel that's exposing a lot of internals of check_array that I don't really want to maintain / keep backward compatible.

+1, I also woudn't directly expose that.

To echo what @jnothman said above: is there actually a good reason we validate at all (by default)? I think it would make sense to have validate=False as default.

@glemaitre
Copy link
Member Author

I start to be convinced by validate=False since that the same rule apply as for other transformer --- i.e. the classifier and regressor will catch if there is NaN.

@glemaitre
Copy link
Member Author

@jnothman I made the change. Does the what's new entry is explicit enough regarding the conversion to array of X. This is a behaviour change.

sparse matrix. If the conversion is not possible an exception is
raised.
- If False, then there is no input validation
- If 'array-or-frame', X will be pass-through if this is a pandas
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"pass-through" -> "passed through unchanged"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about "passed through unchanged if it is a 2d array, sparse matrix or DataFrame"

- If False, then there is no input validation
- If 'array-or-frame', X will be pass-through if this is a pandas
DataFrame or converted to a 2-dimensional array or sparse matrix. In
this latest case, an exception will be raised if the conversion
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what this means

if self._validate:
return check_array(X, accept_sparse=self.accept_sparse)
else:
# convert X to NumPy array when this is a list
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, I don't think this is an acceptable change in terms of backwards compatibility.
you may get memory issues etc if it's a list of long strings, for instance, since numpy will copy it into contiguous memory.

Also, you've suggested above that it will be converted into a 2d array, but there's no 2d here.

@jnothman
Copy link
Member

jnothman commented Jun 9, 2018 via email

@glemaitre
Copy link
Member Author

glemaitre commented Jun 12, 2018

What I was suggesting is that if a new option becomes default, it should probably have this property about lists, not that it should happen now.​

So you mean to only a raise FutureWarning for the moment?

@glemaitre
Copy link
Member Author

OK so this is good to be reviewed once again.

@jnothman jnothman changed the title [MRG] EHN Passthrough DataFrame in FunctionTransformer [MRG] ENH Passthrough DataFrame in FunctionTransformer Jun 14, 2018
Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think this is good, for now at least.

with pytest.warns(expected_warning) as results:
transformer.fit_transform(X)
if expected_warning is None:
assert len(results) == 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this assert not results?

Copy link
Member

@qinhanmin2014 qinhanmin2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vote +1 for using validate=False as default since I don't think it's the duty of FunctionTransformer to check the input. What's more, I agree that DataFrame in & 2d-array out is not friendly.

@@ -663,7 +663,7 @@ error with a ``filterwarnings``::
>>> import warnings
>>> warnings.filterwarnings("error", message=".*check_inverse*.",
... category=UserWarning, append=False)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated change

@@ -301,6 +301,7 @@ Miscellaneous
:issue:`9101` by :user:`alex-33 <alex-33>`
and :user:`Maskani Filali Mohamed <maskani-moh>`.


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated change

@@ -643,6 +644,10 @@ Misc
- Invalid input for :class:`model_selection.ParameterGrid` now raises TypeError.
:issue:`10928` by :user:`Solutus Immensus <solutusimmensus>`

- In :class:`preprocessing.FunctionTransformer`, the default of ``validate``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why in the Misc section? Maybe move to Preprocessing section?

self.func = func
self.inverse_func = inverse_func
self.validate = validate
self.accept_sparse = accept_sparse
self.force_all_finite = force_all_finite
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of it? Seems that force_all_finite is not used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, it is a remaining from another option

@qinhanmin2014 qinhanmin2014 changed the title [MRG] ENH Passthrough DataFrame in FunctionTransformer [MRG+2] ENH Passthrough DataFrame in FunctionTransformer Jul 8, 2018
@glemaitre
Copy link
Member Author

ping @qinhanmin2014 ready to be merged :)

accept_sparse=False, pass_y='deprecated', check_inverse=True,
kw_args=None, inv_kw_args=None):
def __init__(self, func=None, inverse_func=None, validate=None,
accept_sparse=False, force_all_finite=True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we also need to remove force_all_finite here

@qinhanmin2014
Copy link
Member

qinhanmin2014 commented Jul 10, 2018

ping @jnothman @glemaitre and other people around
The PR itself LGTM, but before merging, I would like to confirm whether there's consensus on how to change the default value of a parameter (#11219). We've merged several PRs for 0.20 (e.g., #10331 and #9379) and will have some (e.g., #11128 and #11139 are marked as blocker). Many contributors are also confused about how we do these things.
(1) I think there's consensus that we should give the parameter a new default value and we're going to recommend deprecated instead of None
(2) Do we need to take care of the tests to ensure that no warnings are raised? Currently, two merged PR along with this PR only take care of part of the tests, which seems strange from my side:
#9379 warning message occurs 167 times in Appveyor log
#10331 warning message occurs 25 times in Appveyor log
this PR warning message occurs 7 times in Appveyor log
(3) Do we need to take care of the examples? Seems that #9379 , #10331 and this PR fail to take care of the examples (in the examples/ folder)
#9379 warning message occurs 103 times in Circle CI log
#10331 warning message occurs 41 times in Circle CI log
Changing the default value of a parameter is different from deprecating. For deprecating, we get no warning -> warning -> error. For changing default, we get no warning -> warning -> no warning. So the question is whether we should take care of tests/examples when changing the default value of a parameter.

@jnothman
Copy link
Member

jnothman commented Jul 10, 2018 via email

@qinhanmin2014
Copy link
Member

I'll merge this one. This PR only modifies tests when they fails, which seems reasonable from my side. @jnothman I think someone need to make a decision so that we can check merged PRs and tell contributors what to do in open PRs. Several PRs which change the default value of a parameter are marked as blocker.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

FunctionTransformer should not convert DataFrames to arrays by default
6 participants