Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API Change default value of as_frame in fetch_openml to 'auto' #17610

Conversation

fujiaxiang
Copy link
Contributor

@fujiaxiang fujiaxiang commented Jun 16, 2020

This is a follow up to #17396.

As discussed in #17396, I'm changing the default value of as_frame in function fetch_openml to 'auto'.

@fujiaxiang
Copy link
Contributor Author

I didn't add any warnings for this change of behavior because fetch_openml is an experimental feature: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_openml.html

Let me know if you guys think we should add a warning.

@fujiaxiang fujiaxiang changed the title [WIP] Change default value of as_frame in fetch_openml to 'auto' [MRG] Change default value of as_frame in fetch_openml to 'auto' Jun 16, 2020
@fujiaxiang fujiaxiang changed the title [MRG] Change default value of as_frame in fetch_openml to 'auto' [WIP] Change default value of as_frame in fetch_openml to 'auto' Jun 16, 2020
@fujiaxiang
Copy link
Contributor Author

@glemaitre thanks for review, will follow the guidelines in the doc.

Sorry I am not familiar with sklearn development policies. Will read the docs more.

@glemaitre
Copy link
Contributor

Don't worry. That's why I give pointer :)

@fujiaxiang
Copy link
Contributor Author

hi @glemaitre I added the future warning and corresponding test, could you review?

@fujiaxiang fujiaxiang changed the title [WIP] Change default value of as_frame in fetch_openml to 'auto' [MRG] Change default value of as_frame in fetch_openml to 'auto' Jun 17, 2020
Copy link
Contributor

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only nitpicks. Otherwise looks good

sklearn/datasets/_openml.py Outdated Show resolved Hide resolved
sklearn/datasets/tests/test_openml.py Outdated Show resolved Hide resolved
@lesteve
Copy link
Member

lesteve commented Jun 23, 2020

@fujiaxiang there was a conflict in the whats_new I took the liberty of fixing it, hopefully you don't mind too much.

fujiaxiang and others added 2 commits June 24, 2020 10:21
updated docstring

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
@fujiaxiang
Copy link
Contributor Author

@glemaitre thanks for review, have updated the test and docstring as requested.

@lesteve Not at all, thanks! I also took the liberty to reorder the whatsnew entries a bit by their labels (i.e Enhancement comes before API). Could you review?

Copy link
Member

@lesteve lesteve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments

sklearn/datasets/_openml.py Outdated Show resolved Hide resolved
sklearn/datasets/_openml.py Outdated Show resolved Hide resolved
doc/whats_new/v0.24.rst Outdated Show resolved Hide resolved
@fujiaxiang fujiaxiang requested a review from lesteve June 25, 2020 10:40
Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think forcing the user to always set as_frame is excessively conservative, and creates unnecessary as_frame= where the dataset would fail to load with as_frame=False (e.g. string columns).

I think we should either:

  1. return a dataframe for any dataset that fails with as_frame=False (e.g. non-numeric datasets); or
  2. given the experimental label, simply adopt as_frame='auto' but raise a ChangedBehaviourWarning in the case that as_frame=False would have returned an array, but now will return a dataframe (e.g. all-numeric data)

…_asframe_default_to_auto

# Conflicts:
#	doc/whats_new/v0.24.rst
#	sklearn/datasets/_openml.py
@fujiaxiang
Copy link
Contributor Author

I think we should either:

  1. return a dataframe for any dataset that fails with as_frame=False (e.g. non-numeric datasets); or
  2. given the experimental label, simply adopt as_frame='auto' but raise a ChangedBehaviourWarning in the case that as_frame=False would have returned an array, but now will return a dataframe (e.g. all-numeric data)

Hi @jnothman, thanks for the two suggestions.

I am not particularly fond of Option 1. As per discussions in #14888 and #17396, it causes confusion to users if the return type depends on the data content itself, and potentially causes unwanted exceptions. It also defeats the purpose of having 'auto' being one of the option for as_frame.

I think Option 2 is ok. Although, as @glemaitre and @lesteve pointed out, we have already compensated for the experimental label by shortening the deprecation cycle by 1 version. Do you think this is still too conservative? Another issue we face now is that ChangedBehaviourWarning itself is deprecated in 0.24. If we are to go with Option 2, is there an alternative to it (perhaps UserWarning)? I didn't find such info in the deprecation message.

@fujiaxiang fujiaxiang requested a review from jnothman July 6, 2020 05:28
@glemaitre
Copy link
Contributor

given the experimental label, simply adopt as_frame='auto' but raise a ChangedBehaviourWarning in the case that as_frame=False would have returned an array, but now will return a dataframe (e.g. all-numeric data)

I am + 1 with option 2. I would be in favor to keep ChangedBehaviour for these specific case.

@fujiaxiang
Copy link
Contributor Author

@glemaitre @jnothman I have updated the code to simply adopt new default value as_frame='auto' in 0.24 and raises ChangedBehaviorWarning in cases where it used to return array but now returns DataFrame.

For now I have to include a filter to ignore the deprecation warning of ChangedBehaviorWarning in the test.

@adrinjalali also mentioned in #17804 that perhaps we shouldn't even raise a warning for experimental features.
This seems to be inline with other similar projects pandas and tensorflow, which follows or loosely follows Semantic Verisoning 2.0.
I couldn't find any scikit-learn documentation that explicitly discussed version policy for experimental features, so I took a look at similar projects tensorflow and pandas.

In Tensorflow version policy (https://www.tensorflow.org/guide/versions)

Some parts of TensorFlow can change in backward incompatible ways at any point. These include:

  • Experimental APIs:

In Tensorflow documentation (https://www.tensorflow.org/probability/api_docs/python/tfp/experimental)

tfp.experimental has no API stability guarantee. The public footprint of tfp.experimental code may change without notice or warning.

Also in Tensorflow (https://www.tensorflow.org/api_docs/python/tf/autograph/experimental/Feature)

These conversion options are experimental. They are subject to change without notice and offer no guarantees.

In Pandas version policy (https://pandas.pydata.org/docs/development/policies.html)

Pandas may change the behavior of experimental features at any time.

Also in pandas (https://pandas.pydata.org/pandas-docs/version/1.0.0/whatsnew/v1.0.0.html)

Experimental: the behaviour of pd.NA can still change without warning.

StringDtype is currently considered experimental. The implementation and parts of the API may change without warning.

Both allows experimental features to change any time, and may do so without warning.

So I think it is ok to make this change without any warning. It is a matter of how you guys (sklearn core development team) want to set the policy. I myself kind of prefer the flexibility to change experimental features without warning, as this allows developers to quickly test new features among receive community feedback. Another small downside of giving ChangedBahaviorWarning is when a behavior is changed a second time while the feature is still experimental. It may confuse users what's actually going on.

Perhaps you guys can have a discussion and decide what kind of policy/guideline you want to set?

@adrinjalali
Copy link
Member

adrinjalali commented Jul 7, 2020

Yes, I would be very much happier if we treat the experimental features as experimental and change them when needed without a warning. The warnings unnecessarily clutter user's space and prevent us from raising warnings in places where we actually need to raise a warning (bad default param and intercept scaling just two examples).

@fujiaxiang I know you really mean no harm, but it would be much nicer if you could avoid using "you guys" (https://youguys.club/). Thanks :)

@fujiaxiang
Copy link
Contributor Author

@adrinjalali @glemaitre @jnothman @lesteve I removed the warning message. Could you review?

@fujiaxiang
Copy link
Contributor Author

I know you really mean no harm, but it would be much nicer if you could avoid using "you guys"

@adrinjalali Sure, will avoid using that phrase.

@fujiaxiang
Copy link
Contributor Author

ping

@fujiaxiang
Copy link
Contributor Author

@adrinjalali @glemaitre @jnothman @lesteve Anyone can help review?

sklearn/datasets/_openml.py Outdated Show resolved Hide resolved
@@ -91,6 +91,11 @@ Changelog
`data` and `frame` members are pandas DataFrames, and the `target` member is
a pandas Series. :pr:`17491` by :user:`Alex Liang <tianchuliang>`.

- |API| The default value of `as_frame` in :func:`datasets.fetch_openml` will
change from False to 'auto' in 0.25. It now issues a `FutureWarning` when
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not issue FutureWarning anymore.

fujiaxiang and others added 4 commits August 15, 2020 09:05
@jnothman jnothman changed the title [MRG] Change default value of as_frame in fetch_openml to 'auto' API Change default value of as_frame in fetch_openml to 'auto' Aug 15, 2020
@jnothman jnothman merged commit bdf2ff5 into scikit-learn:master Aug 15, 2020
5 checks passed
@jnothman
Copy link
Member

Thanks @fujiaxiang

jayzed82 pushed a commit to jayzed82/scikit-learn that referenced this pull request Oct 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants