Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add as_frame='auto' option in datasets.fetch_openml #17396

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
8 changes: 8 additions & 0 deletions doc/whats_new/v0.24.rst
Expand Up @@ -44,6 +44,14 @@ Changelog
:pr:`123456` by :user:`Joe Bloggs <joeongithub>`.
where 123456 is the *pull request* number, not the issue number.

:mod:`sklearn.datasets`
.......................

- |Enhancement| :func:`datasets.fetch_openml` now allows argument `as_frame`
to be 'auto', which tries to convert returned data to pandas DataFrame
unless data is sparse.
:pr:`17396` by :user:`Jiaxiang <fujiaxiang>`.

:mod:`sklearn.decomposition`
............................

Expand Down
8 changes: 7 additions & 1 deletion sklearn/datasets/_openml.py
Expand Up @@ -667,13 +667,16 @@ def fetch_openml(name=None, *, version='active', data_id=None, data_home=None,
If True, returns ``(data, target)`` instead of a Bunch object. See
below for more information about the `data` and `target` objects.

as_frame : boolean, default=False
as_frame : boolean or 'auto', default=False
If True, the data is a pandas DataFrame including columns with
appropriate dtypes (numeric, string or categorical). The target is
a pandas DataFrame or Series depending on the number of target_columns.
The Bunch will contain a ``frame`` attribute with the target and the
data. If ``return_X_y`` is True, then ``(data, target)`` will be pandas
DataFrames or Series as describe above.
If as_frame is 'auto', the data and target will be converted to
DataFrame or Series as if as_frame is set to True, unless the dataset
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's always a DataFrame I think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data.target may be a Series. For example:

data_id = 61  # iris dataset version 1
data = fetch_openml(data_id=data_id, as_frame=True)

>>> type(data.target)
<class 'pandas.core.series.Series'>

is stored in sparse format.

Returns
-------
Expand Down Expand Up @@ -768,6 +771,9 @@ def fetch_openml(name=None, *, version='active', data_id=None, data_home=None,
if data_description['format'].lower() == 'sparse_arff':
return_sparse = True

if as_frame == 'auto':
as_frame = not return_sparse

if as_frame and return_sparse:
raise ValueError('Cannot return dataframe with sparse data')

Expand Down
14 changes: 14 additions & 0 deletions sklearn/datasets/tests/test_openml.py
Expand Up @@ -489,6 +489,20 @@ def test_fetch_openml_australian_pandas_error_sparse(monkeypatch):
fetch_openml(data_id=data_id, as_frame=True, cache=False)


def test_fetch_openml_as_frame_auto(monkeypatch):
pd = pytest.importorskip('pandas')

data_id = 61 # iris dataset version 1
fujiaxiang marked this conversation as resolved.
Show resolved Hide resolved
_monkey_patch_webbased_functions(monkeypatch, data_id, True)
data = fetch_openml(data_id=data_id, as_frame='auto')
assert isinstance(data.data, pd.DataFrame)

data_id = 292 # Australian dataset version 1
fujiaxiang marked this conversation as resolved.
Show resolved Hide resolved
_monkey_patch_webbased_functions(monkeypatch, data_id, True)
data = fetch_openml(data_id=data_id, as_frame='auto')
assert isinstance(data.data, scipy.sparse.csr_matrix)


def test_convert_arff_data_dataframe_warning_low_memory_pandas(monkeypatch):
pytest.importorskip('pandas')

Expand Down