Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Adds fetch_openml pandas dataframe support #13902

Merged
merged 43 commits into from Jul 12, 2019

Conversation

@thomasjpfan
Copy link
Member

thomasjpfan commented May 18, 2019

Reference Issues/PRs

Fixes #11819
Fixes #11818
Supersedes #11875
Supersedes #13177

What does this implement/fix? Explain your changes.

This PR adds return_frame to fetch_openml. There were some design decision made to get this to work (when return_frame is True):

  1. The data attribute will contain the whole data frame. Usually a pandas dataframe is use to explore the data, using groupbys, etc. Having the target in the dataframe helps with this aspect.
  2. The target attribute will be set to None. (The target is in the pandas dataframe)
  3. A new target_names attribute was added. Together with feature_names, this allows the pandas dataframe to be split features and targets later.
  4. String datatypes will always be objects
  5. Nominal data types will always be categories
  6. A real, numeric, or integer with missing with always be floats
  7. Integer without missing with be ints.

There will be an unrelated error from test_scale_and_stability.

thomasjpfan added 3 commits May 17, 2019
@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented May 21, 2019

The data attribute will contain the whole data frame. Usually a pandas dataframe is use to explore the data, using groupbys, etc. Having the target in the dataframe helps with this aspect.

What about if return_X_y=True?

I agree that including the target in the frame corresponds to common pandas usage, but it doesn't correspond to Scikit-learn convention. I'd rather leave the target out unless you have a more compelling reason otherwise.

check_pandas_support('fetch_openml with return_frame=True')
import pandas as pd

df = pd.DataFrame(arrf_data['data'], columns=list(features_dict.keys()),

This comment has been minimized.

Copy link
@jnothman

jnothman May 22, 2019

Member

arff_data['data'] is a generator, right? Does that work with all supported versions of Pandas? Can you make sure that it does not just do list(data) and explode the memory usage?

This comment has been minimized.

Copy link
@glemaitre

glemaitre May 22, 2019

Contributor

Otherwise we can pass it to pd.DataFrame.from_records() which support generator from 0.12

Actually @jorisvandenbossche mentioned that it should not be implemented in from_records either.

This comment has been minimized.

Copy link
@thomasjpfan

thomasjpfan May 22, 2019

Author Member

pd.Dataframe does call list(data), which we need to avoid.

Otherwise we can pass it to pd.DataFrame.from_records()

Just tested from_records out and it seems to work. :)

sklearn/datasets/openml.py Show resolved Hide resolved
@thomasjpfan

This comment has been minimized.

Copy link
Member Author

thomasjpfan commented May 22, 2019

I agree that splitting up the dataframe into target and data matches the Scikit-learn convention more. If we break up data and target, we would need to guide users to recombine the dataset to do any exploratory data analysis that needs the target in the dataframe. This is a little counter to what is expected when loading a dataframe from a csv, where the target is included. Users are use to breaking up the dataframe themselves.

In the end, I am okay with going either way on this matter.

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented May 22, 2019

-------
df : pd.DataFrame
"""
check_pandas_support('fetch_openml with return_frame=True')

This comment has been minimized.

Copy link
@glemaitre

glemaitre May 22, 2019

Contributor

Do you think that it would be nice to have directly something like:

pd = check_pandas_support(...)

This comment has been minimized.

Copy link
@thomasjpfan

thomasjpfan May 22, 2019

Author Member

Yup, that would be nice to have. I'll include it and see what others think.

check_pandas_support('fetch_openml with return_frame=True')
import pandas as pd

df = pd.DataFrame(arrf_data['data'], columns=list(features_dict.keys()),

This comment has been minimized.

Copy link
@glemaitre

glemaitre May 22, 2019

Contributor

Otherwise we can pass it to pd.DataFrame.from_records() which support generator from 0.12

Actually @jorisvandenbossche mentioned that it should not be implemented in from_records either.

thomasjpfan added 2 commits May 22, 2019
Copy link
Member

jorisvandenbossche left a comment

@thomasjpfan thanks for taking this up! Added a few comments

except ImportError as e:
raise ImportError(
"{} requires pandas. You can install pandas with "
"`pip install pandas`".format(caller_name)

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche May 22, 2019

Member

I personally wouldn't suggest to pip install pandas (eg for all conda users, this is the wrong suggestion)

df : pd.DataFrame
"""
pd = check_pandas_support('fetch_openml with return_frame=True')
df = pd.DataFrame.from_records(arrf_data['data'],

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche May 22, 2019

Member

For some reason github prevents me of replying on the previous comments about this ..
But, I would personally just use pd.DataFrame(..). from_records is not more efficient (it expands the generator first as well), and less well maintained as the main constructor.

This comment has been minimized.

Copy link
@jnothman

jnothman May 22, 2019

Member

What might work is to load the data in chunks to limit memory consumption

dtype = pd.CategoricalDtype(attributes[column])
dtypes[column] = dtype

return df.astype(dtypes)

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche May 22, 2019

Member

In my original PR, I had the special case of all numeric columns (that prevents yet another copy for that specific case).
Just mentioning, but I think it might be worth it (didn't check again with a timing though)

thomasjpfan added 5 commits May 24, 2019
…port
@thomasjpfan

This comment has been minimized.

Copy link
Member Author

thomasjpfan commented May 24, 2019

This PR has been updated with:

  1. An nrows parameter in fetch_openml to control how many rows to consume at a time when constructing the dataframe.
  2. There is a less copying in _convert_arff_data_dataframe.
  3. Features where openml says the data is is_ignore or is_row_identifier are remove from the dataframe. If a dataset contains these columns, they are removed, and the dataframe is copied. This forces fetch_openml to return a non-view dataframe.
  4. Raises an error when return_X_y=True and return_frame=True. To correctly follow scikit-learn convention, X could still be a dataframe, but y can be reshaped into a 1-D array (if there is one target). Letting y be a numpy array appears to contradict return_frame=True.
arrf_data_gen = _chunk_iterable(arrf['data'], nrows)
dfs = [pd.DataFrame(list(data), columns=arrf_columns, dtype=object)
for data in arrf_data_gen]
df = pd.concat(dfs, copy=False)

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche May 24, 2019

Member

You can forget the copy=False here :) (that's never possible along this axis)

If True, returns a Bunch where the data attribute is a pandas
DataFrame.
nrows : int, default=5000

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche May 24, 2019

Member

Maybe rather chunksize as how you called it above?
For an nrows argument, I would expect that it specifies to only return me that many (first) rows.

arrf_columns = list(attributes)

arrf_data_gen = _chunk_iterable(arrf['data'], nrows)
dfs = [pd.DataFrame(list(data), columns=arrf_columns, dtype=object)

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche May 24, 2019

Member

Is this object dtype needed?

thomasjpfan added 2 commits May 24, 2019
Copy link
Member

jnothman left a comment

I'm still not convinced by the sense of returning the target in the frame. I see it as incompatible with the rest of sklearn.datasets (apart from making it easy to leak y to an estimator).


_monkey_patch_webbased_functions(monkeypatch, data_id, True)

bunch = fetch_openml(data_id=data_id, return_frame=True, cache=False)

This comment has been minimized.

Copy link
@jnothman

jnothman May 25, 2019

Member

I think it would be best to compare the bunches returned with return_frame=False vs True. This is especially needed in case we were to add something to the bunch later...

chunksize : int, default=5000
Number of rows to read at a time when constructing a dataframe.
Only used when ``return_frame`` is True.

This comment has been minimized.

Copy link
@jnothman

jnothman May 25, 2019

Member

we could be using this more generally instead of the messy np.fromiter code.

This comment has been minimized.

Copy link
@thomasjpfan

thomasjpfan May 28, 2019

Author Member

What do you find messy about the np.fromiter code?

DataFrame.
chunksize : int, default=5000
Number of rows to read at a time when constructing a dataframe.

This comment has been minimized.

Copy link
@jnothman

jnothman May 25, 2019

Member

Is it possible to estimate against working_memory instead?

This comment has been minimized.

Copy link
@thomasjpfan

thomasjpfan May 29, 2019

Author Member

It is possible. We would need to estimate the amount of memory used per row.

@thomasjpfan

This comment has been minimized.

Copy link
Member Author

thomasjpfan commented May 25, 2019

This PR was updated with an additional bunch entry, dataframe, which still includes the target. When return_frame=True, both data and target are None. A user would need to use target_columns and feature_columns to split features from the target. Adding a dataframe entry helps distinguish this API from the rest of the sklearn.datasets api.

I think the common use case for using a dataframe is to do EDA (using the target), and then use the target_columns and feature_columns to split the data. The user would also need to be careful with the shape of y, which is already the case when working with dataframes.

Lets say we split the dataframe up into data and target, the target would need to be reshaped, lose its pandas categorical dtype information, and its column name. If we want a user to reconstruct the original dataframe, they would need to manually reconstruct the categorical dtype and place the original column name back. All of this work would be required to do EDA.

Currently the sklearn.dataset api is designed to return a data and target that can be directly fed into a scikit-learn estimator. I am envisioning a new dataframe entry, where a user would want to explore the data first, before building a model.

thomasjpfan added 3 commits May 25, 2019
@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented May 25, 2019

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented May 25, 2019

@thomasjpfan

This comment has been minimized.

Copy link
Member Author

thomasjpfan commented May 25, 2019

(although even there the user should probably not look at the target - and
arguably the features - on their test set)

There are some plots that would like having the target in the dataframe. For example, the iris dataset with seaborn, uses the target, species, to control the hue.

We are forcing users to learn a new way of
interacting with datasets (whether changing to fetch_openml from load_iris,
or vice versa) despite them being functionally required to use return_frame
in many cases. Does that make sense?

This makes sense. Keeping the datasets API consistent is important. Okay, I will update this PR to split the features and the target.

I should also point out that return_frame='auto' could be a nice feature,
for datasets with any non-numeric features. The current approach precludes
that possibility.

This sounds reasonable. It does not feel like too much magic.

Copy link
Member

jnothman left a comment

I've only checked the diff since last review, but I think this is almost there.

first_row = next(arrf['data'])
first_df = pd.DataFrame([first_row], columns=arrf_columns)

row_bytes = first_df.memory_usage(deep=True).sum()

This comment has been minimized.

Copy link
@jnothman

jnothman Jun 18, 2019

Member

Maybe just comment to that effect. In practice the temporary list would have a further intp overhead per feature per sample if I'm not mistaken.

@@ -582,8 +569,11 @@ def fetch_openml(name=None, version='active', data_id=None, data_home=None,
below for more information about the `data` and `target` objects.
as_frame : boolean, default=False
If True, returns a Bunch where the data attribute is a pandas
DataFrame.
If True, where the data is a pandas DataFrame including columns with

This comment has been minimized.

Copy link
@jnothman

jnothman Jun 18, 2019

Member

"where the" -> "the returned"

This comment has been minimized.

Copy link
@amueller

amueller Jul 2, 2019

Member

Not addressed?

If True, where the data is a pandas DataFrame including columns with
appropriate dtypes (numeric, string or categorical). The target is
a pandas DataFrame or Series depending on the number of target_columns.
If ``return_X_y`` is True, then ``(data, target)`` will be pandas

This comment has been minimized.

Copy link
@jnothman

jnothman Jun 18, 2019

Member

I am not sure this is necessary if the previous statement does not mention a Bunch. You could just say "this applies regardless of return_X_y"

appropriate dtypes (numeric, string or categorical). The target is
a pandas DataFrame or Series depending on the number of target_columns.
If ``return_X_y`` is True, then ``(data, target)`` will be pandas
DataFrames or Series as describe above.

This comment has been minimized.

Copy link
@jnothman

jnothman Jun 18, 2019

Member

*described

Copy link
Member

jnothman left a comment

I still wouldn't mind a test that essentially checked the diff between the as_frame=True and the as_frame=False bunch. But otherwise LGTM. Please add a what's new entry. And perhaps open a follow-up issue towards as_frame='auto'.

@thomasjpfan

This comment has been minimized.

Copy link
Member Author

thomasjpfan commented Jun 25, 2019

PR updated with:

  1. Test to compare as_frame=True and as_frame=False on a purely numerical dataset.
@glemaitre glemaitre self-requested a review Jul 2, 2019
Copy link
Contributor

glemaitre left a comment

I think that this is almost done. Can you add a what's new entry.

raise ValueError('Unsupported feature: {}'.format(feature))


def _chunk_generator(gen, chunksize):

This comment has been minimized.

Copy link
@glemaitre

glemaitre Jul 2, 2019

Contributor

I would move this function in the utils next to the other generator.
I don't think that we can factorize it with the current one.

thomasjpfan added 3 commits Jul 2, 2019
…port
Copy link
Contributor

glemaitre left a comment

LGTM. I would like to hear from @ogrisel and @jorisvandenbossche regarding if they agree with the API.
I think this is in line with what we discussed earlier but just to confirm before to click on the green button ;)

doc/whats_new/v0.22.rst Outdated Show resolved Hide resolved
Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>
@@ -71,9 +71,6 @@
clf = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', LogisticRegression())])

X = data.drop('survived', axis=1)

This comment has been minimized.

Copy link
@amueller

amueller Jul 2, 2019

Member

I would like to document this idiom somewhere. It's sooo common and I'm not sure if it's anywhere else in the docs.
Maybe give it as an alternative here?

This comment has been minimized.

Copy link
@thomasjpfan

thomasjpfan Jul 2, 2019

Author Member

Added a comment using the frame attribute.

This comment has been minimized.

Copy link
@amueller

amueller Jul 2, 2019

Member

thanks!

This comment has been minimized.

Copy link
@jnothman

jnothman Jul 2, 2019

Member

I would like to document this idiom somewhere.

Is y = X.pop('survived') a better idiom? Or not because it modifies X in-place.

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Jul 13, 2019

Member

I personally prefer frame.drop(columns='survived') instead of specifying the axis=1

@@ -489,26 +557,38 @@ def fetch_openml(name=None, version='active', data_id=None, data_home=None,
If True, returns ``(data, target)`` instead of a Bunch object. See
below for more information about the `data` and `target` objects.
as_frame : boolean, default=False
If True, where the data is a pandas DataFrame including columns with
appropriate dtypes (numeric, string or categorical). The target is

This comment has been minimized.

Copy link
@amueller

amueller Jul 2, 2019

Member

It doesn't mention the frame attribute?

details : dict
More metadata from OpenML
frame : pandas DataFrame

This comment has been minimized.

Copy link
@amueller

amueller Jul 2, 2019

Member

"Only present when as_frame=True?

thomasjpfan added 2 commits Jul 2, 2019
…port
@glemaitre glemaitre merged commit cf3e303 into scikit-learn:master Jul 12, 2019
17 checks passed
17 checks passed
LGTM analysis: C/C++ No code changes detected
Details
LGTM analysis: JavaScript No code changes detected
Details
LGTM analysis: Python No new or fixed alerts
Details
ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: doc Your tests passed on CircleCI!
Details
ci/circleci: doc-min-dependencies Your tests passed on CircleCI!
Details
ci/circleci: lint Your tests passed on CircleCI!
Details
codecov/patch 98.61% of diff hit (target 96.75%)
Details
codecov/project 96.82% (+0.07%) compared to 03c39af
Details
scikit-learn.scikit-learn Build #20190702.42 succeeded
Details
scikit-learn.scikit-learn (Linux py35_conda_openblas) Linux py35_conda_openblas succeeded
Details
scikit-learn.scikit-learn (Linux py35_np_atlas) Linux py35_np_atlas succeeded
Details
scikit-learn.scikit-learn (Linux pylatest_conda) Linux pylatest_conda succeeded
Details
scikit-learn.scikit-learn (Linux32 py35_np_atlas_32bit) Linux32 py35_np_atlas_32bit succeeded
Details
scikit-learn.scikit-learn (Windows py35_32) Windows py35_32 succeeded
Details
scikit-learn.scikit-learn (Windows py37_64) Windows py37_64 succeeded
Details
scikit-learn.scikit-learn (macOS pylatest_conda) macOS pylatest_conda succeeded
Details
@glemaitre

This comment has been minimized.

Copy link
Contributor

glemaitre commented Jul 12, 2019

@thomasjpfan Thanks. Maybe we can have some example that could benefit from it now.

@rth

This comment has been minimized.

Copy link
Member

rth commented Jul 12, 2019

I'm really excited that this was added! Thanks to everyone who worked on it!

"""
pd = check_pandas_support('fetch_openml with as_frame=True')

attributes = OrderedDict(arrf['attributes'])

This comment has been minimized.

Copy link
@amueller

amueller Jul 12, 2019

Member

typos, should be arff

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

jorisvandenbossche commented Jul 13, 2019

@thomasjpfan thanks for taking this to the finish line !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants
You can’t perform that action at this time.