Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Openml data loader #11419

Merged
merged 153 commits into from Aug 15, 2018
Merged

[MRG] Openml data loader #11419

merged 153 commits into from Aug 15, 2018

Conversation

janvanrijn
Copy link
Contributor

@janvanrijn janvanrijn commented Jul 3, 2018

Reference Issues/PRs

Fixes #9543 Fixes #9908
This is the work I performed on issue #9543,
building upon the work that @amueller wants to merge in #9908

What does this implement/fix? Explain your changes.

Adds an OpenMl data loader

Any other comments?

Builds upon the work of @amueller. the latest version of liac-arff package is packed in externals and the unit tests are more extensive.

TO DO AFTER MERGE:

  • require version when fetching by name, or at least raise a warning along the lines of "Multiple dataset versions match the name BLAH. Versions may be very different. Getting version x."
  • Multilabel datasets which encode their targets as nominals do not work with scikit-learn estimators, since Y consists of strings. We should maybe identify multilabel data, at least when its values are all {FALSE, TRUE} and encode it to {0, 1}.
  • complete [MRG+1] Deprecate fetch_mldata #11466 by using fetch_openml in examples in place of fetch_mldata
  • new issue about caching decoded ARFF
  • new issue about checking dataset integrity with respect to stated number of instances, etc. or just use an MD5 check
  • resolve TODOs/FIXMEs
  • support return_X_y
  • add an option to one-hot encode nominals
  • add an option to ignore some features, or all STRING attributes
  • support returning dataframes (with what API?)
  • separate out a load_arff function?
  • use compressed HTTP requests as in https://stackoverflow.com/questions/3947120/does-python-urllib2-automatically-uncompress-gzip-data-fetched-from-webpage

@rth
Copy link
Member

rth commented Aug 13, 2018

rth or amueller or other, please have another look at this in terms of encoding nominals.

I am happy @jorisvandenbossche joined in on this, because I don't have much to say about it :)

@jnothman
Copy link
Member

I feel very uncomfortable that we are blocking a release for these decisions, because we need such a loader to simply facilitate examples. I wish we had resolved this much earlier.

There are a few of us who wish we could make dataframes and heterogeneous data first-class citizens, but it's a big change and we're not there yet. I look forward to being able to return a DataFrame (I don't think frame=True or fetch_openml_frame is a great imposition on the user), but I think returning an object array, or deciding whether or not to use object dtype on the basis of the presence of nominals, is a very poor solution in the interim, especially when applied to sparse matrices. Or we could return a dict of arrays, but this is not well facilitated by liac-arff, and not supported by ColumnTransformer; or a struct array, but these have issues with object dtypes, and are not supported by ColumnTransformer.

I think we can add a nominal_encoding={'ordinal', 'one-hot'} switch soon (where 'default' would be added when we support returning dataframes, and would use pd.Categorical), and perhaps a way of encoding strings too (but this is fraught; I'd rather just export them in dataframes).

I hope you come to agree with me that this is the best we can do for now in terms of compatibility and extensibility, while supporting our immediate needs. But please do raise alternatives (and required extra features before 0.20 final) if you have them.

@jnothman
Copy link
Member

In seeing the enormous difference between 'titanic' version 1 and 2, I think we need to require version when fetching by name...

@jnothman
Copy link
Member

Instead we can have an issue after merge: Raise a warning along the lines of "Multiple dataset versions match the name BLAH. Versions may be very different. Getting version x."

@jnothman
Copy link
Member

I think this is acceptable as is, and I would like to merge this. Can I hear +1/-1 for merge?

I'll also prepare a deprecation of fetch_mldata based on this... I hope that's not hard to do.

@jnothman
Copy link
Member

A problem: scikit-learn doesn't support string targets for multilabel data. So I can't get the yeast example to work... Do we therefore also need to encode targets, or only when there are multiple? :(

@rth
Copy link
Member

rth commented Aug 15, 2018

I get a few DeprecationWarnings at import with Python 3.6,

In [1]: from sklearn.datasets import fetch_openml
sklearn/externals/_arff.py:204: DeprecationWarning: Flags not at the start of the expression '(?x)\n        ,      ' (truncated)
  ''' % {'value_re': value_re})
sklearn/externals/_arff.py:204: DeprecationWarning: Flags not at the start of the expression '(?x)\n        ,      ' (truncated)
  ''' % {'value_re': value_re})
sklearn/externals/_arff.py:204: DeprecationWarning: Flags not at the start of the expression '(?x)\n        ,      ' (truncated)
  ''' % {'value_re': value_re})
sklearn/externals/_arff.py:219: DeprecationWarning: Flags not at the start of the expression '(?x)\n        (?:^\\s*' (truncated)
  ''' % {'value_re': value_re})
sklearn/externals/_arff.py:219: DeprecationWarning: Flags not at the start of the expression '(?x)\n        (?:^\\s*' (truncated)
  ''' % {'value_re': value_re})
sklearn/externals/_arff.py:219: DeprecationWarning: Flags not at the start of the expression '(?x)\n        (?:^\\s*' (truncated)
  ''' % {'value_re': value_re})

In the datasets I tried that are currently used in examples, titanic v0 works, v1 raises exception about unsupported string field (as discussed above), "minst_748", "shuttle", "iris", "leukemia" works.

For "yeast" I get,

{...
 'target': array(['MIT', 'MIT', 'MIT', ..., 'ME2', 'NUC', 'CYT'], dtype=object)
 ...
}

with 
In [15]: np.unique(data['target'])
Out[15]: 
array(['CYT', 'ERL', 'EXC', 'ME1', 'ME2', 'ME3', 'MIT', 'NUC', 'POX',
       'VAC'], dtype=object)

is this just the first label?

Overall I share the sentiment of #11419 (comment), some things are not currently not that great (e.g. default nominal encoding, sparse datasets loading is very slow etc) but this is a good basis, it's marked as experimental, and we need this to fix broken examples CI. We can improve various points in subsequent PRs after the examples are migrated, even if it means changing the API a bit. +1 for merge.

@jnothman
Copy link
Member

jnothman commented Aug 15, 2018 via email

@rth
Copy link
Member

rth commented Aug 15, 2018

OK, the version parameter can really change the output completely...

Maybe leave the multilabel targets as they are, and do the manual encoding in the single example where we need them for now? I am also not so keen on doing default encoding if we can avoid it...

@jnothman
Copy link
Member

jnothman commented Aug 15, 2018 via email

@jnothman
Copy link
Member

jnothman commented Aug 15, 2018 via email

@jnothman
Copy link
Member

This, and examples based off it, are passing in all tested platforms. Merge?

@rth
Copy link
Member

rth commented Aug 15, 2018

This, and examples based off it, are passing in all tested platforms. Merge?

Great! Someone needs to press the green button.

If the targets all have value {FALSE, TRUE} then it's fairly clear we
should encode it.

Fair enough, also running some detection on targets is probably not too expensive.

Should we error if no version is specified and there are multiple? Or
simply warn?

+1 for warn (in subsequent PR)

@jnothman
Copy link
Member

Fair enough, also running some detection on targets is probably not too expensive.

It's negligible as the {FALSE, TRUE} information is in metadata.

@jnothman
Copy link
Member

I'd like to squash and merge, but I don't know who GitHub will attribute the commit to! I'd like it to be @janvanrijn. I can go squash and merge manually...

@jnothman
Copy link
Member

According to someone somewhere, "GitHub takes the info of the PR author". Let's do it.

@jnothman jnothman merged commit ab82f57 into scikit-learn:master Aug 15, 2018
scikit-learn 0.20 automation moved this from Blockers to Done Aug 15, 2018
@jnothman
Copy link
Member

Thank you @janvanrijn!!

@rth
Copy link
Member

rth commented Aug 15, 2018

Wonderful! Thanks @janvanrijn and also @jnothman for doing major fixes on the fly in related projects !

@janvanrijn
Copy link
Contributor Author

Thanks guys, was a nice experience to work with you :)

See you all in Paris this year!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

Add OpenML dataset fetcher
9 participants