[MRG] Openml data loader #11419

janvanrijn · 2018-07-03T21:57:42Z

Reference Issues/PRs

Fixes #9543 Fixes #9908
This is the work I performed on issue #9543,
building upon the work that @amueller wants to merge in #9908

What does this implement/fix? Explain your changes.

Adds an OpenMl data loader

Any other comments?

Builds upon the work of @amueller. the latest version of liac-arff package is packed in externals and the unit tests are more extensive.

TO DO AFTER MERGE:

require version when fetching by name, or at least raise a warning along the lines of "Multiple dataset versions match the name BLAH. Versions may be very different. Getting version x."
Multilabel datasets which encode their targets as nominals do not work with scikit-learn estimators, since Y consists of strings. We should maybe identify multilabel data, at least when its values are all {FALSE, TRUE} and encode it to {0, 1}.
complete [MRG+1] Deprecate fetch_mldata #11466 by using fetch_openml in examples in place of fetch_mldata
new issue about caching decoded ARFF
new issue about checking dataset integrity with respect to stated number of instances, etc. or just use an MD5 check
resolve TODOs/FIXMEs
support return_X_y
add an option to one-hot encode nominals
add an option to ignore some features, or all STRING attributes
support returning dataframes (with what API?)
separate out a load_arff function?
use compressed HTTP requests as in https://stackoverflow.com/questions/3947120/does-python-urllib2-automatically-uncompress-gzip-data-fetched-from-webpage

…enml_loader

rth · 2018-08-13T15:42:52Z

rth or amueller or other, please have another look at this in terms of encoding nominals.

I am happy @jorisvandenbossche joined in on this, because I don't have much to say about it :)

jnothman · 2018-08-14T00:17:55Z

I feel very uncomfortable that we are blocking a release for these decisions, because we need such a loader to simply facilitate examples. I wish we had resolved this much earlier.

There are a few of us who wish we could make dataframes and heterogeneous data first-class citizens, but it's a big change and we're not there yet. I look forward to being able to return a DataFrame (I don't think frame=True or fetch_openml_frame is a great imposition on the user), but I think returning an object array, or deciding whether or not to use object dtype on the basis of the presence of nominals, is a very poor solution in the interim, especially when applied to sparse matrices. Or we could return a dict of arrays, but this is not well facilitated by liac-arff, and not supported by ColumnTransformer; or a struct array, but these have issues with object dtypes, and are not supported by ColumnTransformer.

I think we can add a nominal_encoding={'ordinal', 'one-hot'} switch soon (where 'default' would be added when we support returning dataframes, and would use pd.Categorical), and perhaps a way of encoding strings too (but this is fraught; I'd rather just export them in dataframes).

I hope you come to agree with me that this is the best we can do for now in terms of compatibility and extensibility, while supporting our immediate needs. But please do raise alternatives (and required extra features before 0.20 final) if you have them.

jnothman · 2018-08-15T04:25:54Z

In seeing the enormous difference between 'titanic' version 1 and 2, I think we need to require version when fetching by name...

jnothman · 2018-08-15T04:38:12Z

Instead we can have an issue after merge: Raise a warning along the lines of "Multiple dataset versions match the name BLAH. Versions may be very different. Getting version x."

jnothman · 2018-08-15T04:42:00Z

I think this is acceptable as is, and I would like to merge this. Can I hear +1/-1 for merge?

I'll also prepare a deprecation of fetch_mldata based on this... I hope that's not hard to do.

jnothman · 2018-08-15T05:20:08Z

A problem: scikit-learn doesn't support string targets for multilabel data. So I can't get the yeast example to work... Do we therefore also need to encode targets, or only when there are multiple? :(

rth · 2018-08-15T05:42:18Z

I get a few DeprecationWarnings at import with Python 3.6,

In [1]: from sklearn.datasets import fetch_openml
sklearn/externals/_arff.py:204: DeprecationWarning: Flags not at the start of the expression '(?x)\n        ,      ' (truncated)
  ''' % {'value_re': value_re})
sklearn/externals/_arff.py:204: DeprecationWarning: Flags not at the start of the expression '(?x)\n        ,      ' (truncated)
  ''' % {'value_re': value_re})
sklearn/externals/_arff.py:204: DeprecationWarning: Flags not at the start of the expression '(?x)\n        ,      ' (truncated)
  ''' % {'value_re': value_re})
sklearn/externals/_arff.py:219: DeprecationWarning: Flags not at the start of the expression '(?x)\n        (?:^\\s*' (truncated)
  ''' % {'value_re': value_re})
sklearn/externals/_arff.py:219: DeprecationWarning: Flags not at the start of the expression '(?x)\n        (?:^\\s*' (truncated)
  ''' % {'value_re': value_re})
sklearn/externals/_arff.py:219: DeprecationWarning: Flags not at the start of the expression '(?x)\n        (?:^\\s*' (truncated)
  ''' % {'value_re': value_re})

In the datasets I tried that are currently used in examples, titanic v0 works, v1 raises exception about unsupported string field (as discussed above), "minst_748", "shuttle", "iris", "leukemia" works.

For "yeast" I get,

{...
 'target': array(['MIT', 'MIT', 'MIT', ..., 'ME2', 'NUC', 'CYT'], dtype=object)
 ...
}

with 
In [15]: np.unique(data['target'])
Out[15]: 
array(['CYT', 'ERL', 'EXC', 'ME1', 'ME2', 'ME3', 'MIT', 'NUC', 'POX',
       'VAC'], dtype=object)

is this just the first label?

Overall I share the sentiment of #11419 (comment), some things are not currently not that great (e.g. default nominal encoding, sparse datasets loading is very slow etc) but this is a good basis, it's marked as experimental, and we need this to fix broken examples CI. We can improve various points in subsequent PRs after the examples are migrated, even if it means changing the API a bit. +1 for merge.

jnothman · 2018-08-15T05:48:57Z

That's not the same yeast. Need version=4. Will look into the liac-arff issue.

rth · 2018-08-15T06:00:17Z

OK, the version parameter can really change the output completely...

Maybe leave the multilabel targets as they are, and do the manual encoding in the single example where we need them for now? I am also not so keen on doing default encoding if we can avoid it...

jnothman · 2018-08-15T06:04:53Z

If the targets all have value {FALSE, TRUE} then it's fairly clear we should encode it. The question is how stable that convention is in OpenML.

jnothman · 2018-08-15T06:15:10Z

Should we error if no version is specified and there are multiple? Or simply warn?

jnothman · 2018-08-15T06:56:54Z

This, and examples based off it, are passing in all tested platforms. Merge?

rth · 2018-08-15T07:16:31Z

This, and examples based off it, are passing in all tested platforms. Merge?

Great! Someone needs to press the green button.

If the targets all have value {FALSE, TRUE} then it's fairly clear we
should encode it.

Fair enough, also running some detection on targets is probably not too expensive.

Should we error if no version is specified and there are multiple? Or
simply warn?

+1 for warn (in subsequent PR)

jnothman · 2018-08-15T07:19:24Z

Fair enough, also running some detection on targets is probably not too expensive.

It's negligible as the {FALSE, TRUE} information is in metadata.

jnothman · 2018-08-15T07:20:16Z

I'd like to squash and merge, but I don't know who GitHub will attribute the commit to! I'd like it to be @janvanrijn. I can go squash and merge manually...

jnothman · 2018-08-15T07:21:09Z

According to someone somewhere, "GitHub takes the info of the PR author". Let's do it.

jnothman · 2018-08-15T07:21:41Z

Thank you @janvanrijn!!

rth · 2018-08-15T07:28:58Z

Wonderful! Thanks @janvanrijn and also @jnothman for doing major fixes on the fly in related projects !

janvanrijn · 2018-08-15T14:52:18Z

Thanks guys, was a nice experience to work with you :)

See you all in Paris this year!

amueller and others added 30 commits October 10, 2017 13:50

start on openml dataset loader

268f533

working on stuff

4f3e93e

first version working

fabaa90

docstrings, use version="active" as default

1804b14

add caching to openml loader

ffd4335

pep8 annoyance

fe0904b

fix download url, allow datasets without target

eea026e

allow specifying the target column, starting on docs

bca12e9

add openml to the narrative docs

f59ce8b

get more people to upload stuff to openml.

d7dee6d

store metadata, convert to dtype object if there is nominal data.

4c19ad9

fix doctests, add fetch_openml to __init__

16b7fed

make arff reading work in python2.7

b3f6c36

ignore doctests for now because of unicode issues

dc401f2

add version filter.

d8cfd37

some typos, addressing joel's comments, working on better errors

6f6bb57

nicer error message on non-existing ID

b5c72d9

minor improvements to data wrangling

64483f8

allow downloading inactive datasets if specified by name and version

26aaff2

update mice version 4 dataset id

b3b9276

Merge branch 'master' of github.com:scikit-learn/scikit-learn into op…

b2c283a

…enml_loader

add whatsnew entry

7e91c71

add unicode and normalize whitespace flags to pytest config

11909d5

Merge branch 'master' into openml_loader

126f406

add test for fetch_openml

7e16203

test error messages

8dcb26b

fix command for make test-coverage

0d562b6

Merge branch 'fix_test_coverage' into openml_loader

c2266c5

make flake8 green

e274ad3

py35 compatiility

eb39a01

Remove obsolete FIXME

a9f3a2c

Better error message with string attributes

bd1a189

cosmit

143c19b

jnothman added 2 commits August 15, 2018 14:44

DOC clean what's new merge mess

1789421

DOC clean what's new merge mess more

088ddeb

Fix urlopen has no __exit__ in Python 2

fd9fba0

jnothman merged commit ab82f57 into scikit-learn:master Aug 15, 2018

scikit-learn 0.20 automation moved this from Blockers to Done Aug 15, 2018

jorisvandenbossche mentioned this pull request Aug 20, 2018

fetch_openml: Add an option to ignore some features, especially STRING type #11819

Closed

rth mentioned this pull request Sep 26, 2018

add dataset parameter to fetch_20newsgroups() #4035

Closed

rth mentioned this pull request Aug 30, 2019

Faster fetch_openml cache #14855

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] Openml data loader #11419

[MRG] Openml data loader #11419

janvanrijn commented Jul 3, 2018 •

edited by jnothman

rth commented Aug 13, 2018

jnothman commented Aug 14, 2018

jnothman commented Aug 15, 2018

jnothman commented Aug 15, 2018

jnothman commented Aug 15, 2018

jnothman commented Aug 15, 2018

rth commented Aug 15, 2018

jnothman commented Aug 15, 2018 via email

rth commented Aug 15, 2018

jnothman commented Aug 15, 2018 via email

jnothman commented Aug 15, 2018 via email

jnothman commented Aug 15, 2018

rth commented Aug 15, 2018

jnothman commented Aug 15, 2018

jnothman commented Aug 15, 2018

jnothman commented Aug 15, 2018

jnothman commented Aug 15, 2018

rth commented Aug 15, 2018

janvanrijn commented Aug 15, 2018

[MRG] Openml data loader #11419

[MRG] Openml data loader #11419

Conversation

janvanrijn commented Jul 3, 2018 • edited by jnothman

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

rth commented Aug 13, 2018

jnothman commented Aug 14, 2018

jnothman commented Aug 15, 2018

jnothman commented Aug 15, 2018

jnothman commented Aug 15, 2018

jnothman commented Aug 15, 2018

rth commented Aug 15, 2018

jnothman commented Aug 15, 2018 via email

rth commented Aug 15, 2018

jnothman commented Aug 15, 2018 via email

jnothman commented Aug 15, 2018 via email

jnothman commented Aug 15, 2018

rth commented Aug 15, 2018

jnothman commented Aug 15, 2018

jnothman commented Aug 15, 2018

jnothman commented Aug 15, 2018

jnothman commented Aug 15, 2018

rth commented Aug 15, 2018

janvanrijn commented Aug 15, 2018

janvanrijn commented Jul 3, 2018 •

edited by jnothman