New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Openml loader #9908

Closed
wants to merge 32 commits into
base: master
from

Conversation

Projects
None yet
10 participants
@amueller
Member

amueller commented Oct 11, 2017

Fixes #9543.

Todo:

  • docstrings
  • userguide
  • tests
  • handle data set versions
  • caching
@lesteve

This comment has been minimized.

Show comment
Hide comment
@lesteve

lesteve Oct 11, 2017

Member

Looking at the flake8 errors, there seems to be genuine problems, e.g.:

./sklearn/datasets/openml.py:81:15: F821 undefined name 'urlretrieve'
    json_dl = urlretrieve(json_loc.format(name))[0]
              ^

Not familiar at all with openml, do you have any outputs you can suggest for name_or_id if we want to play with the fetcher?

Member

lesteve commented Oct 11, 2017

Looking at the flake8 errors, there seems to be genuine problems, e.g.:

./sklearn/datasets/openml.py:81:15: F821 undefined name 'urlretrieve'
    json_dl = urlretrieve(json_loc.format(name))[0]
              ^

Not familiar at all with openml, do you have any outputs you can suggest for name_or_id if we want to play with the fetcher?

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Oct 11, 2017

Member

this is not ready for review yet ;) I wanted to share in case @vrishank97 was working on it. Give me like 3 more hours maybe, I'm running between meetings and sprints.

Member

amueller commented Oct 11, 2017

this is not ready for review yet ;) I wanted to share in case @vrishank97 was working on it. Give me like 3 more hours maybe, I'm running between meetings and sprints.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Oct 11, 2017

Member

you can try "iris" for aname_or_id

Member

amueller commented Oct 11, 2017

you can try "iris" for aname_or_id

@amueller amueller changed the title from Openml loader to [WIP] Openml loader Oct 11, 2017

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Oct 12, 2017

Member

hm not sure how to catch the error on python2. seem there are different errors thrown on python2 and python3.

Member

amueller commented Oct 12, 2017

hm not sure how to catch the error on python2. seem there are different errors thrown on python2 and python3.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Oct 12, 2017

Member

Hm I thought by default returning an active dataset might be a good idea, but then the caching kind of interferes with that. But either way, the caching might interfere with warning people about deactivated datasets. Not sure what to do about it. We could invalidate the cache after a certain amount of time or only use it if there's no internet connection (at least for the meta-data, which can change).

Member

amueller commented Oct 12, 2017

Hm I thought by default returning an active dataset might be a good idea, but then the caching kind of interferes with that. But either way, the caching might interfere with warning people about deactivated datasets. Not sure what to do about it. We could invalidate the cache after a certain amount of time or only use it if there's no internet connection (at least for the meta-data, which can change).

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Oct 12, 2017

Member

The handling of versions will be much easier if there's another filter added to openml, which will be done hopefully today or tomorrow.

Member

amueller commented Oct 12, 2017

The handling of versions will be much easier if there's another filter added to openml, which will be done hopefully today or tomorrow.

@vrishank97

This comment has been minimized.

Show comment
Hide comment
@vrishank97

vrishank97 Oct 12, 2017

Contributor

I've also made some progress on this issue, should I make a separate PR or work on this one? and what are the benefits of using arff over csv with json?

Contributor

vrishank97 commented Oct 12, 2017

I've also made some progress on this issue, should I make a separate PR or work on this one? and what are the benefits of using arff over csv with json?

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Oct 12, 2017

Member

sure can you send a PR so we can merge efforts? It would have been great if you'd send a PR earlier so we had less duplicate work. I'd really like to get this in some reasonable state this week.
arff has possibly more meta-information. maybe not really necessary.

Member

amueller commented Oct 12, 2017

sure can you send a PR so we can merge efforts? It would have been great if you'd send a PR earlier so we had less duplicate work. I'd really like to get this in some reasonable state this week.
arff has possibly more meta-information. maybe not really necessary.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Oct 12, 2017

Member

I'm not entirely sure what to do for tests. We have done some mocking for mldata, but here the requests are somewhat more complex. I could do the mocking for all of them if you think that's necessary.

Member

amueller commented Oct 12, 2017

I'm not entirely sure what to do for tests. We have done some mocking for mldata, but here the requests are somewhat more complex. I could do the mocking for all of them if you think that's necessary.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Oct 12, 2017

Member

I'm also not sure what would be the best datatype to return for mixed data. If there's strings in there, we can either use a recarray or cast everything to string, or make it object. By default things become strings, which doesn't seem great :-/

Member

amueller commented Oct 12, 2017

I'm also not sure what would be the best datatype to return for mixed data. If there's strings in there, we can either use a recarray or cast everything to string, or make it object. By default things become strings, which doesn't seem great :-/

@albertcthomas

This comment has been minimized.

Show comment
Hide comment
@albertcthomas

albertcthomas Oct 12, 2017

Contributor

I don't know if that helps but for fetch_kddcup99, where data also contains strings, we have dtype=object.

from sklearn.datasets import fetch_kddcup99
dataset = fetch_kddcup99()
dataset.data
array([[0, b'tcp', b'http', ..., 0.0, 0.0, 0.0],
       [0, b'tcp', b'http', ..., 0.0, 0.0, 0.0],
       [0, b'tcp', b'http', ..., 0.0, 0.0, 0.0],
       ...,
       [0, b'tcp', b'http', ..., 0.01, 0.0, 0.0],
       [0, b'tcp', b'http', ..., 0.01, 0.0, 0.0],
       [0, b'tcp', b'http', ..., 0.01, 0.0, 0.0]], dtype=object)
Contributor

albertcthomas commented Oct 12, 2017

I don't know if that helps but for fetch_kddcup99, where data also contains strings, we have dtype=object.

from sklearn.datasets import fetch_kddcup99
dataset = fetch_kddcup99()
dataset.data
array([[0, b'tcp', b'http', ..., 0.0, 0.0, 0.0],
       [0, b'tcp', b'http', ..., 0.0, 0.0, 0.0],
       [0, b'tcp', b'http', ..., 0.0, 0.0, 0.0],
       ...,
       [0, b'tcp', b'http', ..., 0.01, 0.0, 0.0],
       [0, b'tcp', b'http', ..., 0.01, 0.0, 0.0],
       [0, b'tcp', b'http', ..., 0.01, 0.0, 0.0]], dtype=object)
@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Oct 12, 2017

Member

thanks @albertcthomas. the question was more how to detect that easily, but I could pull it out of the arff data.

Member

amueller commented Oct 12, 2017

thanks @albertcthomas. the question was more how to detect that easily, but I could pull it out of the arff data.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Oct 12, 2017

Member

@lesteve now you can have a look ;) not tests yet, though, and the version stuff is not supported yet. Will bug jan to make it easier.

Member

amueller commented Oct 12, 2017

@lesteve now you can have a look ;) not tests yet, though, and the version stuff is not supported yet. Will bug jan to make it easier.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Oct 12, 2017

Member

maybe miceprotein is not good and I should use adult instead... (miceprotein is trivial to classify)

Member

amueller commented Oct 12, 2017

maybe miceprotein is not good and I should use adult instead... (miceprotein is trivial to classify)

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Oct 15, 2017

Member
Member

jnothman commented Oct 15, 2017

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Oct 17, 2017

Member

@jnothman this is actually what the openml-python lib does. Right now I opted to convert to object dtype here. I like the dataset fetcher to not do any preprocessing, as that reflects better what people will encounter in "the real world".

Member

amueller commented Oct 17, 2017

@jnothman this is actually what the openml-python lib does. Right now I opted to convert to object dtype here. I like the dataset fetcher to not do any preprocessing, as that reflects better what people will encounter in "the real world".

@jnothman jnothman added the Blocker label Feb 19, 2018

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Feb 19, 2018

Member

We need to support reading Sparse ARFF-formatted CSV... or reject sparse datasets

Member

jnothman commented Feb 19, 2018

We need to support reading Sparse ARFF-formatted CSV... or reject sparse datasets

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Feb 19, 2018

Member

Given that I have reasons to doubt numpy.genfromtxt and reasons to doubt openml's CSV provisions, I'm strongly in favour of investigating the quality of liac-arff and vendoring it as a separately maintained and tested reader of OpenML's most supported format.

Member

jnothman commented Feb 19, 2018

Given that I have reasons to doubt numpy.genfromtxt and reasons to doubt openml's CSV provisions, I'm strongly in favour of investigating the quality of liac-arff and vendoring it as a separately maintained and tested reader of OpenML's most supported format.

@joaquinvanschoren

This comment has been minimized.

Show comment
Hide comment
@joaquinvanschoren

joaquinvanschoren Feb 19, 2018

To be honest I'm not sure how quickly we'll have that sparse CSV export implemented (or whether you can even read it without adding extra dependencies?).

I'd go ahead and either:

  • Go the CSV route without supporting sparse data for now. It's a simple check and we can add support later.
  • Go the ARFF route with liac_arff

joaquinvanschoren commented Feb 19, 2018

To be honest I'm not sure how quickly we'll have that sparse CSV export implemented (or whether you can even read it without adding extra dependencies?).

I'd go ahead and either:

  • Go the CSV route without supporting sparse data for now. It's a simple check and we can add support later.
  • Go the ARFF route with liac_arff
@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Feb 19, 2018

Member

Given all the issues with np.genfromtxt, I'd rather steer clear of generic CSV handling.

Member

jnothman commented Feb 19, 2018

Given all the issues with np.genfromtxt, I'd rather steer clear of generic CSV handling.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Apr 19, 2018

Member

@amueller can we get someone else to finish this?

Member

jnothman commented Apr 19, 2018

@amueller can we get someone else to finish this?

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Jun 4, 2018

Member

@jnothman so you suggest adding a dependency on liac_arff? And you wanted to vendor it? Why vendor? @raghavrv had started a cython arff reader but I don't think he finished it. It might be interesting?

Member

amueller commented Jun 4, 2018

@jnothman so you suggest adding a dependency on liac_arff? And you wanted to vendor it? Why vendor? @raghavrv had started a cython arff reader but I don't think he finished it. It might be interesting?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 5, 2018

Member
Member

jnothman commented Jun 5, 2018

@joaquinvanschoren

This comment has been minimized.

Show comment
Hide comment
@joaquinvanschoren

joaquinvanschoren Jun 5, 2018

@mfeurer: you have been working a lot with/on liac_arff. What is your opinion?

joaquinvanschoren commented Jun 5, 2018

@mfeurer: you have been working a lot with/on liac_arff. What is your opinion?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 14, 2018

Member

Unless I am much mistaken, I think we can solve this for now with liac-arff, and support the sparse and dense cases seamlessly. As long as we also exploit any typing information provided by openml as metadata, I would think we can make the interface quite stable, and the fact that we're using liac-arff internally will remain an implementation detail. I'm keen to have this finished and released.

Member

jnothman commented Jun 14, 2018

Unless I am much mistaken, I think we can solve this for now with liac-arff, and support the sparse and dense cases seamlessly. As long as we also exploit any typing information provided by openml as metadata, I would think we can make the interface quite stable, and the fact that we're using liac-arff internally will remain an implementation detail. I'm keen to have this finished and released.

@jnothman jnothman added this to the 0.20 milestone Jun 14, 2018

@mfeurer

This comment has been minimized.

Show comment
Hide comment
@mfeurer

mfeurer Jun 14, 2018

Contributor

Sorry @joaquinvanschoren I didn't receive an email for this. Could you please let me know about what exactly you want my opinion. A generic answer: arff has a very bad format specification and many undocumented edge-cases. To the best of my knowledge, liac-arff has better support for those than the scipy arff reader, but is also a lot slower. It's not actively developed and I try to devote some time maintaining it. If liac-arff gets broader attention by the usage in scikit-learn this would be amazing. @jnothman do you want to integrate liac-arff the same way you're using joblib?

Contributor

mfeurer commented Jun 14, 2018

Sorry @joaquinvanschoren I didn't receive an email for this. Could you please let me know about what exactly you want my opinion. A generic answer: arff has a very bad format specification and many undocumented edge-cases. To the best of my knowledge, liac-arff has better support for those than the scipy arff reader, but is also a lot slower. It's not actively developed and I try to devote some time maintaining it. If liac-arff gets broader attention by the usage in scikit-learn this would be amazing. @jnothman do you want to integrate liac-arff the same way you're using joblib?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 14, 2018

Member
Member

jnothman commented Jun 14, 2018

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Jun 14, 2018

Contributor

It was probably discussed above, but why not simply downloading and reading in the csv file?

Contributor

jorisvandenbossche commented Jun 14, 2018

It was probably discussed above, but why not simply downloading and reading in the csv file?

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Jun 14, 2018

Contributor

Ah, OK, reading myself :-) So the problem for that are the sparse datasets.

Contributor

jorisvandenbossche commented Jun 14, 2018

Ah, OK, reading myself :-) So the problem for that are the sparse datasets.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 14, 2018

Member
Member

jnothman commented Jun 14, 2018

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Jun 14, 2018

Contributor

Regardless of how it is read in, IMO not returning pandas DataFrames for the dense datasets from this functionality would also make it a lot less functional ... (but that's probably for another discussion ;-))

Contributor

jorisvandenbossche commented Jun 14, 2018

Regardless of how it is read in, IMO not returning pandas DataFrames for the dense datasets from this functionality would also make it a lot less functional ... (but that's probably for another discussion ;-))

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Jun 14, 2018

Contributor

BTW, I wanted to give this another try (although it will probably needed to be reworked quite a bit to use liac-arff), but the current version clearly has a bug in that it simply drops non-numeric columns in the .data array.
For example with mice = fetch_openml('miceprotein', version=4), mice.data is missing 4 columns (and there are 4 nominal features).

One of the problems with returning an object array in case of mixed dtypes, is that if you then simply convert it to a dataframe with pd.DataFrame(mice.data, columns=mice.feature_names), you still have object dtype. Eg:

In [24]: pd.DataFrame(np.array([[1, 'a'], [2, 'b']], dtype=object)).dtypes
Out[24]: 
0    object
1    object
dtype: object

You then would need to do .infer_objects() to convert the numerical features to actual numerical columns, which is a bit annoying that the user needs to do this (and also inefficient).

I understand relying fully on pandas here is not fully desirable, but would it be possible to add an option to return a DataFrame?

Contributor

jorisvandenbossche commented Jun 14, 2018

BTW, I wanted to give this another try (although it will probably needed to be reworked quite a bit to use liac-arff), but the current version clearly has a bug in that it simply drops non-numeric columns in the .data array.
For example with mice = fetch_openml('miceprotein', version=4), mice.data is missing 4 columns (and there are 4 nominal features).

One of the problems with returning an object array in case of mixed dtypes, is that if you then simply convert it to a dataframe with pd.DataFrame(mice.data, columns=mice.feature_names), you still have object dtype. Eg:

In [24]: pd.DataFrame(np.array([[1, 'a'], [2, 'b']], dtype=object)).dtypes
Out[24]: 
0    object
1    object
dtype: object

You then would need to do .infer_objects() to convert the numerical features to actual numerical columns, which is a bit annoying that the user needs to do this (and also inefficient).

I understand relying fully on pandas here is not fully desirable, but would it be possible to add an option to return a DataFrame?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 14, 2018

Member
Member

jnothman commented Jun 14, 2018

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Jun 20, 2018

Member

ok fine with liac arff vendoring, I'll let @janvanrijn take this over.

Member

amueller commented Jun 20, 2018

ok fine with liac arff vendoring, I'll let @janvanrijn take this over.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 20, 2018

Member
Member

jnothman commented Jun 20, 2018

@ogrisel ogrisel added this to PRs tagged in scikit-learn 0.20 Jul 16, 2018

@amueller amueller moved this from PRs tagged to Blockers in scikit-learn 0.20 Jul 16, 2018

@jorisvandenbossche jorisvandenbossche removed this from Blockers in scikit-learn 0.20 Jul 16, 2018

@qinhanmin2014

This comment has been minimized.

Show comment
Hide comment
@qinhanmin2014

qinhanmin2014 Jul 21, 2018

Member

Closed for #11419

Member

qinhanmin2014 commented Jul 21, 2018

Closed for #11419

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment