Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Memory usage of OpenML fetcher: use generator from arff #13312

Merged
merged 5 commits into from Feb 28, 2019

Conversation

5 participants
@jorisvandenbossche
Copy link
Contributor

jorisvandenbossche commented Feb 27, 2019

This reduces the memory usage of fetch_openml (for the case when arrays are returned, not for sparse matrices) by consuming the arff data as a generator instead of a list of lists.

My main question / concern is whether this 'NumberOfFeatures' / 'NumberOfInstances' metadata from OpenML is guaranteed to always be available.

This indirectly should fix #13287 (the hypothesis is that the doc building on Circle CI is failing because of a memory issue when running an example fetching data from OpenML).

@jorisvandenbossche

This comment has been minimized.

Copy link
Contributor Author

jorisvandenbossche commented Feb 27, 2019

@janvanrijn @jnothman @rth any idea how to generate the needed test data?
I need a new response from the openml server (https://github.com/scikit-learn/scikit-learn/pull/13312/files#diff-ea672b15dd808c88257c58681d17bb6aR341), and so also need to generate those responses for the offline tests.

But running them 'online' let the test fail (as offline we use truncated versions), and also does not seem to generate the responses in my local scikit-learn data home.

@janvanrijn

This comment has been minimized.

Copy link
Contributor

janvanrijn commented Feb 27, 2019

Number of observations and features are usually calculated several minutes after dataset is uploaded, if we can parse the dataset on the server.

@jorisvandenbossche

This comment has been minimized.

Copy link
Contributor Author

jorisvandenbossche commented Feb 27, 2019

@janvanrijn Thanks. So we can assume that this information will always be available then? (what happens if the data could not be parsed? then the dataset is also not available to download?)

Further, do you remember how you originally constructed the gzipped responses included in the tests/data?

@janvanrijn

This comment has been minimized.

Copy link
Contributor

janvanrijn commented Feb 27, 2019

(on my phone so can't quote nicely)

There's a status field in the data set description. If the status field is 'active', we could parse the dataset on the server. (alternatively status could be 'in_preparation' and 'deactivated'.)

Can you rephrase your second question? I downloaded them by hand from openml server, removed some records and gzipped them using the unix gzip command, but i guess that's not the answer you're looking for.

@jorisvandenbossche

This comment has been minimized.

Copy link
Contributor Author

jorisvandenbossche commented Feb 27, 2019

Can you rephrase your second question? I downloaded them by hand from openml server, removed some records and gzipped them using the unix gzip command, but i guess that's not the answer you're looking for.

I think that is exactly the answer I was looking for (I only hoped there would be an easier way :-))

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Feb 28, 2019

@jorisvandenbossche

This comment has been minimized.

Copy link
Contributor Author

jorisvandenbossche commented Feb 28, 2019

Yes, so there are certain datasets that have data that can be downloaded, but because there was a processing error might not have the "qualities" filled in (so from where we obtain the expected shape).

Now, we warn in this case and still return you the data.
We could keep that behaviour by falling back in such a case to the old dense list of lists. It's not hard to do, but of course adds some extra complexity to the code.

@jorisvandenbossche jorisvandenbossche changed the title Memory usage of OpenML fetcher: use generator from arff [MRG] Memory usage of OpenML fetcher: use generator from arff Feb 28, 2019

@jorisvandenbossche

This comment has been minimized.

Copy link
Contributor Author

jorisvandenbossche commented Feb 28, 2019

OK, I added a workaround to still process the data when the data qualities are not available, to keep the existing behaviour.

This should be ready to review now.

@jorisvandenbossche jorisvandenbossche added this to To do in Sprint Paris 2019 via automation Feb 28, 2019

@jorisvandenbossche jorisvandenbossche moved this from To do to Needs review in Sprint Paris 2019 Feb 28, 2019

@jnothman
Copy link
Member

jnothman left a comment

Otherwise LGTM

(I wonder if it's worth squeezing this and the arff update into 0.20.3)

return None
for d in data_qualities:
if d['name'] == 'NumberOfFeatures':
n_features = int(float(d['value']))

This comment has been minimized.

@jnothman

jnothman Feb 28, 2019

Member

Are we doing float because there's a . in the data or something??

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Feb 28, 2019

Author Contributor

Yes ..

# the number of samples / features
if data_qualities is None:
return None
for d in data_qualities:

This comment has been minimized.

@jnothman

jnothman Feb 28, 2019

Member

How about we make the code more readable/pythonic (?) by starting with data_qualities = {d['name']: d['value'] for d in data_qualities}?

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Feb 28, 2019

Author Contributor

Ah, that's a better! (it is just a very annoying way that the data is stored .. :-))

if not return_sparse:
data_qualities = _get_data_qualities(data_id, data_home)
shape = _get_data_shape(data_qualities)
# if the data qualities were not available, we cans still get the

This comment has been minimized.

@jnothman

jnothman Feb 28, 2019

Member

cans -> can

@jorisvandenbossche jorisvandenbossche requested a review from adrinjalali Feb 28, 2019

@adrinjalali
Copy link
Member

adrinjalali left a comment

LGTM, thanks @jorisvandenbossche !

@jnothman I don't think it's a must to have this in 0.20.3, but it certainly doesn't hurt to have it.

@glemaitre
Copy link
Contributor

glemaitre left a comment

LGTM

Show resolved Hide resolved sklearn/datasets/openml.py
Show resolved Hide resolved sklearn/datasets/openml.py

@adrinjalali adrinjalali merged commit 1f75ffa into scikit-learn:master Feb 28, 2019

10 checks passed

LGTM analysis: C/C++ No code changes detected
Details
LGTM analysis: JavaScript No code changes detected
Details
LGTM analysis: Python No new or fixed alerts
Details
ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: doc Your tests passed on CircleCI!
Details
ci/circleci: doc-min-dependencies Your tests passed on CircleCI!
Details
ci/circleci: lint Your tests passed on CircleCI!
Details
codecov/patch 94.73% of diff hit (target 92.55%)
Details
codecov/project 92.57% (+0.01%) compared to afc6cc5
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

Sprint Paris 2019 automation moved this from Needs review to Done Feb 28, 2019

@jorisvandenbossche jorisvandenbossche deleted the jorisvandenbossche:openml-generator branch Feb 28, 2019

jnothman added a commit to jnothman/scikit-learn that referenced this pull request Feb 28, 2019

MNT Memory usage of OpenML fetcher: use generator from arff (scikit-l…
…earn#13312)

* Memory usage of OpenML fetcher: use generator from arff

* fix actually getting data qualities in all cases

* Add qualities responses

* add workaround for cases where data qualities are not available

* feedback joel

jnothman added a commit to jnothman/scikit-learn that referenced this pull request Feb 28, 2019

jnothman added a commit that referenced this pull request Mar 1, 2019

Kiku-git added a commit to Kiku-git/scikit-learn that referenced this pull request Mar 4, 2019

MNT Memory usage of OpenML fetcher: use generator from arff (scikit-l…
…earn#13312)

* Memory usage of OpenML fetcher: use generator from arff

* fix actually getting data qualities in all cases

* Add qualities responses

* add workaround for cases where data qualities are not available

* feedback joel

Kiku-git added a commit to Kiku-git/scikit-learn that referenced this pull request Mar 4, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.