ENH buffer openml stream rather than reading all at once #16084

jnothman · 2020-01-09T21:22:48Z

I've not benchmarked yet. This should reduce memory requirements when fetching from OpenML.

This no longer explicitly closes the open URL handler, since it is required until the ARFF has been completely read. We could return the stream object from _download_data_arff if we want to close it.

thomasjpfan · 2020-01-09T22:59:41Z

Quick memory profiling on _download_data_arff (with mnist)

master

   459   84.168 MiB    0.004 MiB       @_retry_with_clean_cache(url, data_home)
   460                                 def _arff_load():
   461   84.176 MiB    0.008 MiB           with closing(_open_openml_url(url, data_home)) as response:
   462   84.176 MiB    0.000 MiB               if sparse is True:
   463                                             return_type = _arff.COO
   464                                         else:
   465   84.176 MiB    0.000 MiB                   return_type = _arff.DENSE_GEN
   466                             
   467  395.703 MiB  311.527 MiB               arff_file = _arff.loads(response.read().decode('utf-8'),
   468  395.703 MiB    0.000 MiB                                       encode_nominal=encode_nominal,
   469  477.031 MiB   81.328 MiB                                       return_type=return_type)
   470  477.031 MiB    0.000 MiB           return arff_file

This PR

   459   81.723 MiB    0.000 MiB       @_retry_with_clean_cache(url, data_home)
   460                                 def _arff_load():
   461   81.723 MiB    0.000 MiB           response = _open_openml_url(url, data_home)
   462   81.723 MiB    0.000 MiB           if sparse is True:
   463                                         return_type = _arff.COO
   464                                     else:
   465   81.723 MiB    0.000 MiB               return_type = _arff.DENSE_GEN
   466                             
   467   81.734 MiB    0.004 MiB           arff_file = _arff.load((line.decode('utf-8')
   468   81.734 MiB    0.004 MiB                                   for line in response),
   469   81.723 MiB    0.000 MiB                                  encode_nominal=encode_nominal,
   470   81.734 MiB    0.000 MiB                                  return_type=return_type)
   471   81.734 MiB    0.000 MiB           return arff_file

jnothman · 2020-01-10T00:38:03Z

Thanks @thomasjpfan, though that's not the relevant portion, since now the content isn't even buffered until _convert_arff_data or _convert_arff_data_dataframe. So I'd be more interested in peak memory usage across the call.

jnothman · 2020-01-10T01:22:47Z

master

In [2]: %memit fetch_openml('mnist_784', cache=True)
peak memory: 1337.64 MiB, increment: 1221.80 MiB

this branch

In [2]: %memit fetch_openml('mnist_784', cache=True)
peak memory: 953.20 MiB, increment: 841.33 MiB

thomasjpfan · 2020-01-10T02:36:22Z

On master and PR _convert_arff_data increments the memory by 839 MB as expected. The biggest difference is how _download_data_arff does not increment on this PR

master

Line #    Mem usage    Increment   Line Contents
================================================
   736   85.578 MiB    0.000 MiB       arff = _download_data_arff(data_description['file_id'], return_sparse,
   737  484.016 MiB  398.438 MiB                                  data_home, encode_nominal=not as_frame)
...
   762  484.016 MiB    0.000 MiB           X, y = _convert_arff_data(arff['data'], col_slice_x,
   763 1323.121 MiB  839.105 MiB                                     col_slice_y, shape)

this pr

Line #    Mem usage    Increment   Line Contents
================================================
   737   81.617 MiB    0.000 MiB       arff = _download_data_arff(data_description['file_id'], return_sparse,
   738   81.758 MiB    0.141 MiB                                  data_home, encode_nominal=not as_frame)
...
   763   81.758 MiB    0.000 MiB           X, y = _convert_arff_data(arff['data'], col_slice_x,
   764  921.691 MiB  839.934 MiB                                     col_slice_y, shape)

This PR works as expected.

thomasjpfan

I am happy with this. All it needs is a whats new entry in 0.23 tagged Efficiency.

jnothman · 2020-01-10T02:53:48Z

sklearn/datasets/_openml.py

+            return_type = _arff.DENSE_GEN
+
+        arff_file = _arff.load((line.decode('utf-8')
+                                for line in response),


@shashanksingh28 if this PR is merged before #14800 I'd suggest just making a _check_md5 helper which takes an iterable of bytes as input and returns generates iterable of bytes, checking md5 on the way.

Sounds good. I would rather wait for this to go in first and md5 check after...

thomasjpfan · 2020-01-10T03:00:02Z

sklearn/datasets/_openml.py

@@ -784,6 +789,8 @@ def fetch_openml(name=None, version='active', data_id=None, data_home=None,
        elif y.shape[1] == 0:
            y = None

+    fp.close()  # explicitly close HTTP connection after parsing


If any of the parsing fails, the connection would remain open.

Maybe, we can put the _convert_arff_data_dataframe and _convert_arff_data logic up:

fp, arff = _download_data_arff(data_description['file_id'], return_sparse, ...) nominal_attributes = None frame = None with closing(fp): if as_frame: columns = data_columns + target_columns frame = _convert_arff_data_dataframe(arff, columns, features_dict) else: X, y = _convert_arff_data(arff['data'], col_slice_x, ...) if as_frame: ... else: ...

An alternative would be to indent the whole parsing logic.

thomasjpfan · 2020-01-10T03:01:03Z

sklearn/datasets/tests/test_openml.py

    data_downloaded = np.array(list(data_arff['data']), dtype='O')
+    fp.close()


So the connection closes even with np.array... fails.

with closing(fp): data_downloaded = np.array(list(data_arff['data']), dtype='O')

jnothman · 2020-01-10T04:05:08Z

hmmm good points. I also need to review how this interacts with the caching

jnothman · 2020-01-10T04:07:01Z

It looks like it should work okay with caching actually.

jnothman · 2020-01-10T05:07:51Z

I've realised that the progressive loading of the file also breaks some of the usefulness of _retry_with_clean_cache.

thomasjpfan · 2020-01-10T18:09:07Z

We most likely need to refactor a little to get _retry_with_clean_cache to work. In its current decorator form, it needs to wrap a function that calls _convert_arff_data_dataframe or _convert_arff_data. Maybe this function can be called _download_and_parse_data_arff:

def fetch_openml(...):
    ....
	if as_frame:
	    parse_arff = partial(_convert_arff_data_dataframe, columns=columns, 	
							 features_dict=features_dict)
	else:
		parse_arff = partial(_convert_arff_data, col_slice_x=, ...)
 	
	result = _download_and_parse_data_arff(..., parse_arff)
	...

def _download_and_parse_data_arff(file_id, sparse, data_home, as_frame, parse_arff):
	url = _DATA_FILE.format(file_id)

	@_retry_with_clean_cache(url, data_home)
	def _download_parse_inner():
    	arff_file = _download_data_arff(...)

		if as_frame:
			return parse_arff(arff_file)
		else:
			return parse_arff(arff_file['data'])

	return _download_parse_inner()

With this type of refactor _convert_arff_data would need to be updated to accept .

jnothman · 2020-01-20T12:08:06Z

Here's my little refactor that ensures the failure will be retried with appropriate scope. Not that the retry business is tested (should I bother??)

jnothman · 2020-01-20T12:15:44Z

Another reviewer would be very welcome here!

thomasjpfan · 2020-01-29T21:28:02Z

Here's my little refactor that ensures the failure will be retried with appropriate scope. Not that the retry business is tested (should I bother??)

The retry logic is independently tested, but not together with fetch_openml. If we want to test streaming generator retry logic, we would need mock out a request that will fail midstream and succeed with a retry. I do not think this is needed for this PR.

@NicolasHug Most of the diff of this PR is moving code around to allow for the stream close with a context manager. This is done to accommodate arff.load consuming a generator.

NicolasHug

Looks good, minor concern about the retry logic now

NicolasHug · 2020-01-30T15:07:38Z

doc/whats_new/v0.23.rst

@@ -58,6 +58,10 @@ Changelog
  :func:`datasets.make_moons` now accept two-element tuple.
  :pr:`15707` by :user:`Maciej J Mikulski <mjmikulski>`.

+- |Efficiency| :func:`datasets.fetch_openml` no longer stores the full dataset


Suggested change

- |Efficiency| :func:`datasets.fetch_openml` no longer stores the full dataset

- |Efficiency| :func:`datasets.fetch_openml` has reduced memory usage because it no longer stores the full dataset

Addressed in latest commit

NicolasHug · 2020-01-30T15:42:39Z

sklearn/datasets/_openml.py

+        # Note that if the data is dense, no reading is done until the data
+        # generator is iterated.


I'm not sure how to interpret this comment

Do you mean e.g. "Since we pass a generator, load() will read lines one by one and avoid excessive memory usage" ?

That'd be a useful comment IMHO

NicolasHug · 2020-01-30T15:46:35Z

sklearn/datasets/_openml.py

+    col_slice_y = [int(features_dict[col_name]['index'])
+                   for col_name in target_columns]
+
+    col_slice_x = [int(features_dict[col_name]['index'])
+                   for col_name in data_columns]
+    for col_idx in col_slice_y:
+        feat = features_list[col_idx]
+        nr_missing = int(feat['number_of_missing_values'])
+        if nr_missing > 0:
+            raise ValueError('Target column {} has {} missing values. '
+                             'Missing values are not supported for target '
+                             'columns. '.format(feat['name'], nr_missing))


The column validation is moved in this function which is decorated by _retry_with_clean_cache.

So the ValueError raised here will cause the decorator to raise warn("Invalid cache, redownloading file", RuntimeWarning) and it will re-run the function after clearing the cache which is not necessary

This is a fair point. I'll try to get some user-input validation out of the retry.

jnothman · 2020-02-02T20:36:47Z

Thanks for the review @NicolasHug. I've tried to address your comments, but haven't been able to test locally thanks to an OS update breaking my conda build...

sklearn/datasets/_openml.py

thomasjpfan · 2020-02-03T13:49:11Z

sklearn/datasets/_openml.py

@@ -51,6 +51,8 @@ def wrapper(*args, **kw):
                return f(*args, **kw)
            except HTTPError:
                raise
+            except ValueError:


except (HTTPError, ValueError): raise

Should we also passthrough ArffException as well?

Should we also passthrough ArffException as well?

No, because they may be due to data corruption.

That was my concern about not letting ValueError through too...

There are some ValueErrors in _arff.py as well. In principle, we only want to retry when there is an parsing error or an error from _open_openml_url, which is scoped to:

def _load_arff(...): response = _open_openml_url(url, data_home) with closing(response): arff = _arff.load((line.decode('utf-8') for line in response), return_type=return_type, encode_nominal=not as_frame) return arff

Can we place this in its own function and use the retry wrapper on this new function?

No, that code only reads the headers and returns a generator, which is the basis of this change. The parsing errors will actually occur when that generator is iterated during conversion to arrays.

If the ValueErrors in _arff.py are not possible to raise due to data corruption, then the current solution is fine

The only ValueError I see from corruption may be

scikit-learn/sklearn/externals/_arff.py

Lines 243 to 249 in fa9cf22

def _escape_sub_callback(match):

s = match.group()

if len(s) == 2:

try:

return _ESCAPE_SUB_MAP[s]

except KeyError:

raise ValueError('Unsupported escape sequence: %s' % s)

On a side note, we can define the _load_arff as follows:

def _load_arff(..., as_frame): response = _open_openml_url(url, data_home) with closing(response): arff = _arff.load((line.decode('utf-8') for line in response), return_type=return_type, encode_nominal=not as_frame) if as_frame: return _convert_arff_data_dataframe(arff, columns, features_dict) else: return _convert_arff_data(arff['data'], col_slice_x, col_slice_y, shape)

jnothman · 2020-02-13T22:03:10Z

Not sure entirely what you're suggesting... The loader needs to also return metadata from arff, not just data. That's why it's all processed here, but you're right that it might be possible to simplify.

thomasjpfan · 2020-02-14T00:16:28Z

To be concrete, I was thinking of this: jnothman#8

Suggestion for scikit-learn#16084

thomasjpfan

PR updated such that only errors from the downloading and parsing will trigger a redownload.

still LGTM

thomasjpfan · 2020-02-22T21:50:26Z

doc/whats_new/v0.23.rst

@@ -58,6 +58,10 @@ Changelog
  :func:`datasets.make_moons` now accept two-element tuple.
  :pr:`15707` by :user:`Maciej J Mikulski <mjmikulski>`.

+- |Efficiency| :func:`datasets.fetch_openml` no longer stores the full dataset


Addressed in latest commit

jnothman · 2020-04-02T00:38:52Z

Perhaps we can get another review here? Not essential, but a nice memory boost for fetch_openml, and unblocking work on the checksum PR.

thomasjpfan · 2020-04-05T00:45:36Z

As an aside, this may help resolve #16629 and allow us to turn back on the memory profiler for gallery examples.

rth

Thanks @jnothman ! LGTM, after a somewhat superficial review. I think our openml fetcher code is non trivial and sometimes difficult to follow. I wonder if type annotations could help some for readability and if we can move part of this logic upstream.

rth · 2020-04-26T18:24:11Z

Maybe I should have synced master, but CI was green I merged it. Looking into a fix in #17047

jnothman · 2020-04-27T07:40:09Z

Great! Thanks!

…n#16084)

ENH buffer openml stream rather than reading all at once

057a46c

jnothman requested a review from thomasjpfan January 9, 2020 21:22

jnothman added the Enhancement label Jan 9, 2020

thomasjpfan approved these changes Jan 10, 2020

View reviewed changes

jnothman added 2 commits January 10, 2020 13:48

Comment and explicitly close files

401c988

DOC add what's new

527b83a

jnothman changed the title ~~[WIP] ENH buffer openml stream rather than reading all at once~~ [MRG] ENH buffer openml stream rather than reading all at once Jan 10, 2020

jnothman commented Jan 10, 2020

View reviewed changes

thomasjpfan reviewed Jan 10, 2020

View reviewed changes

FIX ensure file closing and retrying work despite generators

721219a

NicolasHug reviewed Jan 30, 2020

View reviewed changes

Revision after Nicolas's comments

50c9627

thomasjpfan reviewed Feb 2, 2020

View reviewed changes

sklearn/datasets/_openml.py Show resolved Hide resolved

Partially revert previous commit, and instead let ValueError percolate

331de1f

thomasjpfan approved these changes Feb 3, 2020

View reviewed changes

cosmit

cbaeaf4

CLN Suggestion

d37ec84

CLN Suggestion

d3b5825

thomasjpfan and others added 6 commits February 21, 2020 14:48

CLN Address comments

9656082

DOC Update

f2fa60d

Merge pull request #8 from thomasjpfan/pr/16084

0d51b73

Suggestion for scikit-learn#16084

Merge branch 'master' into recode

552ec29

Merge branch 'recode' of github.com:jnothman/scikit-learn into recode

d759ad0

DOC Address comments

820af03

thomasjpfan approved these changes Feb 22, 2020

View reviewed changes

jnothman added the Waiting for Reviewer label Feb 26, 2020

github-actions bot added the module:datasets label Mar 2, 2020

thomasjpfan mentioned this pull request Mar 9, 2020

[MRG] Turns off memory_profiler in examples to stop circleci from hanging #16629

Merged

jnothman requested a review from rth April 22, 2020 01:26

rth approved these changes Apr 26, 2020

View reviewed changes

rth changed the title ~~[MRG] ENH buffer openml stream rather than reading all at once~~ ENH buffer openml stream rather than reading all at once Apr 26, 2020

rth merged commit a14953a into scikit-learn:master Apr 26, 2020

jnothman mentioned this pull request Apr 27, 2020

ENH Verify md5-checksums received from openml arff file metadata #14800

Merged

gio8tisu pushed a commit to gio8tisu/scikit-learn that referenced this pull request May 15, 2020

ENH buffer openml stream rather than reading all at once (scikit-lear…

22ee4bc

…n#16084)

viclafargue pushed a commit to viclafargue/scikit-learn that referenced this pull request Jun 26, 2020

ENH buffer openml stream rather than reading all at once (scikit-lear…

321e0f4

…n#16084)

thomasjpfan mentioned this pull request Dec 6, 2021

DBG checking fetching of MNIST #21859

Closed

		data_downloaded = np.array(list(data_arff['data']), dtype='O')
		fp.close()

	- \|Efficiency\| :func:`datasets.fetch_openml` no longer stores the full dataset
	- \|Efficiency\| :func:`datasets.fetch_openml` has reduced memory usage because it no longer stores the full dataset

		# Note that if the data is dense, no reading is done until the data
		# generator is iterated.

	def _escape_sub_callback(match):
	s = match.group()
	if len(s) == 2:
	try:
	return _ESCAPE_SUB_MAP[s]
	except KeyError:
	raise ValueError('Unsupported escape sequence: %s' % s)

ENH buffer openml stream rather than reading all at once #16084

ENH buffer openml stream rather than reading all at once #16084

Conversation

jnothman commented Jan 9, 2020

thomasjpfan commented Jan 9, 2020

master

This PR

jnothman commented Jan 10, 2020

jnothman commented Jan 10, 2020

thomasjpfan commented Jan 10, 2020

master

this pr

thomasjpfan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shashanksingh28 Jan 18, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnothman commented Jan 10, 2020

jnothman commented Jan 10, 2020

jnothman commented Jan 10, 2020

thomasjpfan commented Jan 10, 2020

jnothman commented Jan 20, 2020

jnothman commented Jan 20, 2020

thomasjpfan commented Jan 29, 2020

NicolasHug left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NicolasHug Jan 30, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnothman commented Feb 2, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnothman commented Feb 13, 2020 via email

thomasjpfan commented Feb 14, 2020

thomasjpfan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnothman commented Apr 2, 2020

thomasjpfan commented Apr 5, 2020

rth left a comment

Choose a reason for hiding this comment

rth commented Apr 26, 2020

jnothman commented Apr 27, 2020

shashanksingh28 Jan 18, 2020 •

edited

NicolasHug Jan 30, 2020 •

edited