New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH buffer openml stream rather than reading all at once #16084
Conversation
Quick memory profiling on master
This PR
|
Thanks @thomasjpfan, though that's not the relevant portion, since now the content isn't even buffered until |
master
this branch
|
On master and PR master
this pr
This PR works as expected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am happy with this. All it needs is a whats new entry in 0.23 tagged Efficiency
.
sklearn/datasets/_openml.py
Outdated
return_type = _arff.DENSE_GEN | ||
|
||
arff_file = _arff.load((line.decode('utf-8') | ||
for line in response), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shashanksingh28 if this PR is merged before #14800 I'd suggest just making a _check_md5
helper which takes an iterable of bytes as input and returns generates iterable of bytes, checking md5 on the way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. I would rather wait for this to go in first and md5 check after...
sklearn/datasets/_openml.py
Outdated
@@ -784,6 +789,8 @@ def fetch_openml(name=None, version='active', data_id=None, data_home=None, | |||
elif y.shape[1] == 0: | |||
y = None | |||
|
|||
fp.close() # explicitly close HTTP connection after parsing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If any of the parsing fails, the connection would remain open.
Maybe, we can put the _convert_arff_data_dataframe
and _convert_arff_data
logic up:
fp, arff = _download_data_arff(data_description['file_id'], return_sparse, ...)
nominal_attributes = None
frame = None
with closing(fp):
if as_frame:
columns = data_columns + target_columns
frame = _convert_arff_data_dataframe(arff, columns, features_dict)
else:
X, y = _convert_arff_data(arff['data'], col_slice_x, ...)
if as_frame:
...
else:
...
An alternative would be to indent the whole parsing logic.
data_downloaded = np.array(list(data_arff['data']), dtype='O') | ||
fp.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the connection closes even with np.array...
fails.
with closing(fp):
data_downloaded = np.array(list(data_arff['data']), dtype='O')
hmmm good points. I also need to review how this interacts with the caching |
It looks like it should work okay with caching actually. |
I've realised that the progressive loading of the file also breaks some of the usefulness of |
We most likely need to refactor a little to get def fetch_openml(...):
....
if as_frame:
parse_arff = partial(_convert_arff_data_dataframe, columns=columns,
features_dict=features_dict)
else:
parse_arff = partial(_convert_arff_data, col_slice_x=, ...)
result = _download_and_parse_data_arff(..., parse_arff)
...
def _download_and_parse_data_arff(file_id, sparse, data_home, as_frame, parse_arff):
url = _DATA_FILE.format(file_id)
@_retry_with_clean_cache(url, data_home)
def _download_parse_inner():
arff_file = _download_data_arff(...)
if as_frame:
return parse_arff(arff_file)
else:
return parse_arff(arff_file['data'])
return _download_parse_inner() With this type of refactor |
Here's my little refactor that ensures the failure will be retried with appropriate scope. Not that the retry business is tested (should I bother??) |
Another reviewer would be very welcome here! |
The retry logic is independently tested, but not together with @NicolasHug Most of the diff of this PR is moving code around to allow for the stream close with a context manager. This is done to accommodate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, minor concern about the retry logic now
doc/whats_new/v0.23.rst
Outdated
@@ -58,6 +58,10 @@ Changelog | |||
:func:`datasets.make_moons` now accept two-element tuple. | |||
:pr:`15707` by :user:`Maciej J Mikulski <mjmikulski>`. | |||
|
|||
- |Efficiency| :func:`datasets.fetch_openml` no longer stores the full dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- |Efficiency| :func:`datasets.fetch_openml` no longer stores the full dataset | |
- |Efficiency| :func:`datasets.fetch_openml` has reduced memory usage because it no longer stores the full dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in latest commit
sklearn/datasets/_openml.py
Outdated
# Note that if the data is dense, no reading is done until the data | ||
# generator is iterated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how to interpret this comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean e.g. "Since we pass a generator, load() will read lines one by one and avoid excessive memory usage" ?
That'd be a useful comment IMHO
sklearn/datasets/_openml.py
Outdated
col_slice_y = [int(features_dict[col_name]['index']) | ||
for col_name in target_columns] | ||
|
||
col_slice_x = [int(features_dict[col_name]['index']) | ||
for col_name in data_columns] | ||
for col_idx in col_slice_y: | ||
feat = features_list[col_idx] | ||
nr_missing = int(feat['number_of_missing_values']) | ||
if nr_missing > 0: | ||
raise ValueError('Target column {} has {} missing values. ' | ||
'Missing values are not supported for target ' | ||
'columns. '.format(feat['name'], nr_missing)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The column validation is moved in this function which is decorated by _retry_with_clean_cache
.
So the ValueError raised here will cause the decorator to raise warn("Invalid cache, redownloading file", RuntimeWarning)
and it will re-run the function after clearing the cache which is not necessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a fair point. I'll try to get some user-input validation out of the retry.
Thanks for the review @NicolasHug. I've tried to address your comments, but haven't been able to test locally thanks to an OS update breaking my conda build... |
sklearn/datasets/_openml.py
Outdated
@@ -51,6 +51,8 @@ def wrapper(*args, **kw): | |||
return f(*args, **kw) | |||
except HTTPError: | |||
raise | |||
except ValueError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
except (HTTPError, ValueError):
raise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also passthrough ArffException
as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also passthrough ArffException as well?
No, because they may be due to data corruption.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was my concern about not letting ValueError through too...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some ValueError
s in _arff.py
as well. In principle, we only want to retry when there is an parsing error or an error from _open_openml_url
, which is scoped to:
def _load_arff(...):
response = _open_openml_url(url, data_home)
with closing(response):
arff = _arff.load((line.decode('utf-8')
for line in response),
return_type=return_type,
encode_nominal=not as_frame)
return arff
Can we place this in its own function and use the retry wrapper on this new function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that code only reads the headers and returns a generator, which is the basis of this change. The parsing errors will actually occur when that generator is iterated during conversion to arrays.
If the ValueErrors in _arff.py are not possible to raise due to data corruption, then the current solution is fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only ValueError
I see from corruption may be
scikit-learn/sklearn/externals/_arff.py
Lines 243 to 249 in fa9cf22
def _escape_sub_callback(match): | |
s = match.group() | |
if len(s) == 2: | |
try: | |
return _ESCAPE_SUB_MAP[s] | |
except KeyError: | |
raise ValueError('Unsupported escape sequence: %s' % s) |
On a side note, we can define the _load_arff
as follows:
def _load_arff(..., as_frame):
response = _open_openml_url(url, data_home)
with closing(response):
arff = _arff.load((line.decode('utf-8')
for line in response),
return_type=return_type,
encode_nominal=not as_frame)
if as_frame:
return _convert_arff_data_dataframe(arff, columns, features_dict)
else:
return _convert_arff_data(arff['data'], col_slice_x,
col_slice_y, shape)
Not sure entirely what you're suggesting... The loader needs to also return
metadata from arff, not just data. That's why it's all processed here, but
you're right that it might be possible to simplify.
|
To be concrete, I was thinking of this: jnothman#8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR updated such that only errors from the downloading and parsing will trigger a redownload.
still LGTM
doc/whats_new/v0.23.rst
Outdated
@@ -58,6 +58,10 @@ Changelog | |||
:func:`datasets.make_moons` now accept two-element tuple. | |||
:pr:`15707` by :user:`Maciej J Mikulski <mjmikulski>`. | |||
|
|||
- |Efficiency| :func:`datasets.fetch_openml` no longer stores the full dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in latest commit
Perhaps we can get another review here? Not essential, but a nice memory boost for fetch_openml, and unblocking work on the checksum PR. |
As an aside, this may help resolve #16629 and allow us to turn back on the memory profiler for gallery examples. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jnothman ! LGTM, after a somewhat superficial review. I think our openml fetcher code is non trivial and sometimes difficult to follow. I wonder if type annotations could help some for readability and if we can move part of this logic upstream.
Maybe I should have synced master, but CI was green I merged it. Looking into a fix in #17047 |
Great! Thanks! |
I've not benchmarked yet. This should reduce memory requirements when fetching from OpenML.
This no longer explicitly closes the open URL handler, since it is required until the ARFF has been completely read. We could return the stream object from _download_data_arff if we want to close it.