New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Verify md5-checksums received from openml arff file metadata #14800
ENH Verify md5-checksums received from openml arff file metadata #14800
Conversation
6281ea2
to
df4c049
Compare
Most files in the local tests sklearn.datasets.tests.data.openml folder
do not match their checksums. This may be because of differences between
versions of metadata vs actual file downloaded.
Those files were artificially shortened to fit within the package. Feel
free to calculate and store their checksums rather than those on OpenML
|
@thomasjpfan ready for your review, incorporates what we discussed last time |
sklearn/datasets/openml.py
Outdated
BytesIO stream with the same content as input_stream for consumption | ||
""" | ||
with closing(input_stream): | ||
bytes_content = input_stream.read() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do this without consume the whole stream? https://docs.python.org/3/library/hashlib.html?highlight=hashlib#examples
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. I could only think of this: to verify the hash we need to read the stream, so we read it and then return a new stream to keep the methods API consistent.
Did you mean consuming the stream in chunks and updating the hashlib.md5 instance with the chunks instead of reading it all in one go? I don't think that would make any difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I mean in chunks.
Currently, with this PR the stream will be read twice, once during the the md5 check and again by the caller of _open_openml_url
.
Another option is to have _open_openml_url
return the byte content such that the callers do not need to call read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done the second option, now reads stream once and passes content along
@ingrid88 @kellycarmody |
d53fe3d
to
831d78b
Compare
@thomasjpfan changes ready for your review. Thanks. |
sklearn/datasets/openml.py
Outdated
------- | ||
bytes | ||
""" | ||
actual_md5_checksum = hashlib.md5(bytes_content).hexdigest() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we should checksum in in chunks. Could you please take a medium sized OpenML dataset (e.g. MNIST https://www.openml.org/d/554 ) and see computing the checksum with the above way is still reasonable with respect to memory usage (see memory_profiler
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not make any difference (see test gist)
I guess it makes sense, from the wiki page:
The main algorithm then uses each 512-bit message block in turn to modify the state
It sounds like a state machine that uses fixed memory (states) irrespective of the size of the stream.
The only optimization that can be done here afaik is that if the md5checksum flag is enabled (default), we could download the arff file in chunks and keep updating md5, which would save us some time. It might add some code complexity though (we will have to manage the chunking with urllib), do the maintainers think it is worth it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently the data is traversed twice. First when the stream is read into memory. Second when hashlib.md5
goes through the data.
Would the goal of this optimization make it so we read the data twice?
In your testing do you find that the checksum increases the time it loads the data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the optimization (it now reads the stream only once, and keeps updating the md5 while reading).
In terms of memory, it will not add any significant overhead as I mentioned, since md5 uses a state-machine.
In terms of time / cpu, here is a quick test result for a medium size dataset:
Existing openml (master branch)
$ python -m timeit -c "from sklearn.datasets import fetch_openml; fetch_openml(data_id=554, data_home=None)"
1 loop, best of 5: 12.4 sec per loop
This PR:
$ python -m timeit -c "from sklearn.datasets import fetch_openml; fetch_openml(data_id=554, data_home=None, verify_checksum=False)"
1 loop, best of 5: 12.3 sec per loop
$ python -m timeit -c "from sklearn.datasets import fetch_openml; fetch_openml(data_id=554, data_home=None, verify_checksum=True)"
1 loop, best of 5: 12.6 sec per loop
This is single iteration but it looks like it is not making it significantly slow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What was the time before the optimization? (831d78b)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think this actually is better than before. The way to get a benefit from chunking is to verify the data the same time it is being encoded by the arrf parser.
The chunking this PR is doing still end up reading the data twice. Once, for checking, and another when the data is being parsed by the arrf parser.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Data was already being read twice, once here and once again here.
This PR maintains the number of times the stream is read.
To make this validation occur during decoding the arff would:
- Be wasteful if done each time decode is called (currently validation is done only when data is downloaded from internet), future decodes are from a local cached file
- Would involve adding check in the arff decode logic which is in the externals module.
Would we prefer to modify this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not prefer not changing the arrf module. I suspect the parsing of the arrf file needs the whole thing in memory. I am happy with previous non chucked version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
831d78b would make the stream be read 3 times, once at download (non chunked), then to validate (all at one go) and then to decode (arff).
In my testing the second validation overhead is insignificant in terms of memory or cpu, so I would not worry too much about it. If you also agree then I will push a revert commit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, lets keep the current version.
sklearn/datasets/openml.py
Outdated
fsrc = gzip.GzipFile(fileobj=fsrc, mode='rb') | ||
bytes_content = fsrc.read() | ||
if expected_md5_checksum: | ||
# validating checksum reads and consumes the stream |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a stream anymore
sklearn/datasets/openml.py
Outdated
------- | ||
bytes | ||
""" | ||
actual_md5_checksum = hashlib.md5(bytes_content).hexdigest() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently the data is traversed twice. First when the stream is read into memory. Second when hashlib.md5
goes through the data.
Would the goal of this optimization make it so we read the data twice?
In your testing do you find that the checksum increases the time it loads the data?
2b18909
to
10ecf9a
Compare
sklearn/datasets/openml.py
Outdated
break | ||
if expected_md5: | ||
file_md5.update(data) | ||
fsrc_bytes += data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am concerned that every assignment here will create a new bytes
since bytes
is immutable. Would bytearray
and bytearray.extend
be better in this case?
I don't know why the CI is failing now (some installation step in py35_ubuntu_atlas), has anyone seen this before? |
return bytearray instead of new bytes Co-Authored-By: Thomas J Fan <thomasjpfan@gmail.com>
Any thoughts on this PR? I have updated with master... |
@cmarmo Can we add this PR to your radar as well? |
@rth @thomasjpfan are all the comments addressed here? Are you happy enough? :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seeing the memory usage:
On master
from sklearn.datasets import fetch_openml
%memit fetch_openml(data_id=554, data_home="data_home")
# peak memory: 1266.97 MiB, increment: 1175.53 MiB
this PR
%memit fetch_openml(data_id=554, data_home="data_home")
# peak memory: 1403.14 MiB, increment: 1313.07 MiB
sklearn/datasets/_openml.py
Outdated
if expected_md5 does not match actual md5-checksum of stream | ||
""" | ||
fsrc_bytes = bytearray() | ||
file_md5 = hashlib.md5() if expected_md5_checksum else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the early exiting above:
file_md5 = hashlib.md5() if expected_md5_checksum else None | |
file_md5 = hashlib.md5() |
sklearn/datasets/_openml.py
Outdated
------ | ||
ValueError : | ||
if expected_md5 does not match actual md5-checksum of stream | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can early exit here:
if expected_md5 is None:
return fsrc.read()
sklearn/datasets/_openml.py
Outdated
data = fsrc.read(chunk_size) | ||
if not data: | ||
break | ||
if expected_md5: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the early exiting:
file_md5.update(data)
modified_gzip.write(data) | ||
|
||
# simulate request to return modified file | ||
mocked_openml_url = sklearn.datasets._openml.urlopen |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leave a comment that sklearn.datasets._openml.urlopen
is already mocked by _monkey_patch_webbased_functions
.
Please add an entry to the change log at |
flake8 Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
…h28/scikit-learn into md5checksum_validate_openml
assert exc.match("1666876") | ||
|
||
# cleanup fake local file | ||
os.remove(corrupt_copy) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The removal of the tmpdir is handled by by pytest. (Also this is causing the windows builds to fail)
os.remove(corrupt_copy) |
8c049a0
to
c55f64a
Compare
#17053 is merged now! |
I have merged master in to resolve conflicts and will try to review in the next few days. (Though don't wait for merging if there are enough reviewers). Thanks for your patience @shashanksingh28 ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually LGTM, thanks @shashanksingh28 !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@shashanksingh28 Thank you for your patience on this!
This is an accomplishment! This PR is one of the last two from the Aug 2019 NYC sprint. (I just pinged on the other one.) Congrats, everyone. @thomasjpfan can we remove the "Waiting for Reviewer" label on this one? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Great work, @shashanksingh28, and a much neater solution than some of those we tried on the way!!
Thanks @jnothman, @thomasjpfan and @rth for your help with this :) Feels nice to have my first contribution to scikit-learn! |
…it-learn#14800) Co-authored-by: Thomas J Fan <thomasjpfan@gmail.com> Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com> Co-authored-by: Roman Yurchak <rth.yurchak@gmail.com>
…it-learn#14800) Co-authored-by: Thomas J Fan <thomasjpfan@gmail.com> Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com> Co-authored-by: Roman Yurchak <rth.yurchak@gmail.com>
Reference Issues/PRs
Fixes #11816
What does this implement/fix? Explain your changes.
When fetching an open-ml dataset, metadata about the dataset is fetched via
https://openml.org/api/v1/json/data/
, which includes the latest file version to download, its md5-checksum, etc.This PR adds functionality to verify the md5-checksum of the file downloaded to the one provided via the api. If the validation fails, it produces a
ValueError
stating the same.Any other comments?
sklearn.datasets.tests.data.openml
folder do not match their checksums. This may be because of differences between versions of metadata vs actual file downloaded.