Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Verify md5-checksums received from openml arff file metadata #14800

Merged

Conversation

shashanksingh28
Copy link
Contributor

@shashanksingh28 shashanksingh28 commented Aug 24, 2019

Reference Issues/PRs

Fixes #11816

What does this implement/fix? Explain your changes.

When fetching an open-ml dataset, metadata about the dataset is fetched via https://openml.org/api/v1/json/data/, which includes the latest file version to download, its md5-checksum, etc.

This PR adds functionality to verify the md5-checksum of the file downloaded to the one provided via the api. If the validation fails, it produces a ValueError stating the same.

Any other comments?

  • Validation is done by default unless explicitly overriden
  • Most files in the local tests sklearn.datasets.tests.data.openml folder do not match their checksums. This may be because of differences between versions of metadata vs actual file downloaded.

@jnothman
Copy link
Member

jnothman commented Aug 25, 2019 via email

@shashanksingh28
Copy link
Contributor Author

@thomasjpfan ready for your review, incorporates what we discussed last time

@thomasjpfan thomasjpfan self-requested a review September 5, 2019 12:14
BytesIO stream with the same content as input_stream for consumption
"""
with closing(input_stream):
bytes_content = input_stream.read()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do this without consume the whole stream? https://docs.python.org/3/library/hashlib.html?highlight=hashlib#examples

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. I could only think of this: to verify the hash we need to read the stream, so we read it and then return a new stream to keep the methods API consistent.

Did you mean consuming the stream in chunks and updating the hashlib.md5 instance with the chunks instead of reading it all in one go? I don't think that would make any difference.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I mean in chunks.

Currently, with this PR the stream will be read twice, once during the the md5 check and again by the caller of _open_openml_url.

Another option is to have _open_openml_url return the byte content such that the callers do not need to call read.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done the second option, now reads stream once and passes content along

@reshamas
Copy link
Member

reshamas commented Sep 9, 2019

@ingrid88 @kellycarmody
Let's keep an eye on this PR and complete if needed.

@shashanksingh28
Copy link
Contributor Author

@thomasjpfan changes ready for your review. Thanks.

-------
bytes
"""
actual_md5_checksum = hashlib.md5(bytes_content).hexdigest()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we should checksum in in chunks. Could you please take a medium sized OpenML dataset (e.g. MNIST https://www.openml.org/d/554 ) and see computing the checksum with the above way is still reasonable with respect to memory usage (see memory_profiler).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not make any difference (see test gist)

I guess it makes sense, from the wiki page:

The main algorithm then uses each 512-bit message block in turn to modify the state

It sounds like a state machine that uses fixed memory (states) irrespective of the size of the stream.

The only optimization that can be done here afaik is that if the md5checksum flag is enabled (default), we could download the arff file in chunks and keep updating md5, which would save us some time. It might add some code complexity though (we will have to manage the chunking with urllib), do the maintainers think it is worth it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the data is traversed twice. First when the stream is read into memory. Second when hashlib.md5 goes through the data.

Would the goal of this optimization make it so we read the data twice?

In your testing do you find that the checksum increases the time it loads the data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the optimization (it now reads the stream only once, and keeps updating the md5 while reading).

In terms of memory, it will not add any significant overhead as I mentioned, since md5 uses a state-machine.

In terms of time / cpu, here is a quick test result for a medium size dataset:

Existing openml (master branch)

$ python -m timeit -c "from sklearn.datasets import fetch_openml; fetch_openml(data_id=554, data_home=None)"
1 loop, best of 5: 12.4 sec per loop

This PR:

$ python -m timeit -c "from sklearn.datasets import fetch_openml; fetch_openml(data_id=554, data_home=None, verify_checksum=False)"
1 loop, best of 5: 12.3 sec per loop

$ python -m timeit -c "from sklearn.datasets import fetch_openml; fetch_openml(data_id=554, data_home=None, verify_checksum=True)"
1 loop, best of 5: 12.6 sec per loop

This is single iteration but it looks like it is not making it significantly slow

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the time before the optimization? (831d78b)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think this actually is better than before. The way to get a benefit from chunking is to verify the data the same time it is being encoded by the arrf parser.

The chunking this PR is doing still end up reading the data twice. Once, for checking, and another when the data is being parsed by the arrf parser.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data was already being read twice, once here and once again here.

This PR maintains the number of times the stream is read.

To make this validation occur during decoding the arff would:

  1. Be wasteful if done each time decode is called (currently validation is done only when data is downloaded from internet), future decodes are from a local cached file
  2. Would involve adding check in the arff decode logic which is in the externals module.
    Would we prefer to modify this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not prefer not changing the arrf module. I suspect the parsing of the arrf file needs the whole thing in memory. I am happy with previous non chucked version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

831d78b would make the stream be read 3 times, once at download (non chunked), then to validate (all at one go) and then to decode (arff).

In my testing the second validation overhead is insignificant in terms of memory or cpu, so I would not worry too much about it. If you also agree then I will push a revert commit

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, lets keep the current version.

@thomasjpfan thomasjpfan self-assigned this Sep 16, 2019
fsrc = gzip.GzipFile(fileobj=fsrc, mode='rb')
bytes_content = fsrc.read()
if expected_md5_checksum:
# validating checksum reads and consumes the stream
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a stream anymore

-------
bytes
"""
actual_md5_checksum = hashlib.md5(bytes_content).hexdigest()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the data is traversed twice. First when the stream is read into memory. Second when hashlib.md5 goes through the data.

Would the goal of this optimization make it so we read the data twice?

In your testing do you find that the checksum increases the time it loads the data?

break
if expected_md5:
file_md5.update(data)
fsrc_bytes += data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am concerned that every assignment here will create a new bytes since bytes is immutable. Would bytearray and bytearray.extend be better in this case?

@shashanksingh28
Copy link
Contributor Author

I don't know why the CI is failing now (some installation step in py35_ubuntu_atlas), has anyone seen this before?

return bytearray instead of new bytes

Co-Authored-By: Thomas J Fan <thomasjpfan@gmail.com>
@shashanksingh28
Copy link
Contributor Author

Any thoughts on this PR? I have updated with master...

@reshamas
Copy link
Member

@cmarmo Can we add this PR to your radar as well?
It's from the NYC Aug 2019 sprint. Thanks.

@cmarmo
Copy link
Member

cmarmo commented Nov 26, 2019

@rth @thomasjpfan are all the comments addressed here? Are you happy enough? :)

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing the memory usage:

On master

from sklearn.datasets import fetch_openml

%memit fetch_openml(data_id=554, data_home="data_home")
# peak memory: 1266.97 MiB, increment: 1175.53 MiB

this PR

%memit fetch_openml(data_id=554, data_home="data_home")
# peak memory: 1403.14 MiB, increment: 1313.07 MiB

if expected_md5 does not match actual md5-checksum of stream
"""
fsrc_bytes = bytearray()
file_md5 = hashlib.md5() if expected_md5_checksum else None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the early exiting above:

Suggested change
file_md5 = hashlib.md5() if expected_md5_checksum else None
file_md5 = hashlib.md5()

------
ValueError :
if expected_md5 does not match actual md5-checksum of stream
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can early exit here:

if expected_md5 is None:
    return fsrc.read()

data = fsrc.read(chunk_size)
if not data:
break
if expected_md5:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the early exiting:

file_md5.update(data)

sklearn/datasets/tests/test_openml.py Outdated Show resolved Hide resolved
modified_gzip.write(data)

# simulate request to return modified file
mocked_openml_url = sklearn.datasets._openml.urlopen
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leave a comment that sklearn.datasets._openml.urlopen is already mocked by _monkey_patch_webbased_functions.

@thomasjpfan
Copy link
Member

Please add an entry to the change log at doc/whats_new/v0.24.rst with the |Feature| tag. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:.

assert exc.match("1666876")

# cleanup fake local file
os.remove(corrupt_copy)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The removal of the tmpdir is handled by by pytest. (Also this is causing the windows builds to fail)

Suggested change
os.remove(corrupt_copy)

@thomasjpfan
Copy link
Member

#17053 is merged now!

@rth
Copy link
Member

rth commented Jun 25, 2020

I have merged master in to resolve conflicts and will try to review in the next few days. (Though don't wait for merging if there are enough reviewers). Thanks for your patience @shashanksingh28 !

@rth rth self-requested a review June 25, 2020 19:38
Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually LGTM, thanks @shashanksingh28 !

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@shashanksingh28 Thank you for your patience on this!

@thomasjpfan thomasjpfan changed the title [MGR] Verify md5-checksums received from openml arff file metadata ENH Verify md5-checksums received from openml arff file metadata Jun 26, 2020
@thomasjpfan thomasjpfan merged commit 1e08459 into scikit-learn:master Jun 26, 2020
7 checks passed
@reshamas
Copy link
Member

This is an accomplishment! This PR is one of the last two from the Aug 2019 NYC sprint. (I just pinged on the other one.) Congrats, everyone.

@thomasjpfan can we remove the "Waiting for Reviewer" label on this one? Thanks!

cc: @amueller @cmarmo

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Great work, @shashanksingh28, and a much neater solution than some of those we tried on the way!!

@shashanksingh28
Copy link
Contributor Author

Thanks @jnothman, @thomasjpfan and @rth for your help with this :) Feels nice to have my first contribution to scikit-learn!

glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Jul 17, 2020
…it-learn#14800)

Co-authored-by: Thomas J Fan <thomasjpfan@gmail.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Roman Yurchak <rth.yurchak@gmail.com>
jayzed82 pushed a commit to jayzed82/scikit-learn that referenced this pull request Oct 22, 2020
…it-learn#14800)

Co-authored-by: Thomas J Fan <thomasjpfan@gmail.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Roman Yurchak <rth.yurchak@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Validate MD5 checksum of downloaded ARFF in fetch_openml
6 participants