-
-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
warcio doesn't verify digests on read #18
Comments
I coded this up by hacking LimitReader to be configurable by recordloader.py to test digests. If I turn it on for all of the existing tests, the only ones that fail are the truncated warc (expected) and the block_digest revisit record in example.warc (which might be your bug, not mine... I'll figure it out) Because of your enthusiasm for streaming, the digest check will only fire if the user reads to the end of the record. So be it. I have a few style questions:
|
New try: #54 |
Done by "warcio check" |
I was experimenting with injecting digests from my crawler, so that digests aren't computed twice, and noticed that records with a bad WARC-Payload-Digest don't raise an exception on read. No code for it, so I suppose this is a feature request.
The check should be disable-able, and "warcio index" and recompress ought to have a command line flag to ignore digest errors.
Lacking this feature, I don't think that warcio currently has any test to ensure that it's correctly computing digests.
The text was updated successfully, but these errors were encountered: