Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No block digest written for warcinfo records #87

Closed
JustAnotherArchivist opened this issue Jul 26, 2019 · 3 comments · Fixed by #115
Closed

No block digest written for warcinfo records #87

JustAnotherArchivist opened this issue Jul 26, 2019 · 3 comments · Fixed by #115

Comments

@JustAnotherArchivist
Copy link
Contributor

I noticed today that warcio doesn't generate a block digest for warcinfo records:

NO_BLOCK_DIGEST_TYPES = ('warcinfo')

This seems to have been introduced in a791617, but I was unable to figure out why. The spec permits block digests on any record (whereas a payload digest would make no sense on a warcinfo record due to the content type used normally), and it seems good practice to me to always store a digest to allow for integrity checks.

JustAnotherArchivist added a commit to JustAnotherArchivist/qwarc that referenced this issue Aug 26, 2019
…ebrecorder/warcio#87)

The length has to be set manually because otherwise warcio will automatically remove the header again.
@JustAnotherArchivist
Copy link
Contributor Author

Side note: NO_BLOCK_DIGEST_TYPES is actually a string, not a tuple as certainly intended. That won't have any negative effect though unless someone is writing custom records with a WARC-Type that is a substring of warcinfo.

@wumpus
Copy link
Collaborator

wumpus commented Jul 13, 2020

The ('warcinfo') aspect of the bug is old. It was (accidentally) introduced in

25194bc
warcwriter: if writing a record with no Content-Length, buffer and compute length, as well as digests, as needed

You're right that any recordtype that's a subset of the string 'warcinfo' will match, fortunately there is no such standard recordtype.

As for that explicit code that stops writing block digests for warcinfo records, that code was in the original checkin, when the code was split out of webrecorder. I'm kind of surprised that I didn't notice it while working on "warcio test".

@JustAnotherArchivist
Copy link
Contributor Author

I dug around a bit more. a791617 is not the commit that introduced this behaviour. It appears that it was already like that back when warcinfo records were first added in webrecorder/pywb@d40edfc2; at the time, digests were only written for responses as far as I can tell.

I can't really think of any good reason why block digests shouldn't simply always be present for verification purposes. The only exception would be zero-length records, where the digest is useless, but those are so rare and odd that worrying about them isn't worth it. I'll create a PR shortly.

JustAnotherArchivist added a commit to JustAnotherArchivist/warcio that referenced this issue Aug 4, 2020
ikreymer pushed a commit that referenced this issue Aug 11, 2020
* Enable writing block digests for warcinfo records

Fixes #87

* Fix test_recompress_arc2warc failure
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants