Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"warcio check" incorrectly reporting payload digest failures for non-HTTP WARCs #156

Open
acidus99 opened this issue Aug 24, 2023 · 2 comments
Assignees

Comments

@acidus99
Copy link

acidus99 commented Aug 24, 2023

I'm using WARC files with non-HTTP traffic, specifically the Gemini protocol. I'm setting the WARC-Content-Type appropriately to reflect this.

warcio check has been helpful to find problems with WARCs such as incorrect block digests or records with invalid content lengths. However warcio check is incorrectly reporting payload digest failure on these records:

$ warcio check gemini.warc
gemini.warc
  offset 702 WARC-Record-ID <urn:uuid:c0f9d8fd-d27e-43dc-80be-4fbd864c128d> response
    payload digest failed sha256:20670b53ae319b676698eb1aec228b492328574d78c1425b6b68a77876763403

If warcio doesn't understand the protocol defined by a record's WARC-Content-Type field (in this case application/gemini; msgtype=response) it won't understand what constitutes the payload for that record, and thus cannot check the WARC-Payload-Digest field. To my knowledge (and a quick check of the source code) warcio has no concept of the Gemini protocol, so I'm unclear on how it would know what the payload is, and whether the digest is valid or not. Section 6.3.3 of the WARC spec even says the contents of a response record isn't defined for non-HTTP URI schemes.

Perhaps I misunderstand what can be in a payload digest header, but reporting payload digest failures for unknown protocols seems like a bug? At the very least it's cluttering the output.

Attached is an example WARC with a request and response records for Gemini.
gemini.warc.gz

Without getting too detailed, Gemini protocol responses contain a single response line with a status code and MIME type, a single CRLF, and then the body of the response. This body is the gemini equivalent of HTTP's entity-body per section 6.3.2. In the WARC example, the body of the response begins at offset 1338 in the uncompressed version of the WARC file (with the '#' character). The body ends at the end of the record, before the final, double CRLF, signifying the end of the record. The sha256 for this body is 20670b53ae319b676698eb1aec228b492328574d78c1425b6b68a77876763403 which is used in the payload digest field so I can do deduping and generate indexes.

My suggestion would be that warcio check should not check the payload digest for records whose WARC-Content-Type is an unknown protocol. This would allow future PRs to warcio that support other protocols.

@acidus99
Copy link
Author

acidus99 commented Aug 24, 2023

FWIW @JustAnotherArchivist touched on parts of this "what is a well-defined payload" here:
#93

Perhaps there is another discussion to be had here on "iipc/warc-specifications". I would suggest that payload definition shouldn't be codified into the WARC spec. Tools should be able to work with new protocols and payloads and shouldn't make assumptions about what constitutes a payload for protocls/URIs they don't understand.

@wumpus
Copy link
Collaborator

wumpus commented Aug 27, 2023

I 100% agree that I didn't take this into account when writing the check code! Thanks for the analysis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants