Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3's support for additional checksum algorithms #1005

Open
alexwlchan opened this issue Jun 10, 2022 · 0 comments
Open

S3's support for additional checksum algorithms #1005

alexwlchan opened this issue Jun 10, 2022 · 0 comments

Comments

@alexwlchan
Copy link
Contributor

alexwlchan commented Jun 10, 2022

AWS recently released support for new checksum algorithms: SHA-1, SHA-256, CRC-32, and CRC-32C. You can specify a checksum on object upload, and AWS will tell you if the checksum doesn't match. It also stores the checksum as part of the object attributes, which you can retrieve using GetObjectAttributes. This is obviously intriguing as a possible way to do bag verification.

These are my thoughts on the feature, and how we might use it in the storage service.

tl;dr: I think it's an interesting feature we'd be able to use as an additional check in some cases, but it can't replace the bag verifier.

It only supports a limited set of checksum algorithms. We already support SHA-512 in the storage service, and we have at least a few bags that use it. Until AWS add support for that algorithm, we can't verify those checksums.

It only supports a single checksum per object. A bag can contain multiple payload manifests, and we have at least a few bags that do – and the bag verifier will verify every checksum in every manifest. S3 can only verify one of those checksums.

Although I don't think we're bringing in any new bags with multiple checksums, I can imagine it happening for some born-digital content. If a donor supplies, say, MD5 checksums, it'd be nice to create an MD5 manifest as well as the SHA-256 manifest, so we get end-to-end verification as far back as the donor.

It uses a "checksum of checksums" for large objects. If you use multi-part uploads (which you have to use for objects >5GB), what you get isn't a checksum of the object as a whole, but of the parts:

The AWS SDKs now take advantage of client-side parallelism and compute checksums for each part of a multipart upload. The checksums for all of the parts are themselves checksummed and this checksum-of-checksums is transmitted to S3 when the upload is finalized.

The bag verifier gets a checksum for the object as a whole.

We have millions of objects that pre-date this feature. Unless we want to reupload every object already in the storage service (which is an expensive and unnecessary risk), we have millions of objects that don't have these checksum attributes. Although they're already written, we need to be able to re-verify them, e.g. if they're referred to in a future version of a bag by a fetch.txt.

I can imagine us using this as an additional check on objects that (1) use a SHA-256 checksum algorithm and (2) are small enough not to require multi-part upload, but the bag verifier provides more robust checks than this feature.

0
1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant