-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create valid BDBags with external data #518
Conversation
Also compute hash on the fly by downloading external data if it's missing for some reason. Switch default algs to md5 and sha512. The latter is native to girder.
This avoid a situation where we end up with a bag, that bdbag can't handle...
Sad that I didn't catch this eons ago. The fix works for D1, Zenodo, Dataverse. My test case:
What are your thoughts on Globus/MDF? I can't even register http://dx.doi.org/doi:10.18126/M2301J right now:
|
Globus unfortunately doesn't provide checksums. There might be a way to query them, but I'm still investigating. |
Codecov Report
@@ Coverage Diff @@
## master #518 +/- ##
==========================================
- Coverage 93.41% 92.84% -0.58%
==========================================
Files 54 54
Lines 4179 4265 +86
==========================================
+ Hits 3904 3960 +56
- Misses 275 305 +30
Continue to review full report at Codecov.
|
This has exposed how inadequate our testing has been for bags.
Responding to your points:
|
That was a simple mistake on my end. It's been fixed in 1c2a068
All of the above,
which is a part of doi:10.18126/M2301J and have a perfectly nice folder uri.
e.g.
I reverted it for now (e805118), since I don't really need it atm. Fun fact though: apparently each manifest- doesn't have to contain all the files. Right now I dump only md5 fully. Sha256 contains only hash for
It's calculated during registration only once, by manually downloading and running md5() on it. Result is stored in |
To summarize what happen in this PR, cause there are a few nuances that made it work:
aggregates
section, only files. (see 95d5b36). This avoids situation when 1) object doesn't have auri
or 2) we end up withdoi:something data/folder
in bag's manifest. The latter, while neat wasn't even working with BDBag, cause it doesn't have a generic doi resolver.dataSet
generation during Tale import. This is important for a roundtrip of Tales that have folders or subfolders from external datasets indata/
. (see 46a6f30)TaleExporter.verify_aggregate_checksums
in ce194f1). I haven't checked how it's gonna behave for "very large data"^{TM}. Globus simply doesn't work right now. I'll handle it as separate PR (remind me to file an issue).identifier
that looks likedoi:doi:<>
.tl;dr Now any exported Tale with an arbitrary external data (from Zenodo, DataONE, Dataverse or raw HTTP) should be working with:
How to test?