-
Notifications
You must be signed in to change notification settings - Fork 2
How should one get a data download URL from a DOI? #1
Comments
I'm not sure how this would solve the multiple file problem. There are lots of repos with multiple files under one DOI. One answer would be to download a zipped directory for either all cases (simplest), or for the just the case of multiple files. But it seems useful to be able to target one file as well. |
100% agree that content-negotation wold be a great way to sort this out. doesn't seem like that's possible right now. short of that, i can envision client side mappings between data providers and how to get URLs for actual data files - those mappings could be in JSON e.g., so they are lang. agnostic |
Re targeting one file: I agree this would be nice; it's basically up to the repository. For instance, Dryad gives a DOI to the whole package and individual DOIs to each of the component parts, so the user could use the package DOI if they want the whole package, or the DOI for a particular csv if they want that csv. (Currently they still have to resolve download urls though, as those DOIs all resolve only to HTML landing pages). In this case, DataCite actually does list each of the parts as related works in the datacite-xml version from CN, e.g. https://data.datacite.org/application/vnd.datacite.datacite+xml/10.5061/dryad.2k462 However, that's not always the case of course. e.g. KNB gives unique IDs to all the parts (e.g. to each csv file), but I believe only the package as a whole gets an actual registered DOI, and thus DataCite record has no information about these additional parts (e.g. compare, on DataCite https://search.datacite.org/works/10.5063/f1bz63z8 vs on knb: https://knb.ecoinformatics.org/#view/doi:10.5063/F1BZ63Z8). In both cases, there's a metadata file we can parse for identifiers to the components, but it's different structures and actually in both cases it's not obvious how to translate those into downloads. (Actually all Dryad entries are in DataONE anyway, so we can download any of these using the Basically, I'd like to see this same ability DataONE has to return Of course with zenodo / github you're just stuck downloading the whole |
For a Zenodo-Github repo one could in theory get the individual file from GitHub. Of course, that's no guarantee the file will still be there, but the commit hash should at least ensure that if it is, it's the right one. There could be a failsafe otherwise. |
Very interesting discussion and good timing. DataCite will work on this very topic in the coming months thanks to a new grant, and my initial thoughts and feedback from @maxogden and @cameronneylon are at datacite/freya#2. My ideas are around content negotiation using the |
I would treat DOIs for software hosted in code repositories differently, as they have standards way to get to the content, and we should support that in DOI metadata, e.g. adding the commit hash as |
@mfenner Hooray, thanks! Content negotiation w/ Being able to identify that an object is SoftwareSourceCode and get a |
(If we need to do Bagit creation client side we already have https://github.com/ropensci/datapack) |
@sckott datapack looks great! As @cameronneylon pointed out in the issue I referenced, there is both a data consumer and data producer side to this. I would do two small modifications to the bagit standard: include the DataCite metadata as XML file (to avoid extra effort), and zip the bagit as We will have a project kickoff meeting next Wednesday and I can report on any progress. You can also follow along via datacite/freya#2 and related issues. |
@noamross you raise an excellent point here that despite a DOI being the canonical way to refer to / access a dataset, there really is no well-defined / machine-readable way to determine a download URL for the data object: DOIs usually redirect to human-readable pages only. I'm really curious what @mfenner thinks about this; seems to me that this would ideally be addressed by the DOI system itself; but perhaps there's a good argument against that.
To me, the ideal solution would just be some content negotiation directly against the DOI, e.g. including
"Accept: text/csv"
(or maybe something more generic) in your GET header would get the data resource itself.Alternately, it would be nice if the metadata returned by DataCite's existing Content Negotiation system included the data download URL, e.g. following the http://schema.org/Dataset which DataCite already uses, it could add (as Google recommends):
from which we could infer the download url(s). A third option would be for this kind of structured information to be provided in a machine-readable way but from the data repository (e.g. after resolving the DOI to its HTML redirect) rather than at DataCite's end.
Absent any of these server-side solutions, I agree that there would be immediate value in an R package to provide a way for R users at least to script access to data from the DOI without having to research each repository's API or available packages first.
The text was updated successfully, but these errors were encountered: