How should one get a data download URL from a DOI? #1

cboettig · 2018-01-03T04:08:20Z

@noamross you raise an excellent point here that despite a DOI being the canonical way to refer to / access a dataset, there really is no well-defined / machine-readable way to determine a download URL for the data object: DOIs usually redirect to human-readable pages only. I'm really curious what @mfenner thinks about this; seems to me that this would ideally be addressed by the DOI system itself; but perhaps there's a good argument against that.

To me, the ideal solution would just be some content negotiation directly against the DOI, e.g. including "Accept: text/csv" (or maybe something more generic) in your GET header would get the data resource itself.

Alternately, it would be nice if the metadata returned by DataCite's existing Content Negotiation system included the data download URL, e.g. following the http://schema.org/Dataset which DataCite already uses, it could add (as Google recommends):

 "distribution":[
     {
        "@type":"DataDownload",
        "encodingFormat":"CSV",
        "contentUrl":"http://www.ncdc.noaa.gov/stormevents/ftp.jsp"
     },

from which we could infer the download url(s). A third option would be for this kind of structured information to be provided in a machine-readable way but from the data repository (e.g. after resolving the DOI to its HTML redirect) rather than at DataCite's end.

Absent any of these server-side solutions, I agree that there would be immediate value in an R package to provide a way for R users at least to script access to data from the DOI without having to research each repository's API or available packages first.

The text was updated successfully, but these errors were encountered:

noamross · 2018-01-03T04:17:28Z

I'm not sure how this would solve the multiple file problem. There are lots of repos with multiple files under one DOI. One answer would be to download a zipped directory for either all cases (simplest), or for the just the case of multiple files. But it seems useful to be able to target one file as well.

sckott · 2018-01-03T04:24:58Z

100% agree that content-negotation wold be a great way to sort this out. doesn't seem like that's possible right now.

short of that, i can envision client side mappings between data providers and how to get URLs for actual data files - those mappings could be in JSON e.g., so they are lang. agnostic

cboettig · 2018-01-03T05:08:02Z

Re targeting one file: I agree this would be nice; it's basically up to the repository.

For instance, Dryad gives a DOI to the whole package and individual DOIs to each of the component parts, so the user could use the package DOI if they want the whole package, or the DOI for a particular csv if they want that csv. (Currently they still have to resolve download urls though, as those DOIs all resolve only to HTML landing pages).

In this case, DataCite actually does list each of the parts as related works in the datacite-xml version from CN, e.g. https://data.datacite.org/application/vnd.datacite.datacite+xml/10.5061/dryad.2k462

However, that's not always the case of course. e.g. KNB gives unique IDs to all the parts (e.g. to each csv file), but I believe only the package as a whole gets an actual registered DOI, and thus DataCite record has no information about these additional parts (e.g. compare, on DataCite https://search.datacite.org/works/10.5063/f1bz63z8 vs on knb: https://knb.ecoinformatics.org/#view/doi:10.5063/F1BZ63Z8).

In both cases, there's a metadata file we can parse for identifiers to the components, but it's different structures and actually in both cases it's not obvious how to translate those into downloads. (Actually all Dryad entries are in DataONE anyway, so we can download any of these using the dataUrl provided by a DataONE solr query of the DOI / other identifier).

Basically, I'd like to see this same ability DataONE has to return dataUrl given an identifier implemented at the DataCite level, so that it worked on any DataCite DOI and not just those in the DataONE network.

Of course with zenodo / github you're just stuck downloading the whole .zip anyway, which simplifies things but makes it impossible to request just one file.

noamross · 2018-01-03T05:53:58Z

For a Zenodo-Github repo one could in theory get the individual file from GitHub. Of course, that's no guarantee the file will still be there, but the commit hash should at least ensure that if it is, it's the right one. There could be a failsafe otherwise.

mfenner · 2018-01-03T17:47:52Z

Very interesting discussion and good timing. DataCite will work on this very topic in the coming months thanks to a new grant, and my initial thoughts and feedback from @maxogden and @cameronneylon are at datacite/freya#2. My ideas are around content negotiation using the application/zip content type and using bagit for a basic description and checksums.

mfenner · 2018-01-03T17:53:00Z

I would treat DOIs for software hosted in code repositories differently, as they have standards way to get to the content, and we should support that in DOI metadata, e.g. adding the commit hash as related_identifier.

cboettig · 2018-01-03T17:57:06Z

@mfenner Hooray, thanks! Content negotiation w/ application/zip type + bagit sounds great to me.

Being able to identify that an object is SoftwareSourceCode and get a related_identifier from which it could be installed is nice; though of course it would also be good to always be able to just get the bagit zip file of the sourcecode directly from the data repository (e.g. for archived software that ceases to be available from those more standard channels).

sckott · 2018-01-03T18:16:22Z

(If we need to do Bagit creation client side we already have https://github.com/ropensci/datapack)

mfenner · 2018-01-03T18:59:14Z

@sckott datapack looks great! As @cameronneylon pointed out in the issue I referenced, there is both a data consumer and data producer side to this.

I would do two small modifications to the bagit standard: include the DataCite metadata as XML file (to avoid extra effort), and zip the bagit as application/zip (I think this is not part of the bagit spec). And I like a low tech approach that doesn't create hurdles, so no schema.org or other json-ld. I would use content negotiation with application/zip for backwards compatibility, but would like to explore other ways (e.g. providing a contentUrl in the metadata or using a content-location header).

We will have a project kickoff meeting next Wednesday and I can report on any progress. You can also follow along via datacite/freya#2 and related issues.

benmarwick mentioned this issue Jan 3, 2018

use_data_repository benmarwick/rrtools#15

Closed

oxinabox mentioned this issue Feb 28, 2018

Can we do more with DOIs? oxinabox/DataDeps.jl#27

Closed

nuest mentioned this issue Sep 20, 2019

Dataverse repoprovider and URLs jupyterhub/binderhub#900

Closed

remram44 mentioned this issue Apr 15, 2021

Support for more data repositories - with a shared library? VIDA-NYU/reproserver#48

Open

huberrob mentioned this issue Nov 25, 2022

FsF-F3-01M: include examples of metadata schemas that could be used to fulfil this metric pangaea-data-publisher/fuji#307

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How should one get a data download URL from a DOI? #1

How should one get a data download URL from a DOI? #1

cboettig commented Jan 3, 2018

noamross commented Jan 3, 2018

sckott commented Jan 3, 2018

cboettig commented Jan 3, 2018

noamross commented Jan 3, 2018 •

edited

Loading

mfenner commented Jan 3, 2018 •

edited

Loading

mfenner commented Jan 3, 2018

cboettig commented Jan 3, 2018

sckott commented Jan 3, 2018

mfenner commented Jan 3, 2018 •

edited

Loading

How should one get a data download URL from a DOI? #1

How should one get a data download URL from a DOI? #1

Comments

cboettig commented Jan 3, 2018

noamross commented Jan 3, 2018

sckott commented Jan 3, 2018

cboettig commented Jan 3, 2018

noamross commented Jan 3, 2018 • edited Loading

mfenner commented Jan 3, 2018 • edited Loading

mfenner commented Jan 3, 2018

cboettig commented Jan 3, 2018

sckott commented Jan 3, 2018

mfenner commented Jan 3, 2018 • edited Loading

noamross commented Jan 3, 2018 •

edited

Loading

mfenner commented Jan 3, 2018 •

edited

Loading

mfenner commented Jan 3, 2018 •

edited

Loading