Distribution composed of more than one file, but not packaged #482

dr-shorthair · 2018-10-22T07:00:42Z

A Distribution may be composed of multiple files which cannot be used independently, such as a shapefile and its attendant sidecars (index and database files). These might not be packaged into a single distributable artefact, such as a tar or zip archive (see #54 and #259) . So a dataset's distribution, while a single entity, is composed of multiple artefacts. We need to show patterns about how these will appear in a catalog.

agbeltran · 2018-10-26T09:30:38Z

One particular case for the multiple files of a distribution is that of the checksum files. This is something we want to include in DATS (see datatagsuite/schema#11) and it would be good to have specific vocabulary to refer to them.

agreiner · 2018-11-07T01:26:11Z

I have a use case where we have log files that we release as datasets. One "log" could contain hundreds of raw files, making it impractical to treat each as a separate dataset. They are often grouped in multiple layers of directories, such as a directory per cabinet, and within that a directory per chassis, and within that a directory per slot, etc. We've been releasing them as a single gzip, but they are getting so big that it's impractical to use a simple http download. (We offer Globus instead.)

makxdekkers · 2018-11-07T09:27:12Z

This seem to be a completely new requirement. And, given the resource and time constraints, I think it will be hard to incorporate a solution in the current revision of DCAT. I also think that it might be difficult to define a single approach/vocabulary that covers all possible relationships -- even the three examples from @dr-shorthair, @agbeltran and @agreiner seem to require different solutions.
In the current revision, you could do something with dcat:accessURL which you could use to point to a landing page where you link to various files and explain their relationships. The use of dcat:downloadURL is only possible for a single file according to the definition: "The URL of the downloadable file in a given format".
Should we leave this for future work?

dr-shorthair · 2018-11-07T14:18:40Z

@agreiner I wonder if this is just another case of part-whole, as discussed in #411 (proposed new property dcat:componentDistribution) ?

@makxdekkers Yes - I tend to agree - our docket is quite full. However, perhaps we can recommend the solution agreed for #256 (i.e. use dct:relation ) and leave stronger semantics to another day.

agreiner · 2018-11-07T18:59:12Z

dcat:componentDistribution seems like it would work for my use case. dct:relation, if restricted to files that have no relationship, would not. The tricky bit is to find a place to use componentDistribution. I don't fancy giving people separate RDF descriptions for each of the hundreds of files in a set. I would really like to have a property of a distribution that is a list of component parts, by relative path, sort of a manifest, or maybe even just a URI for a manifest. But we are creating our own vocabulary for handling log files, so we will invent our own for that if DCAT can't cover it.

makxdekkers · 2018-11-07T19:51:30Z

@agreiner You can always use dcat:accessURL and link to a page or directory where you files live, with a README file to explain what the files are. Or even use dcat:landingPage on the Dataset description to link to such a page.

agreiner · 2018-11-07T21:37:14Z

That would certainly work for the purposes of getting things into a catalog in a helpful way. What I've been hoping for is a way to reason about these things using RDF. I realize the main purpose of DCAT is just to let people catalog datasets, but it is so close to being useful for this use case, and there are now several related use cases that seem like they would benefit, it just seems a shame not to seize the opportunity.

davebrowning · 2018-11-08T15:02:58Z

While its clear that we have quite a bit to do, I'd prefer that we don't absolutely rule a problem area out quite yet. We have a list of high priority requirements referenced here as discussed and agreed at the F2F, and a target date of mid-January for the rec-track work. We've also talked about generating more examples and/or a primer after that date. That plan gives us some flex on how we can talk to this issue - extend timescales of the rec-track work, provide examples and suggestions or leave it for a further iteration. (All subject to agreement within the WG and the broader W3M)

On this specific case, I would agree with @agreiner's comment - it would be good to 'seize the opportunity' - and that it would be great if @dr-shorthair pithy summary "a dataset's distribution, while a single entity, is composed of multiple artefacts" found its way into the recommendation even if just a comment. I don't think it has higher priority than the requirements that we're focussing on now. But I hope we have the luxury of deferring any final inclusion decisions to January.

smrgeoinfo · 2018-11-08T17:46:37Z

For distributions that are aggregations, pointing to something like an OAI-ORE resource map would be a solution. See also DataOne discussion of packaging for an implementation

makxdekkers · 2019-02-06T08:13:23Z

Based on the discussion at https://www.w3.org/2019/02/05-dxwgdcat-minutes, I will draft a proposal to suggest that a situation that there are several files that need to be considered together could be handled by using dcat:accessURL to link to a page where a list of the files and their relationships can be given.

makxdekkers · 2019-02-09T18:28:09Z

I think this is taken care of in #730?

davebrowning · 2019-02-28T08:56:26Z

Addressed in #746 and closing as agreed at https://www.w3.org/2019/02/27-dxwgdcat-minutes#x08.

agreiner · 2019-02-28T20:20:09Z

Hm, #746 doesn't really address my use case. Indicating the compression and packaging algorithms used to combine multiple files into one doesn't help with the case where we have multiple separate files and don't want to combine them.

makxdekkers · 2019-02-28T21:28:35Z

@agreiner I agree that #746 doesn't address your use case. As far as I understand, a solution using dcat:accessURL or dcat:landingPage would give you a way to link to a set of files, although I admit that this places the solution outside of DCAT.
Would something like an rdf:Bag help here?

davebrowning · 2019-03-01T08:11:08Z

Re-opening issue to track discussion between @agreiner and @makxdekkers - unless you want to open a separate new issue to discuss that use case.

agreiner · 2019-03-01T19:20:47Z

I'm interested in finding a way that specific files within a dataset can be described in DCAT such that they can be reasoned about as individual entities. That would allow us to store metadata for each file from a set of log files and perform queries about temporal coverage or subject. Since we've now agreed that distributions can be informationally different, it seems that they come very close to filling the fill. We would just need to add attributes to a distribution that distinguish them from each other. Perhaps temporal coverage and subject.

makxdekkers · 2019-03-02T08:41:28Z

@agreiner I agree with you that this is an interesting discussion and we addressed this in several issues over the last year, e.g. #52, #317, #411, #531.
@dr-shorthair came to a conclusion at #531 (comment), that "the consensus is that anything short of losslessly-convertible would be use-case specific". @davebrowning wrote at #411 (comment): "As it stands this appears to be an issue best addressed using profiles".
I don't know what we can do more on this, and I'd like to hand this back to the editors @agbeltran, @davebrowning, @dr-shorthair and @pwin.

agbeltran · 2019-03-02T11:17:58Z

While we agreed that we shouldn't impose that distributions should be informationally equivalent, as in some cases we need more flexibility, we also discussed that the distinction between dataset and distributions is important.

To ensure that distributions are informationally equivalent would potentially require automated transformations between distributions. So, we still leave it to a judgment call of the data providers to determine what can be distributions of a dataset, but we don't want to encourage a dataset having multiple distributions that are unrelated.

Including at the distribution level properties that now belong to the dataset level (such as subject and temporal coverage) would blur the distinction between dataset and distribution. Thus, I think that if you need to specify the subject or temporal coverage (or similar properties) of a distribution, you should be considering if they are actually distributions of different datasets.

agbeltran · 2019-03-02T11:20:05Z

@agreiner maybe you have in mind a specific example where the distinction between dataset and distribution is maintained and there is still a need to provide more details about the distribution - if so, please let us know, but otherwise I think that adding more properties to the distribution will not be helpful

agreiner · 2019-03-02T19:55:44Z

My use case here is the one I described at the top of the thread, where we have log data for supercomputing systems. I realize this is a very specific domain, so I wanted to make suggestions of properties that would be more general. By "subject", I meant to identify the particular node for which the log file records data. It is "subject" in the sense of a thing being operated upon, not an abstract topic or domain. The nodes are all part of the same system, so they are indeed very closely related. A set of files for different nodes all record the same fields. You could think of the nodes as parts of a whole, so maybe "part" or "partOfWhole" would make more sense. Others may have better ideas as to what to name such a property. I just didn't want to be as domain specific as to say "node", of course. I suppose similar issues could be addressed in geographic datasets by allowing for identifying locales as parts. Back when distributions just varied by media type, they had a property to describe the media type. It just seems that once we are choosing to allow them to differ more, we should provide analogous properties to describe these new ways of differing. We seem to be okay with adding spatial and temporal resolution properties to fulfill that need; why not enable other common differences as well, if it can be done cleanly?

makxdekkers · 2019-03-03T08:28:22Z

@agreiner As far as I can see, there are two ways to understand "not informationally equivalent". You can read it as "don't have to be the same data", as in your case where the same kind of data is recorded for different entities, such as 'nodes', sensors', 'stations' or what have you; and you can read it in the sense of "not exactly the same", for example as a result of profiling or lossy transformation.
From my understanding of the earlier discussion, the consensus seemed to be the second interpretation, because we felt that requiring exact equivalence was too strict -- but I don't think we agreed that distributions can contain different data.

agbeltran · 2019-03-05T20:52:34Z

Further to @makxdekkers' comment, indeed by saying that distributions don't need to be strictly fully informationally equivalent, we don't mean that they can hold totally different data (and I am not talking about the data type as in @agreiner's example of log files, but about the data itself). So, in your use case @agreiner, those would be different datasets and not different distributions of the same dataset.

The ED currently states:

In some cases all distributions of a dataset will be fully informationally equivalent, in the sense that lossless transformations between the representations are possible. An example would be different serializations of an RDF graph using RDF/XML, Turtle, N3, JSON-LD. However, in other cases the distributions might have different levels of fidelity to the underlying data. For example, a graphical representation alongside a CSV file. The question of whether different representations can be understood to be distributions of the same dataset is use-case specific, so the judgement is the responsibility of the provider.

In my opinion, that text clarifies the points we made a few times in this discussion. The example given about a CSV file and a graphical representation shows that in terms of the information they convey, different representations may not be identical in information, but we are not implying that they can be totally different.

@agreiner do you think we need to add further clarifications on this? If so, can you please suggest some text? Thanks

agbeltran · 2019-03-05T20:52:56Z

Link to the relevant section in the ED: https://w3c.github.io/dxwg/dcat/#Class:Distribution

agbeltran · 2019-03-05T21:56:54Z

I added a bit more detail in the note about distributions - see PR: #789

agreiner · 2019-03-07T01:35:24Z

I think this is much more clear now. Thanks! It looks like our log files use case is going to be awkward for DCAT, but we are developing our own extension that should work.

davebrowning · 2019-04-02T08:16:06Z

This now looks ready to close - the ED is clear enough about what it means. There might be additional requirement around the log files use case, but that would be future work.

davebrowning · 2019-04-05T07:51:57Z

As notified/discussed: https://www.w3.org/2017/dxwg/wiki/Meetings:Telecon2019.04.02

dr-shorthair added dcat dcat:Distribution labels Oct 22, 2018

dr-shorthair added this to To do in DCAT revision via automation Oct 22, 2018

agbeltran added this to the DCAT Third Public Working Draft milestone Oct 26, 2018

agbeltran modified the milestones: DCAT Third Public Working Draft, DCAT Fourth Public Working Draft Nov 21, 2018

ashepherd mentioned this issue Dec 1, 2018

specific recommendations for metadata distribution info ESIPFed/science-on-schema.org#4

Closed

agbeltran mentioned this issue Jan 30, 2019

Add void:rootResource as property of dcat:Resource ? #292

Closed

agbeltran added this to To do in DCAT Sprint 1a: Distributions Packaging/File Composition via automation Jan 31, 2019

davebrowning moved this from To do to In progress in DCAT Sprint 1a: Distributions Packaging/File Composition Feb 5, 2019

davebrowning assigned makxdekkers Feb 5, 2019

agbeltran mentioned this issue Feb 20, 2019

Packaging and compression of files in a distribution #746

Merged

davebrowning added the due for closing Issue that is going to be closed if there are no objection within 6 days label Feb 26, 2019

davebrowning closed this as completed Feb 28, 2019

DCAT revision automation moved this from To do to Done Feb 28, 2019

DCAT Sprint 1a: Distributions Packaging/File Composition automation moved this from In progress to Done Feb 28, 2019

davebrowning removed the due for closing Issue that is going to be closed if there are no objection within 6 days label Feb 28, 2019

davebrowning reopened this Mar 1, 2019

DCAT revision automation moved this from Done to In progress Mar 1, 2019

DCAT Sprint 1a: Distributions Packaging/File Composition automation moved this from Done to In progress Mar 1, 2019

agbeltran mentioned this issue Mar 5, 2019

Further clarification for distributions #789

Merged

davebrowning modified the milestones: DCAT Backlog, DCAT CR Mar 14, 2019

davebrowning added the due for closing Issue that is going to be closed if there are no objection within 6 days label Apr 2, 2019

davebrowning closed this as completed Apr 5, 2019

DCAT revision automation moved this from In progress to Done Apr 5, 2019

DCAT Sprint 1a: Distributions Packaging/File Composition automation moved this from In progress to Done Apr 5, 2019

davebrowning removed the due for closing Issue that is going to be closed if there are no objection within 6 days label Apr 8, 2019

davebrowning mentioned this issue Sep 23, 2019

Change domain or create superclass of dcat:Distribution #317

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distribution composed of more than one file, but not packaged #482

Distribution composed of more than one file, but not packaged #482

dr-shorthair commented Oct 22, 2018 •

edited

Loading

agbeltran commented Oct 26, 2018

agreiner commented Nov 7, 2018

makxdekkers commented Nov 7, 2018

dr-shorthair commented Nov 7, 2018 •

edited

Loading

agreiner commented Nov 7, 2018

makxdekkers commented Nov 7, 2018

agreiner commented Nov 7, 2018

davebrowning commented Nov 8, 2018

smrgeoinfo commented Nov 8, 2018

makxdekkers commented Feb 6, 2019

makxdekkers commented Feb 9, 2019

davebrowning commented Feb 28, 2019

agreiner commented Feb 28, 2019

makxdekkers commented Feb 28, 2019

davebrowning commented Mar 1, 2019 •

edited

Loading

agreiner commented Mar 1, 2019

makxdekkers commented Mar 2, 2019

agbeltran commented Mar 2, 2019

agbeltran commented Mar 2, 2019

agreiner commented Mar 2, 2019

makxdekkers commented Mar 3, 2019

agbeltran commented Mar 5, 2019 •

edited

Loading

agbeltran commented Mar 5, 2019

agbeltran commented Mar 5, 2019

agreiner commented Mar 7, 2019

davebrowning commented Apr 2, 2019

davebrowning commented Apr 5, 2019

Distribution composed of more than one file, but not packaged #482

Distribution composed of more than one file, but not packaged #482

Comments

dr-shorthair commented Oct 22, 2018 • edited Loading

agbeltran commented Oct 26, 2018

agreiner commented Nov 7, 2018

makxdekkers commented Nov 7, 2018

dr-shorthair commented Nov 7, 2018 • edited Loading

agreiner commented Nov 7, 2018

makxdekkers commented Nov 7, 2018

agreiner commented Nov 7, 2018

davebrowning commented Nov 8, 2018

smrgeoinfo commented Nov 8, 2018

makxdekkers commented Feb 6, 2019

makxdekkers commented Feb 9, 2019

davebrowning commented Feb 28, 2019

agreiner commented Feb 28, 2019

makxdekkers commented Feb 28, 2019

davebrowning commented Mar 1, 2019 • edited Loading

agreiner commented Mar 1, 2019

makxdekkers commented Mar 2, 2019

agbeltran commented Mar 2, 2019

agbeltran commented Mar 2, 2019

agreiner commented Mar 2, 2019

makxdekkers commented Mar 3, 2019

agbeltran commented Mar 5, 2019 • edited Loading

agbeltran commented Mar 5, 2019

agbeltran commented Mar 5, 2019

agreiner commented Mar 7, 2019

davebrowning commented Apr 2, 2019

davebrowning commented Apr 5, 2019

dr-shorthair commented Oct 22, 2018 •

edited

Loading

dr-shorthair commented Nov 7, 2018 •

edited

Loading

davebrowning commented Mar 1, 2019 •

edited

Loading

agbeltran commented Mar 5, 2019 •

edited

Loading