Further clarification for distributions #789

agbeltran · 2019-03-05T21:56:14Z

Extended the clarification on not fully informationally equivalent distributions related to discussion in #482

Pre-view in the note after the Distribution definition: https://rawgit.com/w3c/dxwg/agb-issue-482/dcat/index.html#Class:Distribution

makxdekkers · 2019-03-06T08:58:45Z

I am just wondering if we would want to make the note even more clear by explicitly saying that all distributions of a dataset should contain (be about?) the same data. Maybe even give an example like "For example, budget data for different years or observations from different sensors should be modelled as different datasets"?

dr-shorthair

The clarification is good. Makx's suggestion could be an additional note, or perhaps just a sentence following?

agbeltran · 2019-03-06T10:42:59Z

Yes, thanks, I agree and will add Makx's sentence as a further clarification in the same note.

andrea-perego · 2019-03-07T00:17:06Z

@makxdekkers , I'm not sure I agree about the recommendation you propose. IMO, also in the scenario you mention, the decision is up to the data provider, and it ends up to depend on the dataset granularity used in different communities, and on how the data are supposed to be used.

makxdekkers · 2019-03-07T09:26:25Z

@andrea-perego Yes, I see what you mean. It is basically the tension between interoperability versus flexibility. Not including this clarification means that people can argue that the specification does not explicitly recommend against having distributions with different data. In any case, even if the specification includes the clarification, people will still do what they want. The clarification just intends to help people who are looking for advice how to do it to ensure maximum interoperability.

andrea-perego · 2019-03-07T21:10:04Z

@makxdekkers , I totally agree in providing guidance, but in cases like this one I think we should provide alternatives.

Specifically about time series, there's both the option of having different datasets or different distributions of the same dataset. And there are also cases where there's only one dataset with one distribution that is updated every year.

Moreover, the problem I see when we recommend using different datasets is that a user, finding just one of them, have no clue that other datasets exist about earlier / later years, unless the metadata include a specific relationship for that purpose making this explicit. But this is not part of the current DCAT spec, and such a feature is not commonly supported in existing catalogue platforms.

Note that I'm not recommending against this approach. Only, if we provide guidance, it should be clear which are the pros and cons.

makxdekkers · 2019-03-07T22:45:46Z

@andrea-perego As far as I understand, the discussion that we've had reached a consensus that distributions under a dataset should all be about the same data. We established that differences between distributions might be the result of lossy translations, different profiles or different representations (e.g. spreadsheet versus graphic visualisation). If that was the consensus, my proposal was to make that consensus explicit in the clarification. In fact, the "for example" was just to reinforce the sentences right before it that state that distributions should be about the same data.
If you don't agree with the consensus, you probably also don't agree with the rest of the note at https://rawgit.com/w3c/dxwg/agb-issue-482/dcat/index.html#Class:Distribution.

agreiner · 2019-03-07T22:50:51Z

Hm, "the same data" isn't what you get with a different profile. It might be, but you might get completely disjoint sets with two different profiles. I'm pretty agnostic about how we ultimately define distributions, because both options have their own resulting trickle-down effects, some of which I like and some of which I don't. IMHO, one of the effects of saying that distributions must have the same data is that profiles can't define distributions.

smrgeoinfo · 2019-03-07T23:12:56Z

so maybe DCAT could recognize a 'series' as another kind of resource type?

andrea-perego · 2019-03-07T23:51:39Z

Thanks for pointing this out, @makxdekkers . Yes, I guess I'm not completely happy also with other points of the note. And, actually, I think there's another thing to be fixed, concerning dcat:Distribution vs dcat:DataDistributionService.

As you say, since your revision is implementing the current consensus, it should indeed be merged. I'll open two separate issues for further discussion.

makxdekkers · 2019-03-08T08:25:39Z

@agreiner When I wrote "same data", I was not implying that the output from conversion or profiling is the same -- you are right, the result could look quite different -- but that the input to the conversion or profiling is the same.
In other words, the conversion or profiling process is applied to the same data and might produce a result that is not exactly informationally equivalent.

makxdekkers · 2019-03-08T08:58:22Z

@agbeltran can you maybe have a look at where we are on this? If there are serious concerns about what I thought was consensus, and both @agreiner and @andrea-perego seem to disagree, maybe we should consider to either fall back to the silence of DCAT-2014, or see if we can, in the next few days, come up with text that gives advice for various approaches?
I am biased because of a discussion around the EU DCAT-AP where no consensus could be reached and people argued that whatever they did was legal. What I am trying to avoid is situations like for example https://datahub.io/sports-data/spanish-la-liga, where a bunch of files are thrown in distributions without descriptions -- in the example, a user would need to infer the season from the file name. But I must admit that this is what some publishers have, and recommending one approach or the other makes it harder for them to produce DCAT-conformant data. As I wrote earlier, it's the tension between interoperability -- trying to make the landscape more coherent -- and flexibility -- making it easier for publishers to do what they want and declare conformance with DCAT.

agreiner · 2019-03-08T20:03:27Z

To be clear, I'm not pushing back against saying that distributions can be informationally nonequivalent. I think the text now does a much better job than before of clarifying what we mean by that. I just meant to point out that using a different profile returns something more different than a distribution. I wonder if it would be helpful to think about profiles in terms of data services. In a way, they return subsets the way a service does. The service serves the dataset in its entirety, but individual queries return subsets.

andrea-perego · 2019-03-08T22:08:47Z

@makxdekkers said:

[...] maybe we should consider to [...] fall back to the silence of DCAT-2014 [...]

This is indeed an option. Re-thinking about this, and considering the possible different cases and approaches we can provide as examples, I wonder whether providing guidance on something that deals with data management practices is in the scope of DCAT. In DCAT-AP this has been done separately, with the work on the DCAT-AP Implementation Guidelines. So, it might be more in scope of a DCAT primer (although we may not be able to prepare it).

agbeltran · 2019-03-12T21:44:42Z

it seems that the controversial phrase is the example that @makxdekkers proposed, i.e. "For example, budget data for different years or observations from different sensors should be modeled as different datasets" - or @andrea-perego are you against the whole clarification about distributions?

andrea-perego · 2019-03-12T23:31:57Z

Re-reading the note, I think my main concern is on the proposed clarification:

For example, budget data for different years or observations from different sensors should be modeled as different datasets

This looks to me possibly conflicting with the preceding and following sentences:

In any case, all distributions of a dataset should broadly contain the same data.

The question of whether different representations can be understood to be distributions of the same dataset is use-case specific, so the judgement is the responsibility of the provider.

The clarification seems to say that "budget data for different years" and "observations from different sensors" are both to be considered as different data, which might be questionable (time series are supposed to follow the same data schema, data collection methodology, etc., whereas data from different sensors may have nothing in common).

Moreover, the clarification seems to contradict the second sentence - the one saying that it's eventually up to the data provider to decide.

Said that, I do think the content of the note would be more in scope of a primer. The current definition and usage note of distribution is, IMO, good enough to clarify what it should be used for:

A specific representation of a dataset. A dataset might be available in multiple serializations that may differ in various ways, including natural language, media-type or format, schematic organization, temporal and spatial resolution, level of detail or profiles (which might specify any or all of the above).

This represents a general availability of a dataset. It implies no information about the actual access method of the data, i.e. whether by direct download, API, or through a Web page. The use of dcat:downloadURL property indicates directly downloadable distributions.

BTW, I have also a concern about the last two paragraphs of the note - I reported this in a separate issue (#809)

dr-shorthair · 2019-03-16T01:15:40Z

@smrgeoinfo you might be right about 'Data Series' - this is perhaps a common application (I know it has its own slot in ISO 19115, for example). Perhaps you could write a new UC for this and we can put it on the backlog for the next (soon) revision.

But in general I agree with the original intention that different years-worth would typically correspond with different datasets. It's just that these datasets have a rather predictable relationship between them - i.e. they are part of a series in which only (e.g.) the temporal extent is different. We do have a general mechanism to deal with 'relationships between datasets' (i.e. qualified relations) but data-series is probably a special case that is worth giving special treatment.

dr-shorthair · 2019-03-16T01:19:19Z

Where @makxdekkers had written

For example, budget data for different years or observations from different sensors should be modeled as different datasets

perhaps we could just tweak it to

"For example, budget data for different years would typically be modeled as different datasets"

(as a sensor-observations guy I think the discussion of different sensors is potentially much bigger and out of scope here.)

smrgeoinfo · 2019-03-16T14:53:12Z

@dr-shorthair-- I'll get a UC in the back log. Other common 'series' is for satellite remote sensed data-- same sensor, but different spatial and temporal extents.

dr-shorthair · 2019-03-17T10:15:42Z

And @pwin (Scottish Government) has many!

pwin

looks clearer

Remove paras refering to change of scope

dr-shorthair · 2019-03-19T20:54:26Z

I've attempted to address these matters in #832

dr-shorthair · 2019-03-20T21:43:25Z

I would change my review to 'approve' if #832 is accepted

But I guess with a PR on the PR this will require two cycles of plenary approval to get through ...

davebrowning

Good addition, even better with @dr-shorthair 's additions (#832).

Does this mean we can drop the subsequent note about "intention of the phrase "informationally equivalent" needs to be clarified"? #411 is actually already closed and this PR improves on our description - by discussing different levels of fidelity, making it explicit that its down to the data provider to judge, as well as providing a counter example.

davebrowning · 2019-03-25T09:51:33Z

#411 note tidy up is now in #839, but it would be good to add @dr-shorthair's merge.... I can do this if every one (particularly @agbeltran) is okay with it?

agbeltran · 2019-03-26T18:24:58Z

this PR now also includes the changes by @dr-shorthair

Further clarification for distributions

aaa8718

agbeltran added dcat dcat:Distribution labels Mar 5, 2019

agbeltran requested review from dr-shorthair, agreiner, pwin and davebrowning March 5, 2019 21:56

agbeltran mentioned this pull request Mar 5, 2019

Distribution composed of more than one file, but not packaged #482

Closed

dr-shorthair requested changes Mar 6, 2019

View reviewed changes

agbeltran added 2 commits March 6, 2019 18:36

Further clarification based on Makx's comment

8bafb8d

Typo and removed repetition

eb2e574

davebrowning added this to the DCAT CR milestone Mar 14, 2019

davebrowning removed dcat labels Mar 14, 2019

davebrowning added dcat and removed dcat labels Mar 14, 2019

pwin approved these changes Mar 19, 2019

View reviewed changes

Simon Cox added 2 commits March 20, 2019 07:27

Merge branch 'gh-pages' into agb-issue-482

95d39bb

Clarify counter-example for distributions related to the same dataset

897d827

Remove paras refering to change of scope

dr-shorthair mentioned this pull request Mar 19, 2019

(Further) Clarifying notes about distribution #832

Merged

Simon Cox added 2 commits March 20, 2019 07:58

dropped a phrase

72e1922

Merge branch 'gh-pages' into agb-issue-482-simon

9ee11b5

Simon Cox added 2 commits March 21, 2019 08:43

Merge branch 'gh-pages' into agb-issue-482

a8577d3

Merge branch 'agb-issue-482' into agb-issue-482-simon

cc666e5

dr-shorthair added the critical defects that must be completed for CR label Mar 20, 2019

davebrowning requested changes Mar 21, 2019

View reviewed changes

dr-shorthair approved these changes Mar 26, 2019

View reviewed changes

davebrowning approved these changes Mar 26, 2019

View reviewed changes

Simon Cox and others added 2 commits March 27, 2019 08:28

Merge branch 'gh-pages' into agb-issue-482

3cf99c1

Merge branch 'gh-pages' into agb-issue-482

d742fb2

agbeltran merged commit 3c196e6 into gh-pages Mar 27, 2019

dr-shorthair mentioned this pull request Mar 31, 2019

Dataset series #868

Closed

aidig mentioned this pull request Oct 26, 2021

usage guide on dataset - distribution - data service SEMICeu/DCAT-AP#204

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Further clarification for distributions #789

Further clarification for distributions #789

agbeltran commented Mar 5, 2019

makxdekkers commented Mar 6, 2019

dr-shorthair left a comment

agbeltran commented Mar 6, 2019

andrea-perego commented Mar 7, 2019

makxdekkers commented Mar 7, 2019

andrea-perego commented Mar 7, 2019

makxdekkers commented Mar 7, 2019

agreiner commented Mar 7, 2019

smrgeoinfo commented Mar 7, 2019

andrea-perego commented Mar 7, 2019

makxdekkers commented Mar 8, 2019

makxdekkers commented Mar 8, 2019

agreiner commented Mar 8, 2019

andrea-perego commented Mar 8, 2019

agbeltran commented Mar 12, 2019

andrea-perego commented Mar 12, 2019

dr-shorthair commented Mar 16, 2019

dr-shorthair commented Mar 16, 2019

smrgeoinfo commented Mar 16, 2019

dr-shorthair commented Mar 17, 2019

pwin left a comment

dr-shorthair commented Mar 19, 2019

dr-shorthair commented Mar 20, 2019

davebrowning left a comment

davebrowning commented Mar 25, 2019

agbeltran commented Mar 26, 2019

Further clarification for distributions #789

Further clarification for distributions #789

Conversation

agbeltran commented Mar 5, 2019

makxdekkers commented Mar 6, 2019

dr-shorthair left a comment

Choose a reason for hiding this comment

agbeltran commented Mar 6, 2019

andrea-perego commented Mar 7, 2019

makxdekkers commented Mar 7, 2019

andrea-perego commented Mar 7, 2019

makxdekkers commented Mar 7, 2019

agreiner commented Mar 7, 2019

smrgeoinfo commented Mar 7, 2019

andrea-perego commented Mar 7, 2019

makxdekkers commented Mar 8, 2019

makxdekkers commented Mar 8, 2019

agreiner commented Mar 8, 2019

andrea-perego commented Mar 8, 2019

agbeltran commented Mar 12, 2019

andrea-perego commented Mar 12, 2019

dr-shorthair commented Mar 16, 2019

dr-shorthair commented Mar 16, 2019

smrgeoinfo commented Mar 16, 2019

dr-shorthair commented Mar 17, 2019

pwin left a comment

Choose a reason for hiding this comment

dr-shorthair commented Mar 19, 2019

dr-shorthair commented Mar 20, 2019

davebrowning left a comment

Choose a reason for hiding this comment

davebrowning commented Mar 25, 2019

agbeltran commented Mar 26, 2019