Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Further clarification for distributions #789

Merged
merged 11 commits into from
Mar 27, 2019
Merged

Further clarification for distributions #789

merged 11 commits into from
Mar 27, 2019

Conversation

agbeltran
Copy link
Member

Extended the clarification on not fully informationally equivalent distributions related to discussion in #482

Pre-view in the note after the Distribution definition: https://rawgit.com/w3c/dxwg/agb-issue-482/dcat/index.html#Class:Distribution

@makxdekkers
Copy link
Contributor

I am just wondering if we would want to make the note even more clear by explicitly saying that all distributions of a dataset should contain (be about?) the same data. Maybe even give an example like "For example, budget data for different years or observations from different sensors should be modelled as different datasets"?

Copy link
Contributor

@dr-shorthair dr-shorthair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The clarification is good. Makx's suggestion could be an additional note, or perhaps just a sentence following?

@agbeltran
Copy link
Member Author

Yes, thanks, I agree and will add Makx's sentence as a further clarification in the same note.

@andrea-perego
Copy link
Contributor

@makxdekkers , I'm not sure I agree about the recommendation you propose. IMO, also in the scenario you mention, the decision is up to the data provider, and it ends up to depend on the dataset granularity used in different communities, and on how the data are supposed to be used.

@makxdekkers
Copy link
Contributor

@andrea-perego Yes, I see what you mean. It is basically the tension between interoperability versus flexibility. Not including this clarification means that people can argue that the specification does not explicitly recommend against having distributions with different data. In any case, even if the specification includes the clarification, people will still do what they want. The clarification just intends to help people who are looking for advice how to do it to ensure maximum interoperability.

@andrea-perego
Copy link
Contributor

@makxdekkers , I totally agree in providing guidance, but in cases like this one I think we should provide alternatives.

Specifically about time series, there's both the option of having different datasets or different distributions of the same dataset. And there are also cases where there's only one dataset with one distribution that is updated every year.

Moreover, the problem I see when we recommend using different datasets is that a user, finding just one of them, have no clue that other datasets exist about earlier / later years, unless the metadata include a specific relationship for that purpose making this explicit. But this is not part of the current DCAT spec, and such a feature is not commonly supported in existing catalogue platforms.

Note that I'm not recommending against this approach. Only, if we provide guidance, it should be clear which are the pros and cons.

@makxdekkers
Copy link
Contributor

@andrea-perego As far as I understand, the discussion that we've had reached a consensus that distributions under a dataset should all be about the same data. We established that differences between distributions might be the result of lossy translations, different profiles or different representations (e.g. spreadsheet versus graphic visualisation). If that was the consensus, my proposal was to make that consensus explicit in the clarification. In fact, the "for example" was just to reinforce the sentences right before it that state that distributions should be about the same data.
If you don't agree with the consensus, you probably also don't agree with the rest of the note at https://rawgit.com/w3c/dxwg/agb-issue-482/dcat/index.html#Class:Distribution.

@agreiner
Copy link
Contributor

agreiner commented Mar 7, 2019

Hm, "the same data" isn't what you get with a different profile. It might be, but you might get completely disjoint sets with two different profiles. I'm pretty agnostic about how we ultimately define distributions, because both options have their own resulting trickle-down effects, some of which I like and some of which I don't. IMHO, one of the effects of saying that distributions must have the same data is that profiles can't define distributions.

@smrgeoinfo
Copy link
Contributor

so maybe DCAT could recognize a 'series' as another kind of resource type?

@andrea-perego
Copy link
Contributor

Thanks for pointing this out, @makxdekkers . Yes, I guess I'm not completely happy also with other points of the note. And, actually, I think there's another thing to be fixed, concerning dcat:Distribution vs dcat:DataDistributionService.

As you say, since your revision is implementing the current consensus, it should indeed be merged. I'll open two separate issues for further discussion.

@makxdekkers
Copy link
Contributor

@agreiner When I wrote "same data", I was not implying that the output from conversion or profiling is the same -- you are right, the result could look quite different -- but that the input to the conversion or profiling is the same.
In other words, the conversion or profiling process is applied to the same data and might produce a result that is not exactly informationally equivalent.

@makxdekkers
Copy link
Contributor

@agbeltran can you maybe have a look at where we are on this? If there are serious concerns about what I thought was consensus, and both @agreiner and @andrea-perego seem to disagree, maybe we should consider to either fall back to the silence of DCAT-2014, or see if we can, in the next few days, come up with text that gives advice for various approaches?
I am biased because of a discussion around the EU DCAT-AP where no consensus could be reached and people argued that whatever they did was legal. What I am trying to avoid is situations like for example https://datahub.io/sports-data/spanish-la-liga, where a bunch of files are thrown in distributions without descriptions -- in the example, a user would need to infer the season from the file name. But I must admit that this is what some publishers have, and recommending one approach or the other makes it harder for them to produce DCAT-conformant data. As I wrote earlier, it's the tension between interoperability -- trying to make the landscape more coherent -- and flexibility -- making it easier for publishers to do what they want and declare conformance with DCAT.

@agreiner
Copy link
Contributor

agreiner commented Mar 8, 2019

To be clear, I'm not pushing back against saying that distributions can be informationally nonequivalent. I think the text now does a much better job than before of clarifying what we mean by that. I just meant to point out that using a different profile returns something more different than a distribution. I wonder if it would be helpful to think about profiles in terms of data services. In a way, they return subsets the way a service does. The service serves the dataset in its entirety, but individual queries return subsets.

@andrea-perego
Copy link
Contributor

@makxdekkers said:

[...] maybe we should consider to [...] fall back to the silence of DCAT-2014 [...]

This is indeed an option. Re-thinking about this, and considering the possible different cases and approaches we can provide as examples, I wonder whether providing guidance on something that deals with data management practices is in the scope of DCAT. In DCAT-AP this has been done separately, with the work on the DCAT-AP Implementation Guidelines. So, it might be more in scope of a DCAT primer (although we may not be able to prepare it).

@agbeltran
Copy link
Member Author

it seems that the controversial phrase is the example that @makxdekkers proposed, i.e. "For example, budget data for different years or observations from different sensors should be modeled as different datasets" - or @andrea-perego are you against the whole clarification about distributions?

@andrea-perego
Copy link
Contributor

Re-reading the note, I think my main concern is on the proposed clarification:

For example, budget data for different years or observations from different sensors should be modeled as different datasets

This looks to me possibly conflicting with the preceding and following sentences:

In any case, all distributions of a dataset should broadly contain the same data.

The question of whether different representations can be understood to be distributions of the same dataset is use-case specific, so the judgement is the responsibility of the provider.

The clarification seems to say that "budget data for different years" and "observations from different sensors" are both to be considered as different data, which might be questionable (time series are supposed to follow the same data schema, data collection methodology, etc., whereas data from different sensors may have nothing in common).

Moreover, the clarification seems to contradict the second sentence - the one saying that it's eventually up to the data provider to decide.

Said that, I do think the content of the note would be more in scope of a primer. The current definition and usage note of distribution is, IMO, good enough to clarify what it should be used for:

A specific representation of a dataset. A dataset might be available in multiple serializations that may differ in various ways, including natural language, media-type or format, schematic organization, temporal and spatial resolution, level of detail or profiles (which might specify any or all of the above).

This represents a general availability of a dataset. It implies no information about the actual access method of the data, i.e. whether by direct download, API, or through a Web page. The use of dcat:downloadURL property indicates directly downloadable distributions.

BTW, I have also a concern about the last two paragraphs of the note - I reported this in a separate issue (#809)

@davebrowning davebrowning added this to the DCAT CR milestone Mar 14, 2019
@davebrowning davebrowning added dcat and removed dcat labels Mar 14, 2019
@dr-shorthair
Copy link
Contributor

@smrgeoinfo you might be right about 'Data Series' - this is perhaps a common application (I know it has its own slot in ISO 19115, for example). Perhaps you could write a new UC for this and we can put it on the backlog for the next (soon) revision.

But in general I agree with the original intention that different years-worth would typically correspond with different datasets. It's just that these datasets have a rather predictable relationship between them - i.e. they are part of a series in which only (e.g.) the temporal extent is different. We do have a general mechanism to deal with 'relationships between datasets' (i.e. qualified relations) but data-series is probably a special case that is worth giving special treatment.

@dr-shorthair
Copy link
Contributor

Where @makxdekkers had written

For example, budget data for different years or observations from different sensors should be modeled as different datasets

perhaps we could just tweak it to

"For example, budget data for different years would typically be modeled as different datasets"

(as a sensor-observations guy I think the discussion of different sensors is potentially much bigger and out of scope here.)

@smrgeoinfo
Copy link
Contributor

@dr-shorthair-- I'll get a UC in the back log. Other common 'series' is for satellite remote sensed data-- same sensor, but different spatial and temporal extents.

@dr-shorthair
Copy link
Contributor

And @pwin (Scottish Government) has many!

Copy link
Contributor

@pwin pwin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks clearer

@dr-shorthair
Copy link
Contributor

I've attempted to address these matters in #832

@dr-shorthair
Copy link
Contributor

I would change my review to 'approve' if #832 is accepted

But I guess with a PR on the PR this will require two cycles of plenary approval to get through ...

@dr-shorthair dr-shorthair added the critical defects that must be completed for CR label Mar 20, 2019
Copy link
Contributor

@davebrowning davebrowning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good addition, even better with @dr-shorthair 's additions (#832).

Does this mean we can drop the subsequent note about "intention of the phrase "informationally equivalent" needs to be clarified"? #411 is actually already closed and this PR improves on our description - by discussing different levels of fidelity, making it explicit that its down to the data provider to judge, as well as providing a counter example.

@davebrowning
Copy link
Contributor

#411 note tidy up is now in #839, but it would be good to add @dr-shorthair's merge.... I can do this if every one (particularly @agbeltran) is okay with it?

@agbeltran
Copy link
Member Author

this PR now also includes the changes by @dr-shorthair

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
critical defects that must be completed for CR dcat:Distribution dcat
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants