Improving Dataset descriptions #1083

Closed
danbri opened this Issue Apr 6, 2016 · 22 comments

Projects

None yet

8 participants

@danbri
Contributor
danbri commented Apr 6, 2016

Talking with Natasha Noy about possible improvements around dataset description. Some things to look into:

  • coverageStart and coverageEnd (currently, the datasetTimeInterval has DateTime, not interval, as its expected time, which I think is not correct, or at least doesn't allow us to specify the coverage interval)
    timestep (dct:accrualPeriodicity)
  • bibliographic reference: many of the dataset refer to the paper that describes it
  • Main variables measured -- without necessarily knowing the distinction of which ones are dimensions and which ones are measures qb:dimensionProperty and qb:MeasureProperty)

Related work

This all starts to get into the business of looking inside the dataset, which was discussed at schema.org previously - e.g. see Looking inside tables thread from Omar. Subsequently in W3C CSVW some of these ideas went standards track, in particular a templating mechanism to map tabular data into RDF.

@danbri danbri self-assigned this Apr 6, 2016
@darobin
Contributor
darobin commented Apr 6, 2016

This is also related to #975 for versioning dependencies, particularly on datasets (discussed in more detail in https://research.science.ai/article/web-first-data-citations).

@danbri
Contributor
danbri commented Apr 6, 2016

See also #1066 for a quick bugfix (spotted by Natasha too)

@natashafn

A couple of follow up comments:

  • there is a citation property on CreativeWork that is probably the right property to use for bibliographic reference
  • Another property that is missing however is any description of how a dataset was created. In some cases, I would imagine this would be just a text field and in some cases, a structured provenance record. Maybe a property that could be either?
@danbri
Contributor
danbri commented May 16, 2016

Notes from a F2F meeting on lifescience datasets

@danbri
Contributor
danbri commented May 16, 2016

See also http://scholarly.vernacular.io/ w.r.t. data citation /cc @darobin

@trypuz
Contributor
trypuz commented May 17, 2016

Hi!
There is something wrong with:

http://meta.schema.org
http://pending.schema.org
http://health-lifesci.schema.org

I have „The requested URL / was not found on this server”.

Best,
Robert Trypuz

@danbri
Contributor
danbri commented May 31, 2016

Filed #1189 re datasetTimeInterval

@danbri danbri pushed a commit that referenced this issue Jul 15, 2016
Dan Brickley Added variablesMeasured proposal to pending extension. See #1083 422acd8
@danbri danbri pushed a commit that referenced this issue Jul 15, 2016
Dan Brickley Changes towards Dataset improvements (and documenting these in releas…
…es page).

See #1083 for context.
9a746f3
@danbri danbri pushed a commit that referenced this issue Jul 15, 2016
Dan Brickley Noted that we also add a pending property: variablesMeasured.
See #1083
eac8145
@danbri danbri referenced this issue Jul 15, 2016
Merged

Sdo datasets2 #1247

@danbri danbri pushed a commit that referenced this issue Jul 15, 2016
Dan Brickley Added Dublin Core mappings.
See #1083.
also #84
a11ef62
@danbri
Contributor
danbri commented Jul 15, 2016

Most of these suggestions are now implemented/committed and published on our draft webschemas.org site for review: http://webschemas.org/docs/releases.html#g1083

The corresponding pull request was #1247

I copy here some supporting notes. Of all these points, only the overlap with releasedEvent remains unexplored.

CHANGES

1.) for temporal and spatial coverage.

As of v3.0 we have:

Relating to Dataset specifically,

http://schema.org/spatial (Dataset -> Place),
"The range of spatial applicability of a dataset, e.g. for a dataset of New York weather, the state of New York."

http://schema.org/temporal (Dataset -> Datetime),
"The range of temporal applicability of a dataset, e.g. for a 2011 census dataset, the year 2011 (in ISO 8601 time interval format)."

The temporal property superseded by the awkwardly named http://schema.org/datasetTimeInterval -

http://schema.org/datasetTimeInterval (Dataset -> Datetime),
"The range of temporal applicability of a dataset, e.g. for a 2011 census dataset, the year 2011 (in ISO 8601 time interval format)."

Relating to CreativeWork,

http://schema.org/contentLocation (CreativeWork -> Place),
"The location depicted or described in the content. For example, the location in a photograph or painting"

http://schema.org/locationCreated (CreativeWork -> Place),
"The location where the CreativeWork was created, which may not be the same as the location depicted in the CreativeWork."

Note also http://schema.org/releasedEvent which structures things a little differently, grouping place/time within an Event.

PROPOSAL:

1a. a minor detail re releasedEvent, but documenting here:
For works (most typically media broadcasts but potentially e.g. datasets too) whose publication is structured in terms of documented releases, it is reasonable to expect the release information in a http://schema.org/PublicationEvent to match direct contentLocation or spatial[Coverage] properties if the latter are present. A startDate property of the event would match http://schema.org/dateCreated of the published item.

1b.
Create spatialCoverage and temporalCoverage properties as successors to the (vaguely and/or awkwardly named) datasetTimeInterval, spatial and temporal properties.

1c.
Broaden spatialCoverage and temporalCoverage so that they apply to CreativeWork rather than just Dataset.

1d.
Update their textual definitions to accommodate their broader scope, and to address any confusion about related properties.
Proposed text:

spatialCoverage: "The spatialCoverage of a CreativeWork indicates the place(s) which are the focus of some work. It is a subproperty of
contentLocation intended for more technical and specific materials. For example with a Dataset, it indicates
areas that the dataset describes: a dataset of New York weather would have spatialCoverage which was the place: the state of New York."

temporalCoverage: "The temporalCoverage of a CreativeWork indicates the period that the content applies to, i.e. that it describes. In
the case of a Dataset it will typically indicate the relevant time period in a precise notation (e.g. for a 2011 census dataset, the year 2011 (in ISO 8601 time interval format). Other forms of content e.g. ScholarlyArticle, Book, TVSeries or TVEpisode may indicate their temporalCoverage in broader terms - textually or via well-known URL."

1e.
Update RDFS assertions.

spatialCoverage subPropertyOf contentLocation.
temporal supersededBy temporalCoverage. (rather than by datasetTimeInterval as now)
datasetTimeInterval supersededBy temporalCoverage.
Add mappings,

temporalCoverage equivalentProperty http://purl.org/dc/terms/temporal
spatialCoverage equivalentProperty http://purl.org/dc/terms/spatial

@joshsh
Contributor
joshsh commented Jul 19, 2016

So we have arrived at the names spatialCoverage and temporalCoverage, after all. Agreed that they are appropriate for other CreativeWorks, and it's nice to have the explicit mapping into DCMI Terms.

@danbri
Contributor
danbri commented Jul 20, 2016

Yes, I think this terminology bridges well with usage elsewhere, as well as better connecting schema.org dataset description with the approach for other kinds of CreativeWork. Does this work ok for others following along here?

This was referenced Aug 2, 2016
@danbri danbri closed this Aug 10, 2016
@danbri
Contributor
danbri commented Sep 14, 2016

On reflection, and after further feedback, I believe variableMeasured would be a more appropriate name for this property. I'll work on migrating unless anyone objects.

@agbeltran
agbeltran commented Sep 15, 2016 edited

In addition to the change to singular, it seems that the variableMeasured property is missing PropertyValuePair in the range to comply with the definition.

@Aaranged
Aaranged commented Sep 23, 2016 edited

In addition to comment from @agbeltran note that Google's use of variableMeasured extends the expected type from text to include URL.

@danbri danbri pushed a commit that referenced this issue Oct 3, 2016
Dan Brickley Renamed variablesMeasured to be variableMeasured, to fit our pluralit…
…y pattern.

See #1083
f62a9ce
@ypriverol ypriverol referenced this issue in BD2K-DDI/ddi-web-app Nov 1, 2016
Open

First test of dataset in Schema.org #135

0 of 4 tasks complete
@agbeltran

@danbri should we open a new issue about the two problems with variablesMeasured reported above?

@danbri
Contributor
danbri commented Nov 1, 2016

@agbeltran I believe they're fixed ok in our next release, previewable at http://webschemas.org/variableMeasured - can you confirm?

@agbeltran

Thanks @danbri - I can see that it now complies with the definition as its range is Text or PropertyValue.

Maybe what remains to be fixed is the documentation at
https://developers.google.com/search/docs/data-types/datasets
which indicates Text, URL rather than Text, PropertyValue?

@natashafn

@danbri: is the description actually correct about PropertyValue as range?

On Tue, Nov 1, 2016 at 7:29 AM Alejandra Gonzalez-Beltran <
notifications@github.com> wrote:

Thanks @danbri https://github.com/danbri - I can see that it now
complies with the definition as its range is Text or PropertyValue.

Maybe what remains to be fixed is the documentation at
https://developers.google.com/search/docs/data-types/datasets
which indicates Text, URL rather than Text, PropertyValue?


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1083 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AByJRXKK87qBxpjWwqj3FfnYt51PiZcFks5q50zYgaJpZM4IBIz6
.

@agbeltran

Checking this again, both properties singular and plural are live in the pending version:

http://pending.schema.org/variablesMeasured
http://pending.webschemas.org/variableMeasured

The documentation (https://developers.google.com/search/docs/data-types/datasets) refers to the singular variableMeasured, which it is the one we had discussed it was a better option. Right?

What is the conclusion about the range?

@danbri
Contributor
danbri commented Nov 8, 2016

@agbeltran I'm sorry the site doesn't make this clear enough, but roughly: schema.org is the official site, updated in named releases several times a year; webschemas.org is the editor's working draft of the proposed next release, typically edited several times a week. In the webschemas version if you look up the obsolete plural variablesMeasured, you will find youself directed to http://pending.webschemas.org/variablesMeasured -> http://attic.webschemas.org/variablesMeasured which is an area we have made for things that are "as good as removed", for complete transparency.

For range, yes PropertyValue should be in the range - looks like it needs adding on the Google side.

@agbeltran

Thanks! (I was aware about the releases/working draft but had missed the attic redirection.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment