New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving Dataset descriptions #1083

Closed
danbri opened this Issue Apr 6, 2016 · 26 comments

Comments

Projects
None yet
10 participants
@danbri
Contributor

danbri commented Apr 6, 2016

Talking with Natasha Noy about possible improvements around dataset description. Some things to look into:

  • coverageStart and coverageEnd (currently, the datasetTimeInterval has DateTime, not interval, as its expected time, which I think is not correct, or at least doesn't allow us to specify the coverage interval)
    timestep (dct:accrualPeriodicity)
  • bibliographic reference: many of the dataset refer to the paper that describes it
  • Main variables measured -- without necessarily knowing the distinction of which ones are dimensions and which ones are measures qb:dimensionProperty and qb:MeasureProperty)

Related work

This all starts to get into the business of looking inside the dataset, which was discussed at schema.org previously - e.g. see Looking inside tables thread from Omar. Subsequently in W3C CSVW some of these ideas went standards track, in particular a templating mechanism to map tabular data into RDF.

@darobin

This comment has been minimized.

Show comment
Hide comment
@darobin

darobin Apr 6, 2016

Contributor

This is also related to #975 for versioning dependencies, particularly on datasets (discussed in more detail in https://research.science.ai/article/web-first-data-citations).

Contributor

darobin commented Apr 6, 2016

This is also related to #975 for versioning dependencies, particularly on datasets (discussed in more detail in https://research.science.ai/article/web-first-data-citations).

@danbri

This comment has been minimized.

Show comment
Hide comment
@danbri

danbri Apr 6, 2016

Contributor

See also #1066 for a quick bugfix (spotted by Natasha too)

Contributor

danbri commented Apr 6, 2016

See also #1066 for a quick bugfix (spotted by Natasha too)

@natashafn

This comment has been minimized.

Show comment
Hide comment
@natashafn

natashafn Apr 18, 2016

A couple of follow up comments:

  • there is a citation property on CreativeWork that is probably the right property to use for bibliographic reference
  • Another property that is missing however is any description of how a dataset was created. In some cases, I would imagine this would be just a text field and in some cases, a structured provenance record. Maybe a property that could be either?

natashafn commented Apr 18, 2016

A couple of follow up comments:

  • there is a citation property on CreativeWork that is probably the right property to use for bibliographic reference
  • Another property that is missing however is any description of how a dataset was created. In some cases, I would imagine this would be just a text field and in some cases, a structured provenance record. Maybe a property that could be either?
@danbri

This comment has been minimized.

Show comment
Hide comment
@danbri

danbri May 16, 2016

Contributor

Notes from a F2F meeting on lifescience datasets

Contributor

danbri commented May 16, 2016

Notes from a F2F meeting on lifescience datasets

@danbri

This comment has been minimized.

Show comment
Hide comment
@danbri

danbri May 16, 2016

Contributor

See also http://scholarly.vernacular.io/ w.r.t. data citation /cc @darobin

Contributor

danbri commented May 16, 2016

See also http://scholarly.vernacular.io/ w.r.t. data citation /cc @darobin

@trypuz

This comment has been minimized.

Show comment
Hide comment
@trypuz

trypuz May 17, 2016

Contributor

Hi!
There is something wrong with:

http://meta.schema.org
http://pending.schema.org
http://health-lifesci.schema.org

I have „The requested URL / was not found on this server”.

Best,
Robert Trypuz

Contributor

trypuz commented May 17, 2016

Hi!
There is something wrong with:

http://meta.schema.org
http://pending.schema.org
http://health-lifesci.schema.org

I have „The requested URL / was not found on this server”.

Best,
Robert Trypuz

@danbri

This comment has been minimized.

Show comment
Hide comment
@danbri

danbri May 31, 2016

Contributor

Filed #1189 re datasetTimeInterval

Contributor

danbri commented May 31, 2016

Filed #1189 re datasetTimeInterval

@danbri

This comment has been minimized.

Show comment
Hide comment
@danbri

danbri Jul 15, 2016

Contributor

Most of these suggestions are now implemented/committed and published on our draft webschemas.org site for review: http://webschemas.org/docs/releases.html#g1083

The corresponding pull request was #1247

I copy here some supporting notes. Of all these points, only the overlap with releasedEvent remains unexplored.

CHANGES

1.) for temporal and spatial coverage.

As of v3.0 we have:

Relating to Dataset specifically,

http://schema.org/spatial (Dataset -> Place),
"The range of spatial applicability of a dataset, e.g. for a dataset of New York weather, the state of New York."

http://schema.org/temporal (Dataset -> Datetime),
"The range of temporal applicability of a dataset, e.g. for a 2011 census dataset, the year 2011 (in ISO 8601 time interval format)."

The temporal property superseded by the awkwardly named http://schema.org/datasetTimeInterval -

http://schema.org/datasetTimeInterval (Dataset -> Datetime),
"The range of temporal applicability of a dataset, e.g. for a 2011 census dataset, the year 2011 (in ISO 8601 time interval format)."

Relating to CreativeWork,

http://schema.org/contentLocation (CreativeWork -> Place),
"The location depicted or described in the content. For example, the location in a photograph or painting"

http://schema.org/locationCreated (CreativeWork -> Place),
"The location where the CreativeWork was created, which may not be the same as the location depicted in the CreativeWork."

Note also http://schema.org/releasedEvent which structures things a little differently, grouping place/time within an Event.

PROPOSAL:

1a. a minor detail re releasedEvent, but documenting here:
For works (most typically media broadcasts but potentially e.g. datasets too) whose publication is structured in terms of documented releases, it is reasonable to expect the release information in a http://schema.org/PublicationEvent to match direct contentLocation or spatial[Coverage] properties if the latter are present. A startDate property of the event would match http://schema.org/dateCreated of the published item.

1b.
Create spatialCoverage and temporalCoverage properties as successors to the (vaguely and/or awkwardly named) datasetTimeInterval, spatial and temporal properties.

1c.
Broaden spatialCoverage and temporalCoverage so that they apply to CreativeWork rather than just Dataset.

1d.
Update their textual definitions to accommodate their broader scope, and to address any confusion about related properties.
Proposed text:

spatialCoverage: "The spatialCoverage of a CreativeWork indicates the place(s) which are the focus of some work. It is a subproperty of
contentLocation intended for more technical and specific materials. For example with a Dataset, it indicates
areas that the dataset describes: a dataset of New York weather would have spatialCoverage which was the place: the state of New York."

temporalCoverage: "The temporalCoverage of a CreativeWork indicates the period that the content applies to, i.e. that it describes. In
the case of a Dataset it will typically indicate the relevant time period in a precise notation (e.g. for a 2011 census dataset, the year 2011 (in ISO 8601 time interval format). Other forms of content e.g. ScholarlyArticle, Book, TVSeries or TVEpisode may indicate their temporalCoverage in broader terms - textually or via well-known URL."

1e.
Update RDFS assertions.

spatialCoverage subPropertyOf contentLocation.
temporal supersededBy temporalCoverage. (rather than by datasetTimeInterval as now)
datasetTimeInterval supersededBy temporalCoverage.
Add mappings,

temporalCoverage equivalentProperty http://purl.org/dc/terms/temporal
spatialCoverage equivalentProperty http://purl.org/dc/terms/spatial

Contributor

danbri commented Jul 15, 2016

Most of these suggestions are now implemented/committed and published on our draft webschemas.org site for review: http://webschemas.org/docs/releases.html#g1083

The corresponding pull request was #1247

I copy here some supporting notes. Of all these points, only the overlap with releasedEvent remains unexplored.

CHANGES

1.) for temporal and spatial coverage.

As of v3.0 we have:

Relating to Dataset specifically,

http://schema.org/spatial (Dataset -> Place),
"The range of spatial applicability of a dataset, e.g. for a dataset of New York weather, the state of New York."

http://schema.org/temporal (Dataset -> Datetime),
"The range of temporal applicability of a dataset, e.g. for a 2011 census dataset, the year 2011 (in ISO 8601 time interval format)."

The temporal property superseded by the awkwardly named http://schema.org/datasetTimeInterval -

http://schema.org/datasetTimeInterval (Dataset -> Datetime),
"The range of temporal applicability of a dataset, e.g. for a 2011 census dataset, the year 2011 (in ISO 8601 time interval format)."

Relating to CreativeWork,

http://schema.org/contentLocation (CreativeWork -> Place),
"The location depicted or described in the content. For example, the location in a photograph or painting"

http://schema.org/locationCreated (CreativeWork -> Place),
"The location where the CreativeWork was created, which may not be the same as the location depicted in the CreativeWork."

Note also http://schema.org/releasedEvent which structures things a little differently, grouping place/time within an Event.

PROPOSAL:

1a. a minor detail re releasedEvent, but documenting here:
For works (most typically media broadcasts but potentially e.g. datasets too) whose publication is structured in terms of documented releases, it is reasonable to expect the release information in a http://schema.org/PublicationEvent to match direct contentLocation or spatial[Coverage] properties if the latter are present. A startDate property of the event would match http://schema.org/dateCreated of the published item.

1b.
Create spatialCoverage and temporalCoverage properties as successors to the (vaguely and/or awkwardly named) datasetTimeInterval, spatial and temporal properties.

1c.
Broaden spatialCoverage and temporalCoverage so that they apply to CreativeWork rather than just Dataset.

1d.
Update their textual definitions to accommodate their broader scope, and to address any confusion about related properties.
Proposed text:

spatialCoverage: "The spatialCoverage of a CreativeWork indicates the place(s) which are the focus of some work. It is a subproperty of
contentLocation intended for more technical and specific materials. For example with a Dataset, it indicates
areas that the dataset describes: a dataset of New York weather would have spatialCoverage which was the place: the state of New York."

temporalCoverage: "The temporalCoverage of a CreativeWork indicates the period that the content applies to, i.e. that it describes. In
the case of a Dataset it will typically indicate the relevant time period in a precise notation (e.g. for a 2011 census dataset, the year 2011 (in ISO 8601 time interval format). Other forms of content e.g. ScholarlyArticle, Book, TVSeries or TVEpisode may indicate their temporalCoverage in broader terms - textually or via well-known URL."

1e.
Update RDFS assertions.

spatialCoverage subPropertyOf contentLocation.
temporal supersededBy temporalCoverage. (rather than by datasetTimeInterval as now)
datasetTimeInterval supersededBy temporalCoverage.
Add mappings,

temporalCoverage equivalentProperty http://purl.org/dc/terms/temporal
spatialCoverage equivalentProperty http://purl.org/dc/terms/spatial

@joshsh

This comment has been minimized.

Show comment
Hide comment
@joshsh

joshsh Jul 19, 2016

Contributor

So we have arrived at the names spatialCoverage and temporalCoverage, after all. Agreed that they are appropriate for other CreativeWorks, and it's nice to have the explicit mapping into DCMI Terms.

Contributor

joshsh commented Jul 19, 2016

So we have arrived at the names spatialCoverage and temporalCoverage, after all. Agreed that they are appropriate for other CreativeWorks, and it's nice to have the explicit mapping into DCMI Terms.

@danbri

This comment has been minimized.

Show comment
Hide comment
@danbri

danbri Jul 20, 2016

Contributor

Yes, I think this terminology bridges well with usage elsewhere, as well as better connecting schema.org dataset description with the approach for other kinds of CreativeWork. Does this work ok for others following along here?

Contributor

danbri commented Jul 20, 2016

Yes, I think this terminology bridges well with usage elsewhere, as well as better connecting schema.org dataset description with the approach for other kinds of CreativeWork. Does this work ok for others following along here?

@danbri danbri closed this Aug 10, 2016

@danbri

This comment has been minimized.

Show comment
Hide comment
@danbri

danbri Sep 14, 2016

Contributor

On reflection, and after further feedback, I believe variableMeasured would be a more appropriate name for this property. I'll work on migrating unless anyone objects.

Contributor

danbri commented Sep 14, 2016

On reflection, and after further feedback, I believe variableMeasured would be a more appropriate name for this property. I'll work on migrating unless anyone objects.

@agbeltran

This comment has been minimized.

Show comment
Hide comment
@agbeltran

agbeltran Sep 15, 2016

In addition to the change to singular, it seems that the variableMeasured property is missing PropertyValuePair in the range to comply with the definition.

agbeltran commented Sep 15, 2016

In addition to the change to singular, it seems that the variableMeasured property is missing PropertyValuePair in the range to comply with the definition.

@Aaranged

This comment has been minimized.

Show comment
Hide comment
@Aaranged

Aaranged Sep 23, 2016

In addition to comment from @agbeltran note that Google's use of variableMeasured extends the expected type from text to include URL.

Aaranged commented Sep 23, 2016

In addition to comment from @agbeltran note that Google's use of variableMeasured extends the expected type from text to include URL.

@ypriverol ypriverol referenced this issue Nov 1, 2016

Closed

First test of dataset in Schema.org #135

0 of 4 tasks complete
@agbeltran

This comment has been minimized.

Show comment
Hide comment
@agbeltran

agbeltran Nov 1, 2016

@danbri should we open a new issue about the two problems with variablesMeasured reported above?

agbeltran commented Nov 1, 2016

@danbri should we open a new issue about the two problems with variablesMeasured reported above?

@danbri

This comment has been minimized.

Show comment
Hide comment
@danbri

danbri Nov 1, 2016

Contributor

@agbeltran I believe they're fixed ok in our next release, previewable at http://webschemas.org/variableMeasured - can you confirm?

Contributor

danbri commented Nov 1, 2016

@agbeltran I believe they're fixed ok in our next release, previewable at http://webschemas.org/variableMeasured - can you confirm?

@agbeltran

This comment has been minimized.

Show comment
Hide comment
@agbeltran

agbeltran Nov 1, 2016

Thanks @danbri - I can see that it now complies with the definition as its range is Text or PropertyValue.

Maybe what remains to be fixed is the documentation at
https://developers.google.com/search/docs/data-types/datasets
which indicates Text, URL rather than Text, PropertyValue?

agbeltran commented Nov 1, 2016

Thanks @danbri - I can see that it now complies with the definition as its range is Text or PropertyValue.

Maybe what remains to be fixed is the documentation at
https://developers.google.com/search/docs/data-types/datasets
which indicates Text, URL rather than Text, PropertyValue?

@natashafn

This comment has been minimized.

Show comment
Hide comment
@natashafn

natashafn Nov 1, 2016

@danbri: is the description actually correct about PropertyValue as range?

On Tue, Nov 1, 2016 at 7:29 AM Alejandra Gonzalez-Beltran <
notifications@github.com> wrote:

Thanks @danbri https://github.com/danbri - I can see that it now
complies with the definition as its range is Text or PropertyValue.

Maybe what remains to be fixed is the documentation at
https://developers.google.com/search/docs/data-types/datasets
which indicates Text, URL rather than Text, PropertyValue?


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1083 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AByJRXKK87qBxpjWwqj3FfnYt51PiZcFks5q50zYgaJpZM4IBIz6
.

natashafn commented Nov 1, 2016

@danbri: is the description actually correct about PropertyValue as range?

On Tue, Nov 1, 2016 at 7:29 AM Alejandra Gonzalez-Beltran <
notifications@github.com> wrote:

Thanks @danbri https://github.com/danbri - I can see that it now
complies with the definition as its range is Text or PropertyValue.

Maybe what remains to be fixed is the documentation at
https://developers.google.com/search/docs/data-types/datasets
which indicates Text, URL rather than Text, PropertyValue?


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1083 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AByJRXKK87qBxpjWwqj3FfnYt51PiZcFks5q50zYgaJpZM4IBIz6
.

@agbeltran

This comment has been minimized.

Show comment
Hide comment
@agbeltran

agbeltran Nov 8, 2016

Checking this again, both properties singular and plural are live in the pending version:

http://pending.schema.org/variablesMeasured
http://pending.webschemas.org/variableMeasured

The documentation (https://developers.google.com/search/docs/data-types/datasets) refers to the singular variableMeasured, which it is the one we had discussed it was a better option. Right?

What is the conclusion about the range?

agbeltran commented Nov 8, 2016

Checking this again, both properties singular and plural are live in the pending version:

http://pending.schema.org/variablesMeasured
http://pending.webschemas.org/variableMeasured

The documentation (https://developers.google.com/search/docs/data-types/datasets) refers to the singular variableMeasured, which it is the one we had discussed it was a better option. Right?

What is the conclusion about the range?

@danbri

This comment has been minimized.

Show comment
Hide comment
@danbri

danbri Nov 8, 2016

Contributor

@agbeltran I'm sorry the site doesn't make this clear enough, but roughly: schema.org is the official site, updated in named releases several times a year; webschemas.org is the editor's working draft of the proposed next release, typically edited several times a week. In the webschemas version if you look up the obsolete plural variablesMeasured, you will find youself directed to http://pending.webschemas.org/variablesMeasured -> http://attic.webschemas.org/variablesMeasured which is an area we have made for things that are "as good as removed", for complete transparency.

For range, yes PropertyValue should be in the range - looks like it needs adding on the Google side.

Contributor

danbri commented Nov 8, 2016

@agbeltran I'm sorry the site doesn't make this clear enough, but roughly: schema.org is the official site, updated in named releases several times a year; webschemas.org is the editor's working draft of the proposed next release, typically edited several times a week. In the webschemas version if you look up the obsolete plural variablesMeasured, you will find youself directed to http://pending.webschemas.org/variablesMeasured -> http://attic.webschemas.org/variablesMeasured which is an area we have made for things that are "as good as removed", for complete transparency.

For range, yes PropertyValue should be in the range - looks like it needs adding on the Google side.

@agbeltran

This comment has been minimized.

Show comment
Hide comment
@agbeltran

agbeltran Nov 8, 2016

Thanks! (I was aware about the releases/working draft but had missed the attic redirection.)

agbeltran commented Nov 8, 2016

Thanks! (I was aware about the releases/working draft but had missed the attic redirection.)

@dr-shorthair

This comment has been minimized.

Show comment
Hide comment
@dr-shorthair

dr-shorthair May 9, 2017

BTW - the use of the word 'Measured' also has this problem - 'Measure' usually applies to data collection activities with quantitative, but not categorical results. So variableMeasured has the risk that it implicitly excludes datasets where the 'values' are categories rather than numbers.

There are precedents from several scientific domains to use the more general term 'Observed' and 'Observation' (rather than Measured and Measurement) to allow for both categories and quantities. SSN [1] & O&M [2] use 'observedProperty' and OBOE [3] has 'ofCharacteristic'.

[1] http://w3c.github.io/sdw/ssn/
[2] https://en.wikipedia.org/wiki/Observations_and_Measurements https://dx.doi.org/10.13140/2.1.1142.3042
[3] https://dx.doi.org/10.5063/F11C1TTM

dr-shorthair commented May 9, 2017

BTW - the use of the word 'Measured' also has this problem - 'Measure' usually applies to data collection activities with quantitative, but not categorical results. So variableMeasured has the risk that it implicitly excludes datasets where the 'values' are categories rather than numbers.

There are precedents from several scientific domains to use the more general term 'Observed' and 'Observation' (rather than Measured and Measurement) to allow for both categories and quantities. SSN [1] & O&M [2] use 'observedProperty' and OBOE [3] has 'ofCharacteristic'.

[1] http://w3c.github.io/sdw/ssn/
[2] https://en.wikipedia.org/wiki/Observations_and_Measurements https://dx.doi.org/10.13140/2.1.1142.3042
[3] https://dx.doi.org/10.5063/F11C1TTM

@danbri

This comment has been minimized.

Show comment
Hide comment
@danbri

danbri Jul 18, 2017

Contributor

I realize I didn't reply explicitly here @dr-shorthair. I'd like to bring most of SOSA into schema.org (as discussed with SpatialWeb WG) and hope it will address the topic more thoroughly. @agbeltran any thoughts from a bioschemas/lifesci perspective?

Contributor

danbri commented Jul 18, 2017

I realize I didn't reply explicitly here @dr-shorthair. I'd like to bring most of SOSA into schema.org (as discussed with SpatialWeb WG) and hope it will address the topic more thoroughly. @agbeltran any thoughts from a bioschemas/lifesci perspective?

@thadguidry

This comment has been minimized.

Show comment
Hide comment
@thadguidry

thadguidry Apr 4, 2018

@dr-shorthair But Simon, I would prefer we still give publishers the ability to collect both quantitative and categorical results. Doing that makes data flow tooling easier and systems have a bit more information provided to them for proper analysis by machine learning and humans. I think your stance is from a collection effort primarily. However, my stance is we should consider the data after the collection efforts, which is were value in the data is finally extracted for publishers and mankind.

@dr-shorthair Could this be anything, like say "loss of life" as a Result ? http://w3c.github.io/sdw/ssn/#SOSAResult that didn't really specify a "kind" of result and I found the description a bit lacking to determine if there were any limits of its usage.

thadguidry commented Apr 4, 2018

@dr-shorthair But Simon, I would prefer we still give publishers the ability to collect both quantitative and categorical results. Doing that makes data flow tooling easier and systems have a bit more information provided to them for proper analysis by machine learning and humans. I think your stance is from a collection effort primarily. However, my stance is we should consider the data after the collection efforts, which is were value in the data is finally extracted for publishers and mankind.

@dr-shorthair Could this be anything, like say "loss of life" as a Result ? http://w3c.github.io/sdw/ssn/#SOSAResult that didn't really specify a "kind" of result and I found the description a bit lacking to determine if there were any limits of its usage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment