New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements to Dataset #688

Closed
vholland opened this Issue Jul 31, 2015 · 20 comments

Comments

@vholland
Contributor

vholland commented Jul 31, 2015

As it stands, http://schema.org/Dataset allows one to describe the metadata for a dataset, but not the actual data. I propose we:

  1. Move Dataset out from CreativeWork and make it a child of ItemList (similar to the change to BreadcrumbList made in version 1.92).
  2. Create a new type Thing > Intangible > ListItem > DataItem.
  3. Expand the range for http://schema.org/dateCreated to include DataItem.

This would allow people to create data catalogs like:

{
  "@context": "http://schema.org/",
  "@type": "Dataset",
  "name": "Company directory",
  "itemListElement": [
    {
      "@type": "Person",
      "name": "Alice",
      "email": "alice@example.com",
      "dateCreated": "2014-07-01"'
    },
    {
      "@type": "Person",
      "name": "Bob",
      "email": "bob@example.com",
      "dateCreated": "2015-01-02"'
    },
  ]
}

Note in the above example, the dateCreated is the date the record was created not the date when the person joined the company.

One could describe simple datasets by using Number or Text instead of a richer type.

@danbri

This comment has been minimized.

Show comment
Hide comment
@danbri

danbri Jul 31, 2015

Contributor

I'm not quite sure what problem this solves. Wouldn't a document containing these assertions be the dataset? Is every entity mentioned in the dataset considered a value of itemListElement? e.g. if Bob here had an affiliation of some Organization, is that Organization an itemListElement of the dataset too?

I don't think moving it under ItemList works well, in that Dataset covers a great many kinds of dataset - not all of which have a single obvious conceptualization as a list of items. For example, audio recordings (see http://grh.mur.at/sites/default/files/mir_datasets_0.html), pre-trained artificial neural nets (https://github.com/BVLC/caffe/wiki/Model-Zoo), geo data (http://opendata.arcgis.com/ http://wiki.osgeo.org/wiki/Public_Geodata_for_the_UK ), space data incl. imagery and sensor readings (https://data.nasa.gov/data) etc etc.

It's important that we keep this type open and inclusive for all these kinds of data sharing + more. But it is worth taking a closer look at an important subset: datasets whose content can be seen as a set of assertions about the properties of entities. That seems to be where you're heading here. There is some related work over at W3C in the CSV group, see http://www.w3.org/blog/news/archives/4830 especially http://www.w3.org/TR/2015/CR-csv2rdf-20150716/ which includes a framework for mapping table rows (from CSV and similar tabular data) into triples. This is a different approach to "the actual data", but shares with your proposal a concern for treating that data as triples/assertions. Maybe there's some common ground here?

Contributor

danbri commented Jul 31, 2015

I'm not quite sure what problem this solves. Wouldn't a document containing these assertions be the dataset? Is every entity mentioned in the dataset considered a value of itemListElement? e.g. if Bob here had an affiliation of some Organization, is that Organization an itemListElement of the dataset too?

I don't think moving it under ItemList works well, in that Dataset covers a great many kinds of dataset - not all of which have a single obvious conceptualization as a list of items. For example, audio recordings (see http://grh.mur.at/sites/default/files/mir_datasets_0.html), pre-trained artificial neural nets (https://github.com/BVLC/caffe/wiki/Model-Zoo), geo data (http://opendata.arcgis.com/ http://wiki.osgeo.org/wiki/Public_Geodata_for_the_UK ), space data incl. imagery and sensor readings (https://data.nasa.gov/data) etc etc.

It's important that we keep this type open and inclusive for all these kinds of data sharing + more. But it is worth taking a closer look at an important subset: datasets whose content can be seen as a set of assertions about the properties of entities. That seems to be where you're heading here. There is some related work over at W3C in the CSV group, see http://www.w3.org/blog/news/archives/4830 especially http://www.w3.org/TR/2015/CR-csv2rdf-20150716/ which includes a framework for mapping table rows (from CSV and similar tabular data) into triples. This is a different approach to "the actual data", but shares with your proposal a concern for treating that data as triples/assertions. Maybe there's some common ground here?

@vholland

This comment has been minimized.

Show comment
Hide comment
@vholland

vholland Jul 31, 2015

Contributor

I had imagined multi-dimensional sets as a list of lists, but maybe that is too complicated.

Perhaps instead folks use both Dataset and ItemList as necessary, but we still have DataItem to allow for metadata about individual items. The above example becomes:

{
  "@context": "http://schema.org/",
  "@type": ["Dataset", "ItemList"],
  "name": "Company directory",
  "itemListElement": [
    {
      "@type": "Person",
      "name": "Alice",
      "email": "alice@example.com",
      "dateCreated": "2014-07-01"'
    },
    {
      "@type": "Person",
      "name": "Bob",
      "email": "bob@example.com",
      "dateCreated": "2015-01-02"'
    },
  ]
}
Contributor

vholland commented Jul 31, 2015

I had imagined multi-dimensional sets as a list of lists, but maybe that is too complicated.

Perhaps instead folks use both Dataset and ItemList as necessary, but we still have DataItem to allow for metadata about individual items. The above example becomes:

{
  "@context": "http://schema.org/",
  "@type": ["Dataset", "ItemList"],
  "name": "Company directory",
  "itemListElement": [
    {
      "@type": "Person",
      "name": "Alice",
      "email": "alice@example.com",
      "dateCreated": "2014-07-01"'
    },
    {
      "@type": "Person",
      "name": "Bob",
      "email": "bob@example.com",
      "dateCreated": "2015-01-02"'
    },
  ]
}
@vholland

This comment has been minimized.

Show comment
Hide comment
@vholland

vholland Jul 31, 2015

Contributor

I realize there is an error in my JSON-LD. It should be:

{
  "@context": "http://schema.org/",
  "@type": ["Dataset", "ItemList"],
  "name": "Company directory",
  "itemListElement": [
    {
      "@type": "DataItem",
      "dateCreated": "2014-07-01"',
      "item": {
        "@type": "Person",
        "name": "Alice",
        "email": "alice@example.com",
      }
    },
    {
      "@type": "DataItem",
      "dateCreated": "2015-01-02"',
      "item": {
        "@type": "Person",
        "name": "Bob",
        "email": "bob@example.com",
      }
    },
  ]
}

Contributor

vholland commented Jul 31, 2015

I realize there is an error in my JSON-LD. It should be:

{
  "@context": "http://schema.org/",
  "@type": ["Dataset", "ItemList"],
  "name": "Company directory",
  "itemListElement": [
    {
      "@type": "DataItem",
      "dateCreated": "2014-07-01"',
      "item": {
        "@type": "Person",
        "name": "Alice",
        "email": "alice@example.com",
      }
    },
    {
      "@type": "DataItem",
      "dateCreated": "2015-01-02"',
      "item": {
        "@type": "Person",
        "name": "Bob",
        "email": "bob@example.com",
      }
    },
  ]
}

@danbri

This comment has been minimized.

Show comment
Hide comment
@danbri

danbri Jul 31, 2015

Contributor

I'm still missing something. Aren't all docs carrying schema.org datasets already? What is the value in explicitly saying "hey, I'm a dataset" and "hey, this is a thing mentioned in the dataset" all the way through? It feels like the overuse of WebPage we've seen, ... an awkward form of reification where you're not entirely sure what is being described or how deep into the sub-graph the properties apply. If you just want to wrap provenance metadata around chunks of schema.org-flavoured RDF, perhaps JSON-LD named graphs are worth a look? http://www.w3.org/TR/json-ld/#named-graphs

For multidimensional numeric data, http://www.w3.org/TR/vocab-data-cube/ could be a good fit.

Contributor

danbri commented Jul 31, 2015

I'm still missing something. Aren't all docs carrying schema.org datasets already? What is the value in explicitly saying "hey, I'm a dataset" and "hey, this is a thing mentioned in the dataset" all the way through? It feels like the overuse of WebPage we've seen, ... an awkward form of reification where you're not entirely sure what is being described or how deep into the sub-graph the properties apply. If you just want to wrap provenance metadata around chunks of schema.org-flavoured RDF, perhaps JSON-LD named graphs are worth a look? http://www.w3.org/TR/json-ld/#named-graphs

For multidimensional numeric data, http://www.w3.org/TR/vocab-data-cube/ could be a good fit.

@vholland

This comment has been minimized.

Show comment
Hide comment
@vholland

vholland Jul 31, 2015

Contributor

This is more for data feeds that are not necessarily web pages or email messages. In some cases, the full data set is not sent at once, so it is useful to know the creation time of individual items.

Contributor

vholland commented Jul 31, 2015

This is more for data feeds that are not necessarily web pages or email messages. In some cases, the full data set is not sent at once, so it is useful to know the creation time of individual items.

@danbri

This comment has been minimized.

Show comment
Hide comment
@danbri

danbri Jul 31, 2015

Contributor

Here's a quick attempt at using JSON-LD named graphs. Try it in http://json-ld.org/playground/

[
  {
    "@context": "http://schema.org/",
    "@id": "#dataitem-73",
    "generatedAt": "2014-07-01",
    "@graph": [
      {
        "@type": "Person",
        "name": "Alice",
        "email": "alice@example.com"
      }
    ]
  },
  {
    "@context": "http://schema.org/",
    "@id": "#dataitem-74",
    "generatedAt": "2014-07-02",
    "@graph": [
      {
        "@type": "Person",
        "name": "Bob",
        "email": "bob@example.com"
      }
    ]
  }
]

The quads that come back are:

<http://json-ld.org/playground/#dataitem-73> <http://schema.org/generatedAt> "2014-07-01" .
<http://json-ld.org/playground/#dataitem-74> <http://schema.org/generatedAt> "2014-07-02" .
_:b0 <http://schema.org/email> "alice@example.com" <http://json-ld.org/playground/#dataitem-73> .
_:b0 <http://schema.org/name> "Alice" <http://json-ld.org/playground/#dataitem-73> .
_:b0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Person> <http://json-ld.org/playground/#dataitem-73> .
_:b1 <http://schema.org/email> "bob@example.com" <http://json-ld.org/playground/#dataitem-74> .
_:b1 <http://schema.org/name> "Bob" <http://json-ld.org/playground/#dataitem-74> .
_:b1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Person> <http://json-ld.org/playground/#dataitem-74> .

... where the final value in each row is a graph id (date item, in your terminology)

Contributor

danbri commented Jul 31, 2015

Here's a quick attempt at using JSON-LD named graphs. Try it in http://json-ld.org/playground/

[
  {
    "@context": "http://schema.org/",
    "@id": "#dataitem-73",
    "generatedAt": "2014-07-01",
    "@graph": [
      {
        "@type": "Person",
        "name": "Alice",
        "email": "alice@example.com"
      }
    ]
  },
  {
    "@context": "http://schema.org/",
    "@id": "#dataitem-74",
    "generatedAt": "2014-07-02",
    "@graph": [
      {
        "@type": "Person",
        "name": "Bob",
        "email": "bob@example.com"
      }
    ]
  }
]

The quads that come back are:

<http://json-ld.org/playground/#dataitem-73> <http://schema.org/generatedAt> "2014-07-01" .
<http://json-ld.org/playground/#dataitem-74> <http://schema.org/generatedAt> "2014-07-02" .
_:b0 <http://schema.org/email> "alice@example.com" <http://json-ld.org/playground/#dataitem-73> .
_:b0 <http://schema.org/name> "Alice" <http://json-ld.org/playground/#dataitem-73> .
_:b0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Person> <http://json-ld.org/playground/#dataitem-73> .
_:b1 <http://schema.org/email> "bob@example.com" <http://json-ld.org/playground/#dataitem-74> .
_:b1 <http://schema.org/name> "Bob" <http://json-ld.org/playground/#dataitem-74> .
_:b1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Person> <http://json-ld.org/playground/#dataitem-74> .

... where the final value in each row is a graph id (date item, in your terminology)

@vholland

This comment has been minimized.

Show comment
Hide comment
@vholland

vholland Jul 31, 2015

Contributor

I'm not sure I understand "generatedAt" is not valid schema.org, so we would need to change the context to include another vocabulary.

Contributor

vholland commented Jul 31, 2015

I'm not sure I understand "generatedAt" is not valid schema.org, so we would need to change the context to include another vocabulary.

@danbri

This comment has been minimized.

Show comment
Hide comment
@danbri

danbri Jul 31, 2015

Contributor

Yeah that the was the example property name used in the W3C spec, I didn't tweak it. We probably have something appropriate in schema.org or could add, or use a different context. But does the quads / named graph approach look worth consideration?

Contributor

danbri commented Jul 31, 2015

Yeah that the was the example property name used in the W3C spec, I didn't tweak it. We probably have something appropriate in schema.org or could add, or use a different context. But does the quads / named graph approach look worth consideration?

@danbri

This comment has been minimized.

Show comment
Hide comment
@danbri

danbri Jul 31, 2015

Contributor
    [
      {
        "@context": "http://schema.org/",
        "@id": "#dataitem-73",
        "dateCreated": "2014-07-01",
        "@graph": [
          {
            "@type": "Person",
            "name": "Alice",
            "email": "alice@example.com"
          }
        ]
      },
      {
        "@context": "http://schema.org/",
        "@id": "#dataitem-74",
        "dateCreated": "2014-07-02",
        "@graph": [
          {
            "@type": "Person",
            "name": "Bob",
            "email": "bob@example.com"
          }
        ]
      }
    ]
Contributor

danbri commented Jul 31, 2015

    [
      {
        "@context": "http://schema.org/",
        "@id": "#dataitem-73",
        "dateCreated": "2014-07-01",
        "@graph": [
          {
            "@type": "Person",
            "name": "Alice",
            "email": "alice@example.com"
          }
        ]
      },
      {
        "@context": "http://schema.org/",
        "@id": "#dataitem-74",
        "dateCreated": "2014-07-02",
        "@graph": [
          {
            "@type": "Person",
            "name": "Bob",
            "email": "bob@example.com"
          }
        ]
      }
    ]
@vholland

This comment has been minimized.

Show comment
Hide comment
@vholland

vholland Aug 4, 2015

Contributor

I am not sure I understand. As written, I have two disconnected graphs: Alice's graph and Bob's. I still need something to say they are actually parts of a larger graph. Do you disagree with adding something to Dataset to join the graphs?

Contributor

vholland commented Aug 4, 2015

I am not sure I understand. As written, I have two disconnected graphs: Alice's graph and Bob's. I still need something to say they are actually parts of a larger graph. Do you disagree with adding something to Dataset to join the graphs?

@chaals

This comment has been minimized.

Show comment
Hide comment
@chaals

chaals Aug 5, 2015

Contributor

can't we use the collections stuff for that?

Contributor

chaals commented Aug 5, 2015

can't we use the collections stuff for that?

@vholland

This comment has been minimized.

Show comment
Hide comment
@vholland

vholland Aug 11, 2015

Contributor

I spoke with @danbri offline to better understand his concerns. It is probably too much to take on modeling all data sets in one go. To that end, I would like to refocus the discussion on supporting data feeds which may come as JSON-LD instead of web pages. To that end, I propose:

  • Added a new type Thing > CreativeWork > Dataset > DataFeed
  • DataFeed would have the property dataFeedElement which expects Text, Thing, or DataFeedItem.
  • Add the new type Thing > Intangible > DataFeedItem
  • DataFeedItem has the following properties:
    • item: An entity represented by an entry in a list or data feed. (Note item already exists for ListItem.)
    • dateCreated: The datetime the data feed item was created.
    • dateModified: The last time the data feed item was modified.
    • dateDeleted: The datetime the item was removed from the data feed.

The properties http://schema.org/dateCreated and http://schema.org/dateModified exist on http://schema.org/CreativeWork. The proposal is to expand their domains to include DataFeedItem.

The sample JSON-LD becomes:

{
  "@context": "http://schema.org/",
  "@type": "DataFeed",
  "name": "Company directory",
  "dateModified": "2015-01-02",
  "dataFeedElement": [
    {
      "@type": "DataFeedItem",
      "dateCreated": "2014-07-01"',
      "item": {
        "@type": "Person",
        "name": "Alice",
        "email": "alice@example.com",
      }
    },
    {
      "@type": "DataFeedItem",
      "dateModified": "2015-01-02"',
      "item": {
        "@type": "Person",
        "name": "Bob",
        "email": "bob@example.com",
      }
    },
  ]
}
Contributor

vholland commented Aug 11, 2015

I spoke with @danbri offline to better understand his concerns. It is probably too much to take on modeling all data sets in one go. To that end, I would like to refocus the discussion on supporting data feeds which may come as JSON-LD instead of web pages. To that end, I propose:

  • Added a new type Thing > CreativeWork > Dataset > DataFeed
  • DataFeed would have the property dataFeedElement which expects Text, Thing, or DataFeedItem.
  • Add the new type Thing > Intangible > DataFeedItem
  • DataFeedItem has the following properties:
    • item: An entity represented by an entry in a list or data feed. (Note item already exists for ListItem.)
    • dateCreated: The datetime the data feed item was created.
    • dateModified: The last time the data feed item was modified.
    • dateDeleted: The datetime the item was removed from the data feed.

The properties http://schema.org/dateCreated and http://schema.org/dateModified exist on http://schema.org/CreativeWork. The proposal is to expand their domains to include DataFeedItem.

The sample JSON-LD becomes:

{
  "@context": "http://schema.org/",
  "@type": "DataFeed",
  "name": "Company directory",
  "dateModified": "2015-01-02",
  "dataFeedElement": [
    {
      "@type": "DataFeedItem",
      "dateCreated": "2014-07-01"',
      "item": {
        "@type": "Person",
        "name": "Alice",
        "email": "alice@example.com",
      }
    },
    {
      "@type": "DataFeedItem",
      "dateModified": "2015-01-02"',
      "item": {
        "@type": "Person",
        "name": "Bob",
        "email": "bob@example.com",
      }
    },
  ]
}
@danbri

This comment has been minimized.

Show comment
Hide comment
@danbri

danbri Aug 14, 2015

Contributor

Thanks @vholland this is a lot clearer. Can we just run through the date-related properties. At first glance they seem more alike than on 2nd reading, at least for me:

  • dateCreated: "The datetime the data feed item was created." - is this the same as date when added to the feed? or the actual end object/item actually created?
  • dateModified: "The last time the data feed item was modified." - this sounds like the thing itself being modified, not the feed entry i.e. DataFeedItem ( = its metadata). It could be either though.
  • dateDeleted: "The datetime the item was removed from the data feed." - definitely all about the proxy item in the feed.

The problem here is probably just wording: "item" in the definitions could either mean the DataFeedItem, or the actual real world item (e.g. the Person "Bob") that is the value of the item property. Let's try to rephase so it is clearer on first reading.

Contributor

danbri commented Aug 14, 2015

Thanks @vholland this is a lot clearer. Can we just run through the date-related properties. At first glance they seem more alike than on 2nd reading, at least for me:

  • dateCreated: "The datetime the data feed item was created." - is this the same as date when added to the feed? or the actual end object/item actually created?
  • dateModified: "The last time the data feed item was modified." - this sounds like the thing itself being modified, not the feed entry i.e. DataFeedItem ( = its metadata). It could be either though.
  • dateDeleted: "The datetime the item was removed from the data feed." - definitely all about the proxy item in the feed.

The problem here is probably just wording: "item" in the definitions could either mean the DataFeedItem, or the actual real world item (e.g. the Person "Bob") that is the value of the item property. Let's try to rephase so it is clearer on first reading.

vholland added a commit to vholland/schemaorg that referenced this issue Sep 11, 2015

@vholland

This comment has been minimized.

Show comment
Hide comment
@vholland

vholland Sep 11, 2015

Contributor

Good point, regarding wording. In all cases, the dates apply to the proxy item in the feed.

I created pull request #765 with the listed changes. I took the liberty of extending the range for dateCreated and dateModified to also accept DateTime, as feeds (and increasingly online content) has creation dates that include times.

Contributor

vholland commented Sep 11, 2015

Good point, regarding wording. In all cases, the dates apply to the proxy item in the feed.

I created pull request #765 with the listed changes. I took the liberty of extending the range for dateCreated and dateModified to also accept DateTime, as feeds (and increasingly online content) has creation dates that include times.

vholland added a commit to vholland/schemaorg that referenced this issue Sep 11, 2015

@vholland vholland added this to the sdo-phobos release milestone Sep 11, 2015

@danbri

This comment has been minimized.

Show comment
Hide comment
@danbri

danbri Sep 14, 2015

Contributor

This looks good. I'm merging it in so people have a concrete target to review...

Contributor

danbri commented Sep 14, 2015

This looks good. I'm merging it in so people have a concrete target to review...

danbri added a commit that referenced this issue Sep 14, 2015

Merge pull request #765 from vholland/datafeed
Issue #688: Added DataFeed and DataFeedItem including examples and
@danbri

This comment has been minimized.

Show comment
Hide comment
@danbri

danbri Sep 14, 2015

Contributor

/cc @chaals @ajax-als @tilid @pmika @mfhepp @shankarnat @rvguha

Ok, please take a look here: http://sdo-phobos.appspot.com/DataFeed

There's a JSON-LD example (thanks, Vicki).

The idea is, within the constraints of a normal schema.org description (no fancy multi-graph stuff) to provide more feed-like metadata on the items described, to aid consumption, aggregation etc. I looked into some other options and have ended up more convinced than when I started it :) this is useful...

Contributor

danbri commented Sep 14, 2015

/cc @chaals @ajax-als @tilid @pmika @mfhepp @shankarnat @rvguha

Ok, please take a look here: http://sdo-phobos.appspot.com/DataFeed

There's a JSON-LD example (thanks, Vicki).

The idea is, within the constraints of a normal schema.org description (no fancy multi-graph stuff) to provide more feed-like metadata on the items described, to aid consumption, aggregation etc. I looked into some other options and have ended up more convinced than when I started it :) this is useful...

@vholland

This comment has been minimized.

Show comment
Hide comment
@vholland

vholland Oct 1, 2015

Contributor

I forgot to add that one use for this is the supporting data for a software application. (For example, configuration data.)

I'll create a new pull request shortly.

Contributor

vholland commented Oct 1, 2015

I forgot to add that one use for this is the supporting data for a software application. (For example, configuration data.)

I'll create a new pull request shortly.

vholland added a commit to vholland/schemaorg that referenced this issue Oct 1, 2015

@vholland

This comment has been minimized.

Show comment
Hide comment
@vholland

vholland Oct 1, 2015

Contributor

Implemented in pull request #822.

Contributor

vholland commented Oct 1, 2015

Implemented in pull request #822.

danbri added a commit that referenced this issue Oct 1, 2015

Merge pull request #822 from vholland/supporting
Issue #688: Added supportingData to SoftwareApplication.
@elf-pavlik

This comment has been minimized.

Show comment
Hide comment
@elf-pavlik

elf-pavlik Oct 3, 2015

Contributor

This is more for data feeds that are not necessarily web pages or email messages. In some cases, the full data set is not sent at once, so it is useful to know the creation time of individual items.

Will those 'DataFeeds' need paging? I see 4 independent (uncoordinated) developments here

I don't know if the more the merrier applies here 😉

Contributor

elf-pavlik commented Oct 3, 2015

This is more for data feeds that are not necessarily web pages or email messages. In some cases, the full data set is not sent at once, so it is useful to know the creation time of individual items.

Will those 'DataFeeds' need paging? I see 4 independent (uncoordinated) developments here

I don't know if the more the merrier applies here 😉

@danbri

This comment has been minimized.

Show comment
Hide comment
@danbri

danbri Nov 6, 2015

Contributor

Fixed in http://schema.org/docs/releases.html#v2.2 - thanks all. Closing as main issue is addressed, feel free to continue discussions!

Contributor

danbri commented Nov 6, 2015

Fixed in http://schema.org/docs/releases.html#v2.2 - thanks all. Closing as main issue is addressed, feel free to continue discussions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment