Improvements to Dataset #688

Closed
vholland opened this Issue Jul 31, 2015 · 20 comments
@vholland
Contributor

As it stands, http://schema.org/Dataset allows one to describe the metadata for a dataset, but not the actual data. I propose we:

  1. Move Dataset out from CreativeWork and make it a child of ItemList (similar to the change to BreadcrumbList made in version 1.92).
  2. Create a new type Thing > Intangible > ListItem > DataItem.
  3. Expand the range for http://schema.org/dateCreated to include DataItem.

This would allow people to create data catalogs like:

{
  "@context": "http://schema.org/",
  "@type": "Dataset",
  "name": "Company directory",
  "itemListElement": [
    {
      "@type": "Person",
      "name": "Alice",
      "email": "alice@example.com",
      "dateCreated": "2014-07-01"'
    },
    {
      "@type": "Person",
      "name": "Bob",
      "email": "bob@example.com",
      "dateCreated": "2015-01-02"'
    },
  ]
}

Note in the above example, the dateCreated is the date the record was created not the date when the person joined the company.

One could describe simple datasets by using Number or Text instead of a richer type.

@vholland vholland self-assigned this Jul 31, 2015
@danbri
Contributor
danbri commented Jul 31, 2015

I'm not quite sure what problem this solves. Wouldn't a document containing these assertions be the dataset? Is every entity mentioned in the dataset considered a value of itemListElement? e.g. if Bob here had an affiliation of some Organization, is that Organization an itemListElement of the dataset too?

I don't think moving it under ItemList works well, in that Dataset covers a great many kinds of dataset - not all of which have a single obvious conceptualization as a list of items. For example, audio recordings (see http://grh.mur.at/sites/default/files/mir_datasets_0.html), pre-trained artificial neural nets (https://github.com/BVLC/caffe/wiki/Model-Zoo), geo data (http://opendata.arcgis.com/ http://wiki.osgeo.org/wiki/Public_Geodata_for_the_UK ), space data incl. imagery and sensor readings (https://data.nasa.gov/data) etc etc.

It's important that we keep this type open and inclusive for all these kinds of data sharing + more. But it is worth taking a closer look at an important subset: datasets whose content can be seen as a set of assertions about the properties of entities. That seems to be where you're heading here. There is some related work over at W3C in the CSV group, see http://www.w3.org/blog/news/archives/4830 especially http://www.w3.org/TR/2015/CR-csv2rdf-20150716/ which includes a framework for mapping table rows (from CSV and similar tabular data) into triples. This is a different approach to "the actual data", but shares with your proposal a concern for treating that data as triples/assertions. Maybe there's some common ground here?

@vholland
Contributor

I had imagined multi-dimensional sets as a list of lists, but maybe that is too complicated.

Perhaps instead folks use both Dataset and ItemList as necessary, but we still have DataItem to allow for metadata about individual items. The above example becomes:

{
  "@context": "http://schema.org/",
  "@type": ["Dataset", "ItemList"],
  "name": "Company directory",
  "itemListElement": [
    {
      "@type": "Person",
      "name": "Alice",
      "email": "alice@example.com",
      "dateCreated": "2014-07-01"'
    },
    {
      "@type": "Person",
      "name": "Bob",
      "email": "bob@example.com",
      "dateCreated": "2015-01-02"'
    },
  ]
}
@vholland
Contributor

I realize there is an error in my JSON-LD. It should be:

{
  "@context": "http://schema.org/",
  "@type": ["Dataset", "ItemList"],
  "name": "Company directory",
  "itemListElement": [
    {
      "@type": "DataItem",
      "dateCreated": "2014-07-01"',
      "item": {
        "@type": "Person",
        "name": "Alice",
        "email": "alice@example.com",
      }
    },
    {
      "@type": "DataItem",
      "dateCreated": "2015-01-02"',
      "item": {
        "@type": "Person",
        "name": "Bob",
        "email": "bob@example.com",
      }
    },
  ]
}

@danbri
Contributor
danbri commented Jul 31, 2015

I'm still missing something. Aren't all docs carrying schema.org datasets already? What is the value in explicitly saying "hey, I'm a dataset" and "hey, this is a thing mentioned in the dataset" all the way through? It feels like the overuse of WebPage we've seen, ... an awkward form of reification where you're not entirely sure what is being described or how deep into the sub-graph the properties apply. If you just want to wrap provenance metadata around chunks of schema.org-flavoured RDF, perhaps JSON-LD named graphs are worth a look? http://www.w3.org/TR/json-ld/#named-graphs

For multidimensional numeric data, http://www.w3.org/TR/vocab-data-cube/ could be a good fit.

@vholland
Contributor

This is more for data feeds that are not necessarily web pages or email messages. In some cases, the full data set is not sent at once, so it is useful to know the creation time of individual items.

@danbri
Contributor
danbri commented Jul 31, 2015

Here's a quick attempt at using JSON-LD named graphs. Try it in http://json-ld.org/playground/

[
  {
    "@context": "http://schema.org/",
    "@id": "#dataitem-73",
    "generatedAt": "2014-07-01",
    "@graph": [
      {
        "@type": "Person",
        "name": "Alice",
        "email": "alice@example.com"
      }
    ]
  },
  {
    "@context": "http://schema.org/",
    "@id": "#dataitem-74",
    "generatedAt": "2014-07-02",
    "@graph": [
      {
        "@type": "Person",
        "name": "Bob",
        "email": "bob@example.com"
      }
    ]
  }
]

The quads that come back are:

<http://json-ld.org/playground/#dataitem-73> <http://schema.org/generatedAt> "2014-07-01" .
<http://json-ld.org/playground/#dataitem-74> <http://schema.org/generatedAt> "2014-07-02" .
_:b0 <http://schema.org/email> "alice@example.com" <http://json-ld.org/playground/#dataitem-73> .
_:b0 <http://schema.org/name> "Alice" <http://json-ld.org/playground/#dataitem-73> .
_:b0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Person> <http://json-ld.org/playground/#dataitem-73> .
_:b1 <http://schema.org/email> "bob@example.com" <http://json-ld.org/playground/#dataitem-74> .
_:b1 <http://schema.org/name> "Bob" <http://json-ld.org/playground/#dataitem-74> .
_:b1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Person> <http://json-ld.org/playground/#dataitem-74> .

... where the final value in each row is a graph id (date item, in your terminology)

@vholland
Contributor

I'm not sure I understand "generatedAt" is not valid schema.org, so we would need to change the context to include another vocabulary.

@danbri
Contributor
danbri commented Jul 31, 2015

Yeah that the was the example property name used in the W3C spec, I didn't tweak it. We probably have something appropriate in schema.org or could add, or use a different context. But does the quads / named graph approach look worth consideration?

@danbri
Contributor
danbri commented Jul 31, 2015
    [
      {
        "@context": "http://schema.org/",
        "@id": "#dataitem-73",
        "dateCreated": "2014-07-01",
        "@graph": [
          {
            "@type": "Person",
            "name": "Alice",
            "email": "alice@example.com"
          }
        ]
      },
      {
        "@context": "http://schema.org/",
        "@id": "#dataitem-74",
        "dateCreated": "2014-07-02",
        "@graph": [
          {
            "@type": "Person",
            "name": "Bob",
            "email": "bob@example.com"
          }
        ]
      }
    ]
@vholland
Contributor
vholland commented Aug 4, 2015

I am not sure I understand. As written, I have two disconnected graphs: Alice's graph and Bob's. I still need something to say they are actually parts of a larger graph. Do you disagree with adding something to Dataset to join the graphs?

@chaals
Contributor
chaals commented Aug 5, 2015

can't we use the collections stuff for that?

@vholland
Contributor

I spoke with @danbri offline to better understand his concerns. It is probably too much to take on modeling all data sets in one go. To that end, I would like to refocus the discussion on supporting data feeds which may come as JSON-LD instead of web pages. To that end, I propose:

  • Added a new type Thing > CreativeWork > Dataset > DataFeed
  • DataFeed would have the property dataFeedElement which expects Text, Thing, or DataFeedItem.
  • Add the new type Thing > Intangible > DataFeedItem
  • DataFeedItem has the following properties:
    • item: An entity represented by an entry in a list or data feed. (Note item already exists for ListItem.)
    • dateCreated: The datetime the data feed item was created.
    • dateModified: The last time the data feed item was modified.
    • dateDeleted: The datetime the item was removed from the data feed.

The properties http://schema.org/dateCreated and http://schema.org/dateModified exist on http://schema.org/CreativeWork. The proposal is to expand their domains to include DataFeedItem.

The sample JSON-LD becomes:

{
  "@context": "http://schema.org/",
  "@type": "DataFeed",
  "name": "Company directory",
  "dateModified": "2015-01-02",
  "dataFeedElement": [
    {
      "@type": "DataFeedItem",
      "dateCreated": "2014-07-01"',
      "item": {
        "@type": "Person",
        "name": "Alice",
        "email": "alice@example.com",
      }
    },
    {
      "@type": "DataFeedItem",
      "dateModified": "2015-01-02"',
      "item": {
        "@type": "Person",
        "name": "Bob",
        "email": "bob@example.com",
      }
    },
  ]
}
@danbri
Contributor
danbri commented Aug 14, 2015

Thanks @vholland this is a lot clearer. Can we just run through the date-related properties. At first glance they seem more alike than on 2nd reading, at least for me:

  • dateCreated: "The datetime the data feed item was created." - is this the same as date when added to the feed? or the actual end object/item actually created?
  • dateModified: "The last time the data feed item was modified." - this sounds like the thing itself being modified, not the feed entry i.e. DataFeedItem ( = its metadata). It could be either though.
  • dateDeleted: "The datetime the item was removed from the data feed." - definitely all about the proxy item in the feed.

The problem here is probably just wording: "item" in the definitions could either mean the DataFeedItem, or the actual real world item (e.g. the Person "Bob") that is the value of the item property. Let's try to rephase so it is clearer on first reading.

@vholland vholland added a commit to vholland/schemaorg that referenced this issue Sep 11, 2015
@vholland vholland Issue #688: Added DataFeed and DataFeedItem including examples and
release notes.
8315f5b
@vholland
Contributor

Good point, regarding wording. In all cases, the dates apply to the proxy item in the feed.

I created pull request #765 with the listed changes. I took the liberty of extending the range for dateCreated and dateModified to also accept DateTime, as feeds (and increasingly online content) has creation dates that include times.

@vholland vholland added a commit to vholland/schemaorg that referenced this issue Sep 11, 2015
@vholland vholland Issue #688: Fixed typo in range for dataFeedElement. dda4d3f
@vholland vholland added this to the sdo-phobos release milestone Sep 11, 2015
@danbri
Contributor
danbri commented Sep 14, 2015

This looks good. I'm merging it in so people have a concrete target to review...

@danbri
Contributor
danbri commented Sep 14, 2015

/cc @chaals @ajax-als @tilid @pmika @mfhepp @shankarnat @rvguha

Ok, please take a look here: http://sdo-phobos.appspot.com/DataFeed

There's a JSON-LD example (thanks, Vicki).

The idea is, within the constraints of a normal schema.org description (no fancy multi-graph stuff) to provide more feed-like metadata on the items described, to aid consumption, aggregation etc. I looked into some other options and have ended up more convinced than when I started it :) this is useful...

@vholland
Contributor
vholland commented Oct 1, 2015

I forgot to add that one use for this is the supporting data for a software application. (For example, configuration data.)

I'll create a new pull request shortly.

@vholland vholland added a commit to vholland/schemaorg that referenced this issue Oct 1, 2015
@vholland vholland Issue #688: Added supportingData to SoftwareApplication. 9b6a2b1
@vholland
Contributor
vholland commented Oct 1, 2015

Implemented in pull request #822.

@elf-pavlik
Contributor

This is more for data feeds that are not necessarily web pages or email messages. In some cases, the full data set is not sent at once, so it is useful to know the creation time of individual items.

Will those 'DataFeeds' need paging? I see 4 independent (uncoordinated) developments here

I don't know if the more the merrier applies here 😉

@danbri
Contributor
danbri commented Nov 6, 2015

Fixed in http://schema.org/docs/releases.html#v2.2 - thanks all. Closing as main issue is addressed, feel free to continue discussions!

@danbri danbri closed this Nov 6, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment