Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements to Dataset #688

Closed
vickitardif opened this issue Jul 31, 2015 · 20 comments
Closed

Improvements to Dataset #688

vickitardif opened this issue Jul 31, 2015 · 20 comments
Assignees
Labels
schema.org vocab General top level tag for issues on the vocabulary

Comments

@vickitardif
Copy link
Contributor

As it stands, http://schema.org/Dataset allows one to describe the metadata for a dataset, but not the actual data. I propose we:

  1. Move Dataset out from CreativeWork and make it a child of ItemList (similar to the change to BreadcrumbList made in version 1.92).
  2. Create a new type Thing > Intangible > ListItem > DataItem.
  3. Expand the range for http://schema.org/dateCreated to include DataItem.

This would allow people to create data catalogs like:

{
  "@context": "http://schema.org/",
  "@type": "Dataset",
  "name": "Company directory",
  "itemListElement": [
    {
      "@type": "Person",
      "name": "Alice",
      "email": "alice@example.com",
      "dateCreated": "2014-07-01"'
    },
    {
      "@type": "Person",
      "name": "Bob",
      "email": "bob@example.com",
      "dateCreated": "2015-01-02"'
    },
  ]
}

Note in the above example, the dateCreated is the date the record was created not the date when the person joined the company.

One could describe simple datasets by using Number or Text instead of a richer type.

@vickitardif vickitardif self-assigned this Jul 31, 2015
@vickitardif vickitardif added schema.org vocab General top level tag for issues on the vocabulary type:exact proposal labels Jul 31, 2015
@danbri
Copy link
Contributor

danbri commented Jul 31, 2015

I'm not quite sure what problem this solves. Wouldn't a document containing these assertions be the dataset? Is every entity mentioned in the dataset considered a value of itemListElement? e.g. if Bob here had an affiliation of some Organization, is that Organization an itemListElement of the dataset too?

I don't think moving it under ItemList works well, in that Dataset covers a great many kinds of dataset - not all of which have a single obvious conceptualization as a list of items. For example, audio recordings (see http://grh.mur.at/sites/default/files/mir_datasets_0.html), pre-trained artificial neural nets (https://github.com/BVLC/caffe/wiki/Model-Zoo), geo data (http://opendata.arcgis.com/ http://wiki.osgeo.org/wiki/Public_Geodata_for_the_UK ), space data incl. imagery and sensor readings (https://data.nasa.gov/data) etc etc.

It's important that we keep this type open and inclusive for all these kinds of data sharing + more. But it is worth taking a closer look at an important subset: datasets whose content can be seen as a set of assertions about the properties of entities. That seems to be where you're heading here. There is some related work over at W3C in the CSV group, see http://www.w3.org/blog/news/archives/4830 especially http://www.w3.org/TR/2015/CR-csv2rdf-20150716/ which includes a framework for mapping table rows (from CSV and similar tabular data) into triples. This is a different approach to "the actual data", but shares with your proposal a concern for treating that data as triples/assertions. Maybe there's some common ground here?

@vickitardif
Copy link
Contributor Author

I had imagined multi-dimensional sets as a list of lists, but maybe that is too complicated.

Perhaps instead folks use both Dataset and ItemList as necessary, but we still have DataItem to allow for metadata about individual items. The above example becomes:

{
  "@context": "http://schema.org/",
  "@type": ["Dataset", "ItemList"],
  "name": "Company directory",
  "itemListElement": [
    {
      "@type": "Person",
      "name": "Alice",
      "email": "alice@example.com",
      "dateCreated": "2014-07-01"'
    },
    {
      "@type": "Person",
      "name": "Bob",
      "email": "bob@example.com",
      "dateCreated": "2015-01-02"'
    },
  ]
}

@vickitardif
Copy link
Contributor Author

I realize there is an error in my JSON-LD. It should be:

{
  "@context": "http://schema.org/",
  "@type": ["Dataset", "ItemList"],
  "name": "Company directory",
  "itemListElement": [
    {
      "@type": "DataItem",
      "dateCreated": "2014-07-01"',
      "item": {
        "@type": "Person",
        "name": "Alice",
        "email": "alice@example.com",
      }
    },
    {
      "@type": "DataItem",
      "dateCreated": "2015-01-02"',
      "item": {
        "@type": "Person",
        "name": "Bob",
        "email": "bob@example.com",
      }
    },
  ]
}

@danbri
Copy link
Contributor

danbri commented Jul 31, 2015

I'm still missing something. Aren't all docs carrying schema.org datasets already? What is the value in explicitly saying "hey, I'm a dataset" and "hey, this is a thing mentioned in the dataset" all the way through? It feels like the overuse of WebPage we've seen, ... an awkward form of reification where you're not entirely sure what is being described or how deep into the sub-graph the properties apply. If you just want to wrap provenance metadata around chunks of schema.org-flavoured RDF, perhaps JSON-LD named graphs are worth a look? http://www.w3.org/TR/json-ld/#named-graphs

For multidimensional numeric data, http://www.w3.org/TR/vocab-data-cube/ could be a good fit.

@vickitardif
Copy link
Contributor Author

This is more for data feeds that are not necessarily web pages or email messages. In some cases, the full data set is not sent at once, so it is useful to know the creation time of individual items.

@danbri
Copy link
Contributor

danbri commented Jul 31, 2015

Here's a quick attempt at using JSON-LD named graphs. Try it in http://json-ld.org/playground/

[
  {
    "@context": "http://schema.org/",
    "@id": "#dataitem-73",
    "generatedAt": "2014-07-01",
    "@graph": [
      {
        "@type": "Person",
        "name": "Alice",
        "email": "alice@example.com"
      }
    ]
  },
  {
    "@context": "http://schema.org/",
    "@id": "#dataitem-74",
    "generatedAt": "2014-07-02",
    "@graph": [
      {
        "@type": "Person",
        "name": "Bob",
        "email": "bob@example.com"
      }
    ]
  }
]

The quads that come back are:

<http://json-ld.org/playground/#dataitem-73> <http://schema.org/generatedAt> "2014-07-01" .
<http://json-ld.org/playground/#dataitem-74> <http://schema.org/generatedAt> "2014-07-02" .
_:b0 <http://schema.org/email> "alice@example.com" <http://json-ld.org/playground/#dataitem-73> .
_:b0 <http://schema.org/name> "Alice" <http://json-ld.org/playground/#dataitem-73> .
_:b0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Person> <http://json-ld.org/playground/#dataitem-73> .
_:b1 <http://schema.org/email> "bob@example.com" <http://json-ld.org/playground/#dataitem-74> .
_:b1 <http://schema.org/name> "Bob" <http://json-ld.org/playground/#dataitem-74> .
_:b1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Person> <http://json-ld.org/playground/#dataitem-74> .

... where the final value in each row is a graph id (date item, in your terminology)

@vickitardif
Copy link
Contributor Author

I'm not sure I understand "generatedAt" is not valid schema.org, so we would need to change the context to include another vocabulary.

@danbri
Copy link
Contributor

danbri commented Jul 31, 2015

Yeah that the was the example property name used in the W3C spec, I didn't tweak it. We probably have something appropriate in schema.org or could add, or use a different context. But does the quads / named graph approach look worth consideration?

@danbri
Copy link
Contributor

danbri commented Jul 31, 2015

    [
      {
        "@context": "http://schema.org/",
        "@id": "#dataitem-73",
        "dateCreated": "2014-07-01",
        "@graph": [
          {
            "@type": "Person",
            "name": "Alice",
            "email": "alice@example.com"
          }
        ]
      },
      {
        "@context": "http://schema.org/",
        "@id": "#dataitem-74",
        "dateCreated": "2014-07-02",
        "@graph": [
          {
            "@type": "Person",
            "name": "Bob",
            "email": "bob@example.com"
          }
        ]
      }
    ]

@vickitardif
Copy link
Contributor Author

I am not sure I understand. As written, I have two disconnected graphs: Alice's graph and Bob's. I still need something to say they are actually parts of a larger graph. Do you disagree with adding something to Dataset to join the graphs?

@chaals
Copy link
Contributor

chaals commented Aug 5, 2015

can't we use the collections stuff for that?

@vickitardif
Copy link
Contributor Author

I spoke with @danbri offline to better understand his concerns. It is probably too much to take on modeling all data sets in one go. To that end, I would like to refocus the discussion on supporting data feeds which may come as JSON-LD instead of web pages. To that end, I propose:

  • Added a new type Thing > CreativeWork > Dataset > DataFeed
  • DataFeed would have the property dataFeedElement which expects Text, Thing, or DataFeedItem.
  • Add the new type Thing > Intangible > DataFeedItem
  • DataFeedItem has the following properties:
    • item: An entity represented by an entry in a list or data feed. (Note item already exists for ListItem.)
    • dateCreated: The datetime the data feed item was created.
    • dateModified: The last time the data feed item was modified.
    • dateDeleted: The datetime the item was removed from the data feed.

The properties http://schema.org/dateCreated and http://schema.org/dateModified exist on http://schema.org/CreativeWork. The proposal is to expand their domains to include DataFeedItem.

The sample JSON-LD becomes:

{
  "@context": "http://schema.org/",
  "@type": "DataFeed",
  "name": "Company directory",
  "dateModified": "2015-01-02",
  "dataFeedElement": [
    {
      "@type": "DataFeedItem",
      "dateCreated": "2014-07-01"',
      "item": {
        "@type": "Person",
        "name": "Alice",
        "email": "alice@example.com",
      }
    },
    {
      "@type": "DataFeedItem",
      "dateModified": "2015-01-02"',
      "item": {
        "@type": "Person",
        "name": "Bob",
        "email": "bob@example.com",
      }
    },
  ]
}

@danbri
Copy link
Contributor

danbri commented Aug 14, 2015

Thanks @vholland this is a lot clearer. Can we just run through the date-related properties. At first glance they seem more alike than on 2nd reading, at least for me:

  • dateCreated: "The datetime the data feed item was created." - is this the same as date when added to the feed? or the actual end object/item actually created?
  • dateModified: "The last time the data feed item was modified." - this sounds like the thing itself being modified, not the feed entry i.e. DataFeedItem ( = its metadata). It could be either though.
  • dateDeleted: "The datetime the item was removed from the data feed." - definitely all about the proxy item in the feed.

The problem here is probably just wording: "item" in the definitions could either mean the DataFeedItem, or the actual real world item (e.g. the Person "Bob") that is the value of the item property. Let's try to rephase so it is clearer on first reading.

vickitardif added a commit to vickitardif/schemaorg that referenced this issue Sep 11, 2015
@vickitardif
Copy link
Contributor Author

Good point, regarding wording. In all cases, the dates apply to the proxy item in the feed.

I created pull request #765 with the listed changes. I took the liberty of extending the range for dateCreated and dateModified to also accept DateTime, as feeds (and increasingly online content) has creation dates that include times.

@danbri
Copy link
Contributor

danbri commented Sep 14, 2015

This looks good. I'm merging it in so people have a concrete target to review...

danbri added a commit that referenced this issue Sep 14, 2015
Issue #688: Added DataFeed and DataFeedItem including examples and
@danbri
Copy link
Contributor

danbri commented Sep 14, 2015

/cc @chaals @ajax-als @tilid @pmika @mfhepp @shankarnat @rvguha

Ok, please take a look here: http://sdo-phobos.appspot.com/DataFeed

There's a JSON-LD example (thanks, Vicki).

The idea is, within the constraints of a normal schema.org description (no fancy multi-graph stuff) to provide more feed-like metadata on the items described, to aid consumption, aggregation etc. I looked into some other options and have ended up more convinced than when I started it :) this is useful...

@vickitardif
Copy link
Contributor Author

I forgot to add that one use for this is the supporting data for a software application. (For example, configuration data.)

I'll create a new pull request shortly.

vickitardif added a commit to vickitardif/schemaorg that referenced this issue Oct 1, 2015
@vickitardif
Copy link
Contributor Author

Implemented in pull request #822.

danbri added a commit that referenced this issue Oct 1, 2015
Issue #688: Added supportingData to SoftwareApplication.
@elf-pavlik
Copy link
Contributor

This is more for data feeds that are not necessarily web pages or email messages. In some cases, the full data set is not sent at once, so it is useful to know the creation time of individual items.

Will those 'DataFeeds' need paging? I see 4 independent (uncoordinated) developments here

I don't know if the more the merrier applies here 😉

@danbri
Copy link
Contributor

danbri commented Nov 6, 2015

Fixed in http://schema.org/docs/releases.html#v2.2 - thanks all. Closing as main issue is addressed, feel free to continue discussions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
schema.org vocab General top level tag for issues on the vocabulary
Projects
None yet
Development

No branches or pull requests

4 participants