Assisting vocabulary selection

timrdf edited this page Jan 25, 2012 · 28 revisions
Clone this wiki locally

How does DataFAQs play a role in vocabulary selection? Would DataFAQs be used as part of an iterative process?

Yes. And Yes.

The vocabulary that one chooses to model their domain is critically important. Although many vocabularies may adequately communicate the topic of our interests, some vocabularies have more practical value than others.

To take an example from our most recent conversion, consider two alternate RDF forms of the same tabular row:

@prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix void:    <http://rdfs.org/ns/void#> .
@prefix foaf:    <http://xmlns.com/foaf/0.1/> .
@prefix prov:    <http://www.w3.org/ns/prov-o/> .

@prefix local_vocab: 
  <http://logd.tw.rpi.edu/source/congress-gov/dataset/biographical-directory-of-the-united-states-congress/vocab/> .
@prefix e1: 
  <http://logd.tw.rpi.edu/source/congress-gov/dataset/biographical-directory-of-the-united-states-congress/vocab/enhancement/1/> .
@prefix biographical-directory-of-the-united-states-congress: 
  <http://localhost/source/congress-gov/dataset/biographical-directory-of-the-united-states-congress/> .
@prefix value_of_state: 
  <http://localhost/source/congress-gov/dataset/biographical-directory-of-the-united-states-congress/value-of/state/> .
@prefix :      
  <http://logd.tw.rpi.edu/source/congress-gov/dataset/biographical-directory-of-the-united-states-congress/version/2012-Jan-04/> .


:congressperson_49 

   dcterms:isReferencedBy 
    <http://logd.tw.rpi.edu/source/congress-gov/dataset/biographical-directory-of-the-united-states-congress/version/2012-Jan-04> ;
   void:inDataset 
    <http://logd.tw.rpi.edu/source/congress-gov/dataset/biographical-directory-of-the-united-states-congress/version/2012-Jan-04> ;

   a local_vocab:Congressperson , foaf:Person ;

   foaf:firstName   "John" ;
   foaf:family_name "BULL" ;

   e1:congress   biographical-directory-of-the-united-states-congress:congress_0 ;
   foaf:memberOf biographical-directory-of-the-united-states-congress:congress_0 ; # sic
   foaf:workInfoHomepage <http://bioguide.congress.gov/scripts/biodisplay.pl?index=B001047> , 
                         <http://bioguide.congress.gov/scripts/guidedisplay.pl?index=B001047> , 
                         <http://bioguide.congress.gov/scripts/bibdisplay.pl?index=B001047> ;

   con:preferredURI      biographical-directory-of-the-united-states-congress:B001047 ;
   prov:specializationOf biographical-directory-of-the-united-states-congress:B001047 ;

   e1:doc "2012-01-04T02:12:01" ;
   dbpediaprop:state value_of_state:SC; 
.

value_of_state:SC 
   dcterms:identifier "SC" ;
   rdfs:label         "SC" ;
   owl:sameAs dbpedia:South_Carolina , 
             <http://sws.geonames.org/4597040/> , 
             govtrackusgov:SC .

Many semantic web developers would agree that some of the modeling above is slightly better than the modeling that follows:

@prefix : 
  <http://logd.tw.rpi.edu/source/congress-gov/dataset/biographical-directory-of-the-united-states-congress/version/2012-Jan-04/> .
@prefix raw: 
  <http://logd.tw.rpi.edu/source/congress-gov/dataset/biographical-directory-of-the-united-states-congress/vocab/raw/> .

:thing_49 
  dcterms:isReferencedBy 
  <http://localhost/source/congress-gov/dataset/biographical-directory-of-the-united-states-congress/version/2012-Jan-04> ;
  void:inDataset 
  <http://localhost/source/congress-gov/dataset/biographical-directory-of-the-united-states-congress/version/2012-Jan-04> ;

   raw:first_name "John" ;
   raw:last_name  "BULL" ;
   raw:congress   "0" ;
   raw:p_url      "http://bioguide.congress.gov/scripts/biodisplay.pl?index=B001047" ;
   raw:doc        "2012-01-04T02:12:01" ;
   raw:state      "SC" ;
   raw:death      "1802" ;
   raw:birth      "1740c" ;
   raw:party      " " ;
   raw:position   "ContCong" ;
   raw:c_yr       "" ;
   ov:csvRow      "49"^^xsd:integer .

But what, exactly is better about? Well, lots of things. Different people are concerned about different aspects of the difference shown above. Some claims about quality may include:

  • foaf:firstName is way better than raw:first_name because 400 systems recognize it and display it.
  • raw:p_url as a URI and label is incomprehensible to anyone that did not build this database. And it's a literal, which means that RDF agents will not know that it can be resolved on the web. Using foaf:workInfoHomepage is way better because it already exists to associate a person with their work homepages. And systems recognize foaf already. And people know foaf already.
  • e1:congress is way better than raw:congress because its value is a URI that can be further described. Being stuck with raw:congress's value "0" is very uninformative. What do I do with zero? At very least, we can type the biographical-directory-of-the-united-states-congress:congress_0 and start describing it's temporal interval, etc.
  • ACK! Someone starting using foaf:memberOf, when that URI is not defined in the foaf namespace! That violates Linked Data principles. On the other hand, it's pretty obvious what it is -- it's the inverse of foaf:member and we can use it and have systems recognize it even without the FOAF Elite defining it in their vocabulary. Practicality can trump principles. Depending on who you ask.
  • We might not know what local_vocab:Congressperson is, but at least we know it's a kind of person foaf:Person. We can work with that.
  • dbpediaprop:state :SC is way better than raw:state "SC" because lots of people run to dbpedia for example data, so more people will start using dbpediaprop:state. But when more people start using it without clear, established rules, they they'll use it inconsistently. So the relation will have many meanings and runs the risk of becoming meaningless.
  • That is so redundant! dcterms:isReferencedBy AND void:inDataset?! Well, some recognize one, some recognize the other. What if we want to talk to both of them? We say both.
  • Hey! http://www.w3.org/ns/prov-o/ 404s. What gives? The W3C working group isn't done yet.

DataFAQs: the evaluation framework that gives you a voice.

DataFAQs is not designed to declare authoritative quality of the datasets it comes by. Instead, it is a framework to allow interested stakeholders to express, survey, and understand the aspects of quality that they and others value. This increased community understanding -- accelerated by automated, asynchronous feedback -- provides the basis for stakeholders to make better, more informed decisions about the vocabulary that they use. Those decisions are based on concrete, qualitative information that is provided by the community, for the community. DataFAQs just connects all of the dots, accumulates perspectives on datasets, and allows you to explore what the community thinks about your dataset.

DataFAQs can and will be used to assist vocabulary selection.

It is important to remember that DataFAQs is not only a resource that provides "grades" for datasets that you point it to. More importantly, it is a framework that allows any stakeholder to reflect their needs, interests, or preferences when it comes to the quality of any dataset.

How to use DataFAQs to assist vocabulary selection.

How to help stakeholders find high-quality vocabs for the linked data they plan to publish... and subsequently to evaluate the resultant quality of their linked data?

DataFAQs connects data publishers with potential data consumers.

  • Data publishers list their datasets on CKAN, an existing dataset collection infrastructure that is available at http://thedatahub.org.
  • Datasets are evaluated by evaluation services that data consumers and curators deploy in their part of the web and register in an evaluation service catalog (e.g. here). The evaluation services follow the existing SADI Semantic Web Services framework, which accepts RDF descriptions of a dataset and returns an RDF description of its evaluation.
  • The dataset evaluations are periodically accumulated, which creates a three dimensional basis (dataset, evaluation service, and time) for community analysis.
  • DataFAQs exposes these accumulated evaluations through a website that provides custom views tailored to the data publisher, consumer, or curator. Specific quality measures can be viewed over time to see how the LOD cloud changes.
  • Stakeholders interested in finding high-quality vocabularies can use the accumulated analysis to find and compare real uses of the vocabularies that they are considering. Further, they can consider the quality measures already provided by the community and see how different vocabularies compare with respect to each measure.
  • After choosing a particular vocabulary, the stakeholder can create (or endorse existing) evaluation services to reflect the characteristics that they valued when deciding which vocabulary to use. This not only communicates to potential consumers what the publisher valued when making design decisions, but also allows the publisher to monitor their quality measures of their data. They will also be able to see how consumers view their published data, and can methodologically select and respond to this automated feedback.