Skip to content
Tim L edited this page Jan 17, 2014 · 30 revisions

What is first

What we will cover

This page describes the concept and implementation of a technique to summarize arbitrary RDF graphs. We'll summarize the named graphs in http://ieeevis.tw.rpi.edu/sparql as a running example.

Let's get to it!

Invoking the summarizer

vsr-spo-balance.sh wraps the Java invocation using the situate shell paths pattern.

In a separate Prizms node, we set up the dataset "sparql" with source "ieeevis-tw-rpi-edu" at directory data/source/ieeevis-tw-rpi-edu/sparql.

bash-3.2$ vsr-spo-balance.sh 
usage: RepositorySummarizer { -(sysin) [reportURI | .] |
                              -r(emote) serverURL repositoryID <reportURI | .> [context-to-summarize ...] |
                              -d(irectory) path/to/sesame-native-dir/ [context-to-summarize ...] |
                              -f(ile) path/to/a.rdf <reportURI | .> }
where:
   -(sysin):     Summarize the RDF on standard in; print summary report to standard out.
                 If reportURI or . are provided, print TRiG instead of RDF/XML.
   -r(remote):   Summarize listed specimenContexts in repositoryID at serverURL. 
                 If no specimenContexts listed, summarize all contexts in repository.
   -d(irectory): Summarize listed specimenContexts in sesame native directory. 
                 If no specimenContexts listed, summarize all contexts in directory.
   -f(ile):      Summarize the RDF in file; print summary report to standard out.

( version: 2013-Apr-03 )

SPARQLRepository

Summary description

Sketch of the summarization description. The implementation does it slightly differently.

@prefix sio: <http://semanticscience.org/resource/> .

# We analyzed a graph with name http://xmlns.com/foaf/0.1 that was
# provided by a SPARQL endpoint healthdata.tw.rpi.edu/sparql

<http://healthdata.tw.rpi.edu/sparql?query=PREFIX+sd%3A+%3Chttp%3A%2F%2Fwww.w3.org%2Fns%2Fsparql-service-description%23%3E+CONSTRUCT+%7B+%3Fendpoints_named_graph+%3Fp+%3Fo+%7D+WHERE+%7B+GRAPH+%3Chttp%3A%2F%2Fxmlns.com%2Ffoaf%2F0.1%3E+%7B+%5B%5D+sd%3Aurl+%3Chttp%3A%2F%2Fhealthdata.tw.rpi.edu%2Fsparql%3E%3B+sd%3AdefaultDatasetDescription+%5B+sd%3AnamedGraph+%3Fendpoints_named_graph+%5D+.+%3Fendpoints_named_graph+sd%3Aname+%3Chttp%3A%2F%2Fxmlns.com%2Ffoaf%2F0.1%3E%3B+%3Fp+%3Fo+.+%7D+%7D>
   a sd:NamedGraph;
   sd:name <http://xmlns.com/foaf/0.1>;
   prov:hadLocation <http://healthdata.tw.rpi.edu/sparql>;
.

# We derived a few datasets during our analysis.

<spo_balance_for_foaf_graph>
   a void:Dataset, vsr:SPOBalanceSet;
   void:subset <subjects>, <predicates>, <objects>;
   prov:wasDerivedFrom <http://healthdata.tw.rpi.edu/sparql?query=PREFIX+sd%3A+%3Chttp%3A%2F%2Fwww.w3.org%2Fns%2Fsparql-service-description%23%3E+CONSTRUCT+%7B+%3Fendpoints_named_graph+%3Fp+%3Fo+%7D+WHERE+%7B+GRAPH+%3Chttp%3A%2F%2Fxmlns.com%2Ffoaf%2F0.1%3E+%7B+%5B%5D+sd%3Aurl+%3Chttp%3A%2F%2Fhealthdata.tw.rpi.edu%2Fsparql%3E%3B+sd%3AdefaultDatasetDescription+%5B+sd%3AnamedGraph+%3Fendpoints_named_graph+%5D+.+%3Fendpoints_named_graph+sd%3Aname+%3Chttp%3A%2F%2Fxmlns.com%2Ffoaf%2F0.1%3E%3B+%3Fp+%3Fo+.+%7D+%7D>;
.
<resources> # This needs to be split up into S and O...
   a void:Dataset, vsr:ResourceSet;
   sio:count 99;
   sio:has-member <http://xmlns.com/foaf/0.1/workplaceHomepage>,
                  <http://xmlns.com/foaf/0.1/maker>,
                  <http://purl.org/dc/elements/1.1/description>,
                  <http://www.w3.org/2002/07/owl#Class>,
                  <http://xmlns.com/foaf/0.1/page>,
                  <http://xmlns.com/foaf/0.1/birthday>,
                  ... 93 more ...
.

src/spo-balance.sh wraps the call to RepositorySummarizer.java

We use a Sesame Repository, which can be started by running tomcat: apache-tomcat-7.0.34/bin/startup.sh

log.rtf contains implementation details.

src/spo-balance.sh --help

RepositorySummarizer version: 2013-Jan-14
usage: RepositorySummarizer { -(sysin) [reportURI | .] |
                              -r(emote) serverURL repositoryID <reportURI | .> [context-to-summarize ...] |
                              -d(irectory) path/to/sesame-native-dir/ [context-to-summarize ...] |
                              -f(ile) path/to/a.rdf <reportURI | .> }
where:
   -(sysin):     Summarize the RDF on standard in; print summary report to standard out.
                 If reportURI or . are provided, print TRiG instead of RDF/XML.
   -r(remote):   Summarize listed specimenContexts in repositoryID at serverURL. 
                 If no specimenContexts listed, summarize all contexts in repository.
   -d(irectory): Summarize listed specimenContexts in sesame native directory. 
                 If no specimenContexts listed, summarize all contexts in directory.
   -f(ile):      Summarize the RDF in file; print summary report to standard out.

( version: 2013-Jan-14 )

Stereotyping predicate counts by preferred vocabularies

color by those predicates that occur in a given curated list of vocabulary namespaces.

Preferred vocabulary "word clouds"

Decorate the SPO balance with a "word cloud" of prefixes for the [preferred] namespaces that the graph uses. This aggregated information should be derivable from the SPO repository summary RDF description.

at http://opendap.tw.rpi.edu/sparql

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX sio: <http://semanticscience.org/resource/>
PREFIX vsr: <http://purl.org/twc/vocab/vsr#>

select ?vocabulary ?predicate ?count
where {
   <http://purl.org/twc/vocab/vsr#RepositorySummarizer_2014-Jan-15_15-44_1389800671057_ms/spo>
       a vsr:SPODataset;
       void:subset [
          a vsr:PredicatesDataset;
          void:subset [
             a vsr:PredicateOccurrenceDataset;
             owl:hasValue ?predicate;
             sio:count    ?count
          ];
       ] .
    optional { ?predicate rdfs:isDefinedBy ?vocabulary }
}
group by ?vocabulary
order by ?vocabulary ?predicate ?count

The following query results:

PREFIX rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX owl:  <http://www.w3.org/2002/07/owl#>
PREFIX sio:  <http://semanticscience.org/resource/>
PREFIX vsr:  <http://purl.org/twc/vocab/vsr#>

select distinct ?predicate ?count
where {
   ?spo
       a vsr:SPODataset;
       void:subset [    # </spo/p>
          a vsr:PredicatesDataset;
          void:subset [ # </spo/p/bin/1>, </spo/p/bin/2>, ...
             a vsr:PredicateOccurrenceDataset;
             owl:onProperty rdf:predicate;
             owl:hasValue  ?predicate;
             sio:count     ?count
          ];
       ] .
   filter(regex(str(?spo),'1389898600145'))
}
order by ?predicate ?count

If a dataset uses the following properties and frequencies, then we can model it as the following RDF. void:vocabulary,

# This was already provided by the SPO summary calculation:
<spo/p/bin/1>
    a vsr:Bin, vsr:Dataset, vsr:PredicateOccurrenceDataset;
    owl:onProperty rdf:predicate;
    owl:hasValue  <http://usefulinc.com/ns/doap#anon-root>;
    sio:count "1"^^xsd:int;
.
<spo/p/bin/2>
    a vsr:Bin, vsr:Dataset, vsr:PredicateOccurrenceDataset;
    owl:onProperty rdf:predicate;
    owl:hasValue  <http://usefulinc.com/ns/doap#audience>;
    sio:count "1"^^xsd:int;
.
<spo/p/bin/3>
    a vsr:Bin, vsr:Dataset, vsr:PredicateOccurrenceDataset;
    owl:onProperty rdf:predicate;
    owl:hasValue  <http://usefulinc.com/ns/doap#browse>;
    sio:count "2"^^xsd:int;
.
<spo/p/bin/4>
    a vsr:Bin, vsr:Dataset, vsr:PredicateOccurrenceDataset;
    owl:onProperty rdf:predicate;
    owl:hasValue  <http://purl.org/dc/terms/author>;
    sio:count "1"^^xsd:int;
.
<spo/p/bin/5>
    a vsr:Bin, vsr:Dataset, vsr:PredicateOccurrenceDataset;
    owl:onProperty rdf:predicate;
    owl:hasValue  <http://purl.org/dc/terms/contributor>;
    sio:count "6"^^xsd:int;
.
<spo/p/bin/6>
    a vsr:Bin, vsr:Dataset, vsr:PredicateOccurrenceDataset;
    owl:onProperty rdf:predicate;
    owl:hasValue  <http://purl.org/dc/terms/created>;
    sio:count "8"^^xsd:int;
.

<spo/p/ns/doap> # We'll start a new branch, and use prefixes when we have them, hash of ns o/w.
   owl:onProperty rdfs:isDefinedBy;
   owl:hasValue    <http://usefulinc.com/ns/doap#>;
   sio:count 4;
   a void:Dataset;
   void:vocabulary <http://usefulinc.com/ns/doap#>;
   void:propertyPartition </spo/p/bin/1>, # These predicate bins are already defined.
                          </spo/p/bin/2>,
                          </spo/p/bin/3>;
.

<spo/p/ns/dcterms>
   owl:onProperty rdfs:isDefinedBy;
   owl:hasValue    <http://purl.org/dc/terms/>;
   sio:count 15;
   a void:Dataset;
   void:vocabulary <http://purl.org/dc/terms/>;
   void:propertyPartition </spo/p/bin/4>, # These predicate bins are already defined.
                          </spo/p/bin/5>,
                          </spo/p/bin/6>;
.

Feature space to cluster graphs by similarity

The node

What is next