Join GitHub today
This page is about our (Jim and Tim) submission to IEEE Intelligent Systems Special Issue on Linked Open Government Data, which applies the Functional Requirements for Bibliographic Records and cryptographic digests to address some provenance challenges when integrating data in distributed, uncoordinated environments with tools like csv2rdf4lod-automation. (This page has a purl that was mentioned in the paper).
The use case begins with two different data integrators retrieving the URLs:
- http://www.data.gov/download/1554/csv and
- http://explore.data.gov/download/5gah-bvex/CSV, respectively.
Each integrator does something with what they got and offers up results to a data consumer that must choose among the results to use for their analysis, application, demonstration, sales pitch, funding decision, voting decision, etc. While some of the things the data integrators did were the same; they did other things that differed. How can data consumers easily recognize these commonalities and distinctions so that they can make an informed decision about which data product is best suited to their needs? Conversely, what can the data integrators do to provide assurances that their results are not only just as good, but better than the original government data?
The use case is centered around the following events:
- Event 1: Data Integrator E retrieves http://explore.data.gov/download/5gah-bvex/CSV
- Event 2: Data Integrator W retrieves http://www.data.gov/download/1554/csv
- Event 3: Data Integrator E converts its local
us_economic_assistance.csvto raw RDF
- Event 4: Data Integrator E reserializes the raw RDF from Turtle to RDF/XML syntax.
- Event 5: Data Integrator E converts its local
us_economic_assistance.csvto enhanced RDF
- Event 6: Both Data Integrators (E and W) publish their results on the web.
- Event 7: Data Consumer C must choose from among the following:
- Choice 1: http://explore.data.gov/download/5gah-bvex/CSV
- Choice 2: http://www.data.gov/download/1554/csv
- Choice 3: E's us_economic_assistance.csv
- Choice 4: W's us_economic_assistance.csv
- Choice 5: E's us_economic_assistance.csv.raw.ttl
- Choice 6: E's us_economic_assistance.csv.raw.ttl.rdf
- Choice 7: E's us_economic_assistance.csv.e1.ttl
Which would you choose? Did you make a mistake? How do you know?
Let's see how adding FRBR stacks named with cryptographic hashes can help.
Real, concrete example (the details)
This page walks through the technical details that needed to be summarized in the paper. It includes pointers to the ontology, code, sample output, and enormous diagrams.
The github repository contains the automation required to [automatically retrieve](Automated creation of a new Versioned Dataset) the datasets yourself. Have csv2rdf4lod-automation installed and run the
1554-frbr-demo-www (creates a version from
1554-frbr-demo-explore (creates a version from
FRBR-specific versions were created so that we can compare the original provenance to the new FRBR-inspired (crypto-digested) provenance. The
retrieve.sh will recreate the steps of the use case. They are available at:
The Functional Requirements for Information Resources (FRIR) ontology is available at http://purl.org/twc/ontology/frir.owl#; we encourage using the frir prefix, which we pronounce ,"friar". The ontology extends:
- frbrcore - Ian Davis' FRBR ontology,
- nfo - Nepomuk's File Ontology, and
- prov - the W3C Provenance Working Group's current draft OWL ontology.
You can browse the ontology using Manchester's Ontology Browser.
Events and the provenance metadata they produce
First, we'd like to show an overview of the events in the use case, the provenance metadata they produce, and where they end up (on the [file system](conversion cockpit)). For illustration purposes, we saved slices for each step along the way:
- Event 1 by Data Integrator E
- Event 2 by Data Integrator W
- Event 3 by Data Integrator E
- Event 4 by Data Integrator E
- Event 5 by Data Integrator E
- Event 6, both Data Integrators (E, W) publish their data on the web (links above).
- Event 7, the Data Consumer chooses from the data file links, which would be informed by the FRBR provenance metadata listed here.
Using FRBR provenance to comparing two Data Integrator's retrievals of two different URLs that resolve to the same data file ("comparison file 1" in paper)
FRBR provenance when Data Integrators E and W retrieve two different URLs. The relations among the requested URLs becomes apparent: URL 1 (eventually) redirects to URL 2, which redirects to URL 3. Although retrieved independently, the files share the same Manifestation and Expression because the message digest and content digest were used to name them, respectively. The unlabeled dashed lines are rdf:type triples.
FRBR provenance of converting the obtained CSV to RDF ("comparison file 2" in paper)
FRBR provenance when Data Integrator E converts the CSV to raw RDF. Although the files' Manifestations differ, the Expression is the same. By this, we know that no new content was created (or lost) in the conversion.
FRBR provenance of re-serializing RDF; it does not change the Expression, only the Manifestation ("comparison file 3" in paper)
FRBR provenance of the CSV, raw RDF, and a conversion of the raw RDF into RDF/XML. Although the RDF/XML is not stated to be derived from the original, their common Expression permits us to view them as content-equivalent.
Enhanced RDF has a different Expression than that of the CSV and Raw RDF ("comparison file 4" in paper):
FRBR provenance applying enhancement parameters to the CSV's conversion to RDF. Although the raw RDF's Expression was recognized as tabular and mapped to the same Expression as the CSV, the enhanced RDF is a new graph. This new content structure results in a digest derived from the RDF Abstract Model instead of a table. The new Expression is associated with a new derived Work.
pcurl.py and fstack.py are the two utilities that add the features described in the paper. We describe each here, with example usage. While each can be used on their own, csv2rdf4lod-automation doesn't apply them unless
Tool 1 of 2: pcurl.py
For usage, run
$ pcurl.py --help usage: pcurl.py [--help|-h] [--format|-f xml|turtle|n3|nt] [url ...] Download a URL and compute Functional Requirements for Bibliographic Resources (FRBR) stacks using cryptograhic digests for the resulting content. optional arguments: url url to compute a FRBR stack for. -h, --help Show this help message and exit, -f, --format File format for FRBR stacks. One of xml, turtle, n3, or nt.
To download http://explore.data.gov/download/5gah-bvex/CSV while remembering how you got the results, run:
$ cd source/data-gov/1554-frbr-demo-explore/version/2011-Sep-19-frbr/source/ $ pcurl.py http://explore.data.gov/download/5gah-bvex/CSV $ ls us_economic_assistance.csv us_economic_assistance.csv.prov.ttl
Looking at us_economic_assistance.csv.prov.ttl (see its diagram), we see a FRBR stack, rooted from the file (Item) to the URL retrieved (Work). The HTTP redirects from the requested URL to the retrieved URL is described. The Manfifestation and Expression are named by performing a cryptographic digest at the file level (MessageDigest) and at the content level (TabularDigest, in this case), respectively. The file (Item) is named by hashing the machine, file path, and file's last-modified date and appending the local file name (i.e.,
filed://sha(moddate+machine)/sha(path)/filename). All of the FRBR Endeavors (Work, Expression, Manifestation, and Item) are also prov:Entities that contextualize the state at the time of retrieval. The HTTP response (an Item) is also represented by associating it to the file using the prov:wasGeneratedBy and prov:used associations.
Having this bit of metadata is interesting, but how is it useful? What if we have files on disk and we don't know how we got them, or if you and someone else have similar files but you're not sure how they relate? Let's start by running pcurl.py again and comparing the results. Then, we'll use fstack.py to see how we can reconcile any pcurl.py'd file with any file on disk (whether it has provenance or not).
$ cd source/data-gov/1554-frbr-demo-www/version/2011-Sep-19-frbr/source $ pcurl.py http://www.data.gov/download/1554/csv $ ls us_economic_assistance.csv us_economic_assistance.csv.prov.ttl
Comparing this consolidated view to the first view above, we see that a few more Works were added and associated to other Works by irw:redirectsTo. This shows a larger picture of how these URLs are related; it turns out that requesting http://www.data.gov/download/1554/csv actually passes through http://explore.data.gov/download/5gah-bvex/CSV on its way to the final redirection location http://gbk.eads.usaidallnet.gov/data/files/us_economic_assistance.csv. Although we might have considered the two results different because they came from different locations, that is not actually the case because we end up requesting the same URL.
Looking towards the bottom of the diagram, we see a lot of duplication of Items, which is caused by the HTTP requests and the two files on different parts of disk. This distinction corresponds to the spatial-temporal aspect of FRBR Items -- if they are in different places, then they are different things. Their physical structure, however, is the same. This commonality is expressed by both files (Items) referencing the same Manifestation, which is named using the message digest of their literal binary contents.
In FRIR, when FRBR stacks share a more concrete endeavor, they implicitly share all more abstract endeavors. For example, both files on disk share the same Manifestation (
filed://e3b5-11e0...), so they also have to share the same Expression (
The informational power of FRBR stacks comes when using topic-specific cryptographic hashes to name the individual Endeavors (Work, Expression, Manifestation, and Item). Although the view shown is the result of two independent events, the names used to describe the URL retrieval align without prior coordination.
(beyond the use case, this diagrams the FRIR resulting from pcurl.py against a Linked Data URI)
Tool 2 of 2: fstack.py
$ fstack.py --help usage: fstack.py [--help|-h] [--stdout|-c] [--format|-f xml|turtle|n3|nt] [-] [file ...] Compute Functional Requirements for Bibliographic Resources (FRBR) stacks using cryptograhic digests. optional arguments: file file to compute a FRBR stack for. - read content from stdin and print FRBR stack to stdout. -h, --help Show this help message and exit, -c, --stdout Print frbr stacks to stdout. --no-paths Only output path hashes, not actual paths (for security purposes). -f, --format File format for FRBR stacks. One of xml, turtle, n3, or nt.
$ fstack.py --print-item --print-manifestation --print-expression --print-work us_economic_assistance.csv filed://e34d-11e0-a62f-64b9e8c87fbe/SHA256-6d95b1bb167bc23716...3841ec23a4aff67c6e8a/us_economic_assistance.csv hash:Manifestation/SHA256-d7a4442f613aa3a4a8e4a1527a917c4cfa23501ee87e0514a0f10300736b0f57 hash:Expression/SHA256-d7a4442f613aa3a4a8e4a1527a917c4cfa23501ee87e0514a0f10300736b0f57 uuid:55103bd3-8dac-4ae7-90c6-b44744a0119f
Graham Klyne: This reminds me that there's a proposal for a ni: URI scheme that identifies by way of cryptographic hash, which might be useful for some aspects of provenance ... http://www.ietf.org/proceedings/81/slides/decade-3.pdf, http://tools.ietf.org/html/draft-farrell-decade-ni-04