Script: pcurl.py

timrdf edited this page Jul 2, 2012 · 46 revisions
Clone this wiki locally
csv2rdf4lod-automation is licensed under the Apache License, Version 2.0

$CSV2RDF4LOD_HOME/bin/util/pcurl.py is Jim McCusker's reimplemention of pcurl.sh to include FRBR stacks and HTTP-in-RDF. He has included it as part of csv2rdf4lod-automation. Applications of this utility are described in the following publications:

Usage

bash-3.2$ pcurl.py --help
usage: pcurl.py [--help|-h] [--format|-f xml|turtle|n3|nt] [url ...]

Download a URL and compute Functional Requirements for Bibliographic Resources
(FRBR) stacks using cryptograhic digests for the resulting content.

Refer to http://purl.org/twc/pub/mccusker2012parallel
for more information and examples.

optional arguments:
 url            url to compute a FRBR stack for.
 -h, --help     Show this help message and exit,
 -f, --format   File format for FRBR stacks. One of xml, turtle, n3, or nt.

fstack.py is closely associated to pcurl.py. While pcurl.py is used to retrieve a URL and including its FRBR stack, fstack.py can be used to create a FRBR stack of an existing local file.

bash-3.2$ fstack.py --help
usage: fstack.py [--help|-h] [--stdout|-c] [--format|-f xml|turtle|n3|nt] [--print-item] [--print-manifesation] [--print-expression] [--print-work] [-] [file ...]

Compute Functional Requirements for Bibliographic Resources (FRBR)
stacks using cryptograhic digests.

Refer to http://purl.org/twc/pub/mccusker2012parallel
for more information and examples.

optional arguments:
 file                  File to compute a FRBR stack for.
 -                     Read content from stdin and print FRBR stack to stdout.
 -h, --help            Show this help message and exit,
 -c, --stdout          Print frbr stacks to stdout.
 --no-paths            Only output path hashes, not actual paths.
 -f, --format          File format for FRBR stacks. xml, turtle, n3, or nt.
--print-item           Print URI of the Item and quit.
--print-manifestation  Print URI of the Manifestation and quit.
--print-expression     Print URI of the Expression and quit.
--print-work           Print URI of the Work and quit.

Example

The following command will retrieve the latest pcurl.py script and store it to a file in your current directory. The script will include a second file describing the provenance of the one retrieved.

bash-3.2$ pcurl.py https://raw.github.com/timrdf/csv2rdf4lod-automation/master/bin/util/pcurl.py
bash-3.2$ ls
pcurl.py.prov.ttl       pcurl.py

If something happens to the file you retrieved (e.g., a file copy or rename), $CSV2RDF4LOD_HOME//bin/util/fstack.py can be used to recognize an association between the downloaded file and the one we see now:

bash-3.2$ cp pcurl.py mypcurl.py
bash-3.2$ fstack.py mypcurl.py
bash-3.2$ ls
pcurl.py.prov.ttl   pcurl.py        mypcurl.py      mypcurl.py.prov.ttl

To see that the different files pcurl.py and mypcurl.py have the same bitstream, we can look at the snippets of the FRBR stacks shown below and compare the frbr:Manifestation referenced by the frbr:exemplarOf predicate. pcurl.py and mypcurl.py are different frbr:Items with the same frbr:Manifestation.

# from pcurl.py.prov.ttl:
<tag:tw.rpi.edu,2011:filed:SVbQMPyfteayT_XeWKRnygrxhqoAMncsgdRwexQtugw=/sha-256-gvr2NDAF7C0HOGuGFEoYwIbs7mQit_TABy8hQJHIlhU=/pcurl.py>
   a frbr:Item;
   nfo:fileUrl <file:////Users/lebot/pcurl.py>,
               <pcurl.py>;
   dcterms:modified "2012-01-03T11:05:33"^^xsd:dateTime;
   frbr:exemplarOf <tag:tw.rpi.edu,2011:manifestation:sha-256-81X-JdHSWIdGwDaFk8Mlv8iW_TqlUpG2UCZh1ue04HU=>;
...

# from mpcurl.py.prov.ttl:
<tag:tw.rpi.edu,2011:filed:SVbQMPyfteayT_XeWKRnygrxhqoAMncsgdRwexQtugw=/sha-256-gvr2NDAF7C0HOGuGFEoYwIbs7mQit_TABy8hQJHIlhU=/mypcurl.py> 
   a frbr:Item;
   nfo:fileUrl <file:////Users/lebot/mypcurl.py>,
               <mypcurl.py>;
   dcterms:modified "2012-01-03T11:05:33"^^xsd:dateTime;
   frbr:exemplarOf <tag:tw.rpi.edu,2011:Manifestation:sha-256-81X-JdHSWIdGwDaFk8Mlv8iW_TqlUpG2UCZh1ue04HU=>;
...

How to name a file as a frbr:Item

A file's absolute directory path and modification date are used to name the frbr:Item. If either change, a new name is given. The file's directory path includes the machine that is hosting the directory.

The name for the frbr:Item tag:tw.rpi.edu,2011:filed:SVbQMPyfteayT_XeWKRnygrxhqoAMncsgdRwexQtugw=/sha256-gvr2NDAF7C0HOGuGFEoYwIbs7mQit_TABy8hQJHIlhU=/pcurl.py is constructed by concatenating:

  • tag:tw.rpi.edu,2011:
  • filed:
  • SVbQMPyfteayT_XeWKRnygrxhqoAMncsgdRwexQtugw= (a hash of the machine hosting the directory)
  • /
  • sha256-gvr2NDAF7C0HOGuGFEoYwIbs7mQit_TABy8hQJHIlhU= (a hash of the directory and the modification date of the file)
  • /
  • pcurl.py (the file name)

How to name a file's frbr:Manifestation

(todo)

<tag:tw.rpi.edu,2011:Manifestation:sha-256-81X-JdHSWIdGwDaFk8Mlv8iW_TqlUpG2UCZh1ue04HU=>

More than just message digests (md5, sha1, etc): Adding frbr:Expressions

If any character of mypcurl.py changes, the derived frbr:Item will have a different frbr:Manifestation and frbr:Expression from that of pcurl.py because we cannot automatically identify these more abstract notions for the procedural python instructions.

However, this shortcoming can be overcome when your files encode RDF instead of procedural code. To demonstrate this, we use $CSV2RDF4LOD_HOME/bin/util/tic.sh to obtain some (incomplete) RDF description of the python script, such as its author.

bash-3.2$ tic.sh mypcurl.py > mypcurl.py.ttl
bash-3.2$ cat mypcurl.py.ttl | grep "doap:developer"
    doap:developer twi:JamesMcCusker ;

Although changing the serialization of the Turtle describing mypcurl.py results in a new frbr:Manifestation, the new frbr:Item associates to the same frbr:Expression as the first.

bash-3.2$ rapper -q -g -o rdfxml-abbrev mypcurl.py.ttl > mypcurl.py.ttl.rdf
bash-3.2$ fstack.py --no-paths mypcurl.py.ttl
bash-3.2$ fstack.py --no-paths mypcurl.py.ttl.rdf

Discovering aligned FRBR stacks

Some endeavors in the FRBR stack are named in the tag scheme. This was done to use a reserved namespace that people could compute hashes into to allow for "serendipitous" URI collision. Since it wasn't dereferenceable, there was no chance for it to be "take over" by someone who would put misleading information into it.

Using tag scheme is a pure approach, but hinders discoverablity.

We may also want HTTP so the RDF around the "pure, serendipitous" tag URIs can be discoverable. It's cute that the URIs of your file and my file align conceptually (and even physically on each of our computers). Now, let's get to actually finding the connection so we can learn something!

Perhaps we mix both in?

We automatically generate the tag URI AND an HTTP URI within our own namespace, then relate the two?

generated on Jim's machine:

tag:THE_HASH prov:alternateOf http://jimbo.org/id/frir/THE_HASH .

generated on Nick's machine:

tag:THE_HASH prov:alternateOf http://nick-o-roonie.org/id/frir/THE_HASH .

owl:hasKey the hashes (and use tag: for them) and have the entities land where they will?