Content augmentation
Tim L edited this page Mar 9, 2014
·
18 revisions
- Content augmentation can be performed after Recovering Data from View
- This process is discuss in our paper Content-preserving Graphics.
How to use the vsr-follow.sh script to crawl Linked Data.
Usage:
$ vsr-follow.sh
usage: vsr-follow.sh [-w] [-od <directory>] <seed-file> [--no-sameness] [--start-to] [--instances-of <rdfs-class>+] [--follow <rdf-property>+]
-w : write the output to file.
-od <directory> : write the outputs into the given directory.
<seed-file> : the RDF file to start augmentation. Can be an RDF file, or a GRDDL-annotated XML file.
--no-sameness : do not follow owl:sameAs, prov:specialization, or prov:alternateOf on followed objects.
--start-to : clear the visit list before beginning augmentation. Will re-visit everything at least once.
--instances-of <rdfs-class>+ : dereference instances of these classes.
--follow <rdf-property>+ : after dereferencing the depictions, also resolve all objects of the given RDF property.
In this use case, we have formats but not their labels. We can get them by resolving the FileFormat URIs.
When we have some RDF descriptions of things with file formats:
$ head formatted-things.ttl
<http://opendap.tw.rpi.edu/source/us/file/cr-sparql-sd/version/latest/source/sparql-sd.rdf>
dcterms:format <http://provenanceweb.org/formats/pronom/fmt/101> .
<http://opendap.tw.rpi.edu/source/us/file/cr-full-dump/version/latest/conversion/opendap-tw-rpi-edu.nt.gz>
dcterms:format <http://www.w3.org/ns/formats/N-Triples> .
<http://opendap.tw.rpi.edu/source/us/file/opendap-components/version/2014-Jan-07/conversion/us-opendap-components-2014-Jan-07.e1.sample.ttl>
dcterms:format <http://www.w3.org/ns/formats/Turtle> .
... we can get the labels for those formats using the following command.
-
--start-to
clears the visit history that might exist. The visit history lets us avoid revisiting the same URIs. -
--follow
tells us which RDF properties to walk to resolve all of their object URIs. - The file
formatted-things.ttl.ttl
is written. - Notice that the script 'fills "sameness" relation' because some of the FileFormats are prov:alternateOf other URIs; this is equivalent to including "prov:alternateOf, prov:specializationOf, and owl:sameAs" in the
--follow
parameter. We can avoid these extra traversals by specifying the--no-sameness
parameter. - The status reporting shows how many properties it is following
(1 of 1)
, and how many triples it has collected so far (starting at886
and ending at983
). Lines such as| http://provenanceweb.org/formats/pronom/fmt/101
indicate that we've already requested the URI and are avoiding a repeat request.
$ vsr-follow.sh formatted-things.ttl --start-to --follow dcterms:format
following http://purl.org/dc/terms/format (1 / 1) from subjects in manual/formatted-things.ttl.ttl
886 < http://provenanceweb.org/format/mime/application/gzip (1)
895 < http://provenanceweb.org/format/mime/text/plain (2)
908 < http://provenanceweb.org/formats/pronom/fmt/101 (3)
923 < http://provenanceweb.org/formats/pronom/fmt/11 (4)
936 < http://provenanceweb.org/formats/pronom/fmt/282 (5)
949 < http://provenanceweb.org/formats/pronom/fmt/471 (6)
961 < http://provenanceweb.org/formats/pronom/x-fmt/266 (7)
961 < https://github.com/tetherless-world/opendap/issues/45 (for #datadds) (8)
961 < https://github.com/tetherless-world/opendap/wiki/OPeNDAP-Vocabulary (for #wiki-abstract-netcdf) (9)
976 < http://www.w3.org/ns/formats/N-Triples (10)
992 < http://www.w3.org/ns/formats/Turtle (11)
filling "sameness" relation http://www.w3.org/2002/07/owl#sameAs for objects of http://purl.org/dc/terms/format (1 / 1)
filling "sameness" relation http://www.w3.org/ns/prov#alternateOf for objects of http://purl.org/dc/terms/format (1 / 1)
| http://provenanceweb.org/formats/pronom/fmt/101 (12)
| http://provenanceweb.org/formats/pronom/fmt/11 (13)
| http://provenanceweb.org/formats/pronom/fmt/282 (14)
| http://provenanceweb.org/formats/pronom/fmt/471 (15)
| http://provenanceweb.org/formats/pronom/x-fmt/266 (16)
992 < http://www.nationalarchives.gov.uk/pronom/fmt/101 (17)
992 < http://www.nationalarchives.gov.uk/pronom/fmt/11 (18)
992 < http://www.nationalarchives.gov.uk/pronom/fmt/282 (19)
992 < http://www.nationalarchives.gov.uk/pronom/fmt/471 (20)
992 < http://www.nationalarchives.gov.uk/pronom/x-fmt/266 (21)
983
Alternatively, we might have some instances that we want to know more about. For example, we want to augment all instances of FileFormat.
<http://www.w3.org/ns/formats/Turtle>
a dcterms:FileFormat .
<http://www.w3.org/ns/formats/N-Triples>
a dcterms:FileFormat .
<http://provenanceweb.org/formats/pronom/fmt/101>
a dcterms:FileFormat .
<http://provenanceweb.org/formats/pronom/x-fmt/266>
a dcterms:FileFormat .
$ vsr-follow.sh formats.ttl --no-sameness --start-to --instances-of dcterms:FileFormat
filling http://purl.org/dc/terms/FileFormat (class 1 / 1) from subjects in formats.ttl.ttl
27 < http://www.w3.org/ns/formats/Turtle (1)
42 < http://www.w3.org/ns/formats/N-Triples (2)
55 < http://provenanceweb.org/formats/pronom/fmt/101 (3)
67 < http://provenanceweb.org/formats/pronom/x-fmt/266 (4)
82 < http://provenanceweb.org/formats/pronom/fmt/11 (5)
95 < http://provenanceweb.org/formats/pronom/fmt/282 (6)
108 < http://provenanceweb.org/formats/pronom/fmt/471 (7)
155 < http://provenanceweb.org/format/mime/application/gzip (8)
164 < http://provenanceweb.org/format/mime/text/plain (9)
164 < https://github.com/tetherless-world/opendap/wiki/OPeNDAP-Vocabulary (for #wiki-abstract-netcdf) (10)
164 < https://github.com/tetherless-world/opendap/issues/45 (for #datadds) (11)
151
- Prizms' pr-neighborlod also uses vsr-follow.sh.
- Prizms' pr-aggregate-pingback also uses vsr-follow.sh.