Skip to content

Content augmentation

Tim L edited this page Mar 9, 2014 · 18 revisions

What is first

What we will cover

How to use the vsr-follow.sh script to crawl Linked Data.

Let's get to it

Usage:

$ vsr-follow.sh
usage: vsr-follow.sh [-w] [-od <directory>] <seed-file> [--no-sameness] [--start-to] [--instances-of <rdfs-class>+]  [--follow <rdf-property>+]

                           -w : write the output to file.
              -od <directory> : write the outputs into the given directory.
                  <seed-file> : the RDF file to start augmentation. Can be an RDF file, or a GRDDL-annotated XML file.
                --no-sameness : do not follow owl:sameAs, prov:specialization, or prov:alternateOf on followed objects.
                   --start-to : clear the visit list before beginning augmentation. Will re-visit everything at least once.
 --instances-of <rdfs-class>+ : dereference instances of these classes.
 --follow <rdf-property>+     : after dereferencing the depictions, also resolve all objects of the given RDF property.

Example: Get labels for FileFormats

In this use case, we have formats but not their labels. We can get them by resolving the FileFormat URIs.

When we have some RDF descriptions of things with file formats:

$ head formatted-things.ttl

<http://opendap.tw.rpi.edu/source/us/file/cr-sparql-sd/version/latest/source/sparql-sd.rdf>
    dcterms:format <http://provenanceweb.org/formats/pronom/fmt/101> .

<http://opendap.tw.rpi.edu/source/us/file/cr-full-dump/version/latest/conversion/opendap-tw-rpi-edu.nt.gz>
    dcterms:format <http://www.w3.org/ns/formats/N-Triples> .

<http://opendap.tw.rpi.edu/source/us/file/opendap-components/version/2014-Jan-07/conversion/us-opendap-components-2014-Jan-07.e1.sample.ttl>
    dcterms:format <http://www.w3.org/ns/formats/Turtle> .

... we can get the labels for those formats using the following command.

  • --start-to clears the visit history that might exist. The visit history lets us avoid revisiting the same URIs.
  • --follow tells us which RDF properties to walk to resolve all of their object URIs.
  • The file formatted-things.ttl.ttl is written.
  • Notice that the script 'fills "sameness" relation' because some of the FileFormats are prov:alternateOf other URIs; this is equivalent to including "prov:alternateOf, prov:specializationOf, and owl:sameAs" in the --follow parameter. We can avoid these extra traversals by specifying the --no-sameness parameter.
  • The status reporting shows how many properties it is following (1 of 1), and how many triples it has collected so far (starting at 886 and ending at 983). Lines such as | http://provenanceweb.org/formats/pronom/fmt/101 indicate that we've already requested the URI and are avoiding a repeat request.
$ vsr-follow.sh formatted-things.ttl --start-to --follow dcterms:format
following http://purl.org/dc/terms/format (1 / 1) from subjects in manual/formatted-things.ttl.ttl
          886 < http://provenanceweb.org/format/mime/application/gzip  (1)
          895 < http://provenanceweb.org/format/mime/text/plain  (2)
          908 < http://provenanceweb.org/formats/pronom/fmt/101  (3)
          923 < http://provenanceweb.org/formats/pronom/fmt/11  (4)
          936 < http://provenanceweb.org/formats/pronom/fmt/282  (5)
          949 < http://provenanceweb.org/formats/pronom/fmt/471  (6)
          961 < http://provenanceweb.org/formats/pronom/x-fmt/266  (7)
          961 < https://github.com/tetherless-world/opendap/issues/45 (for #datadds) (8)
          961 < https://github.com/tetherless-world/opendap/wiki/OPeNDAP-Vocabulary (for #wiki-abstract-netcdf) (9)
          976 < http://www.w3.org/ns/formats/N-Triples  (10)
          992 < http://www.w3.org/ns/formats/Turtle  (11)
   filling "sameness" relation http://www.w3.org/2002/07/owl#sameAs for objects of http://purl.org/dc/terms/format (1 / 1)
   filling "sameness" relation http://www.w3.org/ns/prov#alternateOf for objects of http://purl.org/dc/terms/format (1 / 1)
              | http://provenanceweb.org/formats/pronom/fmt/101  (12)
              | http://provenanceweb.org/formats/pronom/fmt/11  (13)
              | http://provenanceweb.org/formats/pronom/fmt/282  (14)
              | http://provenanceweb.org/formats/pronom/fmt/471  (15)
              | http://provenanceweb.org/formats/pronom/x-fmt/266  (16)
          992 < http://www.nationalarchives.gov.uk/pronom/fmt/101  (17)
          992 < http://www.nationalarchives.gov.uk/pronom/fmt/11  (18)
          992 < http://www.nationalarchives.gov.uk/pronom/fmt/282  (19)
          992 < http://www.nationalarchives.gov.uk/pronom/fmt/471  (20)
          992 < http://www.nationalarchives.gov.uk/pronom/x-fmt/266  (21)
983

Example: Augmenting instances of a certain type

Alternatively, we might have some instances that we want to know more about. For example, we want to augment all instances of FileFormat.

<http://www.w3.org/ns/formats/Turtle>
    a dcterms:FileFormat .

<http://www.w3.org/ns/formats/N-Triples>
    a dcterms:FileFormat .

<http://provenanceweb.org/formats/pronom/fmt/101>
    a dcterms:FileFormat .

<http://provenanceweb.org/formats/pronom/x-fmt/266>
    a dcterms:FileFormat .
$ vsr-follow.sh formats.ttl --no-sameness --start-to --instances-of dcterms:FileFormat
filling  http://purl.org/dc/terms/FileFormat (class 1 / 1) from subjects in formats.ttl.ttl
          27 < http://www.w3.org/ns/formats/Turtle  (1)
          42 < http://www.w3.org/ns/formats/N-Triples  (2)
          55 < http://provenanceweb.org/formats/pronom/fmt/101  (3)
          67 < http://provenanceweb.org/formats/pronom/x-fmt/266  (4)
          82 < http://provenanceweb.org/formats/pronom/fmt/11  (5)
          95 < http://provenanceweb.org/formats/pronom/fmt/282  (6)
          108 < http://provenanceweb.org/formats/pronom/fmt/471  (7)
          155 < http://provenanceweb.org/format/mime/application/gzip  (8)
          164 < http://provenanceweb.org/format/mime/text/plain  (9)
          164 < https://github.com/tetherless-world/opendap/wiki/OPeNDAP-Vocabulary (for #wiki-abstract-netcdf) (10)
          164 < https://github.com/tetherless-world/opendap/issues/45 (for #datadds) (11)
151

What is next