Extracting RDF from HTML

semsol edited this page Mar 11, 2011 · 3 revisions

There are several structured formats evolving on the Web. ARC provides dedicated parsers for core serializations such as RDF/XML or Turtle, but this approach wouldn't work too well for HTML-embedded solutions such as eRDF, microformats, or RDFa, where full parsing per format could get inefficient and slow, especially as a single document can support any combination of embedded formats.

ARC parses the source document (no matter if valid, well-formed, or tag soup) into a tree structure only once instead, and then a number of extractors can be applied to the tree. We are working on extractors for eRDF, RDFa, the core 15-20 microformats, OpenID hooks, basic Dublin Core, and other formats (and mappings) that may be of interest.

Many formats don't necessarily map to a single RDF vocabulary, e.g. a social graph aggregator might prefer FOAF as the target format, an address book application may be based on a certain vCard-RDF mapping. Additionally, only a subset of all possible triples in a document will be needed in a given context.

ARC therefore allows the specification of desired formats, and also the preferred mapping (if available). Currently supported formats and mappings are listed in the right column of this page.

Extracting a single format

include_once('path/to/arc/ARC2.php');

$config = array('auto_extract' => 0);
$parser = ARC2::getSemHTMLParser();
$parser->parse('http://example.com/home.html');
$parser->extractRDF('rdfa');

$triples = $parser->getTriples();
$rdfxml = $parser->toRDFXML($triples);

Extracting multiple formats

The extractRDF method expects a single string parameter which can contain multiple space-separated entries:

$parser->extractRDF('erdf openid microformats');

Setting application-wide defaults

Sooner or later, you are probably going to switch from the low-level method above to code such as

$store->query('LOAD <http://example.com/home.html>');

In this case you can specify a configuration setting for the format mappings. This setting will be used by all components invoked by the instantiated class:

$config = array(
  /* db */
  ...
  /* store */
  ...
  /* sem html extraction */
  'sem_html_formats' => 'openid dc rdfa',
);
$store = ARC2::getStore($config);
...

Available Extractors

  • dc (title, link, and meta tags)
  • erdf
  • microformats (xfn, rel-tag, rel-bookmark, rel-nofollow, rel-directory, rel-license, hcard, hcalendar, hatom, hreview, xfolk, hresume, address, and geolocation)
  • openid
  • posh-rdf (custom definitions)
  • rdfa