RDFa Parser for java
HTML Java Shell
Latest commit 5fc8f8f Oct 17, 2016 @shellac committed on GitHub Merge pull request #41 from BioQwer/master
Added complete gitignore
Failed to load latest commit information.
htmlparser-module [maven-release-plugin] prepare for next development iteration Jun 29, 2012
src 1.1 core tests all passing, save one slightly suspect test Feb 7, 2013
.gitignore added gitignore Oct 15, 2016
README.md Updated README Jun 13, 2010


Welcome to java-rdfa

The cruftiest RDFa parser in the world, I'll bet. Apologies that there isn't much documentation. Things may explode: you have been warned.

Currently passing all conformance tests for XHTML, and the HTML 4 and 5 tests with one exception.

This was written by Damian Steer. It is an offshoot of the Stars Project which was funded by JISC

Useful Links

Basic Use

$ ls
htmlparser-1.2.1.jar    java-rdfa-0.4.jar

$ java -jar java-rdfa-0.4.jar http://examples.tobyinkster.co.uk/hcard
<http://examples.tobyinkster.co.uk/hcard> <http://xmlns.com/foaf/0.1/primaryTopic> <http://examples.tobyinkster.co.uk/hcard#jack> .

or (equivalent):

$ java -cp '*' rdfa.simpleparse http://examples.tobyinkster.co.uk/hcard
<http://examples.tobyinkster.co.uk/hcard> <http://xmlns.com/foaf/0.1/primaryTopic> <http://examples.tobyinkster.co.uk/hcard#jack> .

For HTML sources add the format argument, and you will need the validator.nu parser:

$ java -cp '*' rdfa.simpleparse --format HTML http://www.slideshare.net/intdiabetesfed/world-diabetes-day-2009
<http://www.slideshare.net/intdiabetesfed/world-diabetes-day-2009> <http://www.w3.org/1999/xhtml/vocab#stylesheet> <http://public.slidesharecdn.com/v3/styles/combined.css?1265372095> .

The output of simpleparse is n-triples, and hard to read. If you have jena try adding it to you classpath and using rdfa.parse instead:

$ java -cp '*:/path/to/jena/lib/*' rdfa.parse --format HTML http://www.slideshare.net/intdiabetesfed/world-diabetes-day-2009
@prefix dc:      <http://purl.org/dc/terms/> .
@prefix hx:      <http://purl.org/NET/hinclude> .
... nice turtle output ...

Java Use

To use the parser directly, without the assistance of an RDF toolkit (a bold choice) implement a StatementSink to collect the triples, then use a parser from the Factory to make a reader:

XMLReader reader = ParserFactory.createReaderForFormat(sink, Format.XHTML); // or HTML, still an XMLReader
reader.parse(source); // Your sink will be sent triples

java-rdfa can be used from jena. Simply invoke:


Which will hook the two readers in to jena, then you will be able to:

model.read(url, "XHTML"); // xml parsing
model.read(other, "HTML"); // html parsing

java-rdfa is available in the maven central repositories. Note that it does not depend on jena.

A sesame reader provided by Henry Story is also available.

Open Graph Protocol

A very simple OGP reader is provided. This follows what (I think) Toby Inkster did:

    Map<String, String> prop =


    title => 'Kick-Ass'
    http://www.facebook.com/2008/fbml#app_id => '326803741017'
    http://www.w3.org/1999/xhtml/vocab#icon => 'http://images.rottentomatoes.com/images/icons/favicon.ico'
    http://www.w3.org/1999/xhtml/vocab#stylesheet => 'http://images.rottentomatoes.com/files/inc_beta/generated/css/mob.css'
    image => 'http://images.rottentomatoes.com/images/movie/custom/00/1217700.jpg'
    site_name => 'Rotten Tomatoes'
    type => 'movie'
    url => 'http://www.rottentomatoes.com/m/1217700-kick_ass/'
    http://www.facebook.com/2008/fbml#admins => '1106591'

Form Mode

There is a secret form mode (that prompted the development of this parser). In this mode you can generate basic graph patterns by including ?variables where curies are allowed, and INPUT tags generate @name variables.

Simple example (from the tests) and the query that results.



  • (Finally) support overlapping literals. No one noticed this didn't work!
  • Added turtle-ish output. Slightly less nasty than N-Triples.
  • Bug fixes...
  • Turned OFF html 5 streaming. Such a bad idea on my part.
  • Started RDFa 1.1 support.
  • Added simple OGP reader.


  • Updated to current conformance tests
  • Switched validator.nu to streaming mode (may live to regret this).
  • Created very simple n-triple and rdf/xml streaming serialisers.
  • Usual bug fixes etc.
  • Jena is now a provided maven dependency. Using java-rdfa won't pull in jena.
  • Sesame reader create by Henry Story added. Can't be added to central maven repository since Sesame isn't available, so spun out in small module.
  • Tests for query, and some utilities.