Skip to content


Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
RDFa Parser for java
branch: master
Failed to load latest commit information.
htmlparser-module [maven-release-plugin] prepare for next development iteration
sesame-module [maven-release-plugin] prepare for next development iteration
src 1.1 core tests all passing, save one slightly suspect test
.gitignore Bye bye target
COPYING Added (c) and formatted sources Updated README
pom.xml Bump jena version

Welcome to java-rdfa

The cruftiest RDFa parser in the world, I'll bet. Apologies that there isn't much documentation. Things may explode: you have been warned.

Currently passing all conformance tests for XHTML, and the HTML 4 and 5 tests with one exception.

This was written by Damian Steer. It is an offshoot of the Stars Project which was funded by JISC

Useful Links

Basic Use

$ ls
htmlparser-1.2.1.jar    java-rdfa-0.4.jar

$ java -jar java-rdfa-0.4.jar
<> <> <> .

or (equivalent):

$ java -cp '*' rdfa.simpleparse
<> <> <> .

For HTML sources add the format argument, and you will need the parser:

$ java -cp '*' rdfa.simpleparse --format HTML
<> <> <> .

The output of simpleparse is n-triples, and hard to read. If you have jena try adding it to you classpath and using rdfa.parse instead:

$ java -cp '*:/path/to/jena/lib/*' rdfa.parse --format HTML
@prefix dc:      <> .
@prefix hx:      <> .
... nice turtle output ...

Java Use

To use the parser directly, without the assistance of an RDF toolkit (a bold choice) implement a StatementSink to collect the triples, then use a parser from the Factory to make a reader:

XMLReader reader = ParserFactory.createReaderForFormat(sink, Format.XHTML); // or HTML, still an XMLReader
reader.parse(source); // Your sink will be sent triples

java-rdfa can be used from jena. Simply invoke:


Which will hook the two readers in to jena, then you will be able to:, "XHTML"); // xml parsing, "HTML"); // html parsing

java-rdfa is available in the maven central repositories. Note that it does not depend on jena.

A sesame reader provided by Henry Story is also available.

Open Graph Protocol

A very simple OGP reader is provided. This follows what (I think) Toby Inkster did:

    Map<String, String> prop =


    title => 'Kick-Ass' => '326803741017' => '' => ''
    image => ''
    site_name => 'Rotten Tomatoes'
    type => 'movie'
    url => '' => '1106591'

Form Mode

There is a secret form mode (that prompted the development of this parser). In this mode you can generate basic graph patterns by including ?variables where curies are allowed, and INPUT tags generate @name variables.

Simple example (from the tests) and the query that results.



  • (Finally) support overlapping literals. No one noticed this didn't work!
  • Added turtle-ish output. Slightly less nasty than N-Triples.
  • Bug fixes...
  • Turned OFF html 5 streaming. Such a bad idea on my part.
  • Started RDFa 1.1 support.
  • Added simple OGP reader.


  • Updated to current conformance tests
  • Switched to streaming mode (may live to regret this).
  • Created very simple n-triple and rdf/xml streaming serialisers.
  • Usual bug fixes etc.
  • Jena is now a provided maven dependency. Using java-rdfa won't pull in jena.
  • Sesame reader create by Henry Story added. Can't be added to central maven repository since Sesame isn't available, so spun out in small module.
  • Tests for query, and some utilities.
Something went wrong with that request. Please try again.