Skip to content
Chase Tingley edited this page Feb 11, 2014 · 2 revisions

In a nutshell

In order to parse XML with SNAX, you will need to construct a NodeModel, which is basically a state machine. The way to do this is by using a NodeModelBuilder to specify chains of selectors, to which you attach instances of ElementHandler.

Building a simple NodeModel

Selectors are created by using the NodeModelBuilder EDSL. It's easiest to explain this with an example:

NodeModel<MyData> model = new NodeModelBuilder<MyData>() {{
    elements("xml", "foo")
        .element("bar", with("attr").equalTo("someVal"))
            .attach(new BarHandler());
}}.build();

A few things to note about this example:

  • After model is created, SNAX won't modify it further. It's also stateless, which means it's reusable and threadsafe. You can define a model once for your application and then parse lots of different documents against it.
  • If the model is stateless, where does the state go? In this case, into MyData. SNAX interfaces are generic on user classes called data objects. A data object passed to SNAXParser.parse() will be passed back to each of its handlers.
  • The {{ ... }} idiom is the standard pattern for model building, but you don't have to do it that way. This is equivalent:
NodeModelBuilder<MyData> builder = new NodeModelBuilder<MyData>();
builder.elements("xml", "foo")
    .element("bar", builder.with("attr").equalTo("someVal"))
            .attach(new BarHandler());
NodeModel<MyData> model = builder.build();
  • The elements() call is shorthand for multiple individual calls to element(). Each call to element() selects for a single element in the document by name. The document node is assumed to be the root of the selection chain. In this case, elements("xml", "foo") means "foo elements that are children of xml elements that are children of the document node".
  • You can pass these element names as Strings (in which case they are assumed to have no namespace), or you can pass a javax.xml.namespace.QName.
  • The element() call uses addition syntax to constrain the element selection based on attribute value. In this case, it looks for bar elements with an attribute @attr that has the value someVal. Again, attributes without namespaces can be passed as strings; you can also pass a QName.
  • BarHandler is a class implementing ElementHandler<MyData>. It will receive callbacks of various types whenever a selected element (equivalent to the XPath /xml/foo/bar[@attr='someVal']) is encountered.

Writing an ElementHandler

The typical way to implement the ElementHandler interface is to subclass DefaultElementHandler. Note that ElementHandler is generic in the type of the data object.

class SomeHandler extends DefaultElementHandler<MyData> {
    static final QName Q_SOME_ATTR = new QName("http://some/namespace", "someAttr");

    @Override
    public void startElement(StartElement element, MyData data) {
        // `element` is the StAX event for this element, which you
        // can use to query attributes and namespaces.
        // `data` is the object that was passed to the parser when parsing began.
        String attrValue = element.getAttributeValue(Q_SOME_ATTR);
        if (attrValue != null) {
            data.setCurrentValue(attrValue);
        }
    }
}

ElementHandler also supports a build() method that allows the nested assembly of models. More on this later.

Using a NodeModel to parse content

XMLInputFactory factory = XMLInputFactory.newInstance();
NodeModel<MyData> model = new NodeModelBuilder<MyData>() {{ ... }}.build();
SNAXParser<MyData> parser = SNAXParser.createParser(factory, model);
MyData data = new MyData();
parser.parse(new StringReader("<xml>....</xml>", data);

Things to note:

  • The XMLInputFactory and NodeModel are both injected so that they can be used repeatedly with different parser instances.
  • The data instance will be passed back to the handlers that receive events. The data object can hold arbitrary runtime state. This allows concurrent use of the NodeModel by multiple threads. If you just want something quick and dirty, you're free to use stateful handlers and just pass null as the data object.
  • parse() will consume the input to completion. If you want to be able to control the rate at which the XML input is consumed, see the processEvent() example below.

Controlling the rate of parsing

The parse() method just consumes all the XML available. Rhis is poorly-suited for some uses, like if you want to wrap XML parsing inside some sort of object iterator. The SNAXParser also exposes the underlying event-by-event processing of StaX:

SNAXParser<MyData> parser = SNAXParser.createParser(factory, model);
MyData data = new MyData();
parser.startParsing(new StringReader("<xml>....</xml>", data);
while (parser.hasMoreEvents()) {
    do {
        parser.processEvent();
    }
    while (!data.foundSomethingUseful() && parser.hasMoreEvents());
    // Do something useful with the data object
}

Building a more complicated NodeModel

You're not limited to just one selection chain. You can attach handlers to an arbitrary number of places in the code, including using the wildcard selectors child() and descendant().

NodeModel<MyData> model = new NodeModelBuilder<MyData>() {{
    elements("xml", "foo").child()
        .attach(new ChildOfFooHandler());
    element("xml")
        .descendant(with("action").equalTo("process"))
            .attach(new ActionHandler());
}}.build();

child() is roughly equivalent to /* in XPath, and descendant() is roughly equivalent to //.

You can also use the addTransition() method to link node states together directly. This allows for backwards or circular transitions that aren't otherwise possible in the SNAX syntax.

NodeModel<MyData> model = new NodeModelBuilder<MyData>() {{
    ElementSelector<MyData> fooSelector = elements("xml", "foo");
    ElementSelector<MyData> barSelector = fooSelector.element("bar");
    // <bar> can contain nested <foo> elements, to arbitrary depth
    barSelector.addTransition("foo", fooSelector);
    // ....
}}

Trickier stuff: Rule Ordering and Evaluation

Only a single selector will be matched against a given element. Selectors are tested in the order they are declared. If no selector is found, any outstanding descendant() selectors are tested against the element, searching upwards through the stack.

The implications of this can be subtle, but they are consistent. Consider this model:

NodeModel<MyData> model = new NodeModelBuilder<MyData>() {{
    elements(with("foo").equalTo("bar")).attach(new Handler1());
    element("xml").attach(new Handler2());
}}.build();

If this model is used to parse the string <xml foo="bar"/>, the Handler1 instance will receive callbacks, but the Handler2 instance will not, because as soon as the xml element is matched against the first rule, the selection process ends.

This case demonstrates how explicit rules take precedence over descendant() rules:

NodeModel<MyData> model = new NodeModelBuilder<MyData>() {{
    elements("foo", "bar").attach(new Handler1());
    descendant("bar").attach(new Handler2());
}}.build();

When applied to <foo><bar/></foo>, the Handler1 instance will get callbacks, but the Handler2 instance will not, because the explicit selectors are tested first.

Lastly, more recent descendant() rules take precedence over ones declared further up the tree. This could become an issue in a situation like this:

NodeModel<MyData> model = new NodeModelBuilder<MyData>() {{
    element("xml").descendant("bar").attach(new Handler1());
    descendant("bar").attach(new Handler2());
}}.build();

When used to parse the string <xml><bar /></xml>, the Handler1 instance will get the callbacks for <bar>, while the Handler2 instance will not -- the more specific descendant() rule masks the less-specific one.

Nesting build selectors

As mentioned above, ElementHandler supports a method called build(). Unlike the other ElementHandler methods that receive callbacks during the parse operation, build() is called during model construction. When an ElementHandler is attached, its build() method is called, passing the NodeModelBuilder instance that is constructing the model.

This allows some limited encapsulation and reuse of model logic:

NodeModel<MyData> model = new NodeModelBuilder<MyData>() {{
    elements("xml", "foo").attach(new FooHandler());
}}.build();

// ... elsewhere ...
class FooHandler extends DefaultHandler<MyData> {
    @Override
    public build(NodeModelBuilder<MyData> builder) {
        builder.element("bar").attach(new BarHandler());
        builder.element("quux", with("id")).attach(new QuuxHandler());
    }
}

This feature is somewhat experimental, and it's not clear how useful it actually is. In particular, the generic type may get in the way. (Or possible the APIs need to be switched to bounded generics.)