-
Notifications
You must be signed in to change notification settings - Fork 0
Getting Started
In order to parse XML with SNAX, you will need to construct a NodeModel
, which is basically a state machine. The way to do this is by using a NodeModelBuilder
to specify chains of selectors, to which you attach instances of ElementHandler
.
Selectors are created by using the NodeModelBuilder EDSL. It's easiest to explain this with an example:
NodeModel<MyData> model = new NodeModelBuilder<MyData>() {{
elements("xml", "foo")
.element("bar", with("attr").equalTo("someVal"))
.attach(new BarHandler());
}}.build();
A few things to note about this example:
- After model is created, SNAX won't modify it further. It's also stateless, which means it's reusable and threadsafe. You can define a model once for your application and then parse lots of different documents against it.
- If the model is stateless, where does the state go? In this case, into MyData. SNAX interfaces are generic on user classes called data objects. A data object passed to SNAXParser.parse() will be passed back to each of its handlers.
- The
{{ ... }}
idiom is the standard pattern for model building, but you don't have to do it that way. This is equivalent:
NodeModelBuilder<MyData> builder = new NodeModelBuilder<MyData>();
builder.elements("xml", "foo")
.element("bar", builder.with("attr").equalTo("someVal"))
.attach(new BarHandler());
NodeModel<MyData> model = builder.build();
- The
elements()
call is shorthand for multiple individual calls toelement()
. Each call toelement()
selects for a single element in the document by name. The document node is assumed to be the root of the selection chain. In this case,elements("xml", "foo")
means "foo
elements that are children ofxml
elements that are children of the document node". - You can pass these element names as Strings (in which case they are assumed to have no namespace), or you can pass a
javax.xml.namespace.QName
. - The
element()
call uses addition syntax to constrain the element selection based on attribute value. In this case, it looks forbar
elements with an attribute@attr
that has the valuesomeVal
. Again, attributes without namespaces can be passed as strings; you can also pass aQName
. -
BarHandler
is a class implementingElementHandler<MyData>
. It will receive callbacks of various types whenever a selected element (equivalent to the XPath/xml/foo/bar[@attr='someVal']
) is encountered.
The typical way to implement the ElementHandler interface is to subclass DefaultElementHandler. Note that ElementHandler is generic in the type of the data object.
class SomeHandler extends DefaultElementHandler<MyData> {
static final QName Q_SOME_ATTR = new QName("http://some/namespace", "someAttr");
@Override
public void startElement(StartElement element, MyData data) {
// `element` is the StAX event for this element, which you
// can use to query attributes and namespaces.
// `data` is the object that was passed to the parser when parsing began.
String attrValue = element.getAttributeValue(Q_SOME_ATTR);
if (attrValue != null) {
data.setCurrentValue(attrValue);
}
}
}
ElementHandler
also supports a build()
method that allows the nested assembly of models. More on this later.
XMLInputFactory factory = XMLInputFactory.newInstance();
NodeModel<MyData> model = new NodeModelBuilder<MyData>() {{ ... }}.build();
SNAXParser<MyData> parser = SNAXParser.createParser(factory, model);
MyData data = new MyData();
parser.parse(new StringReader("<xml>....</xml>", data);
Things to note:
- The XMLInputFactory and NodeModel are both injected so that they can be used repeatedly with different parser instances.
- The
data
instance will be passed back to the handlers that receive events. Thedata
object can hold arbitrary runtime state. This allows concurrent use of theNodeModel
by multiple threads. If you just want something quick and dirty, you're free to use stateful handlers and just passnull
as the data object. -
parse()
will consume the input to completion. If you want to be able to control the rate at which the XML input is consumed, see theprocessEvent()
example below.
The parse()
method just consumes all the XML available. Rhis is poorly-suited for some uses, like if you want to wrap XML parsing inside some sort of object iterator. The SNAXParser
also exposes the underlying event-by-event processing of StaX:
SNAXParser<MyData> parser = SNAXParser.createParser(factory, model);
MyData data = new MyData();
parser.startParsing(new StringReader("<xml>....</xml>", data);
while (parser.hasMoreEvents()) {
do {
parser.processEvent();
}
while (!data.foundSomethingUseful() && parser.hasMoreEvents());
// Do something useful with the data object
}
You're not limited to just one selection chain. You can attach handlers to an arbitrary number of places in the code, including using the wildcard selectors child()
and descendant()
.
NodeModel<MyData> model = new NodeModelBuilder<MyData>() {{
elements("xml", "foo").child()
.attach(new ChildOfFooHandler());
element("xml")
.descendant(with("action").equalTo("process"))
.attach(new ActionHandler());
}}.build();
child()
is roughly equivalent to /*
in XPath, and descendant()
is roughly equivalent to //
.
You can also use the addTransition()
method to link node states together directly. This allows for backwards or circular transitions that aren't otherwise possible in the SNAX syntax.
NodeModel<MyData> model = new NodeModelBuilder<MyData>() {{
ElementSelector<MyData> fooSelector = elements("xml", "foo");
ElementSelector<MyData> barSelector = fooSelector.element("bar");
// <bar> can contain nested <foo> elements, to arbitrary depth
barSelector.addTransition("foo", fooSelector);
// ....
}}
Only a single selector will be matched against a given element. Selectors are tested in the order they are declared. If no selector is found, any outstanding descendant()
selectors are tested against the element, searching upwards through the stack.
The implications of this can be subtle, but they are consistent. Consider this model:
NodeModel<MyData> model = new NodeModelBuilder<MyData>() {{
elements(with("foo").equalTo("bar")).attach(new Handler1());
element("xml").attach(new Handler2());
}}.build();
If this model is used to parse the string <xml foo="bar"/>
, the Handler1
instance will receive callbacks, but the Handler2
instance will not, because as soon as the xml
element is matched against the first rule, the selection process ends.
This case demonstrates how explicit rules take precedence over descendant()
rules:
NodeModel<MyData> model = new NodeModelBuilder<MyData>() {{
elements("foo", "bar").attach(new Handler1());
descendant("bar").attach(new Handler2());
}}.build();
When applied to <foo><bar/></foo>
, the Handler1
instance will get callbacks, but the Handler2
instance will not, because the explicit selectors are tested first.
Lastly, more recent descendant()
rules take precedence over ones declared further up the tree. This could become an issue in a situation like this:
NodeModel<MyData> model = new NodeModelBuilder<MyData>() {{
element("xml").descendant("bar").attach(new Handler1());
descendant("bar").attach(new Handler2());
}}.build();
When used to parse the string <xml><bar /></xml>
, the Handler1
instance will get the callbacks for <bar>
, while the Handler2
instance will not -- the more specific descendant()
rule masks the less-specific one.
As mentioned above, ElementHandler
supports a method called build()
. Unlike the other ElementHandler
methods that receive callbacks during the parse operation, build()
is called during model construction. When an ElementHandler
is attached, its build()
method is called, passing the NodeModelBuilder
instance that is constructing the model.
This allows some limited encapsulation and reuse of model logic:
NodeModel<MyData> model = new NodeModelBuilder<MyData>() {{
elements("xml", "foo").attach(new FooHandler());
}}.build();
// ... elsewhere ...
class FooHandler extends DefaultHandler<MyData> {
@Override
public build(NodeModelBuilder<MyData> builder) {
builder.element("bar").attach(new BarHandler());
builder.element("quux", with("id")).attach(new QuuxHandler());
}
}
This feature is somewhat experimental, and it's not clear how useful it actually is. In particular, the generic type may get in the way. (Or possible the APIs need to be switched to bounded generics.)