Using the Reader API

McBen edited this page Sep 2, 2016 · 17 revisions
Clone this wiki locally

Note: This was written years back, and needs to be checked against git head.

This post describes the Reader API of ruby-libxml.

##Introduction Several techniques exist to parse XML documents. You can read up on them on this Wikipedia article. Reader provides a StAX API for parsing XML documents.

The Reader API provides a "cursor" that moves forward through the XML document node by node and you process the data in a node while the cursor is at it. This paradigm is also called "pull parsing". You can initialize an XML document from a file, string, uri or an io object and then call XML::Reader#read to move through the document. The read method returns false when there is no more node to read. Optionally you can provide a hash while initializing the document to control how parsing is done. Typically, you would do something like this:

doc = XML::Reader.file("trees.xml", :options =>XML::Parser::Options::NOENT)
process(doc) while doc.read

Possible parsing options are constants of the class XML::Parser::Options. More than one options can be combined using bitwise or ( | ).

After a document is parsed you should free the resources by calling doc.close.

Getting Information from the current node.

While the cursor is at one of the nodes, you can query it for:

  1. Node Type: doc.node_type, will return the type of the node from the following,
    • Start of an element : 1
    • Attributes : 2
    • Text : 3
    • CDATA : 4
    • Entity References : 5
    • Entity Declarations : 6
    • Processing Instruction : 7
    • Comments : 8
    • Document : 9
    • DTD/Doctype : 10
    • Document Fragment : 11
    • Notation : 12
    • Whitespace : 13
    • Significant Whitespace : 14
    • End of an element : 15
    • End entity : 16
    • XML Declaration : 17
      See this for a description of all the node types. Constants are defined for the node types under the XML::Reader class.
  2. Name : doc.name, will return the qualified name of the node( prefix + local name )
    • Local Name : doc.local_name, will return the local name of the node( name, without the associated prefix )
    • Prefix : doc.prefix, will return the namespace prefix associated with the node
  3. Namespace : doc.namespace_uri, will return the URI of the node's namespace.
    • Namespace declarations are also considered node, in line with the DOM API. You can use doc.namespace_declaration? to find if the attribute node is a namespace declaration or not.
    • Given the prefix( see 2 ) you can find out the associated namespace with doc.lookup_namespace("prefix"); use nil if you want the default namespace.
  4. Value : doc.value, will return the text value of the node if present else nil. Alternatively, you can also check if the node has a text value or not by doc.has_value?
  5. Empty : doc.empty_element?, will tell you if the node is empty or not. Empty elements are those that are closed in their start tag itself.
  6. Depth : doc.depth, will return the depth of the node in the tree from the base element

Reading attributes.

To find out if a node has an attribute or not, use doc.has_attributes?. You can find the attribute count of the node with doc.attribute_count. Even though attributes are also nodes, doc.read does not move the cursor to an attribute node.

  • Attributes can be accessed in a hash like manner with the [] method. [] can be called with the attribute's name or index( the first attribute is indexed 0).
  • With the doc.move_to_next_attribute you can move the cursor to the next attribute. It returns 1 if the cursor moved to the next attribute and 0 if there is no attribute to move to. While the cursor is at an attribute node you can query it like any other node( for name, value, node type, depth ) as described above. You must remember to move back to the element node by doc.move_to_element. Alternatively, you can call the move_to_attribute function on the cursor with a node's name as the argument to move to an attribute node. I prefer the array notation.
  • read_attribute_value is a related method whose use I have not understood fully. Refer the document if you will.

Validation.

To check if the XML document confirms to valid schema definition, call the schema_validate method on the reader object and pass it the location of the schema file. It returns 0 if the document validates and -1 in case of an error. Note that this function should be called just after you instantiate a Reader object. Trying to validate an XML document after you have started reading( called read on the document object ) is an error.

doc.schema_validate("schema.xsd")

There are a few more API calls which you can refer here.

Code Example

Below is a "hello world" code example using the Reader API, a sample XML file and the result of parsing it. Since I have described the technicalities above, I am not going to walk you through the code.

require "rubygems"
require "xml"

#parse the sample.xml ignoring whitespaces and
#performing entity substitution.
doc = XML::Reader.file("sample.xml", :options => XML::Parser::Options::NOBLANKS |
                                            XML::Parser::Options::NOENT
                      )

#display a node's name: local and prefix
def display_name( node )
    puts "\tName: #{node.name}"
    if node.prefix
        puts "\t\tPrefix: #{node.prefix}" if node.prefix
        puts "\t\tLocal: #{node.local_name}"
    end
end

#display attributes of a node
def display_attributes( node )
    node.attribute_count.times do | index |
        puts "Attribute # #{index + 1}"
        node.move_to_next_attribute
        display node
    end
    node.move_to_element
end

#process a node
def display( node )
    display_name node
    puts "\tDepth: #{node.depth}"
    puts "\tEmpty Element" if node.empty_element?
    puts "\tValue: #{node.value}" if node.has_value?
    display_attributes node
    print "\n"
end

#shift through the document.
i = 1
while doc.read
    unless doc.node_type == XML::Reader::TYPE_END_ELEMENT
        puts "Node # #{i}"
        display doc
        i += 1
    end
end

#free the resources
doc.close

Sample: it is an NeXML file.

<?xml version="1.0" encoding="ISO-8859-1"?>
<nex:nexml 
	version="0.8"
	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xsi:schemaLocation="http://www.nexml.org/1.0 ../xsd/nexml.xsd"
	xmlns:nex="http://www.nexml.org/1.0"
	generator="mesquite"
	xmlns:xlink="http://www.w3.org/1999/xlink"
	xmlns="http://www.nexml.org/1.0">
	<otus 
		id="taxa1" 
		label="My taxa block" 
		xml:base="http://example.org/" 
		xml:id="taxa1" 
		class="taxset1" 
		xml:lang="EN" 
		xlink:href="#taxa1">
		<!--  
			The taxon element is analogous to a single label in 
			a nexus taxa block. It may have the same additional
			attributes (label, xml:base, xml:lang, xml:id, xlink:href
			and class) as the taxa element.
		-->
		<otu id="t1"/>
		<otu id="t2"/>
		<otu id="t3"/>
		<otu id="t4"/>
		<otu id="t5"/>
	</otus>
</nex:nexml>

Output:

Node # 1
	Name: nex:nexml
		Prefix: nex
		Local: nexml
	Depth: 0
Attribute # 1
	Name: xmlns:xsi
		Prefix: xmlns
		Local: xsi
	Depth: 1
	Value: http://www.w3.org/2001/XMLSchema-instance

Attribute # 2
	Name: xmlns:xsi
		Prefix: xmlns
		Local: xsi
	Depth: 1
	Value: http://www.w3.org/2001/XMLSchema-instance

Attribute # 3
	Name: xmlns:xsi
		Prefix: xmlns
		Local: xsi
	Depth: 1
	Value: http://www.w3.org/2001/XMLSchema-instance

Attribute # 4
	Name: xmlns:xsi
		Prefix: xmlns
		Local: xsi
	Depth: 1
	Value: http://www.w3.org/2001/XMLSchema-instance

Attribute # 5
	Name: xmlns:xsi
		Prefix: xmlns
		Local: xsi
	Depth: 1
	Value: http://www.w3.org/2001/XMLSchema-instance

Attribute # 6
	Name: xmlns:xsi
		Prefix: xmlns
		Local: xsi
	Depth: 1
	Value: http://www.w3.org/2001/XMLSchema-instance

Attribute # 7
	Name: xmlns:xsi
		Prefix: xmlns
		Local: xsi
	Depth: 1
	Value: http://www.w3.org/2001/XMLSchema-instance


Node # 2
	Name: otus
	Depth: 1
Attribute # 1
	Name: id
	Depth: 2
	Value: taxa1

Attribute # 2
	Name: id
	Depth: 2
	Value: taxa1

Attribute # 3
	Name: id
	Depth: 2
	Value: taxa1

Attribute # 4
	Name: id
	Depth: 2
	Value: taxa1

Attribute # 5
	Name: id
	Depth: 2
	Value: taxa1

Attribute # 6
	Name: id
	Depth: 2
	Value: taxa1

Attribute # 7
	Name: id
	Depth: 2
	Value: taxa1


Node # 3
	Name: #comment
	Depth: 2
	Value:   
			The taxon element is analogous to a single label in 
			a nexus taxa block. It may have the same additional
			attributes (label, xml:base, xml:lang, xml:id, xlink:href
			and class) as the taxa element.


Node # 4
	Name: otu
	Depth: 2
	Empty Element
Attribute # 1
	Name: id
	Depth: 3
	Value: t1


Node # 5
	Name: otu
	Depth: 2
	Empty Element
Attribute # 1
	Name: id
	Depth: 3
	Value: t2


Node # 6
	Name: otu
	Depth: 2
	Empty Element
Attribute # 1
	Name: id
	Depth: 3
	Value: t3


Node # 7
	Name: otu
	Depth: 2
	Empty Element
Attribute # 1
	Name: id
	Depth: 3
	Value: t4


Node # 8
	Name: otu
	Depth: 2
	Empty Element
Attribute # 1
	Name: id
	Depth: 3
	Value: t5

XML::Reader is primarily a streaming interface, but, it also provides convenient methods to mix the DOM API( XML::Parser ). Xpath queries can then be used. Perhaps I will write about it in some future post, after I have tried it out. You can find good info here.