Find file
Fetching contributors…
Cannot retrieve contributors at this time
335 lines (308 sloc) 13.5 KB
<meta name="author" content="Bernhard R. Fischer">
<meta name="date" content="2012-01-09T12:06:00+0200">
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<title>libhpxml &ndash; A High Performance XML Stream Parser</title>
<style type="text/css"> p { text-align:justify; } </style>
<h1>libhpxml &ndash; A High Performance XML Stream Parser</h1>
libhpxml currently is not a shared library. It is provided as a set
source files and can directly be compiled and linked into your project with
The current version is <a href="download/">available here</a>.
libhpxml is a high performance XML stream parser library written in C with a simple API. It
is intended to parse
XML files very speed efficient. This may be required when processing
huge XML files
like the <a href="">OSM
planet file</a> which currently has 260GB+.
The development goals are <span style="font-weight:bold">speed
efficiency</span> and
<span style="font-weight:bold">simple memory handling</span> to reduce the
memory leaks.
These objectives are achieved through
<li>avoidance of system calls (such as <span
style="font-family:monospace">malloc()</span>) as much as possible,</li>
<li>usage of (nearly) static memory buffers, and</li>
<li>avoidance of copying memory.</li>
Being a stream parser, libhpxml returns in a loop one XML element after
the other. It uses a
result buffer which is initialized once and then reused for every
Thus, repeated calls to <span style="font-family:monospace">malloc()</span> and
style="font-family:monospace">free()</span> are omitted. The input data is read
in blocks. The result buffer does not contain the data itself but just
pointers to the XML elements
within the input buffer. Thus, data is not copied, it is just pointed to.
libhpxml provides a set of functions and structures. <span
style="font-family:monospace">hpx_ctrl_t</span> is a control structure
contains all relevant information for a XML stream. The contents of
the structure are used internally by the library and should not be
modified in any way. The structure is initialized with a call to
<span style="font-family:monospace">hpx_init()</span> and must be
with <span style="font-family:monospace">hpx_free()</span>.
Note that <span style="font-family:monospace">hpx_free()</span> will not close the file descriptor.
hpx_ctrl_t *hpx_init(int fd, int len);
void hpx_free(hpx_ctrl_t *ctl);
The arguments to <span style="font-family:monospace">hpx_init()</span>
is a file descriptor to an open XML file and the length
of the block buffer. It will initialize a <span
style="font-family:monospace">hpx_ctrl_t</span> structure and returns a pointer
to it. In case of error <span style="font-family:monospace">NULL</span>
is returned and <span style="font-family:monospace">errno</span> is set appropriately.
The buffer size must be at least as large as the longest XML element
in the file but it is recommended to be much larger. The larger the
the lesser
the number of reads. If there is enough system memory available, it is
safe to choose 100MB or even more.
<h4 id="mmap">Memory Mapping</h4>
Libhpxml now supports memory mapping through the system call
<span style="font-family:monospace">mmap()</span>. This is activated if
<span style="font-family:monospace">hpx_init()</span> is called with a negative
<span style="font-face:underline;">len</span> parameter. In case of memory mapping, len must
be as long as the (negative value) of the total length of the XML input file.
Memory mapping of files greater than 2 GB is currently just supported on 64 bit architectures
(see manpage <span style="font-family:monospace">mmap(2)</span> or POSIX manpage
<span style="font-family:monospace">mmap(3)</span>, respectively).
The main application for memory mapping is if libhpxml is not just used as stream parser but
XML objects are kept in memory during the whole runtime. This is necessary if on-the-fly object
processing is not possible. This typically is the case if XML objects are nested or they depend
on each other. An example is the rendering process of OSM data.
Keeping pointers valid (see <a href="#hpx_get_elem"><span style="font-family:monospace">hpx_get_elem()</span></a>)
is still possible without memory mapping, but it requires that the buffer
is as large as the file itself because it has to pull in the whole file at once. Thus, this
works just if the system has enough memory. Memory mapping in contrast does not require physical memory,
hence, even a file with several hundred GB may be used.
Note that the preprocessor macro <span style="font-family:monospace">WITH_MMAP</span>
must be defined at compile time to compile libhpxml with <span style="font-family:monospace">mmap()</span>
If it was not compiled with <span style="font-family:monospace;">WITH_MMAP</span>, <span style="font-family:monospace">hpx_init()</span>
will fail, in which case <span style="font-family:monospace;">NULL</span> is returned
and <span style="font-family:monospace;">errno</span> is set to
<span style="font-family:monospace;">EINVAL</span>.
<h3>Supporting Functions</h3>
While parsing an XML file libhpxml returns pointers to the elements
C strings are usually '\0'-terminated but this is not applicable here
because it would
require that '\0' characters are inserted after each element,
in huge data
movement. Thus, libhpxml uses "B strings" which are hold in the <span
style="font-family:monospace">bstring_t</span> structure. The structure
contains a pointer to the string and its length. Additionally, a
set of function is provided to handle those strings.
typedef struct bstring
int len;
char *buf;
} bstring_t;
<h3>Processing Elements</h3>
After initializing the control structure, XML elements are
retrieved by repeated calls to <a id="hpx_get_elem"><span
int hpx_get_elem(hpx_ctrl_t *ctl, bstring_t *b, int *in_tag, size_t *lno);
The function processes the buffer and fills out the bstring pointing
the next XML element. <span style="font-family:monospace">ctl</span>
the pointer to control structure. <span
is filled with either 0 or 1, either if the XML element is a tag
(&lt;...&gt;) or if it is literal text between tags. <span
style="font-family:monospace">lno</span> is filled
with the line number at which this element starts. Both, <span
style="font-family:monospace">in_tag</span> and <span
style="font-family:monospace">lno</span> may be <span
style="font-family:monospace">NULL</span> if
it is not used. <span
style="font-family:monospace">hpx_get_elem()</span> returns the length of the
bstring, 0 on EOF, and -1 in case of error.
Such an element can now be parsed with a call to <span
int hpx_process_elem(bstring_t b, hpx_tag_t *p);
typedef struct hpx_tag
bstring_t tag; // name of tag
int type; // type of tag
int nattr; // number of attributes
int mattr; // size of attr array
hpx_attr_t attr[]; // array containing attributes
} hpx_tag_t;
typedef struct hpx_attr
bstring_t name; //! name of attribute
bstring_t value; //! value of attribute
char delim; //! delimiter character of attribute value
} hpx_attr_t;
It takes a bstring which contains an XML element and parses it into
the <span style="font-family:monospace">hpx_tag_t</span> structure.
This structure may be initialized using
<span style="font-family:monospace">hpx_tm_create()</span> but it
may also be initialized manually. In the latter
case the structure member <span
style="font-family:monospace">mattr</span> must contain the size of the
array. Otherwise the program may segfault. The argument to <span
style="font-family:monospace">hpx_tm_create()</span> is the
maximum number of expected attributes. The tag structure should be
freed again with <span
style="font-family:monospace">hpx_tm_free()</span> after use. It is recommended
reuse the tag structure. This reduces unnecessary memory management
system calls.
Please note that a call to <span
style="font-family:monospace">hpx_get_elem()</span> may invalidate the
pointers within previously filled-out tag structures because it might read in the
next block of the input file. Thus, the tag must be processed
<span style="font-weight:bold;">before</span> the next call to <span
The <span style="font-family:monospace">type</span> member of <span
defines the type this XML element.
Currently, the following types are known.
<tr><td style="font-family:monospace">HPX_ILL</td><td>Element
unknown. This may indicate a syntax error.</td></tr>
<tr><td style="font-family:monospace">HPX_OPEN</td><td>An XML
opening tag.</td><td style="font-family:monospace">&lt;tagname
<tr><td style="font-family:monospace">HPX_SINGLE</td><td>A
single, closed XML tag.</td><td style="font-family:monospace">&lt;tagname
<tr><td style="font-family:monospace">HPX_CLOSE</td><td>An XML
closing tag.</td><td style="font-family:monospace">&lt;/tagname&gt;</td></tr>
<tr><td style="font-family:monospace">HPX_LITERAL</td><td>No
just text between tags.</td><td style="font-family:monospace"></td></tr>
style="font-family:monospace">&lt;! ..... &gt;</td></tr>
style="font-family:monospace">&lt;? .... ?&gt;</td></tr>
style="font-family:monospace">&lt;!-- .... --&gt;</td></tr>
hpx_tag_t *hpx_tm_create(int n);
void hpx_tm_free(hpx_tag_t *t);
The tag structure further contains an array of attributes. The member <span
contains the actual number of attributes parsed. It is always at most <span
elements. If an XML tag has more than <span
style="font-family:monospace">mattr</span> elements they are just ignored. At
the current version there's no feedback to the calling function. This will be
improved in future releases.
The attributes themselves are stored each in an <span
style="font-family:monospace">hpx_attr_t</span> structure. It contains two
bstrings, one for the name and one for the value of the attribute. The third
member <span style="font-family:monospace">delim</span> keeps the delimiter of
the value which is either '\'' (single quote, 0x27) or '"' (double quote,
This example parses an XML file and outputs some stats about each XML
You can download the example <a href="example.c">directly here</a>.
$gs = "geshi/geshi.php";
$s = file_get_contents("example.c");
if (file_exists($gs))
$h = new GeSHi($s, "C");
echo $h->parse_code();
echo "<pre>$s</pre>";
<h2>Bugs and Caveats</h2>
libhpxml does not validate the XML file using e.g. DTD. Thus, it does not
care about semantic errors. Syntactical ones of course are reported.
In the current version, libhpxml is not thread-safe. The interface to the
functions may change because it is in early development. The array of
attributes within
the <span style="font-family:monospace">hpx_tag_t</span> structure has a
static size and is not resized if an XML tag has more attributes
as array entries are available. Currently, <span
does not report if the number of attributes would exceed the array (of
course, it does not exhaust it).
libhpxml is developed and maintained by <a
href="">Bernhard R. Fischer, 2048R/5C5FFD47
Latest update 2012/01/09.
libhpxml is released under GNU GPLv3.