Skip to content

salsify/salsify_libxml_ruby

 
 

Repository files navigation

Salsify LibXML Ruby

The libxml_ruby gem provides bindings for libxml library. The Salsify version of this gem adds static linking to a locally-compiled version of libxml2 whose version can be set in extconf.rb. This isolates it from changes to the underlying native library, which follows the model of a much better known Ruby library binding to libxml2 (Nokogiri) and is especially important in this case due to the use of bindings to undocumented APIs.

SECURITY

The maintainers will periodically provide security updates for libxml, but YOU SHOULD NOT USE this library in production environments.

INSTALLATION

Add a reference to “salsify_libxml_ruby” in your Gemfile. This assumes that bundler is configured to use gems.salisfy.com.

Original library

Original README offering more information about this library follows.

LibXML Ruby

Overview

The libxml gem provides Ruby language bindings for GNOME’s Libxml2 XML toolkit. It is free software, released under the MIT License.

We think libxml-ruby is the best XML library for Ruby because:

  • Speed - Its much faster than REXML and Hpricot

  • Features - It provides an amazing number of featues

  • Conformance - It passes all 1800+ tests from the OASIS XML Tests Suite

The gem also includes a Microsoft VC++ 2012 solution and XCode 5 project - these are very useful for debugging.

libxml-ruby’s source codes lives on GitHub.

Getting Started

Using libxml is easy. First decide what parser you want to use:

  • Generally you’ll want to use the LibXML::XML::Parser which provides a tree based API.

  • For larger documents that don’t fit into memory, or if you prefer an input based API, use the LibXML::XML::Reader.

  • To parse HTML files use LibXML::XML::HTMLParser.

  • If you are masochistic, then use the LibXML::XML::SaxParser, which provides a callback API.

Once you have chosen a parser, choose a datasource. Libxml can parse files, strings, URIs and IO streams. For each data source you can specify an LibXML::XML::Encoding, a base uri and various parser options. For more information, refer the LibXML::XML::Parser.document, LibXML::XML::Parser.file, LibXML::XML::Parser.io or LibXML:::XML::Parser.string methods (the same methods are defined on all four parser classes).

Advanced Functionality

Beyond the basics of parsing and processing XML and HTML documents, libxml provides a wealth of additional functionality.

Most commonly, you’ll want to use its LibXML::XML::XPath support, which makes it easy to find data inside an XML document. Although not as popular, LibXML::XML::XPointer provides another API for finding data inside an XML document.

Often times you’ll need to validate data before processing it. For example, if you accept user generated content submitted over the Web, you’ll want to verify that it does not contain malicious code such as embedded scripts. This can be done using libxml’s powerful set of validators:

  • DTDs (LibXML::XML::Dtd)

  • Relax Schemas (LibXML::XML::RelaxNG)

  • XML Schema (LibXML::XML::Schema)

Finally, if you’d like to use XSL Transformations to process data, then install the libxslt gem.

Usage

For information about using libxml-ruby please refer to its documentation. Some tutorials are also available.

All libxml classes are in the LibXML::XML module. The easiest way to use libxml is to require 'xml'. This will mixin the LibXML module into the global namespace, allowing you to write code like this:

require 'xml'
document = XML::Document.new

However, when creating an application or library you plan to redistribute, it is best to not add the LibXML module to the global namespace, in which case you can either write your code like this:

require 'libxml'
document = LibXML::XML::Document.new

Or you can utilize a namespace for your own work and include LibXML into it. For example:

require 'libxml'

module MyApplication
  include LibXML

  class MyClass
    def some_method
      document = XML::Document.new
    end
  end
end

For simplicity’s sake, the documentation uses the xml module in its examples.

Threading

libxml-ruby fully supports native, background Ruby threads. This of course only applies to Ruby 1.9.x and higher since earlier versions of Ruby do not support native threads.

Tests

To run tests you first need to build the shared libary:

rake compile

Once you have build the shared libary, you can then run tests using rake:

rake test

+Travis build status: <img src=“https://travis-ci.org/xml4r/libxml-ruby.svg?branch=master” alt=“Build Status” />

Performance

In addition to being feature rich and conformation, the main reason people use libxml-ruby is for performance. Here are the results of a couple simple benchmarks recently blogged about on the Web (you can find them in the benchmark directory of the libxml distribution).

From depixelate.com/2008/4/23/ruby-xml-parsing-benchmarks

              user     system      total        real
libxml    0.032000   0.000000   0.032000 (  0.031000)
Hpricot   0.640000   0.031000   0.671000 (  0.890000)
REXML     1.813000   0.047000   1.860000 (  2.031000)

From svn.concord.org/svn/projects/trunk/common/ruby/xml_benchmarks/

              user     system      total        real
libxml    0.641000   0.031000   0.672000 (  0.672000)
hpricot   5.359000   0.062000   5.421000 (  5.516000)
rexml    22.859000   0.047000  22.906000 ( 23.203000)

Documentation

Documentation is available via rdoc, and is installed automatically with the gem.

libxml-ruby’s online documentation is generated using Hanna, which is a development gem dependency.

Note that older versions of Rdoc, which ship with Ruby 1.8.x, will report a number of errors. To avoid them, install Rdoc 2.1 or higher. Once you have installed the gem, you’ll have to disable the version of Rdoc that Ruby 1.8.x includes. An easy way to do that is rename the directory ruby/lib/ruby/1.8/rdoc to ruby/lib/ruby/1.8/rdoc_old.

Support

If you have any questions about using libxml-ruby, please report an issue on GitHub.

Memory Management

libxml-ruby automatically manages memory associated with the underlying libxml2 library. The bindings create a one-to-one mapping between Ruby objects and libxml documents and libxml parent nodes (ie, nodes that do not have a parent and do not belong to a document). In these cases, the bindings manage the memory. They do this by installing a free function and storing a back pointer to the Ruby object from the xmlnode using the _private member on libxml structures. When the Ruby object goes out of scope, the underlying libxml structure is freed. Libxml itself then frees all child nodes (recursively).

For all other nodes (the vast majority), the bindings create temporary Ruby objects that get freed once they go out of scope. Thus there can be more than one Ruby object pointing to the same xml node. To mostly hide this from a programmer on the Ruby side, the #eql? and #== methods are overriden to check if two Ruby objects wrap the same xmlnode. If they do, then the methods return true. During the mark phase, each of these temporary objects marks its owning document, thereby keeping the Ruby document object alive and thus the xmldoc tree.

In the sweep phase of the garbage collector, or when a program ends, there is no order to how Ruby objects are freed. In fact, the Ruby document object is almost always freed before any Ruby objects that wrap child nodes. However, this is ok because those Ruby objects do not have a free function and are no longer in scope (since if they were the document would not be freed).

License

See LICENSE for license information.

Packages

No packages published

Languages

  • C 58.3%
  • Ruby 40.0%
  • HTML 1.6%
  • XSLT 0.1%