[Feature] Allow SAX Parser to be run in recovery mode #776

Closed
rampion opened this Issue Oct 16, 2012 · 1 comment

Comments

Projects
None yet
2 participants

rampion commented Oct 16, 2012

The base libxml C library that Nokogiri uses allows SAX Parsers to be run in recovery mode. When the recovery value of xmlParserCtxt is non-zero, it allows parsing to continue past errors in the XML.

Currently there is no way in Nokogiri to enable recovery mode, except through use of a C extension.

For example,  is not a legal xml escape and invalidates the document. Note how it affects the parsing of the XML document in this example:

#!/usr/bin/env ruby

require 'rubygems'
require 'nokogiri'
require 'inline'

EXAMPLE_DOCUMENT = <<-XML
  <?xml version="1.0" encoding="UTF-8"?>
  <foo color="red">
    <!-- this is legal -->
    <legal>I like cheese</legal>
    <!-- this is not -->
    <illegal>I &#8; cheese</illegal>
    <!-- this is legal -->
    <legal>I like crackers</legal>
  </foo>
XML

include Nokogiri::XML::SAX

# hack in a recovery option
class ParserContext
  inline do |builder|
    builder.add_compile_flags("-I/usr/include/libxml2") # nonportable
    builder.include "<libxml/parser.h>"
    builder.struct_name = 'xmlParserCtxt'
    builder.accessor :recovery, 'int'
  end
end

# SAX parser callbacks
class Inspector < Document
  [ :start_document,
    :start_element,
    :characters,
    :end_element,
    :end_document,
  ].each do |method|
    define_method(method) do |*args|
      p [method, *args]
    end
  end
end

puts "Without recovery mode"
Parser.new(Inspector.new).parse(EXAMPLE_DOCUMENT)

puts "With recovery mode"
Parser.new(Inspector.new).parse(EXAMPLE_DOCUMENT) { |ctxt| ctxt.recovery = 1 } 

This produces:

Without recovery mode
[:start_document]
[:characters, "\n    "]
[:characters, "\n    "]
[:characters, "I like cheese"]
[:characters, "\n    "]
[:characters, "\n    "]
[:characters, "I "]
[:characters, " cheese"]
[:characters, "\n    "]
[:characters, "\n    "]
[:characters, "I like crackers"]
[:characters, "\n  "]
[:end_document]
With recovery mode
[:start_document]
[:start_element, "foo", [["color", "red"]]]
[:characters, "\n    "]
[:characters, "\n    "]
[:start_element, "legal", []]
[:characters, "I like cheese"]
[:end_element, "legal"]
[:characters, "\n    "]
[:characters, "\n    "]
[:start_element, "illegal", []]
[:characters, "I "]
[:characters, " cheese"]
[:end_element, "illegal"]
[:characters, "\n    "]
[:characters, "\n    "]
[:start_element, "legal", []]
[:characters, "I like crackers"]
[:end_element, "legal"]
[:characters, "\n  "]
[:end_element, "foo"]
[:end_document]
Member

jvshahid commented Nov 20, 2013

Duplicate of #453, this was merged into master in 4b7a7fb

jvshahid closed this Nov 20, 2013

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment