Correcting bad HTML markup #2056

activklaus · 2020-07-30T23:28:06Z

I have to handle an invalid HTML document:

<html>
    <body>
      <p>1</p> 
      <p>
        <div>2</div>
      </p> 
      <p>3</p>
  </body>
</html>

First I'm getting the XPath with a Javascript function like the one firebug used (https://github.com/firebug/firebug/blob/master/extension/content/firebug/lib/xpath.js -> Xpath.getElementTreeXPath).

Using this script I'm getting something like /p[4] when I select the last p. This is because the browser "fixes" the invalid p (i.e. the second one with die div inside) by adding a closing p right before the div and a second opening p right after the closing div and just before the closing p. This results in an extra p making the last p the fourth one - although there are only three p in the original document.

When using the refered Javascript function which uses previousSibling the browser let it walk through all the p (including the extra p).

After I got the XPath (with /p[4]) I try to get the content of it by invoking at_xpath like in the following test:

require 'nokogiri'
require 'minitest/autorun'

class Test < MiniTest::Spec
  describe "Node#at_xpath" do
    it "should add an extra p after div" do
      html = <<~HTML
        <html>
          <body>
            <p>1</p> 
            <p>
              <div>2</div>
            </p> 
            <p>3</p>
          </body>
        </html> 
      HTML
      
      doc = Nokogiri::HTML::Document.parse(html)
      
      assert_equal '', doc.at_xpath("/html/body/p[3]").text       // This p is added by the browser and should therefore be empty
      assert_equal '3', doc.at_xpath("/html/body/p[4]").text
    end
  end
end

The problem is that Nokogiri simply omits/removes the closing p after the div instead of fixing it like the browser does (I checked Chrome, Firefox and IE which all act the same way and add an extra p).

While Nokogiri fixes many other mistakes in the markup (https://nokogiri.org/tutorials/ensuring_well_formed_markup.html) I don't understand why Nokogiri acts differently to many of the mainstream browsers in this case. And there is no parse option to change this behavior. Is this a bug or a "feature"?

The text was updated successfully, but these errors were encountered:

flavorjones · 2020-08-03T15:52:55Z

Hi, thanks for asking this question, and sorry you're having problems.

HTML and XML parsers, generally speaking, will parse well-formed markup identically, because there's a spec. There is no spec and no formal W3C guidance on how to correct or "fix up" malformed markup, and so every parser seems to do it differently. You're likely seeing differences between your browser's parser and libxml2 (which is what Nokogiri uses). There's nothing we can easily do to change this behavior, unfortunately.

flavorjones closed this as completed Aug 3, 2020

flavorjones mentioned this issue Nov 6, 2021

Note in the documentation that broken markup may be "fixed" differently, or docs may be serialized differently, by libxml than by your browser or by xerces sparklemotion/nokogiri.org#41

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correcting bad HTML markup #2056

Correcting bad HTML markup #2056

activklaus commented Jul 30, 2020

flavorjones commented Aug 3, 2020

Correcting bad HTML markup #2056

Correcting bad HTML markup #2056

Comments

activklaus commented Jul 30, 2020

flavorjones commented Aug 3, 2020