Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correcting bad HTML markup #2056

Closed
activklaus opened this issue Jul 30, 2020 · 1 comment
Closed

Correcting bad HTML markup #2056

activklaus opened this issue Jul 30, 2020 · 1 comment

Comments

@activklaus
Copy link

I have to handle an invalid HTML document:

<html>
    <body>
      <p>1</p> 
      <p>
        <div>2</div>
      </p> 
      <p>3</p>
  </body>
</html>

First I'm getting the XPath with a Javascript function like the one firebug used (https://github.com/firebug/firebug/blob/master/extension/content/firebug/lib/xpath.js -> Xpath.getElementTreeXPath).

Using this script I'm getting something like /p[4] when I select the last p. This is because the browser "fixes" the invalid p (i.e. the second one with die div inside) by adding a closing p right before the div and a second opening p right after the closing div and just before the closing p. This results in an extra p making the last p the fourth one - although there are only three p in the original document.

When using the refered Javascript function which uses previousSibling the browser let it walk through all the p (including the extra p).

After I got the XPath (with /p[4]) I try to get the content of it by invoking at_xpath like in the following test:

require 'nokogiri'
require 'minitest/autorun'

class Test < MiniTest::Spec
  describe "Node#at_xpath" do
    it "should add an extra p after div" do
      html = <<~HTML
        <html>
          <body>
            <p>1</p> 
            <p>
              <div>2</div>
            </p> 
            <p>3</p>
          </body>
        </html> 
      HTML
      
      doc = Nokogiri::HTML::Document.parse(html)
      
      assert_equal '', doc.at_xpath("/html/body/p[3]").text       // This p is added by the browser and should therefore be empty
      assert_equal '3', doc.at_xpath("/html/body/p[4]").text
    end
  end
end

The problem is that Nokogiri simply omits/removes the closing p after the div instead of fixing it like the browser does (I checked Chrome, Firefox and IE which all act the same way and add an extra p).

While Nokogiri fixes many other mistakes in the markup (https://nokogiri.org/tutorials/ensuring_well_formed_markup.html) I don't understand why Nokogiri acts differently to many of the mainstream browsers in this case. And there is no parse option to change this behavior. Is this a bug or a "feature"?

@flavorjones
Copy link
Member

Hi, thanks for asking this question, and sorry you're having problems.

HTML and XML parsers, generally speaking, will parse well-formed markup identically, because there's a spec. There is no spec and no formal W3C guidance on how to correct or "fix up" malformed markup, and so every parser seems to do it differently. You're likely seeing differences between your browser's parser and libxml2 (which is what Nokogiri uses). There's nothing we can easily do to change this behavior, unfortunately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants