Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] libxml 2.9.13 breaks HTML4 parser recovery from ill-formed < character #2461

Closed
5 tasks done
flavorjones opened this issue Feb 21, 2022 · 5 comments · Fixed by #2462
Closed
5 tasks done

[bug] libxml 2.9.13 breaks HTML4 parser recovery from ill-formed < character #2461

flavorjones opened this issue Feb 21, 2022 · 5 comments · Fixed by #2462

Comments

@flavorjones
Copy link
Member

@flavorjones flavorjones commented Feb 21, 2022

Summary

Nokogiri v1.13.2 shipped libxml 2.9.13. That version of libxml2 introduced a behavior change to how the HTML4 parser recovers when it sees a bare (ill-formed) < character (one that is not part of a start tag).

I've opened an issue upstream at https://gitlab.gnome.org/GNOME/libxml2/-/issues/339

Immediate next steps

Less-urgent next steps

@flavorjones flavorjones added the state/needs-triage label Feb 21, 2022
@flavorjones flavorjones changed the title [bug] Nokogiri v1.13.2 is buggy with respect to sanitization and entities [bug] Nokogiri v1.13.2 / libxml 2.9.13 breaks some sanitization and entity behavior Feb 21, 2022
@flavorjones
Copy link
Member Author

@flavorjones flavorjones commented Feb 21, 2022

OK, I have a repro which seems common across the rails-html-sanitizer failures as well as my day job CI failures:

      it "handles < character" do
        input = %{<div> this < that </div>}
        expected = %{<div> this &lt; that </div>}
        actual = Loofah.scrub_fragment(input, :escape)
        assert_equal(expected, actual.to_html)
      end

with nokogiri v1.13.1, this passes. with nokogiri v1.13.2:

Expected: "<div> this &lt; that </div>"
  Actual: "<div> this </div>"

@flavorjones flavorjones added upstream/libxml2 and removed state/needs-triage labels Feb 21, 2022
@flavorjones
Copy link
Member Author

@flavorjones flavorjones commented Feb 21, 2022

Without Loofah, here's the core problem:

# nokogiri 1.13.1
$ ruby -rnokogiri -e 'pp Nokogiri::HTML4::Document.parse("<div> this < that </div>")'
#(Document:0x3c {
  name = "document",
  children = [
    #(DTD:0x50 { name = "html" }),
    #(Element:0x64 {
      name = "html",
      children = [
        #(Element:0x78 {
          name = "body",
          children = [
            #(Element:0x8c {
              name = "div",
              children = [ #(Text " this < that ")]
              })]
          })]
      })]
  })

# nokogiri 1.13.2
$ ruby -rnokogiri -e 'pp Nokogiri::HTML4::Document.parse("<div> this < that </div>")'
#(Document:0x3c {
  name = "document",
  children = [
    #(DTD:0x50 { name = "html" }),
    #(Element:0x64 {
      name = "html",
      children = [
        #(Element:0x78 {
          name = "body",
          children = [
            #(Element:0x8c { name = "div", children = [ #(Text " this ")] })]
          })]
      })]
  })

@flavorjones
Copy link
Member Author

@flavorjones flavorjones commented Feb 21, 2022

I've opened an issue upstream: https://gitlab.gnome.org/GNOME/libxml2/-/issues/339

I'm going to explore reverting the related commits in a patch to see if I can get a fast-follow release of Nokogiri for y'all.

@flavorjones
Copy link
Member Author

@flavorjones flavorjones commented Feb 21, 2022

I've updated this issue's description with a punch list of next steps.

@flavorjones flavorjones changed the title [bug] Nokogiri v1.13.2 / libxml 2.9.13 breaks some sanitization and entity behavior [bug] Nokogiri v1.13.2 / libxml 2.9.13 breaks HTML4 parser recovery from ill-formed < character Feb 21, 2022
@flavorjones flavorjones changed the title [bug] Nokogiri v1.13.2 / libxml 2.9.13 breaks HTML4 parser recovery from ill-formed < character [bug] libxml 2.9.13 breaks HTML4 parser recovery from ill-formed < character Feb 21, 2022
flavorjones added a commit to flavorjones/loofah that referenced this issue Feb 21, 2022
@flavorjones flavorjones reopened this Feb 22, 2022
@flavorjones
Copy link
Member Author

@flavorjones flavorjones commented Feb 22, 2022

v1.13.3 has been released to address this: https://github.com/sparklemotion/nokogiri/releases/tag/v1.13.3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant