Collapsing adjacent text nodes is too aggressive #1578

sglaser · 2017-01-09T06:50:16Z

Ruby code demonstrating the problem: bug_nokogiri.rb

Output from ruby code demonstrating the problem: bug_nokogiri.txt

What problems are you experiencing?
The example runs the offending code 3 times.

When path == 1, this is my initial version, converted from a JavaScript version that's been running for a long time.
When path == 2, this is my first workaround. It works for two consecutive text nodes but fails for more.
When path == 3, this is my final workaround

This code is trying to convert traditional HTML to use HTML5 <section> tags.

It does this by locating <h1>, inserting an empty <section> before it and then moving everything up to the next <h1> into that section. (then repeat for other header levels.)

Earlier code (unwrap) causes the DOM tree to have multiple text nodes that are adjacent to each other.

In path 1 and path 2 the result is corrupt. It appears that when add_child appends a text node to the <section> and the last child was already a text node, these are combined into a single text node (good). The problem is that the pointer to the next node is broken.

In other words

If:

sect.last_child.text? && sect.next_sibling == n1 && n1.text? && n2.text? && n2 == n1.next_sibling

then the sequence

n1.unlink
sect.add_child(n1)

sometimes causes n2 to become nil

What's the output from nokogiri -v?
$ nokogiri -v

Nokogiri (1.7.0.1)

---
warnings: []
nokogiri: 1.7.0.1
ruby:
  version: 2.4.0
  platform: x86_64-darwin16
  description: ruby 2.4.0p0 (2016-12-24 revision 57164) [x86_64-darwin16]
  engine: ruby
libxml:
  binding: extension
  source: packaged
  libxml2_path: "/usr/local/lib/ruby/gems/2.4.0/gems/nokogiri-1.7.0.1/ports/x86_64-apple-darwin16.3.0/libxml2/2.9.4"
  libxslt_path: "/usr/local/lib/ruby/gems/2.4.0/gems/nokogiri-1.7.0.1/ports/x86_64-apple-darwin16.3.0/libxslt/1.1.29"
  libxml2_patches: []
  libxslt_patches: []
  compiled: 2.9.4
  loaded: 2.9.4

Can you provide a self-contained script that reproduces what you're seeing?
See attached

The text was updated successfully, but these errors were encountered:

flavorjones · 2017-01-09T14:00:36Z

Thanks for reporting this. I'll take a look as soon as I can.

flavorjones · 2017-01-13T15:20:01Z

The code example provided here is complex, and I don't have a ton of time at the moment to dig in and make sure I understand it. Can you help me understand what document structure you are trying to generate?

Text mode merging is beyond the control of Nokogiri, and is done by the underlying XML library. There are places in Nokogiri where we merge nodes during certain operations, but we do it defensively, because if we let libxml2 do it we are left with dangling pointers and the ensuing memory corruption.

I can go into more detail if you like, but understanding what you're trying to do will perhaps focus the conversation on your immediate problem.

sglaser · 2017-01-13T16:39:34Z

I'm trying to convert a sequence of <h\d> headings into HTML5 by adding <section> tags. <section> <h1>heading 1</h1> <section> <h2>heading 2</h2> <p>paragraph 1</p> <p>paragraph 2</p> <p>paragraph 3</p> </section> <section> <h2>heading 2a</h2> </section> </section> The original looks exactly the same if you remove the <section> tags. The code converts the <p> items to text nodes (the unwrap routine in the code). These end up as distinct text nodes (I would be OK if they got merged but they didn't). For each <h1> element it then: 1. Inserts <section> in front of the <h1> element. 2. It then loops through nodes beginning with the <h1> element following the new <section> until it hits the next <h1> element or the end of the enclosing element (nil returned from next_sibling). For each node, it unlinks it and uses add_child to move it into the new section. 3. Repeat steps 1 and 2 for <h2>, <h3>, ... This makes sure that the hierarchy is maintained. When doing <h2>, the loop stops at the end of the enclosing <h1>. When doing <h3>, the loop stops at the end of the enclosing <h2>, etc. When it moves the text node containing "paragraph 1", that node gets moved correctly. When it moves the text node containing "paragraph 2" that gets merged with the "paragraph 1" node (perfectly legal), but something screws up and also merges "paragraph 3" or sets the loop pointer used in step 2 to nil. The resulting nil is too early and causes loop 2 to exit early. Sometimes "paragraph 3" gets duplicated in the output (one copy inside the <section> and another one after the <section>). The path == 3 workaround does a manual "merge" of the text nodes in loop 2 by appending adjacent text nodes to the first text node of the sequence. Steveg

…

On Jan 13, 2017, at 7:20 AM, Mike Dalessio ***@***.***> wrote: The code example provided here is complex, and I don't have a ton of time at the moment to dig in and make sure I understand it. Can you help me understand what document structure you are trying to generate? Text mode merging is beyond the control of Nokogiri, and is done by the underlying XML library. There are places in Nokogiri where we merge nodes during certain operations, but we do it defensively, because if we let libxml2 do it we are left with dangling pointers and the ensuing memory corruption. I can go into more detail if you like, but understanding what you're trying to do will perhaps focus the conversation on your immediate problem. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

sglaser · 2017-01-13T18:01:35Z

In step 3 I misspoke. It should say that the loop stops at the enclosing <section>, <body> or whatever. No <h\d> elements are changed. Steveg

…

On Jan 13, 2017, at 8:39 AM, Steve Glaser ***@***.***> wrote: I'm trying to convert a sequence of <h\d> headings into HTML5 by adding <section> tags. <section> <h1>heading 1</h1> <section> <h2>heading 2</h2> <p>paragraph 1</p> <p>paragraph 2</p> <p>paragraph 3</p> </section> <section> <h2>heading 2a</h2> </section> </section> The original looks exactly the same if you remove the <section> tags. The code converts the <p> items to text nodes (the unwrap routine in the code). These end up as distinct text nodes (I would be OK if they got merged but they didn't). For each <h1> element it then: 1. Inserts <section> in front of the <h1> element. 2. It then loops through nodes beginning with the <h1> element following the new <section> until it hits the next <h1> element or the end of the enclosing element (nil returned from next_sibling). For each node, it unlinks it and uses add_child to move it into the new section. 3. Repeat steps 1 and 2 for <h2>, <h3>, ... This makes sure that the hierarchy is maintained. When doing <h2>, the loop stops at the end of the enclosing <h1>. When doing <h3>, the loop stops at the end of the enclosing <h2>, etc. When it moves the text node containing "paragraph 1", that node gets moved correctly. When it moves the text node containing "paragraph 2" that gets merged with the "paragraph 1" node (perfectly legal), but something screws up and also merges "paragraph 3" or sets the loop pointer used in step 2 to nil. The resulting nil is too early and causes loop 2 to exit early. Sometimes "paragraph 3" gets duplicated in the output (one copy inside the <section> and another one after the <section>). The path == 3 workaround does a manual "merge" of the text nodes in loop 2 by appending adjacent text nodes to the first text node of the sequence. Steveg > On Jan 13, 2017, at 7:20 AM, Mike Dalessio ***@***.***> wrote: > > The code example provided here is complex, and I don't have a ton of time at the moment to dig in and make sure I understand it. Can you help me understand what document structure you are trying to generate? > > Text mode merging is beyond the control of Nokogiri, and is done by the underlying XML library. There are places in Nokogiri where we merge nodes during certain operations, but we do it defensively, because if we let libxml2 do it we are left with dangling pointers and the ensuing memory corruption. > > I can go into more detail if you like, but understanding what you're trying to do will perhaps focus the conversation on your immediate problem. > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub, or mute the thread. >

flavorjones · 2017-05-10T05:42:08Z

Hi @sglaser,

Sorry for the slow response. I think I understand now what you're trying to do; but I'm having trouble following the code, and so I'll respond to your description above.

My advice would be to avoid operating on the text nodes, and instead operate on the p nodes. This will avoid having text nodes get merged by libxml2 (or nokogiri) while the structure is still being changed.

You may also want to consider building a second document based on the content of the first (perhaps even leveraging the SAX parser on the first).

I hope I haven't misunderstood the problem again; if so, happy to try once more. Otherwise, I'll close this out soon.

flavorjones added the needs/more-info label Jan 13, 2017

flavorjones added state/will-close meta/user-help and removed needs/more-info labels May 10, 2017

flavorjones closed this as completed Jan 5, 2019

flavorjones removed the state/will-close label Feb 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collapsing adjacent text nodes is too aggressive #1578

Collapsing adjacent text nodes is too aggressive #1578

sglaser commented Jan 9, 2017 •

edited

flavorjones commented Jan 9, 2017

flavorjones commented Jan 13, 2017

sglaser commented Jan 13, 2017 via email

sglaser commented Jan 13, 2017 via email

flavorjones commented May 10, 2017

Collapsing adjacent text nodes is too aggressive #1578

Collapsing adjacent text nodes is too aggressive #1578

Comments

sglaser commented Jan 9, 2017 • edited

In other words

Nokogiri (1.7.0.1)

flavorjones commented Jan 9, 2017

flavorjones commented Jan 13, 2017

sglaser commented Jan 13, 2017 via email

sglaser commented Jan 13, 2017 via email

flavorjones commented May 10, 2017

sglaser commented Jan 9, 2017 •

edited