New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Collapsing adjacent text nodes is too aggressive #1578
Comments
Thanks for reporting this. I'll take a look as soon as I can. |
The code example provided here is complex, and I don't have a ton of time at the moment to dig in and make sure I understand it. Can you help me understand what document structure you are trying to generate? Text mode merging is beyond the control of Nokogiri, and is done by the underlying XML library. There are places in Nokogiri where we merge nodes during certain operations, but we do it defensively, because if we let libxml2 do it we are left with dangling pointers and the ensuing memory corruption. I can go into more detail if you like, but understanding what you're trying to do will perhaps focus the conversation on your immediate problem. |
I'm trying to convert a sequence of <h\d> headings into HTML5 by adding <section> tags.
<section>
<h1>heading 1</h1>
<section>
<h2>heading 2</h2>
<p>paragraph 1</p>
<p>paragraph 2</p>
<p>paragraph 3</p>
</section>
<section>
<h2>heading 2a</h2>
</section>
</section>
The original looks exactly the same if you remove the <section> tags.
The code converts the <p> items to text nodes (the unwrap routine in the code). These end up as distinct text nodes (I would be OK if they got merged but they didn't).
For each <h1> element it then:
1. Inserts <section> in front of the <h1> element.
2. It then loops through nodes beginning with the <h1> element following the new <section> until it hits the next <h1> element or the end of the enclosing element (nil returned from next_sibling). For each node, it unlinks it and uses add_child to move it into the new section.
3. Repeat steps 1 and 2 for <h2>, <h3>, ... This makes sure that the hierarchy is maintained. When doing <h2>, the loop stops at the end of the enclosing <h1>. When doing <h3>, the loop stops at the end of the enclosing <h2>, etc.
When it moves the text node containing "paragraph 1", that node gets moved correctly. When it moves the text node containing "paragraph 2" that gets merged with the "paragraph 1" node (perfectly legal), but something screws up and also merges "paragraph 3" or sets the loop pointer used in step 2 to nil. The resulting nil is too early and causes loop 2 to exit early. Sometimes "paragraph 3" gets duplicated in the output (one copy inside the <section> and another one after the <section>).
The path == 3 workaround does a manual "merge" of the text nodes in loop 2 by appending adjacent text nodes to the first text node of the sequence.
Steveg
… On Jan 13, 2017, at 7:20 AM, Mike Dalessio ***@***.***> wrote:
The code example provided here is complex, and I don't have a ton of time at the moment to dig in and make sure I understand it. Can you help me understand what document structure you are trying to generate?
Text mode merging is beyond the control of Nokogiri, and is done by the underlying XML library. There are places in Nokogiri where we merge nodes during certain operations, but we do it defensively, because if we let libxml2 do it we are left with dangling pointers and the ensuing memory corruption.
I can go into more detail if you like, but understanding what you're trying to do will perhaps focus the conversation on your immediate problem.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
In step 3 I misspoke. It should say that the loop stops at the enclosing <section>, <body> or whatever. No <h\d> elements are changed.
Steveg
… On Jan 13, 2017, at 8:39 AM, Steve Glaser ***@***.***> wrote:
I'm trying to convert a sequence of <h\d> headings into HTML5 by adding <section> tags.
<section>
<h1>heading 1</h1>
<section>
<h2>heading 2</h2>
<p>paragraph 1</p>
<p>paragraph 2</p>
<p>paragraph 3</p>
</section>
<section>
<h2>heading 2a</h2>
</section>
</section>
The original looks exactly the same if you remove the <section> tags.
The code converts the <p> items to text nodes (the unwrap routine in the code). These end up as distinct text nodes (I would be OK if they got merged but they didn't).
For each <h1> element it then:
1. Inserts <section> in front of the <h1> element.
2. It then loops through nodes beginning with the <h1> element following the new <section> until it hits the next <h1> element or the end of the enclosing element (nil returned from next_sibling). For each node, it unlinks it and uses add_child to move it into the new section.
3. Repeat steps 1 and 2 for <h2>, <h3>, ... This makes sure that the hierarchy is maintained. When doing <h2>, the loop stops at the end of the enclosing <h1>. When doing <h3>, the loop stops at the end of the enclosing <h2>, etc.
When it moves the text node containing "paragraph 1", that node gets moved correctly. When it moves the text node containing "paragraph 2" that gets merged with the "paragraph 1" node (perfectly legal), but something screws up and also merges "paragraph 3" or sets the loop pointer used in step 2 to nil. The resulting nil is too early and causes loop 2 to exit early. Sometimes "paragraph 3" gets duplicated in the output (one copy inside the <section> and another one after the <section>).
The path == 3 workaround does a manual "merge" of the text nodes in loop 2 by appending adjacent text nodes to the first text node of the sequence.
Steveg
> On Jan 13, 2017, at 7:20 AM, Mike Dalessio ***@***.***> wrote:
>
> The code example provided here is complex, and I don't have a ton of time at the moment to dig in and make sure I understand it. Can you help me understand what document structure you are trying to generate?
>
> Text mode merging is beyond the control of Nokogiri, and is done by the underlying XML library. There are places in Nokogiri where we merge nodes during certain operations, but we do it defensively, because if we let libxml2 do it we are left with dangling pointers and the ensuing memory corruption.
>
> I can go into more detail if you like, but understanding what you're trying to do will perhaps focus the conversation on your immediate problem.
>
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub, or mute the thread.
>
|
Hi @sglaser, Sorry for the slow response. I think I understand now what you're trying to do; but I'm having trouble following the code, and so I'll respond to your description above. My advice would be to avoid operating on the text nodes, and instead operate on the You may also want to consider building a second document based on the content of the first (perhaps even leveraging the SAX parser on the first). I hope I haven't misunderstood the problem again; if so, happy to try once more. Otherwise, I'll close this out soon. |
Ruby code demonstrating the problem: bug_nokogiri.rb
Output from ruby code demonstrating the problem: bug_nokogiri.txt
What problems are you experiencing?
The example runs the offending code 3 times.
path == 1
, this is my initial version, converted from a JavaScript version that's been running for a long time.path == 2
, this is my first workaround. It works for two consecutive text nodes but fails for more.path == 3
, this is my final workaroundThis code is trying to convert traditional HTML to use HTML5 <section> tags.
It does this by locating <h1>, inserting an empty <section> before it and then moving everything up to the next <h1> into that section. (then repeat for other header levels.)
Earlier code (unwrap) causes the DOM tree to have multiple text nodes that are adjacent to each other.
In path 1 and path 2 the result is corrupt. It appears that when add_child appends a text node to the <section> and the last child was already a text node, these are combined into a single text node (good). The problem is that the pointer to the next node is broken.
In other words
If:
then the sequence
sometimes causes n2 to become nil
What's the output from
nokogiri -v
?$ nokogiri -v
Nokogiri (1.7.0.1)
Can you provide a self-contained script that reproduces what you're seeing?
See attached
The text was updated successfully, but these errors were encountered: