Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collapsing adjacent text nodes is too aggressive #1578

Closed
sglaser opened this issue Jan 9, 2017 · 5 comments
Closed

Collapsing adjacent text nodes is too aggressive #1578

sglaser opened this issue Jan 9, 2017 · 5 comments

Comments

@sglaser
Copy link

sglaser commented Jan 9, 2017

Ruby code demonstrating the problem: bug_nokogiri.rb

Output from ruby code demonstrating the problem: bug_nokogiri.txt

What problems are you experiencing?
The example runs the offending code 3 times.

  • When path == 1, this is my initial version, converted from a JavaScript version that's been running for a long time.
  • When path == 2, this is my first workaround. It works for two consecutive text nodes but fails for more.
  • When path == 3, this is my final workaround

This code is trying to convert traditional HTML to use HTML5 <section> tags.

It does this by locating <h1>, inserting an empty <section> before it and then moving everything up to the next <h1> into that section. (then repeat for other header levels.)

Earlier code (unwrap) causes the DOM tree to have multiple text nodes that are adjacent to each other.

In path 1 and path 2 the result is corrupt. It appears that when add_child appends a text node to the <section> and the last child was already a text node, these are combined into a single text node (good). The problem is that the pointer to the next node is broken.

In other words

If:

sect.last_child.text? && sect.next_sibling == n1 && n1.text? && n2.text? && n2 == n1.next_sibling

then the sequence

n1.unlink
sect.add_child(n1)

sometimes causes n2 to become nil

What's the output from nokogiri -v?
$ nokogiri -v

Nokogiri (1.7.0.1)

---
warnings: []
nokogiri: 1.7.0.1
ruby:
  version: 2.4.0
  platform: x86_64-darwin16
  description: ruby 2.4.0p0 (2016-12-24 revision 57164) [x86_64-darwin16]
  engine: ruby
libxml:
  binding: extension
  source: packaged
  libxml2_path: "/usr/local/lib/ruby/gems/2.4.0/gems/nokogiri-1.7.0.1/ports/x86_64-apple-darwin16.3.0/libxml2/2.9.4"
  libxslt_path: "/usr/local/lib/ruby/gems/2.4.0/gems/nokogiri-1.7.0.1/ports/x86_64-apple-darwin16.3.0/libxslt/1.1.29"
  libxml2_patches: []
  libxslt_patches: []
  compiled: 2.9.4
  loaded: 2.9.4

Can you provide a self-contained script that reproduces what you're seeing?
See attached

@flavorjones
Copy link
Member

Thanks for reporting this. I'll take a look as soon as I can.

@flavorjones
Copy link
Member

The code example provided here is complex, and I don't have a ton of time at the moment to dig in and make sure I understand it. Can you help me understand what document structure you are trying to generate?

Text mode merging is beyond the control of Nokogiri, and is done by the underlying XML library. There are places in Nokogiri where we merge nodes during certain operations, but we do it defensively, because if we let libxml2 do it we are left with dangling pointers and the ensuing memory corruption.

I can go into more detail if you like, but understanding what you're trying to do will perhaps focus the conversation on your immediate problem.

@sglaser
Copy link
Author

sglaser commented Jan 13, 2017 via email

@sglaser
Copy link
Author

sglaser commented Jan 13, 2017 via email

@flavorjones
Copy link
Member

Hi @sglaser,

Sorry for the slow response. I think I understand now what you're trying to do; but I'm having trouble following the code, and so I'll respond to your description above.

My advice would be to avoid operating on the text nodes, and instead operate on the p nodes. This will avoid having text nodes get merged by libxml2 (or nokogiri) while the structure is still being changed.

You may also want to consider building a second document based on the content of the first (perhaps even leveraging the SAX parser on the first).

I hope I haven't misunderstood the problem again; if so, happy to try once more. Otherwise, I'll close this out soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants