Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] Very long lines (or maybe very long attributes?) are silently corrupted, on parsing and immediately saving file #2201

Closed
matkoniecz opened this issue Mar 4, 2021 · 9 comments

Comments

@matkoniecz
Copy link
Contributor

matkoniecz commented Mar 4, 2021

Please describe the bug

Very long lines (or maybe very long attributes?) are silently corrupted, on parsing and immediately saving file.

"/> at end of element with long attribute is turned into "/>

Help us reproduce what you're seeing

require 'nokogiri'

text = File.open('bug.svg').read
svg_merged = Nokogiri::XML(text, &:noblanks)
code = svg_merged.to_xml(indent:3, indent_text:" ")
code_not_pretty = svg_merged.to_xml()
puts text[-26..-10]
puts code[-26..-10]
puts code_not_pretty[-26..-10]

To run this code with data triggering it at least on my computer:

git clone https://github.com/matkoniecz/nokogiri_testcase.git
cd nokogiri_testcase
ruby bug.rb

Expected output that demonstrates bug (note "/> vs "/>):

2.75 355.79"/>
</
 355.79"/&gt;
</g
 355.79"/&gt;
</g

There is also sample file not triggering it. In general once line is less outrageously long, corruption is disappearing.

Expected behavior

No corruption or at least a crash rather than a silent corruption of data.

Environment

# Nokogiri (1.11.1)
    ---
    warnings: []
    nokogiri:
      version: 1.11.1
      cppflags:
      - "-I/home/mateusz/.gem/ruby/2.7.0/gems/nokogiri-1.11.1-x86_64-linux/ext/nokogiri"
      - "-I/home/mateusz/.gem/ruby/2.7.0/gems/nokogiri-1.11.1-x86_64-linux/ext/nokogiri/include"
      - "-I/home/mateusz/.gem/ruby/2.7.0/gems/nokogiri-1.11.1-x86_64-linux/ext/nokogiri/include/libxml2"
    ruby:
      version: 2.7.0
      platform: x86_64-linux-gnu
      gem_platform: x86_64-linux
      description: ruby 2.7.0p0 (2019-12-25 revision 647ee6f091) [x86_64-linux-gnu]
      engine: ruby
    libxml:
      source: packaged
      precompiled: true
      patches:
      - 0001-Revert-Do-not-URI-escape-in-server-side-includes.patch
      - 0002-Remove-script-macro-support.patch
      - 0003-Update-entities-to-remove-handling-of-ssi.patch
      - 0004-libxml2.la-is-in-top_builddir.patch
      - 0005-Fix-infinite-loop-in-xmlStringLenDecodeEntities.patch
      - 0006-htmlParseComment-treat-as-if-it-closed-the-comment.patch
      - 0007-use-new-htmlParseLookupCommentEnd-to-find-comment-en.patch
      - '0008-use-glibc-strlen.patch'
      - '0009-avoid-isnan-isinf.patch'
      libxml2_path: "/home/mateusz/.gem/ruby/2.7.0/gems/nokogiri-1.11.1-x86_64-linux/ext/nokogiri"
      iconv_enabled: true
      compiled: 2.9.10
      loaded: 2.9.10
    libxslt:
      source: packaged
      precompiled: true
      patches: []
      compiled: 1.1.34
      loaded: 1.1.34
    other_libraries:
      zlib: 1.2.11

Additional context

It is file from actual use, not some deliberately silly attempt.

I have taken ocean data from OpenStreetMap, converted to geojson and later generated SVG.

I am trying to use nokogiri to merge this SVG file with another parts of the map (as one of final steps toward burning this map on wood using laser cutter).

Lubuntu Image Viewer also refuses to work with this file, but at least clearly crashes

XML parse error: Error domain 1 code 73 on line 4 column 10000011 of data: Couldn't find end of Start Tag path line 4

Inkscape opens both files without complaints.

For my use case I would be fine with crash (simplifying data input was on my todo list anyway, 10MB in one line is quite weird), but silent data corruption is irritating. Though supporting such input would be probably nice.

@matkoniecz matkoniecz added the state/needs-triage Inbox for non-installation-related bug reports or help requests label Mar 4, 2021
@matkoniecz
Copy link
Contributor Author

matkoniecz commented Mar 4, 2021

Sadly, it appears to be triggered by long attributes, so it is not feasible to further reduce it.

Please let me know if further detail, testing or a better test case is needed.

@flavorjones
Copy link
Member

Hi, @matkoniecz, thanks for opening up this issue, and I'm sorry that you're having trouble. I'll try to help.

I can't reproduce what you're seeing at the moment without a version of the "bug.svg" file. Can you share that?

One guess is that you may be running up against libxml2 size limitations. You should be able to verify that by looking at svg_merged.errors to see if it contains any helpful error messages.

You may also want to try setting the HUGE parse option, which instructs libxml2 to ignore the hard-coded size limits (note: there will be a performance penalty). You can see more info at https://nokogiri.org/tutorials/parsing_an_html_xml_document.html#parse-options or https://nokogiri.org/rdoc/Nokogiri/XML/ParseOptions

@flavorjones flavorjones added needs/more-info meta/user-help and removed state/needs-triage Inbox for non-installation-related bug reports or help requests labels Mar 5, 2021
@matkoniecz
Copy link
Contributor Author

matkoniecz commented Mar 5, 2021

I can't reproduce what you're seeing at the moment without a version of the "bug.svg" file. Can you share that?

It is at https://github.com/matkoniecz/nokogiri_testcase/blob/master/bug.svg

Is it not fetched by git clone https://github.com/matkoniecz/nokogiri_testcase.git ?

svg_merged.errors is it a file that would be placed at the same folder where script is located? Then it is not created.

And yes, huge removes the problem. Is it expected that not enabling huge may result in a silent corruption of data? I would consider crash as preferable in such case.

BTW, any idea how can I modify svg_merged = Nokogiri::XML(text, &:noblanks) to enable both noblanks and huge?

@flavorjones
Copy link
Member

Is it not fetched by git clone https://github.com/matkoniecz/nokogiri_testcase.git ?

Ah, sorry, I was looking for a hyperlink in your original post and missed this. Thanks.

svg_merged.errors is it a file that would be placed at the same folder where script is located? Then it is not created.

svg_merged is the Nokogiri::XML::Document object in your example. I'm asking you to look at the #errors attribute of that object, which will be an array of Nokogiri::XML::SyntaxError objects if there were problems during parsing.

yes, huge removes the problem. Is it expected that not enabling huge may result in a silent corruption of data?

Well, again, if there's a problem parsing then it should show up in Document#errors and raise an exception. If it's not doing that, it's probably a bug.

any idea how can I modify svg_merged = Nokogiri::XML(text, &:noblanks)

This is covered pretty well in the tutorial I linked to, https://nokogiri.org/tutorials/parsing_an_html_xml_document.html#parse-options

Try something like:

svg_merged = Nokogiri::XML(text) { |config| config.noblanks.huge }

@matkoniecz
Copy link
Contributor Author

matkoniecz commented Mar 5, 2021

Maybe https://nokogiri.org/tutorials/parsing_an_html_xml_document.html should mention checking for errors if it is happening silently and may damage data? I am really surprised that it happened silently.

And yes, there is

4:9: FATAL: AttValue length too long
4:9: FATAL: attributes construct error
4:9: FATAL: Couldn't find end of Start Tag path line 4

Try something like:

Thanks for a help! My code will now have

    svg_merged = Nokogiri::XML(text) { |config| config.noblanks.huge }
    raise svg_merged.errors if svg_merged.errors != []

hopefully there is no need to check other places for silent failures.

@flavorjones
Copy link
Member

Maybe https://nokogiri.org/tutorials/parsing_an_html_xml_document.html should mention checking for errors if it is happening silently and may damage data? I am really surprised that it happened silently.

In the Parse Options section, you'll find this description:

RECOVER - Attempt to recover from errors. Recommended for parsing malformed or invalid documents. This is set by default!

If you want to change Nokogiri's behavior to raise an exception when warnings or errors are detected, set the norecover parse option:

    svg_merged = Nokogiri::XML(text) { |config| config.noblanks.huge.norecover }

I'm glad we were able to figure this out!

@flavorjones
Copy link
Member

Note I've created an issue to better document these parse options: sparklemotion/nokogiri.org#34

@matkoniecz
Copy link
Contributor Author

Note I've created an issue to better document these parse options: sparklemotion/nokogiri.org#34

Thanks! Is it possible to at least log some errors when one hits HUGE limits without huge option? Or is it not a valid thing to do? (changing behaviour to crashing on documents with super-long lines is likely result in worse confusion and damage overall)

RECOVER - Attempt to recover from errors. Recommended for parsing malformed or invalid documents. This is set by default!

I noticed it, but I have not thought that Nokogiri will corrupt XML on its own :/

One of first things that I tried was checking with another tool whatever xml input is corrupt or not.

@flavorjones
Copy link
Member

Is it possible to at least log some errors when one hits HUGE limits without huge option?

It would be challenging to figure out whether an error registered by libxml2 during parse time is due to the size limits. Because it's hard to determine the source of the error, we handle them all the same way -- by adding them to the #errors array, and by raising an exception if you've set norecover.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants