Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Document.to_xml is reformatting tags inconsistently #415

Closed
pragdave opened this Issue Feb 7, 2011 · 10 comments

Comments

Projects
None yet
3 participants

pragdave commented Feb 7, 2011

The code at https://gist.github.com/815162 reads an XML document and then writes it back out. It produces different results with libxml Nokogiri and the pure-Java version. The first and last tags are split onto multiple lines by the Java version, but left intact by the libxml version.

libxml nokogiri

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE book SYSTEM "unused.dtd">
<book code="_authorinfo" in-beta="yes">

<processedcode language="ruby" size="normal">
<codeline><cokw>class</cokw> Fred &lt; Preprocessor</codeline>
<codeline>  <cokw>def</cokw> map(line)</codeline>
<codeline>    line.gsub(%r{&lt;ror/&gt;}, <costring>"Ruby on Rails"</costring>)</codeline>
<codeline>  <cokw>end</cokw></codeline>
<codeline><cokw>end</cokw></codeline>
</processedcode>
</book>

Java nokogiri

<?xml version="1.0" encoding="utf-8"?>
<book code="_authorinfo" in-beta="yes">

<processedcode language="ruby" size="normal">
<codeline>
  <cokw>class</cokw> Fred &lt; Preprocessor
</codeline>

<codeline>  <cokw>def</cokw> map(line)</codeline>
<codeline>    line.gsub(%r{&lt;ror/&gt;}, <costring>"Ruby on Rails"</costring>)</codeline>
<codeline>  <cokw>end</cokw></codeline>
<codeline>
  <cokw>end</cokw>
</codeline>

</processedcode>

</book>

Is there any configuration I can use to turn off this behavior—it's currently preventing me from switching our toolchain to JRuby.

Dave

Owner

yokolet commented Feb 8, 2011

Hello! Thank you for testing pure Java Nokogiri.

Inconsistent spaces and newlines are so difficult to resolve for pure Java version. Xerces needs schema to handle those. However, the difference displayed here looks like coming from some bug. I'll look at what makes that difference.

pragdave commented Feb 8, 2011

Let me know if I can help.

Dave

Owner

flavorjones commented Feb 8, 2011

Hi Dave,

As @yokolet mentioned, it is extraordinarily hard to keep formatting behavior consistent between implementations.

So, I'm wondering, can you explain a bit more about why this sort of non-semantic change is a blocker for you?

Generally when people have pointed this out, it's because their test suites are asserting that the serialized document is identical to what they expect. Is this what you're up against?

A more semantic (and thus portable) way to do this sort of testing is to assert against the document structure. The gems lorax or nokogiri-diff may be able to help in that case.

Regardless, it would help us all if you'd give us some insight into what your particular blocker is here.

Thanks for using Nokogirl! (Aaron made me say that. ;))

pragdave commented Feb 8, 2011

This isn't a question of a failing test. The problem is that this generated XML has additional whitespace that, when formatted using FO, results in extra spaces in the printed book.

I'm representing a line of source code to be formatted in a book.

So, the source code

class Fred < Preprocessor

gets converted to

<codeline><cokw>class</cokw> Fred &lt; Preprocessor</codeline>

However, the pure Java version converts it to

<codeline>
   <cokw>class</cokw> Fred &lt; Preprocessor
</codeline>

Now, in a <codeline>, whitespace is significant. The libxml version correctly puts no whitespace before the class keyword, while the Java version inserts it. As a result, the code listings format incorrectly.

Even more confusingly, though, the Java version treats the codelines differently—the first and last are wrapped, while the rest are formated in the same way that libxml formats them.

If we can stop the wrapping of the first and last, I think the problem would be solved.

Dave

pragdave commented Mar 9, 2011

I can fix this by overriding the default formatting

@doc.to_xml(:save_with => 0)

Owner

flavorjones commented Mar 9, 2011

Dave - should this be closed? If this is as simple as turning off Node::SaveOptions::FORMAT by default on JRuby then maybe we should make that the default? I'm going to reopen.

pragdave commented Mar 9, 2011

I think it makes more sense to have it off by default.

Owner

flavorjones commented Mar 9, 2011

Mise en place refactor: fa671aa

Owner

flavorjones commented Mar 9, 2011

default output of XML on JRuby is no longer formatted due to inconsistent whitespace handling. Closed by 4337005

Owner

yokolet commented Mar 9, 2011

The change of default of Node::SaveOptions::FORMAT makes sense. I've almost fixed this, but format option was doing something wrong. That confused me.

I've already fixed the problem that a doctype decl was missing.

@nathanl nathanl pushed a commit to nathanl/nokogiri that referenced this issue May 17, 2011

@flavorjones flavorjones default output of XML on JRuby is no longer formatted due to inconsis…
…tent whitespace handling. Closes #415
4337005

@nathanl nathanl pushed a commit to nathanl/nokogiri that referenced this issue May 17, 2011

@yokolet yokolet Fix for formatted output along with the default format option change,…
… issue #415.
3dce889

This issue was closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment