When using with Capybara and JRuby 1.6.5 (1.9 mode) the page.html returns something that looks like:
"<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.01 Transitional//EN\"...."
We can clearly see that <?xml version="1.0" encoding="UTF-8"> has been inserted which changes the resulting output.
Also the DOCTYPE is transitional HTML 4.01, but this is not correct either because the page was generated with DOCTYPE for HTML5.
<?xml version="1.0" encoding="UTF-8">
This doesn't happen on MRI 1.9.3.
Any clues how to fix/workaround it?
(Originally opened on the Capybara but was pointed to the Nokogiri).
For some reason, it seems XML declaration switch has been turned on.
Can you share reproduciable snippet?
I tried to put up a simple repro. But it led me back to the issue with Capybara. See jnicklas/capybara#570 (comment)
I'm not exactly sure whose issue it is now. I tend to think it's Capybara using Nokogiri incorrectly.
This is happening to me, too. I just upgraded from Capybara 0.4 to 1.1. Everything works OK in my laptop (ruby 1.9.2p180), but my spec fail in my CI server (ruby 1.9.2-p0, 64 bits machine). I'll try to make a simple failing test.
The problem, as you said, is that there's an extra "<?xml version=\"1.0\" encoding=\"UTF-8\"?>", so some parsing by capybara breaks.
I've added more information in jnicklas/capybara#597 , with a failing spec in https://gist.github.com/1578175
Is capybara test still failing? The issue filed in capybara's side has been closed, though.
Unfortunately I'm off JRuby and can't tell. Maybe other will?
@dnagir Thanks for responding. I'll let this issue open since somebody might hit this again. I'm hoping I can get updated report in such a case.
I hit this with Nokogiri 1.5.4.rc3 on JRuby 18.104.22.168. Here's my code and output: https://gist.github.com/2965699. The doctype should be HTML5.
If I set the "insert-doctype" property to false here: https://github.com/sparklemotion/nokogiri/blob/master/ext/java/nokogiri/internals/HtmlDomParserContext.java#L103, then the DOCTYPE does not get stripped. Indeed, the docs for the Neko HTML parser states, "Also, setting this feature to true will cause the parser to ignore any document type declaration that appears in the document." http://nekohtml.sourceforge.net/settings.html
When the insert-doctype property is set to false, Neko leaves a space character after the doctype declaration in the markup: <!DOCTYPE html >. I'm still digging into this.
Seems like Neko requires a second parameter in the DOCTYPE (https://github.com/sparklemotion/nokogiri/wiki/pure-java-nokogiri-for-jruby), although HTML 5 does not (http://www.w3.org/TR/html5-diff/#doctype).
This doesn't appear to be a problem at this point in time, so I'm closing. Please let me know if you feel it should remain open.