HTML includes XML declaration and DOCTYPE on JRuby #590

Closed
dnagir opened this Issue Jan 3, 2012 · 11 comments

Projects

None yet

5 participants

@dnagir

When using with Capybara and JRuby 1.6.5 (1.9 mode) the page.html returns something that looks like:

> page.html

"<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.01 Transitional//EN\"...."

We can clearly see that <?xml version="1.0" encoding="UTF-8"> has been inserted which changes the resulting output.
Also the DOCTYPE is transitional HTML 4.01, but this is not correct either because the page was generated with DOCTYPE for HTML5.

This doesn't happen on MRI 1.9.3.

Any clues how to fix/workaround it?

(Originally opened on the Capybara but was pointed to the Nokogiri).

@yokolet
Sparkle Motion member

Hello,

For some reason, it seems XML declaration switch has been turned on.

Can you share reproduciable snippet?

@dnagir

I tried to put up a simple repro. But it led me back to the issue with Capybara. See jnicklas/capybara#570 (comment)

I'm not exactly sure whose issue it is now. I tend to think it's Capybara using Nokogiri incorrectly.

@gaizka

Hi there!

This is happening to me, too. I just upgraded from Capybara 0.4 to 1.1. Everything works OK in my laptop (ruby 1.9.2p180), but my spec fail in my CI server (ruby 1.9.2-p0, 64 bits machine). I'll try to make a simple failing test.

The problem, as you said, is that there's an extra "<?xml version=\"1.0\" encoding=\"UTF-8\"?>", so some parsing by capybara breaks.

@gaizka

I've added more information in jnicklas/capybara#597 , with a failing spec in https://gist.github.com/1578175

@yokolet
Sparkle Motion member

Hello,

Is capybara test still failing? The issue filed in capybara's side has been closed, though.

@dnagir

Unfortunately I'm off JRuby and can't tell. Maybe other will?

@yokolet
Sparkle Motion member

@dnagir Thanks for responding. I'll let this issue open since somebody might hit this again. I'm hoping I can get updated report in such a case.

@statonjr

I hit this with Nokogiri 1.5.4.rc3 on JRuby 1.6.7.2. Here's my code and output: https://gist.github.com/2965699. The doctype should be HTML5.

@statonjr

If I set the "insert-doctype" property to false here: https://github.com/sparklemotion/nokogiri/blob/master/ext/java/nokogiri/internals/HtmlDomParserContext.java#L103, then the DOCTYPE does not get stripped. Indeed, the docs for the Neko HTML parser states, "Also, setting this feature to true will cause the parser to ignore any document type declaration that appears in the document." http://nekohtml.sourceforge.net/settings.html

When the insert-doctype property is set to false, Neko leaves a space character after the doctype declaration in the markup: <!DOCTYPE html >. I'm still digging into this.

@statonjr

Seems like Neko requires a second parameter in the DOCTYPE (https://github.com/sparklemotion/nokogiri/wiki/pure-java-nokogiri-for-jruby), although HTML 5 does not (http://www.w3.org/TR/html5-diff/#doctype).

Related: #547

@flavorjones
Sparkle Motion member

This doesn't appear to be a problem at this point in time, so I'm closing. Please let me know if you feel it should remain open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment