parse_io should respect @encoding when no encoding is passed to it #757

Closed
wants to merge 3 commits into
from

Projects

None yet

3 participants

@soulcutter

I assumed that #756 would not be controversial (which is why this pull request is based on that branch).

This change may be slightly more controversial since it changes existing functioning behavior from always defaulting to ASCII (and never using @encoding) to using @encoding by default (and @encoding defaults to UTF-8).

The existing behavior doesn't make sense to me and is not what I would expect, hence this separate pull request. One small issue remains here, existing behavior is that calling parse_io with an encoding has the side-effect of changing the encoding for all uses of the Parser instance, and I did not address that.

@travisbot

This pull request fails (merged 6bcf9a47 into c3953c2).

@soulcutter

Any thoughts on this? @flavorjones ? Also, travis failing cannot possibly be because of this change.

@flavorjones
Sparkle Motion member

Will be going through the backlog of Nokogiri pull requests and issues this weekend. Thanks for your patience.

@soulcutter

waiting...

Bump... I guess since #756 was accepted I can rebase this to make it easier to accept. Would like to know if there's any interest in this, though. If other solutions would be preferable, I'm open to discussion. If there's no interest in solving this perceived problem, I will drop it.

@flavorjones
Sparkle Motion member

@soulcutter - Noticed that the only tests in this PR are the same as the tests in #756. Can you rebase off current master and add some test coverage for this behavior?

@soulcutter soulcutter closed this Oct 1, 2012
@soulcutter

@flavorjones Ug, I did a rebase, deleted remote branch, re-pushed branch, and looks like it closed this pull request (wasn't sure it would do that - now I know). I will open a new PR with the added test coverage.

@soulcutter soulcutter reopened this Oct 1, 2012
@soulcutter

Ok, I created a test which exercises this, however it runs into the problem I reported in #671 so maybe this whole PR is moot until that is sorted out (and I have no clue how to address that one since it seems to be on the libxml side).

@soulcutter

As you mentioned, you don't see the libxml issue when using jruby, so I at least got this test passing.

However I noticed that while it does exercise the way encoding is set on the parser, this test actually passes on jruby whether you set the encoding or not. At first I thought xerces may have been using the encoding from the xml decl, however when I removed that it -still- worked. Xerces is too smart.

The failure you see on the ruby/libxml side is more representative of the problem - libxml is not as smart as xerces and cannot parse "こんにちは" without a properly-set encoding.

Run options: -n test_parser_respects_encoding_ivar --seed 63186

# Running tests:

parser error : xmlSwitchToEncoding : no input
F

Finished tests in 0.001295s, 772.2008 tests/s, 1544.4015 assertions/s.

  1) Failure:
test_parser_respects_encoding_ivar(Nokogiri::XML::SAX::TestParser) [./test/xml/sax/test_parser.rb:167]:
Expected ["\n  ", "This is a Shift_JIS File", "\n  ", "\n"] to include "こんにちは".

1 tests, 2 assertions, 1 failures, 0 errors, 0 skips
@flavorjones
Sparkle Motion member

This test looks like it passes on Nokogiri 1.5.9 with libxml 2.8.0. Is there anything still actionable here?

@soulcutter

It's been a long time since I looked at this... #671 still seems valid to me, so I think this would remain a sort of open issue since there is no point in passing the encoding to libxml2 if it will not actually take the encoding.

@flavorjones
Sparkle Motion member

As I mentioned in my comment from April 2013, the test in this PR passes on master right now (both MRI and JRuby), and so I don't think there's anything actionable here anymore. Closing.

Happy to reopen if I've misunderstood, just make a comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment