I assumed that #756 would not be controversial (which is why this pull request is based on that branch).
This change may be slightly more controversial since it changes existing functioning behavior from always defaulting to ASCII (and never using @encoding) to using @encoding by default (and @encoding defaults to UTF-8).
The existing behavior doesn't make sense to me and is not what I would expect, hence this separate pull request. One small issue remains here, existing behavior is that calling parse_io with an encoding has the side-effect of changing the encoding for all uses of the Parser instance, and I did not address that.
This pull request fails (merged 6bcf9a47 into c3953c2).
Any thoughts on this? @flavorjones ? Also, travis failing cannot possibly be because of this change.
Will be going through the backlog of Nokogiri pull requests and issues this weekend. Thanks for your patience.
Bump... I guess since #756 was accepted I can rebase this to make it easier to accept. Would like to know if there's any interest in this, though. If other solutions would be preferable, I'm open to discussion. If there's no interest in solving this perceived problem, I will drop it.
@soulcutter - Noticed that the only tests in this PR are the same as the tests in #756. Can you rebase off current master and add some test coverage for this behavior?
parse_io respects @encoding when no encoding is passed to it
@flavorjones Ug, I did a rebase, deleted remote branch, re-pushed branch, and looks like it closed this pull request (wasn't sure it would do that - now I know). I will open a new PR with the added test coverage.
Added a test for sax parsing with a specified encoding
Ok, I created a test which exercises this, however it runs into the problem I reported in #671 so maybe this whole PR is moot until that is sorted out (and I have no clue how to address that one since it seems to be on the libxml side).
Moved encoding test next to other encoding test, fixed incorrect use of
@parser rather than parser
As you mentioned, you don't see the libxml issue when using jruby, so I at least got this test passing.
However I noticed that while it does exercise the way encoding is set on the parser, this test actually passes on jruby whether you set the encoding or not. At first I thought xerces may have been using the encoding from the xml decl, however when I removed that it -still- worked. Xerces is too smart.
The failure you see on the ruby/libxml side is more representative of the problem - libxml is not as smart as xerces and cannot parse "こんにちは" without a properly-set encoding.
Run options: -n test_parser_respects_encoding_ivar --seed 63186
# Running tests:
parser error : xmlSwitchToEncoding : no input
Finished tests in 0.001295s, 772.2008 tests/s, 1544.4015 assertions/s.
Expected ["\n ", "This is a Shift_JIS File", "\n ", "\n"] to include "こんにちは".
1 tests, 2 assertions, 1 failures, 0 errors, 0 skips
This test looks like it passes on Nokogiri 1.5.9 with libxml 2.8.0. Is there anything still actionable here?
It's been a long time since I looked at this... #671 still seems valid to me, so I think this would remain a sort of open issue since there is no point in passing the encoding to libxml2 if it will not actually take the encoding.
As I mentioned in my comment from April 2013, the test in this PR passes on master right now (both MRI and JRuby), and so I don't think there's anything actionable here anymore. Closing.
Happy to reopen if I've misunderstood, just make a comment.