Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ParseException could not get message when xml with invalid characters #29

Closed
kewudu opened this issue Apr 24, 2020 · 3 comments · Fixed by #123
Closed

ParseException could not get message when xml with invalid characters #29

kewudu opened this issue Apr 24, 2020 · 3 comments · Fixed by #123

Comments

@kewudu
Copy link

kewudu commented Apr 24, 2020

I get the following backtrace message when i load xml:

incompatible character encodings: UTF-8 and ASCII-8BIT
/usr/local/rvm/rubies/ruby-2.6.3/lib/ruby/gems/2.6.0/gems/rexml-3.2.2/lib/rexml/parseexception.rb:32:in `to_s'

the xml encoding is UTF-8 and with invalid characters, but parseexception to_s use ASCII-8BIT encoding, so here to_s will raise an exception with encoding fail, user will not get the actual error information in xml

@kou
Copy link
Member

kou commented Apr 25, 2020

Could you show a Ruby script and XML that reproduce this problem?

@kewudu
Copy link
Author

kewudu commented Apr 30, 2020

Could you show a Ruby script and XML that reproduce this problem?

My XML file contains invalid encoding, part of XML file is:

<?xml version="1.0" encoding="UTF-8"?>

<environmentblock>
  <userconf>
     <!--LocalHLTConfig.jsonτݾһˇҘѫքìիɧڻզ՚LocalHLTConfig.jsonτݾìȒτݾŚɝא"jsonfirst": "true"ʱìܡԅЈʹԃL -->
    .... lots of content...
  </userconf>
</environmentblock>

my rub script is simple:

require 'rexml/document'
include REXML
require 'json'

# $confing_path is the full path of the xml file which I want to load
xml_file = Document.new(File::open($confing_path))

the xml file is utf-8 encoding, I know the xml contains invalid characters, after I load the xml file, ruby raise <REXML::ParseException: #<ArgumentError: invalid byte sequence in UTF-8> excepiton and I cann't get the exact error info by exception message, if I temporary change the ParseException to_s method line 32 to utf-8 like this:
err << @source.buffer[0..80].force_encoding("UTF-8"), now I get the exact error information in xml file:
...
Exception parsing
Line: 5
Position: 270
Last 80 unconsumed characters:
^M

@iangreenleaf
Copy link

Here's a very simple reproduction of this bug (the base64 stuff is just there to make sure the special characters in the string come through):

require 'rexml/document'
require 'base64'
include REXML

begin
  REXML::Document.new(Base64.decode64("YT08YSDigIs+4oCL\n"))
  # Equivalent to:
  # REXML::Document.new "a=<a ​>​"
rescue => e
  e.to_s
end

The input is invalid XML and rightly triggers a ParseException, but then reading the exception's attributes raises another error: incompatible character encodings: UTF-8 and ASCII-8BIT (Encoding::CompatibilityError).

It looks like this is a bug in the ParseException code. Line 21 initializes an empty string, which defaults to UTF8 encoding. Then line 32 forces a string's encoding to ASCII-8BIT and tries to append it to the UTF8 string, which triggers the encoding mismatch:

err << @source.buffer[0..80].force_encoding("ASCII-8BIT").gsub(/\n/, ' ')

naitoh added a commit to naitoh/rexml that referenced this issue May 3, 2024
…etrieved if the error content contained Unicode characters.

## Why?
If the xml tag contains Unicode characters when the error occurs, an `Encoding::CompatibilityError: incompatible character encodings: UTF-8 and ASCII-8BIT` exception is raised, ParseException error message cannot be retrieved.

See: ruby#29
@kou kou closed this as completed in #123 May 3, 2024
kou pushed a commit that referenced this issue May 3, 2024
…alid encoding XML (#123)

## Why?

If the XML tag contains Unicode characters and an error is occurred for
the tag, an incompatible encoding error is raised. Because our parse
exception message parts have an UTF-8 part (that includes the target tag
information) and an ASCII-8BIT part (that includes error context input).

Fix GH-29

Reported by DuKewu. Thanks!!!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

3 participants