Skip to content

[Bug]: Get data from XML: encoding setting doesn't work, it seems that it always use UTF-8 #5222

Open
@RonnyRen

Description

@RonnyRen

Apache Hop version?

2.12

Java version?

18

Operating system

Windows

What happened?

I used transform "Get data from XML" to process a file that is Windows-1252 encoding and there is a special character in it, an error happened as below no matter what encoding I used unless I specified encoding in the XML file. (No encoding info in the XML decoration)
Error:
org.dom4j.DocumentException: Error on line 13 of document file:///C:/workspace/hop/windows-1252 : Invalid byte 1 of 1-byte UTF-8 sequence.

I viewed the source code and I think that I found the root cause.
As the link below, it seems that it uses read function of SAXReader incorrectly.


As document said, the second parameter is systemId not encoding.
Image

It should use function setEncoding to specify encoding of input source before calling read function.

Image

Please feel free to correct me if something wrong.

Note: XML input stream (Stax) is working with specified encoding.

Issue Priority

Priority: 2

Issue Component

Component: Transforms

Metadata

Metadata

Assignees

No one assigned

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions