Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XML::Parser screws up non-ascii text (and does so inconsistently) [rt.cpan.org #11899] #30

Open
toddr opened this issue Sep 24, 2019 · 0 comments

Comments

@toddr
Copy link
Member

toddr commented Sep 24, 2019

Migrated from rt.cpan.org#11899 (status was 'new')

Requestors:

Attachments:

From on 2005-03-16 08:25:18
:
Noticed on Debian Sarge:
libexpat1 1.95.8-1
libxml-parser-perl 2.34-3
perl 5.8.4-6

and Gentoo:
expat-1.95.8
XML-Parser-2.34
perl-5.8.5

XML::Parser is screwing around with non-ascii characters - most of the time, accented characters are converted from utf-8 down to iso-8859-1. After much debugging, I determined it wasn't Expat.so doing it but Parser.pm, despite the documentation saying that all text is returned as utf-8.

In the attached tar file, I have two xml files and a sample perl script... there is only one character difference between the xml file but perl handles them differently. the perl-unicode manpage says:

   If strings operating under byte semantics and strings with Unicode
   character data are concatenated, the new string will be created by
   decoding the byte strings as ISO 8859-1 (Latin-1) [...]

Anyway, putting "use encoding 'utf8';" at the top of XML::Parser made perl keep the string as utf-8 instead of munging the accented characters. It also worked putting it at the top of the script with the Char handler, but it really should be in XML::Parser if you want it to always return utf-8 like it claims to do, I think.

John McPherson

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant