Skip to content

Add support for decoding CESU-8 encoded strings. #17

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 29, 2017

Conversation

nlitsme
Copy link
Contributor

@nlitsme nlitsme commented Jun 29, 2017

This works around java's broken utf-8 implementation.

You will need the https://github.com/LuminosoInsight/python-ftfy module for the patch to have an effect.

The following code will now output a 😃 ( \u0001f603 ), instead of raising a UnicodeDecodeError, or outputting ??????.

from __future__ import division, print_function
from binascii import a2b_hex
import javaobj

b = a2b_hex("ACED0005740006EDA0BDEDB883")
print(javaobj.loads(b))

The problem with the byte sequence ED A0 BD ED B8 83 is that it decodes to d83d de03 which are invalid codepoints, but is actually a valid UTF-16 sequence, so you have to decode it twice, first utf-8, then utf-16, then you will end up with unicode character 0x1F603.

@tcalmant tcalmant merged commit 07ca2a0 into tcalmant:master Jun 29, 2017
@tcalmant
Copy link
Owner

Thanks for your contribution !
I'll add a word about the ftfy package in the README.

@nlitsme nlitsme deleted the itsme-cesu8 branch June 29, 2017 13:53
@voetsjoeba
Copy link
Contributor

Note that the CESU-8/Java-UTF-8 decoder in ftfy.bad_codecs does not enforce correctness, and is documented as being explicitly intended not to do so.

Here's an example of a byte sequence that is invalid CESU-8 and is rejected by Java, but is accepted by ftfy's decoder:

import ftfy.bad_codecs
print(b'\xf0\x90\x80\x80'.decode("java_utf8", errors="strict") == u"\U00010000") # True

So be careful not to rely on the codec to make accept/reject decisions about the validity of serialized objects ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants