Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError that cannot be caught in narrow unicode builds #45818

Closed
sbp mannequin opened this issue Nov 20, 2007 · 9 comments
Closed

UnicodeDecodeError that cannot be caught in narrow unicode builds #45818

sbp mannequin opened this issue Nov 20, 2007 · 9 comments
Assignees
Labels
topic-unicode type-bug An unexpected behavior, bug, or error

Comments

@sbp
Copy link
Mannequin

sbp mannequin commented Nov 20, 2007

BPO 1477
Nosy @doerwalter, @amauryfa
Files
  • raw-unicode-escape.patch
  • raw-unicode-escape2.patch: 2nd version
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/amauryfa'
    closed_at = <Date 2008-03-24.21:29:24.442>
    created_at = <Date 2007-11-20.21:17:38.980>
    labels = ['type-bug', 'expert-unicode']
    title = 'UnicodeDecodeError that cannot be caught in narrow unicode builds'
    updated_at = <Date 2008-03-24.21:29:24.440>
    user = 'https://bugs.python.org/sbp'

    bugs.python.org fields:

    activity = <Date 2008-03-24.21:29:24.440>
    actor = 'amaury.forgeotdarc'
    assignee = 'amaury.forgeotdarc'
    closed = True
    closed_date = <Date 2008-03-24.21:29:24.442>
    closer = 'amaury.forgeotdarc'
    components = ['Unicode']
    creation = <Date 2007-11-20.21:17:38.980>
    creator = 'sbp'
    dependencies = []
    files = ['9714', '9798']
    hgrepos = []
    issue_num = 1477
    keywords = ['patch']
    message_count = 9.0
    messages = ['57710', '63730', '63840', '64191', '64222', '64322', '64323', '64353', '64442']
    nosy_count = 5.0
    nosy_names = ['doerwalter', 'jafo', 'amaury.forgeotdarc', 'ggenellina', 'sbp']
    pr_nums = []
    priority = 'low'
    resolution = 'fixed'
    stage = None
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue1477'
    versions = ['Python 2.5']

    @sbp
    Copy link
    Mannequin Author

    sbp mannequin commented Nov 20, 2007

    The following error is uncatchable:

    >>> try: ur'\U0010FFFF'
    ... except UnicodeDecodeError: pass
    ... 
    UnicodeDecodeError: 'rawunicodeescape' codec can't decode byte 0x5c 
    in position 0: \Uxxxxxxxx out of range

    This is in a narrow unicode build:

    >>> sys.version_info, hex(sys.maxunicode)
    ((2, 5, 1, 'final', 0), '0xffff')

    Of course the r in ur'...' is redundant in the test case above, but
    there are cases in which it isn't...

    >>> ur'\U0010FFFF\test'
    u'\U0010ffff\\test'
    - from a wide unicode build
    
    >>> ur'\U0010FFFF\test'
    UnicodeDecodeError: 'rawunicodeescape' codec can't decode byte 0x5c 
    in position 0: \Uxxxxxxxx out of range
    - from the narrow unicode build

    The problem occurs with .decode('raw-unicode-escape') too.

    >>> '\U0010FFFF\test'.decode('raw-unicode-escape')
    Traceback (most recent call last):
    [&c.]

    Most surprisingly of all, however, this problem doesn't occur when you
    don't use a raw string:

    >>> u'\U0010ffff\\test'
    u'\U0010ffff\\test'

    So there is at least a workaround for all cases, which is why this bug
    is marked as Severity: minor. It did take a while to work out that what
    manifests with ur mightn't apply to u, however; it's usually one's first
    thought to think the bug is with you, not with python.

    @sbp sbp mannequin added topic-unicode type-bug An unexpected behavior, bug, or error labels Nov 20, 2007
    @jafo
    Copy link
    Mannequin

    jafo mannequin commented Mar 17, 2008

    Can someone comment on this, or bring it up on python-dev if it needs
    more discussion?

    @jafo jafo mannequin assigned doerwalter Mar 17, 2008
    @amauryfa
    Copy link
    Member

    The error is not uncatchable; but it is generated while compiling, like
    a SyntaxError. No bytecode is generated for the input, and the "except"
    opcode is not run at all.

    OTOH, there is a bug in PyUnicode_DecodeRawUnicodeEscape(): it should
    accept code points > 0xffff. It has another problem:

    >>> ur'\U00010000'
    u'\x00'

    I join a patch to make raw-unicode-escape similar to unicode-escape:
    characters outside the Basic Plane are encoded into a utf-16 surrogate
    pair; on decoding, utf-16 surrogates are decoded into \U00xxxxxx.

    @doerwalter
    Copy link
    Contributor

    For a wide build, the code
            if (x <= 0xffff)
                    *p++ = (Py_UNICODE) x;
            else {
                    *p++ = (Py_UNIC0DE) x;

    looks strange.

    Furthermore with the patch applied Python no longer complains about
    illegal code points:

    >>> ur'\U11111111'
    u'\u1c04\udd11'

    @amauryfa
    Copy link
    Member

    The "strange" code is a copy of PyUnicode_DecodeUnicodeEscape. I find it
    easier to read. And the duplicate lines are likely to be optimized by
    the compiler.

    Here is a new version of the patch which:

    • correctly forbid illegal code points
    • compute the byte positions; this is important for error handlers
    in python2.5, the end position was completely bogus:
    >>> try: '\U11111111'.decode("raw-unicode-escape")
    ... except Exception, e: print repr(e)
    UnicodeDecodeError('rawunicodeescape', '\\U11111111', 0, 504955452,
    '\\Uxxxxxxxx out of range')

    @doerwalter
    Copy link
    Contributor

    The patch looks goog to me now. Go ahead and check it in.

    @doerwalter doerwalter assigned amauryfa and unassigned doerwalter Mar 22, 2008
    @doerwalter
    Copy link
    Contributor

    s/goog/good/g ;)

    @amauryfa
    Copy link
    Member

    Committed r61793. Will backport.

    @amauryfa
    Copy link
    Member

    backported to 2.5 branch as r61854

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    topic-unicode type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants