Support PEP 383 on Windows: mbcs support of surrogateescape error handler #54030

vstinner · 2010-09-10T11:04:40Z

BPO	9821
Nosy	@malemburg, @loewis, @vstinner

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2010-09-10.21:36:45.337>
created_at = <Date 2010-09-10.11:04:40.374>
labels = ['interpreter-core', 'invalid', 'library', 'expert-unicode', 'OS-windows']
title = 'Support PEP 383 on Windows: mbcs support of\tsurrogateescape error handler'
updated_at = <Date 2010-09-10.21:36:45.336>
user = 'https://github.com/vstinner'

bugs.python.org fields:

activity = <Date 2010-09-10.21:36:45.336>
actor = 'vstinner'
assignee = 'none'
closed = True
closed_date = <Date 2010-09-10.21:36:45.337>
closer = 'vstinner'
components = ['Interpreter Core', 'Library (Lib)', 'Unicode', 'Windows']
creation = <Date 2010-09-10.11:04:40.374>
creator = 'vstinner'
dependencies = []
files = []
hgrepos = []
issue_num = 9821
keywords = []
message_count = 4.0
messages = ['116001', '116006', '116011', '116044']
nosy_count = 3.0
nosy_names = ['lemburg', 'loewis', 'vstinner']
pr_nums = []
priority = 'normal'
resolution = 'not a bug'
stage = None
status = 'closed'
superseder = None
type = None
url = 'https://bugs.python.org/issue9821'
versions = ['Python 3.2']

vstinner · 2010-09-10T11:04:39Z

It would be nice to support PEP-383 (surrogateescape) on Windows, but the mbcs codec doesn't support it for performance reason. The Windows functions to encode/decode MBCS don't give the index of the unencodable/undecodable character/byte. For encoding, we can try to encode character by character (but be careful of surrogate pairs) and check that the character is a Python lone surrogate character or not (character in range U+DC80..U+DCFF). For decoding, it is more complex because MBCS can be a multibyte encoding, eg. cp65001 (Microsoft variant of utf-8, see bpo-6058). So it's not possible to encode byte per byte and we should write an heuristic to guess the right number of bytes for each call to the decode function.

--

A completly different solution is to get the MBCS code page and use the Python code page codec (eg. "cp1252") instead of "mbcs" encoding, because Python cpXXXX codecs support all Python error handlers. Example (with Python 2.6):

>>> print(u"abcŁdef".encode("cp1252", "replace"))
abc?def
>>> print(u"abcŁdef".encode("cp1252", "ignore"))
abcdef
>>> print(u"abcŁdef".encode("cp1252", "backslashreplace"))
abc\u0141def

See also bpo-8611 for the problem if the Python path cannot be encoded to mbcs (work in progress, see bpo-9425).

malemburg · 2010-09-10T11:14:59Z

STINNER Victor wrote:
> 
> New submission from STINNER Victor <victor.stinner@haypocalc.com>:
> 
> It would be nice to support PEP 383 (surrogateescape) on Windows, but the mbcs codec doesn't support it for performance reason. The Windows functions to encode/decode MBCS don't give the index of the unencodable/undecodable character/byte. For encoding, we can try to encode character by character (but be careful of surrogate pairs) and check that the character is a Python lone surrogate character or not (character in range U+DC80..U+DCFF). For decoding, it is more complex because MBCS can be a multibyte encoding, eg. cp65001 (Microsoft variant of utf-8, see python/cpython#50308). So it's not possible to encode byte per byte and we should write an heuristic to guess the right number of bytes for each call to the decode function.
> 
> --
> 
> A completly different solution is to get the MBCS code page and use the Python code page codec (eg. "cp1252") instead of "mbcs" encoding, because Python cpXXXX codecs support all Python error handlers. Example (with Python 2.6):
> 
>>>> print(u"abcŁdef".encode("cp1252", "replace"))
> abc?def
>>>> print(u"abcŁdef".encode("cp1252", "ignore"))
> abcdef
>>>> print(u"abcŁdef".encode("cp1252", "backslashreplace"))
> abc\u0141def

That would certainly be a better approach, provided that our
cp-encodings are indeed compatible with the Windows variants
(which unfortunately tend to often use slightly different
mappings).

We could then also alias 'mbcs' to the cp-encoding (sort of
like the reverse of what we do in site.py:aliasmbcs().

vstinner · 2010-09-10T12:16:20Z

Oh wait. PEP-383 is a solution to store undecodable bytes in an unicode string, but for mbcs I'm trying to get the opposite: store unicode in bytes and this is not possible (at least with PEP-383).

Example with Python 3.1:

>>> print("abcŁdef".encode("cp1252", "surrogateescape"))
...
UnicodeEncodeError: 'charmap' codec can't encode character '\u0141' in position 3: character maps to <undefined>

vstinner · 2010-09-10T21:36:45Z

Close this issue: PEP-383 is specific to filesystem using bytes, it is useless on Windows (the problem on Windows is on encoding, not on decoding).

vstinner added interpreter-core (Objects, Python, Grammar, and Parser dirs) stdlib Python modules in the Lib dir topic-unicode OS-windows labels Sep 10, 2010

malemburg changed the title ~~Support PEP 383 on Windows: mbcs support of surrogateescape error handler~~ Support PEP 383 on Windows: mbcs support of surrogateescape error handler Sep 10, 2010

vstinner closed this as completed Sep 10, 2010

vstinner added the invalid label Sep 10, 2010

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support PEP 383 on Windows: mbcs support of surrogateescape error handler #54030

Support PEP 383 on Windows: mbcs support of surrogateescape error handler #54030

vstinner commented Sep 10, 2010

vstinner commented Sep 10, 2010

malemburg commented Sep 10, 2010

vstinner commented Sep 10, 2010

vstinner commented Sep 10, 2010

Support PEP 383 on Windows: mbcs support of surrogateescape error handler #54030

Support PEP 383 on Windows: mbcs support of surrogateescape error handler #54030

Comments

vstinner commented Sep 10, 2010

vstinner commented Sep 10, 2010

malemburg commented Sep 10, 2010

vstinner commented Sep 10, 2010

vstinner commented Sep 10, 2010