python3.6.4 os.environ error when write chinese to file #87742

rushant001 · 2021-03-21T04:46:53Z

BPO	43576
Nosy	@terryjreedy, @vstinner, @ezio-melotti, @eryksun, @rushant001

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2021-03-26.23:12:04.502>
created_at = <Date 2021-03-21.04:46:52.916>
labels = ['interpreter-core', 'type-bug', 'expert-IO', 'invalid', 'library', 'expert-unicode']
title = 'python3.6.4 os.environ error when write chinese to file'
updated_at = <Date 2021-03-27.00:06:43.938>
user = 'https://github.com/rushant001'

bugs.python.org fields:

activity = <Date 2021-03-27.00:06:43.938>
actor = 'eryksun'
assignee = 'none'
closed = True
closed_date = <Date 2021-03-26.23:12:04.502>
closer = 'vstinner'
components = ['Interpreter Core', 'Library (Lib)', 'Unicode', 'IO']
creation = <Date 2021-03-21.04:46:52.916>
creator = 'rushant'
dependencies = []
files = []
hgrepos = []
issue_num = 43576
keywords = []
message_count = 6.0
messages = ['389215', '389560', '389568', '389571', '389572', '389579']
nosy_count = 5.0
nosy_names = ['terry.reedy', 'vstinner', 'ezio.melotti', 'eryksun', 'rushant']
pr_nums = []
priority = 'normal'
resolution = 'not a bug'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue43576'
versions = ['Python 3.6']

rushant001 · 2021-03-21T04:46:53Z

# -*- coding: utf-8 -*-
import os
job_name = os.environ['a']
print(job_name)
print(isinstance(job_name, str))
print(type(job_name))
with open('name.txt', 'w', encoding='utf-8')as fw:
    fw.write(job_name)

i have set environment param by :
export a="中文"
it returns error:
中文
True
<class 'str'>
Traceback (most recent call last):
  File "aa.py", line 8, in <module>
    fw.write(job_name)
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-5: surrogates not allowed

terryjreedy · 2021-03-26T19:09:18Z

3.6 only gets security patches. You or someone needs to show an unfixed bug in master. Your code runs for me on Windows, whereas you appear to be using *nix. Replacing job_name.encode() should have the same behavior. Do you see the same with job_name="中文" at the top instead?

eryksun · 2021-03-26T20:45:35Z

I think this is a locale configuration problem, in which the locale encoding doesn't match the terminal encoding. If so, it can be closed as not a bug.

export a="中文"

In POSIX, the shell reads "中文" from the terminal as bytes encoded in the terminal encoding, which could be UTF-8 or some legacy encoding. The value of a is set directly as this encoded text. There is no intermediate decode/encode stage in the shell. For a child process that decodes the value of the environment variable, as Python does, the locale's LC_CTYPE encoding should be the same or compatible with the terminal encoding.

job_name = os.environ['a']
print(job_name)

In POSIX, sys.stdout.errors, as used by print(), will be "surrogateescape" if the default LC_CTYPE locale is a legacy locale -- which in 3.6 is the case for the "C" locale, since it's usually limited to 7-bit ASCII. "surrogateescape" is also the errors handler for decoding bytes os.environb (POSIX) as text os.environ. When decoding, "surrogateescape" handles non-ASCII byte values that can't be decoded by translating the value into the reserved surrogate range U+DC80 - U+DCFF. When encoding, it translates each surrogate code back to the original byte value in the range 0x80 - 0xFF.

Given the above setup, byte sequences in os.environb that can't be decoded with the default LC_CTYPE locale encoding will be surrogate escaped in the decoded text The surrogate-escaped values roundtrip back to bytes when printed, presumably as the terminal encoding.

with open('name.txt', 'w', encoding='utf-8')as fw:
fw.write(job_name)

The default errors handler for open() is "strict" instead of "surrogateescape", so the surrogate-escaped values in job_name cause the encoding to fail.

Your code runs for me on Windows

In Windows, Python uses the wide-character (16-bit wchar_t) environment of the process for os.environ, and, in 3.6+, it uses the console session's wide-character API for console files such as sys.std* when they aren't redirected to a pipe or disk file. Conventionally, wide-character strings should be valid UTF-16LE text. So getting "中文" from os.environ and printing it should 'just work'. The output will even be displayed correctly if the console session uses a font that supports "中文", or if it's a pseudoconsole (conpty) session that's attached to a terminal that supports automatic font fallback, such as Windows Terminal.

vstinner · 2021-03-26T23:12:04Z

Python works as expected: the UTF-8 codec doesn't allow to encode surrogate characters.

Surrogate characters are coming from os.environ['a'] because this environment variable contains bytes which cannot be decoded from the sys.getfilesystemencoding().

You should fix your system setup, especially the locale encoding. The strings stored in the "a" environment variable was not encoded to the Python filesystem encoding:
https://docs.python.org/dev/glossary.html#term-filesystem-encoding-and-error-handler

If you are lost with locale encodings, you can attempt to encode everything in UTF-8 and enables the Python UTF-8 Mode:
https://docs.python.org/dev/library/os.html#python-utf-8-mode

Good luck with your setup ;-)

Hint: use print(ascii(job_name)) to dump the string content.

vstinner · 2021-03-26T23:13:42Z

Oh, I forgot to note that Windows is not affected by this issue, since Windows provides directly environment variables as Unicode, and so Python doesn't need to decode byte strings to read os.environ['a'] ;-)

eryksun · 2021-03-27T00:06:44Z

If you are lost with locale encodings, you can attempt to encode
everything in UTF-8 and enables the Python UTF-8 Mode:

rushant is using Python 3.6. UTF-8 mode was added in 3.7, so it's not an option without first upgrading to 3.7. Also, it's important to note that the suggestion to "attempt to encode everything in UTF-8" includes whatever terminal encoding or shell-script file encoding is used for export a="中文". If it's not using UTF-8, then setting the preferred encoding in Python to UTF-8 isn't going to help.

rushant001 mannequin added topic-C-API type-bug An unexpected behavior, bug, or error labels Mar 21, 2021

eryksun added interpreter-core (Objects, Python, Grammar, and Parser dirs) stdlib Python modules in the Lib dir topic-unicode topic-IO and removed topic-C-API labels Mar 26, 2021

vstinner closed this as completed Mar 26, 2021

vstinner added the invalid label Mar 26, 2021

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

python3.6.4 os.environ error when write chinese to file #87742

python3.6.4 os.environ error when write chinese to file #87742

rushant001 mannequin commented Mar 21, 2021

rushant001 mannequin commented Mar 21, 2021

terryjreedy commented Mar 26, 2021

eryksun commented Mar 26, 2021

vstinner commented Mar 26, 2021

vstinner commented Mar 26, 2021

eryksun commented Mar 27, 2021

python3.6.4 os.environ error when write chinese to file #87742

python3.6.4 os.environ error when write chinese to file #87742

Comments

rushant001 mannequin commented Mar 21, 2021

rushant001 mannequin commented Mar 21, 2021

terryjreedy commented Mar 26, 2021

eryksun commented Mar 26, 2021

vstinner commented Mar 26, 2021

vstinner commented Mar 26, 2021

eryksun commented Mar 27, 2021