New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
python3.6.4 os.environ error when write chinese to file #87742
Comments
# -*- coding: utf-8 -*-
import os
job_name = os.environ['a']
print(job_name)
print(isinstance(job_name, str))
print(type(job_name))
with open('name.txt', 'w', encoding='utf-8')as fw:
fw.write(job_name) i have set environment param by :
export a="中文"
it returns error:
中文
True
<class 'str'>
Traceback (most recent call last):
File "aa.py", line 8, in <module>
fw.write(job_name)
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-5: surrogates not allowed |
3.6 only gets security patches. You or someone needs to show an unfixed bug in master. Your code runs for me on Windows, whereas you appear to be using *nix. Replacing job_name.encode() should have the same behavior. Do you see the same with job_name="中文" at the top instead? |
I think this is a locale configuration problem, in which the locale encoding doesn't match the terminal encoding. If so, it can be closed as not a bug.
In POSIX, the shell reads "中文" from the terminal as bytes encoded in the terminal encoding, which could be UTF-8 or some legacy encoding. The value of
In POSIX, sys.stdout.errors, as used by print(), will be "surrogateescape" if the default LC_CTYPE locale is a legacy locale -- which in 3.6 is the case for the "C" locale, since it's usually limited to 7-bit ASCII. "surrogateescape" is also the errors handler for decoding bytes os.environb (POSIX) as text os.environ. When decoding, "surrogateescape" handles non-ASCII byte values that can't be decoded by translating the value into the reserved surrogate range U+DC80 - U+DCFF. When encoding, it translates each surrogate code back to the original byte value in the range 0x80 - 0xFF. Given the above setup, byte sequences in os.environb that can't be decoded with the default LC_CTYPE locale encoding will be surrogate escaped in the decoded text The surrogate-escaped values roundtrip back to bytes when printed, presumably as the terminal encoding.
The default errors handler for open() is "strict" instead of "surrogateescape", so the surrogate-escaped values in job_name cause the encoding to fail.
In Windows, Python uses the wide-character (16-bit wchar_t) environment of the process for os.environ, and, in 3.6+, it uses the console session's wide-character API for console files such as sys.std* when they aren't redirected to a pipe or disk file. Conventionally, wide-character strings should be valid UTF-16LE text. So getting "中文" from os.environ and printing it should 'just work'. The output will even be displayed correctly if the console session uses a font that supports "中文", or if it's a pseudoconsole (conpty) session that's attached to a terminal that supports automatic font fallback, such as Windows Terminal. |
Python works as expected: the UTF-8 codec doesn't allow to encode surrogate characters. Surrogate characters are coming from os.environ['a'] because this environment variable contains bytes which cannot be decoded from the sys.getfilesystemencoding(). You should fix your system setup, especially the locale encoding. The strings stored in the "a" environment variable was not encoded to the Python filesystem encoding: If you are lost with locale encodings, you can attempt to encode everything in UTF-8 and enables the Python UTF-8 Mode: Good luck with your setup ;-) Hint: use print(ascii(job_name)) to dump the string content. |
Oh, I forgot to note that Windows is not affected by this issue, since Windows provides directly environment variables as Unicode, and so Python doesn't need to decode byte strings to read os.environ['a'] ;-) |
rushant is using Python 3.6. UTF-8 mode was added in 3.7, so it's not an option without first upgrading to 3.7. Also, it's important to note that the suggestion to "attempt to encode everything in UTF-8" includes whatever terminal encoding or shell-script file encoding is used for |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: