Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

python3.6.4 os.environ error when write chinese to file #87742

Closed
rushant001 mannequin opened this issue Mar 21, 2021 · 6 comments
Closed

python3.6.4 os.environ error when write chinese to file #87742

rushant001 mannequin opened this issue Mar 21, 2021 · 6 comments
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) stdlib Python modules in the Lib dir topic-IO topic-unicode type-bug An unexpected behavior, bug, or error

Comments

@rushant001
Copy link
Mannequin

rushant001 mannequin commented Mar 21, 2021

BPO 43576
Nosy @terryjreedy, @vstinner, @ezio-melotti, @eryksun, @rushant001

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2021-03-26.23:12:04.502>
created_at = <Date 2021-03-21.04:46:52.916>
labels = ['interpreter-core', 'type-bug', 'expert-IO', 'invalid', 'library', 'expert-unicode']
title = 'python3.6.4 os.environ error when write chinese to file'
updated_at = <Date 2021-03-27.00:06:43.938>
user = 'https://github.com/rushant001'

bugs.python.org fields:

activity = <Date 2021-03-27.00:06:43.938>
actor = 'eryksun'
assignee = 'none'
closed = True
closed_date = <Date 2021-03-26.23:12:04.502>
closer = 'vstinner'
components = ['Interpreter Core', 'Library (Lib)', 'Unicode', 'IO']
creation = <Date 2021-03-21.04:46:52.916>
creator = 'rushant'
dependencies = []
files = []
hgrepos = []
issue_num = 43576
keywords = []
message_count = 6.0
messages = ['389215', '389560', '389568', '389571', '389572', '389579']
nosy_count = 5.0
nosy_names = ['terry.reedy', 'vstinner', 'ezio.melotti', 'eryksun', 'rushant']
pr_nums = []
priority = 'normal'
resolution = 'not a bug'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue43576'
versions = ['Python 3.6']

@rushant001
Copy link
Mannequin Author

rushant001 mannequin commented Mar 21, 2021

# -*- coding: utf-8 -*-
import os
job_name = os.environ['a']
print(job_name)
print(isinstance(job_name, str))
print(type(job_name))
with open('name.txt', 'w', encoding='utf-8')as fw:
    fw.write(job_name)
i have set environment param by :
export a="中文"
it returns error:
中文
True
<class 'str'>
Traceback (most recent call last):
  File "aa.py", line 8, in <module>
    fw.write(job_name)
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-5: surrogates not allowed

@rushant001 rushant001 mannequin added topic-C-API type-bug An unexpected behavior, bug, or error labels Mar 21, 2021
@terryjreedy
Copy link
Member

3.6 only gets security patches. You or someone needs to show an unfixed bug in master. Your code runs for me on Windows, whereas you appear to be using *nix. Replacing job_name.encode() should have the same behavior. Do you see the same with job_name="中文" at the top instead?

@eryksun
Copy link
Contributor

eryksun commented Mar 26, 2021

I think this is a locale configuration problem, in which the locale encoding doesn't match the terminal encoding. If so, it can be closed as not a bug.

export a="中文"

In POSIX, the shell reads "中文" from the terminal as bytes encoded in the terminal encoding, which could be UTF-8 or some legacy encoding. The value of a is set directly as this encoded text. There is no intermediate decode/encode stage in the shell. For a child process that decodes the value of the environment variable, as Python does, the locale's LC_CTYPE encoding should be the same or compatible with the terminal encoding.

job_name = os.environ['a']
print(job_name)

In POSIX, sys.stdout.errors, as used by print(), will be "surrogateescape" if the default LC_CTYPE locale is a legacy locale -- which in 3.6 is the case for the "C" locale, since it's usually limited to 7-bit ASCII. "surrogateescape" is also the errors handler for decoding bytes os.environb (POSIX) as text os.environ. When decoding, "surrogateescape" handles non-ASCII byte values that can't be decoded by translating the value into the reserved surrogate range U+DC80 - U+DCFF. When encoding, it translates each surrogate code back to the original byte value in the range 0x80 - 0xFF.

Given the above setup, byte sequences in os.environb that can't be decoded with the default LC_CTYPE locale encoding will be surrogate escaped in the decoded text The surrogate-escaped values roundtrip back to bytes when printed, presumably as the terminal encoding.

with open('name.txt', 'w', encoding='utf-8')as fw:
fw.write(job_name)

The default errors handler for open() is "strict" instead of "surrogateescape", so the surrogate-escaped values in job_name cause the encoding to fail.

Your code runs for me on Windows

In Windows, Python uses the wide-character (16-bit wchar_t) environment of the process for os.environ, and, in 3.6+, it uses the console session's wide-character API for console files such as sys.std* when they aren't redirected to a pipe or disk file. Conventionally, wide-character strings should be valid UTF-16LE text. So getting "中文" from os.environ and printing it should 'just work'. The output will even be displayed correctly if the console session uses a font that supports "中文", or if it's a pseudoconsole (conpty) session that's attached to a terminal that supports automatic font fallback, such as Windows Terminal.

@eryksun eryksun added interpreter-core (Objects, Python, Grammar, and Parser dirs) stdlib Python modules in the Lib dir topic-unicode topic-IO and removed topic-C-API labels Mar 26, 2021
@vstinner
Copy link
Member

Python works as expected: the UTF-8 codec doesn't allow to encode surrogate characters.

Surrogate characters are coming from os.environ['a'] because this environment variable contains bytes which cannot be decoded from the sys.getfilesystemencoding().

You should fix your system setup, especially the locale encoding. The strings stored in the "a" environment variable was not encoded to the Python filesystem encoding:
https://docs.python.org/dev/glossary.html#term-filesystem-encoding-and-error-handler

If you are lost with locale encodings, you can attempt to encode everything in UTF-8 and enables the Python UTF-8 Mode:
https://docs.python.org/dev/library/os.html#python-utf-8-mode

Good luck with your setup ;-)

Hint: use print(ascii(job_name)) to dump the string content.

@vstinner
Copy link
Member

Oh, I forgot to note that Windows is not affected by this issue, since Windows provides directly environment variables as Unicode, and so Python doesn't need to decode byte strings to read os.environ['a'] ;-)

@eryksun
Copy link
Contributor

eryksun commented Mar 27, 2021

If you are lost with locale encodings, you can attempt to encode
everything in UTF-8 and enables the Python UTF-8 Mode:

rushant is using Python 3.6. UTF-8 mode was added in 3.7, so it's not an option without first upgrading to 3.7. Also, it's important to note that the suggestion to "attempt to encode everything in UTF-8" includes whatever terminal encoding or shell-script file encoding is used for export a="中文". If it's not using UTF-8, then setting the preferred encoding in Python to UTF-8 isn't going to help.

@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) stdlib Python modules in the Lib dir topic-IO topic-unicode type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

3 participants