Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[subprocess] Better Unicode support for shell=True on Windows #77961

Closed
YoniRozenshein mannequin opened this issue Jun 6, 2018 · 5 comments
Closed

[subprocess] Better Unicode support for shell=True on Windows #77961

YoniRozenshein mannequin opened this issue Jun 6, 2018 · 5 comments
Labels
3.7 (EOL) end of life 3.8 only security fixes OS-windows stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@YoniRozenshein
Copy link
Mannequin

YoniRozenshein mannequin commented Jun 6, 2018

BPO 33780
Nosy @pfmoore, @vstinner, @giampaolo, @tjguk, @zware, @eryksun, @zooba

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2021-03-16.03:46:03.307>
created_at = <Date 2018-06-06.10:01:53.017>
labels = ['3.7', '3.8', 'type-feature', 'library', 'OS-windows']
title = '[subprocess] Better Unicode support for shell=True on Windows'
updated_at = <Date 2021-03-16.03:46:03.307>
user = 'https://bugs.python.org/YoniRozenshein'

bugs.python.org fields:

activity = <Date 2021-03-16.03:46:03.307>
actor = 'eryksun'
assignee = 'none'
closed = True
closed_date = <Date 2021-03-16.03:46:03.307>
closer = 'eryksun'
components = ['Library (Lib)', 'Windows']
creation = <Date 2018-06-06.10:01:53.017>
creator = 'Yoni Rozenshein'
dependencies = []
files = []
hgrepos = []
issue_num = 33780
keywords = []
message_count = 5.0
messages = ['318807', '318868', '318870', '319810', '388805']
nosy_count = 8.0
nosy_names = ['paul.moore', 'vstinner', 'giampaolo.rodola', 'tim.golden', 'zach.ware', 'eryksun', 'steve.dower', 'Yoni Rozenshein']
pr_nums = []
priority = 'normal'
resolution = 'third party'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue33780'
versions = ['Python 3.6', 'Python 3.7', 'Python 3.8']

@YoniRozenshein
Copy link
Mannequin Author

YoniRozenshein mannequin commented Jun 6, 2018

In subprocess, the implementation of shell=True on Windows is to launch a subprocess with using {comspec} /c "{args}" (normally comspec=cmd.exe).

By default, the output of cmd is encoded with the "active" codepage. In Python 3.6, you can decode this using encoding='oem'.

However, this actually loses information. For example, try creating a file with a filename in a language that is not your active codepage, and then doing subprocess.check_output('dir', shell=True). In the output, the filename is replaced with question marks (not by Python, by cmd!).

To get the correct output, cmd has a "/u" switch (this switch has probably existed forever - at least since Windows NT 4.0, by my internet search). The output can then be decoded using encoding='utf-16-le', like any native Windows string.

Currently, Popen constructs the command line in this hardcoded format: {comspec} /c "{args}", so you can't get the /u in there with the shell=True shortcut, and have to write your own wrapping code.

I suggest adding an feature to Popen where /u may be inserted before the /c within the shell=True shortcut. I've thought of several ways to implement this:

  1. A new argument to Popen, which indicates that we want Unicode shell output; if True, add the /u. Note that we already have a couple of Windows-only arguments to Popen, so this would not be a precedent.

  2. If the encoding argument is 'utf-16-le' or one of its aliases, then add the /u.

  3. If the encoding argument is not None, then add the /u.

@YoniRozenshein YoniRozenshein mannequin added 3.7 (EOL) end of life 3.8 only security fixes stdlib Python modules in the Lib dir type-feature A feature request or enhancement labels Jun 6, 2018
@eryksun
Copy link
Contributor

eryksun commented Jun 7, 2018

To get the correct output, cmd has a "/u" switch (this switch has
probably existed forever - at least since Windows NT 4.0, by my
internet search). The output can then be decoded using
encoding='utf-16-le', like any native Windows string.

However, the /u switch doesn't affect how CMD reads from stdin when it's a disk file or pipe. For example, set /p will stop at the first NUL byte. In general this is mismatched with subprocess, which provides a single encoding value for all 3 standard I/O streams. For example:

    >>> r = subprocess.run('cmd /d /v /u /c "set /p spam= & echo !spam!"',
    ...     capture_output=True, input='spam', encoding='oem')
    >>> r.stdout
    's\x00p\x00a\x00m\x00\n\x00\n\x00'

With UTF-16 input, CMD only reads up to "s" instead of reading the entire "s\x00p\x00a\x00m\x00" string that was written to stdin:

    >>> r = subprocess.run('cmd /d /v /u /c "set /p spam= & echo !spam!"',
    ...     capture_output=True, input='spam', encoding='utf-16le')
    >>> r.stdout
    's\n'
  1. A new argument to Popen

This may lead to confusion and false bug reports by people who expect the setting to also affect external programs run via the shell (e.g. tasklist.exe). It's also not consistent with how CMD reads from stdin, as shown above.

I can see the use of adding a cross-platform get_shell_path() function that returns the fully-qualified path to the shell that's used by shell=True. This way programs don't have to figure it out on their own if they need custom shell options.

Common CMD shell options in Windows include /d (skip AutoRun commands), /v (enable delayed expansion of environment variables via "!"), /e (enable command extensions), /k (remain running after the command), and /u. I'd prefer subprocess to use /d by default. It's strange that the CRT's system() command doesn't use it.

Currently the shell path can be "/bin/sh" or "/system/bin/sh" in POSIX and os.environ.get("COMSPEC", "cmd.exe") in Windows. I'd prefer that Windows instead used:

    shell_path = os.path.abspath(os.environ.get('ComSpec',
                    os.path.join(_winapi.GetSystemDirectory(), 'cmd.exe')))

i.e. never use an unqualified, relative path such as "cmd.exe".

Instead of the single-use GetSystemDirectory function, it could instead use _winapi.SHGetKnownFolderPath(_winapi.FOLDERID_System), or _winapi.SHGetKnownFolderPath('{1AC14E77-02E7-4E5D-B744-2EB1AE5198B7}') if the GUID constants aren't added.

@eryksun
Copy link
Contributor

eryksun commented Jun 7, 2018

By default, the output of cmd is encoded with the "active"
codepage. In Python 3.6, you can decode this using
encoding='oem'.

FYI, the actual encoding is not necessarily "oem".

The console codepage may have been changed from the initial value by a SetConsoleCP call in the current process or another process (e.g. chcp.com, mode.com). For example, a batch script can switch to codepage 65001 to allow CMD to read a UTF-8 encoded batch file; or read UTF-8 from an external command in a for /f loop; or write UTF-8 to a disk file or pipe.

(Only switch to codepage 65001 temporarily. Using UTF-8 for legacy console I/O is buggy. CMD, PowerShell, and Python 3.6+ aren't affected since they use the wide-character API for console I/O. But a legacy console application that uses the codepage implicitly with ReadFile and WriteFile for byte-based I/O may get invalid results such as reading a non-ASCII character as NUL, or the entire read failing, or writing garbage to the console after output that contains non-ASCII characters.)

To accommodate applications that use the current console codepage for standard I/O, Python could add two encodings that correspond to the current value of GetConsoleCP and GetConsoleOutputCP (e.g. named "conin" and "conout").

Additionally, we can't assume the console codepage is initially OEM. It depends on settings in the registry or the shell shortcut for the application that allocated the console. In particular, if a new console window is allocated by a process (either explicitly via AllocConsole or implicitly for a console app that either hasn't inherited a console or was created with the CREATE_NEW_CONSOLE or CREATE_NO_WINDOW creation flag), then the console loads custom settings from either the registry key "HKCU\Console\<window title>" or the shell shortcut (LNK file) that started the application.

If the console uses the window-title registry key, it looks for a "CodePage" DWORD value. The key name is the normalized window title, which comes from the WindowTitle field of the process parameters. This can be set explicitly using the STARTUPINFO lpTitle field that's passed to CreateProcess. Otherwise the system uses the executable path as the default window title. The console normalizes the title string to create a valid key name by replacing backslash with underscore, and it also substitutes "%SystemRoot%" for the Windows directory, e.g. the default configuration key for CMD is "HKCU\Console\%SystemRoot%_system32_cmd.exe".

The codepage can also be set in a shell shortcut (LNK file) 1. When an application is started from a shell shortcut, the shell sets the STARTUPINFO flag STARTF_TITLEISLINKNAME and the lpTitle string to the fully-qualified path of the LNK file. In this case, the console reads the LNK file to load its settings, rather than using the window-title subkey in the registry. But the "HKCU\Console" root key is still used for the default settings.

Finally, if CMD is run without a console (i.e. using the DETACHED_PROCESS creation flag), the default codepage is ANSI, not OEM. This isn't hard-coded in CMD. It happens that GetConsoleCP returns 0 (i.e. CP_ACP) in this case.

@YoniRozenshein
Copy link
Mannequin Author

YoniRozenshein mannequin commented Jun 17, 2018

After reading your messages and especially after reading https://bugs.python.org/issue27179#msg267091 I admit I have been convinced this is much more complicated than I thought, and maybe more of a Windows bug than a Python bug :)

@eryksun
Copy link
Contributor

eryksun commented Mar 16, 2021

The complexity of mixing standard I/O from the shell and external programs is a limitation of the Windows command line. Each program could choose to use the system (or process) ANSI or OEM code page, the console session's input or output code page, UTF-8, or UTF-16. There's no uniform way to enforce one, consistent choice. So I'm closing this issue as a third party limitation that cannot be addressed in general. The problem has to be handled on a case by case basis.

@eryksun eryksun closed this as completed Mar 16, 2021
@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.7 (EOL) end of life 3.8 only security fixes OS-windows stdlib Python modules in the Lib dir type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

2 participants