Another attempt to fix encoding issues #4486

pfmoore · 2017-05-16T11:55:18Z

One more attempt to fix the various encoding issues we see. The main thing here is that we use the errors="replace" parameter when decoding subprocess output, so we should hopefully avoid many of the Unicode errors people are seeing.

There's also an attempt to choose (and document) a better encoding that we expect from build tools. There are known issues with this approach (see the comments under #4280) but it should at least improve things.

Partial fix for #4110, #4003, #4212, Note that this PR does not address any issues that are reported as happening on Python 2, this only changes Python 3 behaviour.

pfmoore · 2017-05-16T11:59:48Z

@pypa/pip-committers I'd appreciate reviews of this, if possible. I have no easy means of testing on non-English systems (Unix or Windows), and I know our tests don't exercise situations like that, so this change is at best theoretically OK. I don't think we have any committers using non-English systems, but if anyone can give it a once-over that would be great.

pfmoore · 2017-05-20T09:15:08Z

Made this WIP. See discussions on distutils-sig under "PEP 517 - specifying build system in pyproject.toml" and specifically https://mail.python.org/pipermail/distutils-sig/2017-May/030442.html. In summary, Visual C can produce output with an inconsistent (mixed) encoding, so trying to deal with that (beyond simply not crashing) is pointless.

xavfernandez

It would also be nice to incorporate something like https://github.com/danilaml/pip/blob/test-non-ascii-user/appveyor.yml // http://help.appveyor.com/discussions/questions/3530-is-it-possible-to-somehow-specify-the-usernamehostname-with-which-ci-tests-are-run to check for future regressions

xavfernandez · 2017-05-20T11:39:35Z

pip/compat.py

-            return s.decode(sys.__stdout__.encoding)
-        except UnicodeDecodeError:
-            return s.decode('utf_8')
+        return s.decode(subprocess_encoding(), errors="replace")


Ideally, I'd like pip to first try decoding without error handling and on UnicodeDecodeError switch to errors="replace" with a nice warning

OK, I'll work on that.

pfmoore · 2017-05-20T19:37:44Z

One further thought. We need the output to be printable, so it must only use characters representable in sys.stdout.encoding. To make sure that's the case, I think we probably need another round of transcoding, to strip out any characters that can't be printed.

I believe the required dance is:

Try decoding using the expected encoding.
If this fails, print a warning that the data is encoded incorrectly, and then decode with errors=replace.
Re-encode in sys.stdout.encoding with errors=replace, to ensure that no non-printable characters are included.
Decode back to Unicode using sys.stdout.encoding.

Result is a Unicode string that's a "best possible" representation of the subprocess output, assuming the expected encoding, using only characters that can be printed to sys.stdout without error.

At the moment, we're only doing this for Python 3. But for Python 2 it's still possible for the output to contain bytes that aren't valid in sys.stdout.encoding. Should we therefore do the same thing for Python 2? It'll need to be modified a bit to deal with the nasty Python 2 string model, but I think we do need to clean the data so it doesn't include characters that can't be printed without error.

pfmoore · 2017-05-22T17:48:48Z

Just pushed an update that more robustly tidies up subprocess output to avoid encoding errors. Includes @xavfernandez suggestion to warn if we can't use the expected encoding without loss, and chooses a default encoding based on discussions on distutils-sig.

Also, this version sanitises the data even on Python 2, as it's possible to get Unicode errors on output there, as well.

Reviews appreciated, but I'm intending to get this change in for the next version of pip, as even if we get the encoding details slightly wrong, switching to errors=replace is better regardless. Backslashreplace would be better still, but that's only allowed for decoding since Python 3.5. I guess I could do something like error_strategy = "backslashreplace" if sys.version_info > (3, 5) else "replace". What do people think? Is it worth it?

dstufft · 2017-05-22T22:57:24Z

I could do something like error_strategy = "backslashreplace" if sys.version_info > (3, 5) else "replace". What do people think? Is it worth it?

That seems reasonable to me, though I'd maybe ask to put it into the compat module to keep the "random crap we can maybe delete as we drop support for Py versions" as contained to one module as we can.

ncoghlan · 2017-05-23T05:06:19Z

Regarding backslashreplace as an error handler: you can use that safely for the encoding step regardless of version. The only point where it needs to be conditional is on the initial decoding step, as the change made in 3.5 was to add support for handling UnicodeDecodeError in addition to the existing handling of UnicodeEncodeError.

For 3.4 and earlier, rather than falling back to replace, you also have the option of registering a custom backslashreplace_decode error handler based on the examples in https://stackoverflow.com/questions/25442954/how-should-i-decode-bytes-using-ascii-without-losing-any-junk-bytes-if-xmlch/25443356#25443356

Something like:

    def backslashreplace_decode(err):
        raw_bytes = (ord(err.object[i]) for i in range(err.start, err.end))
        return u"".join(u"\\x%x" % c for c in raw_bytes), err.end

>>> codecs.register_error("backslashreplace_decode", backslashreplace_decode)
>>> b'\xd3PS-90AC'.decode("ascii", "backslashreplace_decode")
u'\\xd3PS-90AC'

pradyunsg · 2017-05-23T05:06:20Z

docs/reference/pip.rst

+requires that the output is written in a well-defined encoding, specifically
+the encoding returned by python's ``locale.getpreferredencoding`` function, or
+"utf8" if ``getpreferredencoding`` does not return a value (or returns "ASCII",
+which ).


This isn't a complete sentence... :/

Meh. Teach me to rush the doc update :-( I'll revise the docs to fix this and take into account Nick's comments.

ncoghlan · 2017-05-23T05:12:12Z

docs/reference/pip.rst

+requires that the output is written in a well-defined encoding, specifically
+the encoding returned by python's ``locale.getpreferredencoding`` function, or
+"utf8" if ``getpreferredencoding`` does not return a value (or returns "ASCII",
+which ).


The most reliable incantation I know for detecting ASCII as the preferred encoding is: codecs.lookup(locale.getpreferredencoding()).name == "ascii"

The reason for that is that going through the codec system and then looking at the codec name automatically deals with the fact that different platforms may report the same encoding under different aliases. Most importantly for this purpose, some Linux systems report ASCII by its official spec number, 'ANSI_X3.4-1968'.

(The patch itself handles this correctly, but it makes sense to include it here for the benefit of build tool developers as well)

I don't think we need that much detail, I'll reword to make the intent clear (that we use UTF-8 if the locale encoding is ASCII).

ncoghlan · 2017-05-23T05:16:47Z

docs/reference/pip.rst

+produce output in the correct encoding. In practice - and in particular
+on Windows, where tools are inconsistent in their use of the "OEM" and
+"ANSI" codepages - this may not always be possible, so pip will attempt to
+recover cleanly if presented with incorrectly encoded build tool output.


Given the backslashreplace discussions, perhaps append ", by translating unexpected byte sequences to Python-style hexadecimal escape sequences (\x80\xff, etc)."

ncoghlan · 2017-05-23T05:18:38Z

docs/reference/pip.rst

+recover cleanly if presented with incorrectly encoded build tool output.
+However, pip cannot guarantee in that case that the displayed output will
+not be corrupted (mojibake, or characters replaced with the standard
+replacement character, often a question mark).


The backslashreplace adjustment should be reliable enough that it makes sense to drop this sentence (it's technically still true, but the cases where even backslashreplace is insufficient are obscure enough that I think including it would create more confusion than it clears up)

ncoghlan · 2017-05-23T05:19:44Z

news/4486.bugfix

@@ -0,0 +1 @@
+Improve handling of Unicode output from build tools under Python 3.


Given the discussions, "Improve handling of improperly encoded text output from build tools" would be a more accurate description now.

I'm leaving out the phrase "improperly encoded", as it's mostly text that is properly encoded, just not in the same encoding that we're assuming. (There's no standard yet for what encoding setuptools should be using for output).

ncoghlan · 2017-05-23T05:23:18Z

pip/compat.py

+        logger.warning(
+            "Subprocess output does not appear to be encoded as %s" %
+            encoding)
+        s = data.decode(encoding, errors="replace")


This is the step that would need to conditionally use the native backslashreplace on 3.5+, and a pip-provided emulation otherwise.

ncoghlan · 2017-05-23T05:23:38Z

pip/compat.py

+    # that won't fail).
+    output_encoding = sys.__stderr__.encoding
+    if output_encoding:
+        s = s.encode(output_encoding, errors="replace")


This step can unconditionally use backslashreplace

pfmoore · 2017-05-23T16:37:57Z

For 3.4 and earlier, rather than falling back to replace, you also have the option of registering a custom backslashreplace_decode error handler

That looks like a good option - I knew registering a handler was possible, but hadn't looked into how difficult it was. Using the built in handler for 3.5+ and registering our own implementation for older versions seems like a good way to go.

pfmoore · 2017-05-23T21:22:02Z

OK, I think I've dealt with all the review comments. @ncoghlan could you check if you're OK with the reworded docs in particular?

ncoghlan

This looks pretty good to me. Just the one actual bug (the version check), and a couple of minor comments inline.

ncoghlan · 2017-05-24T06:29:33Z

docs/reference/pip.rst

+requires that the output is written in a well-defined encoding, specifically
+the encoding the user has configured for text output (which can be obtained in
+Python using ``locale.getpreferredencoding``). If the configured encoding is
+ASCII, pip assumes UTF-8 (to match the behaviour of some Unix systems).


It's not so much about "matching the behaviour of" as "accounting for the misbehaviour of" :)

OK. I'll go for "accounting for the behaviour of" (don't want to be accused of being the Windows guy criticising Unix ;-))

ncoghlan · 2017-05-24T06:30:43Z

docs/reference/pip.rst

+this may not always be possible. Pip will therefore attempt to recover cleanly
+if presented with incorrectly encoded build tool output, by translating
+unexpected byte sequences to Python-style hexadecimal escape sequences
+(\x80\xff, etc). However, it is still possible for output to be displayed


Perhaps show this as an inline Python string literal? That is:

``"\x80\xff"``

ncoghlan · 2017-05-24T06:34:53Z

pip/compat.py

-            return s.decode(sys.__stdout__.encoding)
-        except UnicodeDecodeError:
-            return s.decode('utf_8')
+if sys.version_info > (3, 4):


This check will succeed for 3.4.x releases, which isn't what you want. sys.version_info >= (3, 5) will do the right thing.

Good catch, thanks! I'll fix this.

ncoghlan · 2017-05-24T06:36:11Z

pip/compat.py

+if sys.version_info > (3, 4):
+    backslashreplace_decode = "backslashreplace"
+else:
+    def backslashreplace_decode_fn(err):


Probably worth adding a comment here to say that while backslashreplace exists on these earlier versions, it's only usable for encoding, so this implements a version that specifically handles decoding.

pfmoore · 2017-05-24T07:51:49Z

Thanks for the review @ncoghlan - I'll update the PR (and fix the pep8 failures) this evening.

pfmoore · 2017-05-24T17:51:51Z

OK, I'm going to try to add some unit tests for console_to_str (specifically, just to exercise it to prove that it doesn't fail given bad input) - mostly because there was a typo in my code that the existing tests weren't catching.

Once I've done that, if the tests don't expose any more bugs, I'm going to merge this - if anyone has any further comments before then, just shout.

pypa-bot · 2017-05-25T09:20:44Z

Hello!

I am an automated bot and I have noticed that this pull request is not currently able to be merged. If you are able to either merge the master branch into this pull request or rebase this pull request against master then it will eligible for code review and hopefully merging!

shivan · 2017-06-02T11:07:15Z

I found a workaround until pip 10 will be available.

enter
chcp 65001
in the commandline before installing package via pip

This will set UTF-8 codepage.

Is there a plan, when this fix will be available to public? Milestone is not set here.

dsudduth · 2017-06-02T18:45:39Z

@pfmoore - Can you share the latest status? I'd like to give this a shot to see if this fixes my issue on Russian Operating Systems.

pfmoore · 2017-06-02T22:40:30Z

You can pull the latest development version from git, and try that. pip install https://github.com/pypa/pip should also work.

pradyunsg · 2017-10-30T11:29:59Z

tests/unit/test_compat.py

+
+    monkeypatch.setattr(locale, 'getpreferredencoding', lambda: 'utf-8')
+    monkeypatch.setattr(pip.compat.logger, 'warning', check_warning)
+    console_to_str(some_bytes)


Maybe it should be asserted that this monkey patched function is called?

Could do I guess. The idea of the monkeypatching is to fix the behaviour rather than to test those functions are called, but it wouldn't do any harm.

lock · 2019-06-02T16:06:42Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

pfmoore changed the title ~~Another attempt to fix encoding issues~~ [WIP] Another attempt to fix encoding issues May 20, 2017

xavfernandez reviewed May 20, 2017

View reviewed changes

pradyunsg reviewed May 23, 2017

View reviewed changes

ncoghlan reviewed May 23, 2017

View reviewed changes

ncoghlan suggested changes May 24, 2017

View reviewed changes

pfmoore changed the title ~~[WIP] Another attempt to fix encoding issues~~ Another attempt to fix encoding issues May 24, 2017

pfmoore force-pushed the encoding branch 2 times, most recently from 19ab087 to 1cf7e84 Compare May 24, 2017 13:25

pypa-bot added the needs rebase or merge PR has conflicts with current master label May 25, 2017

pfmoore added 8 commits May 25, 2017 10:25

Another attempt to fix encoding issues

56503e3

Add a news entry

baceb4a

Make a more complete attempt to sanitise subprocess output

30f1de9

Revise docs in line with new code

f093346

Always use backslashreplace when encoding or decoding

9194d1a

Improve documentation

2fc94f6

Address review comments

60baa76

Added tests for console_to_str

4aa7954

pfmoore force-pushed the encoding branch from dd86551 to 4aa7954 Compare May 25, 2017 09:32

pypa-bot removed the needs rebase or merge PR has conflicts with current master label May 25, 2017

pfmoore merged commit 8a10132 into pypa:master May 27, 2017

pfmoore mentioned this pull request Jul 5, 2017

AttributeError while installing package from PythonShell #4598

Closed

stuxcrystal mentioned this pull request Jul 6, 2017

Improve setup.py to use system vapoursynth.dll vapoursynth/vapoursynth#320

Merged

aleksey-kutepov mentioned this pull request Jul 30, 2017

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xce in position 82: invalid continuation byte py3minepi/py3minepi#15

Open

rmax mentioned this pull request Aug 24, 2017

Conda installation not working correctly chartbeat-labs/textacy#127

Closed

sashkab mentioned this pull request Sep 25, 2017

Create version of PyQ for Windows? KxSystems/pyq#1

Closed

pfmoore mentioned this pull request Oct 3, 2017

UnicodeDecodeError when install simplejson under python 3.6.0 #4761

Closed

pradyunsg mentioned this pull request Oct 4, 2017

UnicodeDecodeError When trying to install something with pip2 #2903

Closed

pradyunsg mentioned this pull request Oct 17, 2017

pip install #4788

Closed

pradyunsg mentioned this pull request Oct 30, 2017

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd5 in position 24: invalid continuation byte #4825

Closed

pradyunsg reviewed Oct 30, 2017

View reviewed changes

pradyunsg mentioned this pull request Nov 14, 2017

Error with utf-8 filenames #4863

Closed

dhimmel mentioned this pull request Nov 20, 2017

pip insall on windows raises a UnicodeDecodeError dhimmel/obonet#9

Closed

pradyunsg mentioned this pull request Dec 4, 2017

When try to install shapely on Python3(32bit) on Windows 7, UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd5 in position 24: invalid continuation byte #4903

Closed

pradyunsg mentioned this pull request Dec 31, 2017

I don't know what's happened #4948

Closed

pfmoore mentioned this pull request Jan 23, 2018

Installing packages fails if Python 3 installed into path with non-ASCII characters #4984

Closed

jakirkham mentioned this pull request Jun 16, 2018

Windows Python 2.7 build issues with pip conda/conda-build#2959

Open

lock bot added the auto-locked Outdated issues that have been locked by automation label Jun 2, 2019

lock bot locked as resolved and limited conversation to collaborators Jun 2, 2019

pfmoore deleted the encoding branch January 24, 2021 11:13

		@@ -0,0 +1 @@
		Improve handling of Unicode output from build tools under Python 3.

Another attempt to fix encoding issues #4486

Another attempt to fix encoding issues #4486

Conversation

pfmoore commented May 16, 2017

pfmoore commented May 16, 2017

pfmoore commented May 20, 2017

xavfernandez left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pfmoore commented May 20, 2017

pfmoore commented May 22, 2017

dstufft commented May 22, 2017

ncoghlan commented May 23, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pfmoore commented May 23, 2017

pfmoore commented May 23, 2017

ncoghlan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pfmoore commented May 24, 2017

pfmoore commented May 24, 2017

pypa-bot commented May 25, 2017

shivan commented Jun 2, 2017

dsudduth commented Jun 2, 2017

pfmoore commented Jun 2, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lock bot commented Jun 2, 2019

ncoghlan commented May 23, 2017 •

edited