Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix defaultEncoding in base.py, uncovered by Buildbot regression tests #1502

Merged
merged 6 commits into from
Feb 10, 2021

Conversation

rodrigc
Copy link
Contributor

@rodrigc rodrigc commented Jan 4, 2021

Scope and purpose

In ​buildbot/buildbot#5591 , we found that Twisted trunk started breaking some of Buildbot's tests:

===============================================================================
[ERROR]
Traceback (most recent call last):
  File "/buildbot/buildbot-job/build/sandbox/lib/python3.8/site-packages/twisted/internet/defer.py", line 1445, in _inlineCallbacks
    result = current_context.run(g.send, result)
  File "/buildbot/buildbot-job/build/master/buildbot/test/unit/scripts/test_start.py", line 117, in test_start_timeout_number_string
    res = yield self.runStart(start_timeout='10')
  File "/buildbot/buildbot-job/build/master/buildbot/test/unit/scripts/test_start.py", line 87, in runStart
    return getProcessOutputAndValue(sys.executable, args=args, env=env)
  File "/buildbot/buildbot-job/build/sandbox/lib/python3.8/site-packages/twisted/internet/utils.py", line 174, in getProcessOutputAndValue
    return _callProtocolWithDeferred(
  File "/buildbot/buildbot-job/build/sandbox/lib/python3.8/site-packages/twisted/internet/utils.py", line 28, in _callProtocolWithDeferred
    reactor.spawnProcess(p, executable, (executable,) + tuple(args), env, path)
  File "/buildbot/buildbot-job/build/sandbox/lib/python3.8/site-packages/twisted/internet/posixbase.py", line 386, in spawnProcess
    args, env = self._checkProcessArgs(args, env)
  File "/buildbot/buildbot-job/build/sandbox/lib/python3.8/site-packages/twisted/internet/base.py", line 1088, in _checkProcessArgs
    _arg = argChecker(arg)
  File "/buildbot/buildbot-job/build/sandbox/lib/python3.8/site-packages/twisted/internet/base.py", line 1074, in argChecker
    arg = arg.encode(defaultEncoding)
builtins.TypeError: encode() argument 'encoding' must be str, not None
buildbot.test.unit.scripts.test_start.TestStart.test_start_timeout_number_string
-------------------------------------------------------------------------------

Contributor Checklist:

work.

This is required for Buildbot's test infrastructure to work properly.
@rodrigc
Copy link
Contributor Author

rodrigc commented Jan 4, 2021

With help from @p12tic (Buildbot release manager), we found the following.

sys.stdout.encoding is always populated in CPython as per:

However, as part of it's test suite, Buildbot overrides sys.stdout with io.StringIO: https://github.com/buildbot/buildbot/blob/cf0b9095ae79e532ad9d51613041b6bdfd373921/master/buildbot/test/util/misc.py#L57

This is not ideal, but there could be other projects out there using Twisted which do the same.

So I'm taking the conservative approach here, and always populating defaultEncoding.

@rodrigc
Copy link
Contributor Author

rodrigc commented Jan 4, 2021

I tested Twisted's 10069-rodrigc-encoding branch against Buildbot here: buildbot/buildbot#5739
and the Buildbot tests passed.

@p12tic
Copy link
Contributor

p12tic commented Jan 4, 2021

The PR looks good to me. I'm not an official Twisted reviewer, just coming from Buildbot side.

Copy link
Contributor

@twm twm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @rodrigc, thanks for continuing to work on this.

I'm afraid that this doesn't look right to me. In addition to the None issue highlighted inline, it's unclear to me why the stdout encoding is relevant for environment variables, and furthermore given that change was originally made for Windows, why are we encoding environment variables to bytes at all on that platform?

I've spelunked the CPython sources a bit, but have to head to bed now. I'll follow up with you again tomorrow.

# value of sys.stdout.encoding.
# If a client application patches sys.stdout so that encoding is not
# set properly, try to fall back to sys.__stdout__.encoding.
defaultEncoding = sys.stdout.encoding or sys.__stdout__.encoding
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sys.stdout and sys.__stdout__ can be None, so if the encoding comes from those places that case must be handled.

@twm
Copy link
Contributor

twm commented Jan 6, 2021

So it looks like there are two issues here:

  1. We appear to do this encoding-to-bytes dance on Windows. I'm no Windows expert, but it looks like the relevant APIs take UTF-16LE strings. See also the os.environ implementation.
  2. The encoding-to-bytes stuff is appropriate on POSIX, but the current implementation doesn't account for PEP 383. Specifically, it's missing the surrogateescape error handler, so inherited environment variables that contain arbitrary bytes may fail to round trip (e.g., if the user does reactor.spawnProcess(..., env=os.environ)). See os.execve and the os.environ implementation. I'm not sure if this has any implications for Windows with its personalities and such.

I'm a little concerned to release Twisted with the use of sys.stdout.encoding, as it doesn't seem to match the Python internals that argChecker is trying to imitate the behavior of. I think it may often be the same encoding — indeed that's very likely on Python 3.6+, where Python is aggressive about using UTF-8 on POSIX platforms — but it seems likely to cause regressions.

Overall I'm unsure if argChecker is actually worth the cost. Perhaps we should just delete it, since it is coupled to Python internals that have changed rather a lot, and it provides very little value in a UTF-8 world (versus the old locale encoding badtimes).

@rodrigc What do you want to do?

@adiroiban
Copy link
Member

Thanks for the update. The CI looks green, but I only see some mocked unit tests without an integration tests passing actual Unicode arguments.

So I think that this fixed the error from Buildbot, but I am not sure if this add proper UTF-8 / UTF-16 / Unicode handling in spawnProcess.

My suggestion is to use at least the "str with only ascii" argument passing tests and the "str with unicode" tests... and add another test in which the arguments have invalid UTF-8 encoding.

https://github.com/twisted/twisted/pull/1501/files#diff-c4d8a010d03a1063bfa9bd277af991ae132d176577511386fd22dfe1af49a4e4

In python 2.7 popen was complicated .... but I think that things are better in Python 3. So as suggested by Tom, maybe just remove the encoding handling in Twisted and let stdlib do its magic.

@rodrigc
Copy link
Contributor Author

rodrigc commented Jan 23, 2021

@twm The patch in this PR is minimal, and my preference is to merge it as-is. I do not want to complicate this
code any more by adding UTF handling. Over the summer,

I spent a lot of time testing the code in base.py and changing it due to the various problems I saw on Windows in different versions of Python from 3.5-3.7.
Making any assumptions about encoding is dangerous, because due to environment/codepage differences, there are subtle configuration problems that can be uncovered and break things. I found the best approach is to just defer to
how CPython does encoding and not make any assumptions otherwise.

After this is merged, I want to do a new release of Twisted.
Once there is a new Twisted release, I will:

  1. Drop support for Python 3.5 in trunk with this: Drop support for Python 3.5 #1386
  2. Revisit this code in base.py and look at dropping _checkProcessArgs() completely.

The original motivation for _checkProcessArgs() was because in the Python 2 days,
on Windows, the environment variables and command-line arguments were wchar but on POSIX they were bytes.
This required a lot of complex stuff to dance around the bytes/unicode issues.

In Python 3, environment variables and command-line arguments are Unicode strings on all platforms, and these are passed all the way down to the lowest levels of the reactor. So there is no need to play these bytes/unicode games when passing args/environment variables to the reactor.
However, dropping _checkProcessArgs() will require a separate review and testing cycle.

I do not want to hold up the current release cycle on that, but I want to get this change in to not break Buildbot with the latest release.

@rodrigc rodrigc requested a review from twm January 23, 2021 21:19
Copy link
Contributor

@twm twm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rodrigc I am okay with an incremental step that doesn't address the larger issues here. However, I think that some minor changes are still required here before we can land and release.

It would be a regression to do a release that doesn't work in a Windows GUI app, where sys.stdout may be None, as I noted before. Twisted 20.3.0 would not fail in this case. I do not think that we should merge without addressing this (and adding a test, of course).

Given the dubious value this codepath offers on Windows I think that it would be reasonable to address this case by skipping bytes coercion entirely if sys.stdout/sys.__stdout__ are not present. The argChecker codepath is only intended to give an early error, after all. Cryptic post-fork() errors aren't great, but they're probably better than not being able to spawn processes at all.

I'll file a ticket about the whole arg/environment encoding situation based on my comments above once this PR is in. While there's a minor mess here most of this code has been around a while, so I don't think that it qualifies as a release blocker. I also agree that it would be easier to address once Python 3.5 is no longer a factor.

I hope this clarifies my earlier feedback. Thanks for your work on this!

@rodrigc
Copy link
Contributor Author

rodrigc commented Feb 1, 2021

@twm Thanks for the feedback. I don't understand what you are asking for. Are you saying that if sys.stdout is None, then _checkProcessArgs() should be skipped?

@twm
Copy link
Contributor

twm commented Feb 1, 2021

@rodrigc Yes, I think that it is okay if _checkProcessArgs is skipped when sys.stdout is None. This is because:

  1. I believe that encoding to bytes is unnecessary on Windows, and sys.stdout is only documented as None on Windows
  2. Even if there are other cases where sys.stdout can be None, the encoding to bytes here is only an attempt to do that work before fork() for the purposes of improved error reporting.

@rodrigc
Copy link
Contributor Author

rodrigc commented Feb 1, 2021

@twm I'm still don't understand what you are suggesting. Are you suggesting that _checkProcessArgs should be skipped in these places if sys.stdout is None?

src/twisted/internet/iocpreactor/reactor.py:        args, env = self._checkProcessArgs(args, env)
src/twisted/internet/posixbase.py:        args, env = self._checkProcessArgs(args, env)

or are you suggesting that inside _checkProcessArgs(), if sys.stdout is None,
it should just return args and env unmodified?

@twm
Copy link
Contributor

twm commented Feb 2, 2021

or are you suggesting that inside _checkProcessArgs(), if sys.stdout is None, it should just return args and env unmodified?

Yes, that would be acceptable.

@rodrigc
Copy link
Contributor Author

rodrigc commented Feb 3, 2021

I wrote the following program test.py:

import sys
f = open("b.txt", "w")
print(sys.stdout, file=f)

I then ran it on Windows 10, with Python 3.5.4's pythonw.exe as:

pythonw.exe test.py

The contents of b.txt was:

None

You are indeed correct. Thanks for pointing out this corner case.

This scenario occurs in Windows GUI Applications which have no console.
@rodrigc rodrigc closed this Feb 3, 2021
@rodrigc rodrigc reopened this Feb 3, 2021
@rodrigc rodrigc requested a review from twm February 8, 2021 04:41
@rodrigc
Copy link
Contributor Author

rodrigc commented Feb 8, 2021

@twm I have added a check in _checkProcessArgs() is sys.stdout is None, and added a test.
Does this look OK to you?

Copy link
Contributor

@twm twm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rodrigc Sorry I missed your message until now. This looks okay to me.

I think we'll want to revisit all this in the next release cycle. Once we drop Python 3.5 support all of this will be simpler, but I don't think making it perfect need block the release any further.

@rodrigc rodrigc merged commit 7cf6c8b into trunk Feb 10, 2021
@rodrigc rodrigc deleted the 10069-rodrigc-encoding branch February 10, 2021 08:31
@rodrigc
Copy link
Contributor Author

rodrigc commented Feb 10, 2021

@twm thanks for the review, and especially for bringing up the corner cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants