-
-
Notifications
You must be signed in to change notification settings - Fork 30.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test_site.StartupImportTests.test_startup_imports fails if default code page is cp65001 #80959
Comments
Windows desktop skus have a default ANSI codepage (returned by GetACP()) of 1252 (Western European). Windows IoT Core and Windows Nano Server have a default codepage of 65001 (UTF-8). This causes test_site.StartupImportTests.test_startup_imports to fail on Windows IoT Core and Windows Nano Server because cp65001.py is loaded instead of the frozen cp1252.py at startup. I tried changing the default codepage to 65001 on my dev machine and rebuilding Python and it had no effect that I could tell on the generated frozen importlibs. The simplest solutions would be for the test_startup_imports test to be skipped or changed to pass when the locale.getpreferredencoding() returns 'cp65001' |
Could you paste how the test fails? |
====================================================================== Traceback (most recent call last):
File "c:\docker\pythond\lib\test\test_site.py", line 542, in test_startup_imports
self.assertFalse(modules.intersection(collection_mods), stderr)
AssertionError: {'operator', 'keyword', 'functools', 'heapq', 'collections', 'reprlib'} is not false : import _frozen_importlib # frozen
import _imp # builtin
import '_thread' # <class '_frozen_importlib.BuiltinImporter'>
import '_warnings' # <class '_frozen_importlib.BuiltinImporter'>
import '_weakref' # <class '_frozen_importlib.BuiltinImporter'>
import '_frozen_importlib_external' # <class '_frozen_importlib.FrozenImporter'>
import '_io' # <class '_frozen_importlib.BuiltinImporter'>
import 'marshal' # <class '_frozen_importlib.BuiltinImporter'>
import 'nt' # <class '_frozen_importlib.BuiltinImporter'>
import _thread # previously loaded ('_thread')
import '_thread' # <class '_frozen_importlib.BuiltinImporter'>
import _weakref # previously loaded ('_weakref')
import '_weakref' # <class '_frozen_importlib.BuiltinImporter'>
import 'winreg' # <class '_frozen_importlib.BuiltinImporter'>
# installing zipimport hook
import 'time' # <class '_frozen_importlib.BuiltinImporter'>
import 'zipimport' # <class '_frozen_importlib.FrozenImporter'>
# installed zipimport hook
# c:\docker\pythond\lib\encodings\__pycache__\__init__.cpython-38.pyc matches c:\docker\pythond\lib\encodings\__init__.py
# code object from 'c:\\docker\\pythond\\lib\\encodings\\__pycache__\\__init__.cpython-38.pyc'
# c:\docker\pythond\lib\__pycache__\codecs.cpython-38.pyc matches c:\docker\pythond\lib\codecs.py
# code object from 'c:\\docker\\pythond\\lib\\__pycache__\\codecs.cpython-38.pyc'
import '_codecs' # <class '_frozen_importlib.BuiltinImporter'>
import 'codecs' # <_frozen_importlib_external.SourceFileLoader object at 0x01D9DBD0>
# c:\docker\pythond\lib\encodings\__pycache__\aliases.cpython-38.pyc matches c:\docker\pythond\lib\encodings\aliases.py
# code object from 'c:\\docker\\pythond\\lib\\encodings\\__pycache__\\aliases.cpython-38.pyc'
import 'encodings.aliases' # <_frozen_importlib_external.SourceFileLoader object at 0x01EFF900>
import 'encodings' # <_frozen_importlib_external.SourceFileLoader object at 0x01D9DA50>
# c:\docker\pythond\lib\encodings\__pycache__\utf_8.cpython-38.pyc matches c:\docker\pythond\lib\encodings\utf_8.py
# code object from 'c:\\docker\\pythond\\lib\\encodings\\__pycache__\\utf_8.cpython-38.pyc'
import 'encodings.utf_8' # <_frozen_importlib_external.SourceFileLoader object at 0x01D9DCC0>
import '_signal' # <class '_frozen_importlib.BuiltinImporter'>
# c:\docker\pythond\lib\encodings\__pycache__\cp65001.cpython-38.pyc matches c:\docker\pythond\lib\encodings\cp65001.py
# code object from 'c:\\docker\\pythond\\lib\\encodings\\__pycache__\\cp65001.cpython-38.pyc'
# c:\docker\pythond\lib\__pycache__\functools.cpython-38.pyc matches c:\docker\pythond\lib\functools.py
# code object from 'c:\\docker\\pythond\\lib\\__pycache__\\functools.cpython-38.pyc'
# c:\docker\pythond\lib\__pycache__\abc.cpython-38.pyc matches c:\docker\pythond\lib\abc.py
# code object from 'c:\\docker\\pythond\\lib\\__pycache__\\abc.cpython-38.pyc'
import '_abc' # <class '_frozen_importlib.BuiltinImporter'>
import 'abc' # <_frozen_importlib_external.SourceFileLoader object at 0x01F16FC0>
# c:\docker\pythond\lib\collections\__pycache__\__init__.cpython-38.pyc matches c:\docker\pythond\lib\collections\__init__.py
# code object from 'c:\\docker\\pythond\\lib\\collections\\__pycache__\\__init__.cpython-38.pyc'
# c:\docker\pythond\lib\__pycache__\_collections_abc.cpython-38.pyc matches c:\docker\pythond\lib\_collections_abc.py
# code object from 'c:\\docker\\pythond\\lib\\__pycache__\\_collections_abc.cpython-38.pyc'
import '_collections_abc' # <_frozen_importlib_external.SourceFileLoader object at 0x01F423C0>
# c:\docker\pythond\lib\__pycache__\operator.cpython-38.pyc matches c:\docker\pythond\lib\operator.py
# code object from 'c:\\docker\\pythond\\lib\\__pycache__\\operator.cpython-38.pyc'
import '_operator' # <class '_frozen_importlib.BuiltinImporter'>
import 'operator' # <_frozen_importlib_external.SourceFileLoader object at 0x01F4D630>
# c:\docker\pythond\lib\__pycache__\keyword.cpython-38.pyc matches c:\docker\pythond\lib\keyword.py
# code object from 'c:\\docker\\pythond\\lib\\__pycache__\\keyword.cpython-38.pyc'
import 'keyword' # <_frozen_importlib_external.SourceFileLoader object at 0x01F58810>
# c:\docker\pythond\lib\__pycache__\heapq.cpython-38.pyc matches c:\docker\pythond\lib\heapq.py
# code object from 'c:\\docker\\pythond\\lib\\__pycache__\\heapq.cpython-38.pyc'
import '_heapq' # <class '_frozen_importlib.BuiltinImporter'>
import 'heapq' # <_frozen_importlib_external.SourceFileLoader object at 0x01F588D0>
import 'itertools' # <class '_frozen_importlib.BuiltinImporter'>
# c:\docker\pythond\lib\__pycache__\reprlib.cpython-38.pyc matches c:\docker\pythond\lib\reprlib.py
# code object from 'c:\\docker\\pythond\\lib\\__pycache__\\reprlib.cpython-38.pyc'
import 'reprlib' # <_frozen_importlib_external.SourceFileLoader object at 0x01F59900>
import '_collections' # <class '_frozen_importlib.BuiltinImporter'>
import 'collections' # <_frozen_importlib_external.SourceFileLoader object at 0x01F25810>
import '_functools' # <class '_frozen_importlib.BuiltinImporter'>
import 'functools' # <_frozen_importlib_external.SourceFileLoader object at 0x01EFFCC0>
import 'encodings.cp65001' # <_frozen_importlib_external.SourceFileLoader object at 0x01EFF9F0>
# c:\docker\pythond\lib\encodings\__pycache__\latin_1.cpython-38.pyc matches c:\docker\pythond\lib\encodings\latin_1.py
# code object from 'c:\\docker\\pythond\\lib\\encodings\\__pycache__\\latin_1.cpython-38.pyc'
import 'encodings.latin_1' # <_frozen_importlib_external.SourceFileLoader object at 0x01EFF810>
# c:\docker\pythond\lib\__pycache__\io.cpython-38.pyc matches c:\docker\pythond\lib\io.py
# code object from 'c:\\docker\\pythond\\lib\\__pycache__\\io.cpython-38.pyc'
import 'io' # <_frozen_importlib_external.SourceFileLoader object at 0x01D88DB0>
Python 3.8.0a3+ (heads/iot-merged-dirty:88716a51a3, Apr 5 2019, 11:11:18) [MSC v.1916 32 bit (ARM)] on win32
Type "help", "copyright", "credits" or "license" for more information.
# c:\docker\pythond\lib\__pycache__\site.cpython-38.pyc matches c:\docker\pythond\lib\site.py
# code object from 'c:\\docker\\pythond\\lib\\__pycache__\\site.cpython-38.pyc'
# c:\docker\pythond\lib\__pycache__\os.cpython-38.pyc matches c:\docker\pythond\lib\os.py
# code object from 'c:\\docker\\pythond\\lib\\__pycache__\\os.cpython-38.pyc'
# c:\docker\pythond\lib\__pycache__\stat.cpython-38.pyc matches c:\docker\pythond\lib\stat.py
# code object from 'c:\\docker\\pythond\\lib\\__pycache__\\stat.cpython-38.pyc'
import '_stat' # <class '_frozen_importlib.BuiltinImporter'>
import 'stat' # <_frozen_importlib_external.SourceFileLoader object at 0x01F25990>
# c:\docker\pythond\lib\__pycache__\ntpath.cpython-38.pyc matches c:\docker\pythond\lib\ntpath.py
# code object from 'c:\\docker\\pythond\\lib\\__pycache__\\ntpath.cpython-38.pyc'
# c:\docker\pythond\lib\__pycache__\genericpath.cpython-38.pyc matches c:\docker\pythond\lib\genericpath.py
# code object from 'c:\\docker\\pythond\\lib\\__pycache__\\genericpath.cpython-38.pyc'
import 'genericpath' # <_frozen_importlib_external.SourceFileLoader object at 0x01F9CDE0>
import 'ntpath' # <_frozen_importlib_external.SourceFileLoader object at 0x01F9C5D0>
import 'os' # <_frozen_importlib_external.SourceFileLoader object at 0x01F873F0>
# c:\docker\pythond\lib\__pycache__\_sitebuiltins.cpython-38.pyc matches c:\docker\pythond\lib\_sitebuiltins.py
# code object from 'c:\\docker\\pythond\\lib\\__pycache__\\_sitebuiltins.cpython-38.pyc'
import '_sitebuiltins' # <_frozen_importlib_external.SourceFileLoader object at 0x01F87FC0>
import 'site' # <_frozen_importlib_external.SourceFileLoader object at 0x01F16C60>
# cleanup[3] wiping _functools
# cleanup[3] wiping _collections
# cleanup[3] wiping heapq
# cleanup[3] wiping _heapq
# destroy _heapq
# cleanup[3] wiping _operator
# cleanup[3] wiping _collections_abc
# cleanup[3] wiping _abc
# cleanup[3] wiping encodings.utf_8
# cleanup[3] wiping encodings.aliases
# cleanup[3] wiping codecs
# cleanup[3] wiping _codecs
# cleanup[3] wiping winreg
# cleanup[3] wiping _weakref
# cleanup[3] wiping _thread
# cleanup[3] wiping nt
# cleanup[3] wiping marshal
# cleanup[3] wiping _io
# cleanup[3] wiping _frozen_importlib_external
# destroy io
# destroy nt
# destroy winreg
# destroy marshal
# cleanup[3] wiping _warnings
# cleanup[3] wiping _imp
# cleanup[3] wiping _frozen_importlib
# destroy _frozen_importlib_external
# destroy _imp
# destroy _warnings
# cleanup[3] wiping sys
# clear builtins._
# clear sys.path
# clear sys.argv
# clear sys.ps1
# clear sys.ps2
# clear sys.last_type
# clear sys.last_value
# clear sys.last_traceback
# clear sys.path_hooks
# clear sys.path_importer_cache
# clear sys.meta_path
# clear sys.__interactivehook__
# clear sys.flags
# clear sys.float_info
# restore sys.stdin
# restore sys.stdout
# restore sys.stderr
# cleanup[2] removing sys
# cleanup[2] removing builtins
# cleanup[2] removing _frozen_importlib
# cleanup[2] removing _imp
# cleanup[2] removing _warnings
# cleanup[2] removing _frozen_importlib_external
# cleanup[2] removing _io
# cleanup[2] removing marshal
# cleanup[2] removing nt
# cleanup[2] removing _thread
# cleanup[2] removing _weakref
# cleanup[2] removing winreg
# cleanup[2] removing time
# cleanup[2] removing zipimport
# destroy zipimport
# cleanup[2] removing _codecs
# cleanup[2] removing codecs
# cleanup[2] removing encodings.aliases
# cleanup[2] removing encodings
# destroy encodings
# cleanup[2] removing encodings.utf_8
# cleanup[2] removing _signal
# cleanup[2] removing __main__
# destroy __main__
# cleanup[2] removing _abc
# cleanup[2] removing abc
# cleanup[2] removing _collections_abc
# cleanup[2] removing _operator
# cleanup[2] removing operator
# destroy operator
# cleanup[2] removing keyword
# destroy keyword
# cleanup[2] removing _heapq
# cleanup[2] removing heapq
# cleanup[2] removing itertools
# cleanup[2] removing reprlib
# destroy reprlib
# cleanup[2] removing _collections
# cleanup[2] removing collections
# destroy collections
# cleanup[2] removing _functools
# cleanup[2] removing functools
# cleanup[2] removing encodings.cp65001
# cleanup[2] removing encodings.latin_1
# cleanup[2] removing io
# destroy io
# cleanup[2] removing _stat
# cleanup[2] removing stat
# cleanup[2] removing genericpath
# cleanup[2] removing ntpath
# cleanup[2] removing os.path
# cleanup[2] removing os
# cleanup[2] removing _sitebuiltins
# cleanup[2] removing site
# destroy site
# destroy time
# destroy _signal
# destroy itertools
# destroy _sitebuiltins
# destroy abc
# destroy ntpath
# destroy _stat
# destroy os
# destroy stat
# destroy genericpath
# cleanup[3] wiping encodings.latin_1
# cleanup[3] wiping encodings.cp65001
# destroy functools
# cleanup[3] wiping builtins
# destroy _functools
# destroy _collections_abc
# destroy _operator
# destroy heapq
# destroy _weakref
# destroy _collections
# destroy _thread
# destroy _abc
# destroy _frozen_importlib |
cp65001 is *not* utf-8: Microsoft decided to handle surrogates differently |
Do you mean valid UTF-16 surrogate pairs? For example: >>> codecs.code_page_encode(65001, '\ud800\udc00')
(b'\xf0\x90\x80\x80', 2) PyUnicode_AsUnicodeAndSize is neutral about storing surrogate codes in a 16-bit wchar_t string. In particular, the Python string in this case contains two surrogate codes, but they're passed to WideCharToMultiByte as a UTF-16 surrogate pair for the single character U+10000. Anyway, it seems to me this issue will be resolved if cp65001.py is rewritten without functools.partial. |
I think it is better to just make the check in the test conditional. It already contains some macOs specific conditions. |
Okay. The test verifies work done to minimize interpreter startup time, but probably the relative cost of importing functools (and thus collections et al) isn't significant compared to the overall cost of spawning a process in a Windows desktop environment. That may not be the case for Nano Server and IoT Core. |
Paul Monson: I'm unable to reproduce exactly your issue, but I tried to reproduce it partially using PYTHONIOENCODING=cp65001. My PR 13110 avoids "import functools" at startup. Can you please try it and check if it fix test_site? |
Victor:
Eryk:
Code page 65001 handles lone surrogate differently on Windows XP and older. It changed in Windows Vista: Steve Dower removed support for Vista from test_codecs.py 3 years ago: commit f5aba58
Maybe it's time to remove Lib/encodings/cp65001.py and add an alias cp65001 => utf_8 in Lib/encodings/aliases.py? See bpo-32592. |
Is there an easy way to measure this?
I tried setting PYTHONIOENCODING=cp1252 on Windows IoT Core as a workaround and it didn't work. Victor> My PR 13110 avoids "import functools" at startup. Can you please try it and check if it fix test_site? I tried the PR and it fixes test_startup_imports, which seems promising. The PR breaks other test_site tests on Windows IoT Core. |
FYI, I expect cp65001 will be used more widely in near future, Today, Microsoft announced new Terminal application. I think treating cp65001 as right "UTF-8" locale is better for all |
cp65001 is the default codepage on Windows IoT Core and Windows NanoServer. There is also an option in control panel in Windows desktop 1809 (version 17763) and greater which changes the default codepage to cp65001.
If I read the docs correctly a lone surrogate is an error. I don't think a corner case like handling errors differently makes cp65001 not UTF-8. Am I misunderstanding this point? |
The XP/Vista change is just context - we don't have to worry about OS that old any more. If we remove the functools.partial call, does that help? |
Removing import functools from cp65001.py fixes test_startup_imports. Victor proposed this PR: #13110 I tried to build on Victor's change but there is still one test failure I haven't tracked down yet: #13211 FAIL: test_incremental_surrogatepass (test.test_codecs.CP65001Test) Traceback (most recent call last):
File "C:\master\pythond\lib\test\test_codecs.py", line 436, in test_incremental_surrogatepass
self.assertEqual(dec.decode(data[i:], True), '\uD901')
AssertionError: '' != '\ud901'
+ \ud901 |
Unless PYTHONLEGACYWINDOWSSTDIO is defined, Python 3.6+ doesn't use the console's codepage-based interface (except for low-level os.read and os.write). Console files uses the wide-character console API internally, and have a "utf-8" encoding. "cp65001" isn't a factor in this context. This issue probably occurs due to the encoding returned by locale.getpreferredencoding(). This calls _locale._getdefaultlocale, which returns a tuple that mixes the user locale with the system ANSI codepage. For example, with ANSI set to UTF-8 (Windows 10): >>> _locale._getdefaultlocale()
('en_GB', 'cp65001') The Universal CRT special cases CP_UTF8 (codepage 65001) as "utf8" and accepts "utf-8" as an alias. For example, after setting the ANSI codepage to UTF-8: >>> locale.setlocale(locale.LC_CTYPE, '')
'English_United Kingdom.utf8' Python could similarly special case CP_UTF8 as "utf-8" in _locale._getdefaultlocale. |
@eryk I didn't say new Terminal will cause this issue. I know ConsoeIO too. I just meant Microsoft use cp65001 more widely for better UTF-8 support nowadays.
I like this idea too. |
I wrote PR 13230 to remove Lib/encodings/cp65001.py and simply reuse Lib/encodings/utf_8.py. |
My PR 13110 (avoid functools) makes codecs.lookup('cp65001').encode() made 2.7x slower: My PR 13230 (remove cp65001.py) makes it 1.5x faster :-) The reference is: 156 ns +- 3 ns. |
I dislike lying in the locale module. This change is basically useless with my PR 13230. |
Yes, functionally it's no different than using 'cp65001' as an alias. That said, the CRT special cases 65001 as "utf8": >>> locale.setlocale(locale.LC_CTYPE, '')
'English_United Kingdom.utf8'
>>> crt_locale = ctypes.CDLL('api-ms-win-crt-locale-l1-1-0', use_errno=True)
>>> crt_locale.___lc_codepage_func()
65001 So the suggested change makes the locale module internally consistent on Windows and more transparent for anyone who doesn't know off the top of their head that "cp65001" is just UTF-8. |
I can verify that PR 13110 fixes the issue with test_startup_imports on Windows IoT Core ARM32 |
Sorry that was supposed to say: |
Note that Python produce "cpNNN" encoding name, not Windows. cpython/Modules/_localemodule.c Line 395 in 137be34
So I don't think it is lie. It is just "what encoding name we should choose when GetACP() returned 65001.". |
About the ANSI code page, Lib/encodings/init.py calls _winapi.GetACP() to avoid relying on locale.getpreferredencoding() which lies when UTF-8 Mode is enabled: import _winapi
ansi_code_page = "cp%s" % _winapi.GetACP()
if encoding == ansi_code_page:
import encodings.mbcs
return encodings.mbcs.getregentry() INADA-san:
Well, feel free to propose a PR. I have no strong opinion on this level of detail :-) |
Paul Monson: Your initial issue has been fixed in the master branch. I'm not sure what are Windows IoT Core and Windows Nano Server. Do you care of Python 3.7? If someone wants to support running test_site with ANSI code page set to 65001, I suggest to fix test_site directly like PR 13072 in Python 3.7. My attempt to avoid functools made cp65001 codec way slower. Fixing one specific test should not make Python that much slower ;-) |
Thanks Victor! Since we aren't backporting ARM32 changes, I don't think it's important to fix this test in 3.7. I am trying to get the buildbot tests for Windows ARM32 to zero errors. Windows IoT Core runs on Raspberry Pi and similar devices: https://developer.microsoft.com/en-us/windows/iot Windows NanoServer is a very small version of Windows Server for running in Docker containers hosted on Windows Server. |
Ok, thanks. I close the issue. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: