-
Notifications
You must be signed in to change notification settings - Fork 3.2k
Description
Environment
- pip version: 18.1
- Python version: 3.7.1
- OS: Fedora 30 s390x
This is a bug that manifests itself on a Big Endian architecture, when the tests are run.
However it can be examined on Little Endian as well.
Description
This is the test failure on s390x:
=================================== FAILURES ===================================
____________________ TestEncoding.test_auto_decode_utf16_le ____________________
self = <tests.unit.test_utils.TestEncoding object at 0x3ff9cb5b5c0>
def test_auto_decode_utf16_le(self):
data = (
b'\xff\xfeD\x00j\x00a\x00n\x00g\x00o\x00=\x00'
b'=\x001\x00.\x004\x00.\x002\x00'
)
> assert auto_decode(data) == "Django==1.4.2"
tests/unit/test_utils.py:459:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
data = '\xff\xfeD\x00j\x00a\x00n\x00g\x00o\x00=\x00=\x001\x00.\x004\x00.\x002\x00'
def auto_decode(data):
"""Check a bytes string for a BOM to correctly detect the encoding
Fallback to locale.getpreferredencoding(False) like open() on Python3"""
for bom, encoding in BOMS:
if data.startswith(bom):
> return data[len(bom):].decode(encoding)
E LookupError: unknown encoding: utf16-le
src/pip/_internal/utils/encoding.py:25: LookupError
Expected behavior
The tests should pass on all architectures alike.
How to Reproduce
- Get a big endian machine (virtualize maybe?)
- Run the tests.
More info
I've checked and pip has:
pip/src/pip/_internal/utils/encoding.py
Lines 6 to 14 in e5ab7f6
| BOMS = [ | |
| (codecs.BOM_UTF8, 'utf8'), | |
| (codecs.BOM_UTF16, 'utf16'), | |
| (codecs.BOM_UTF16_BE, 'utf16-be'), | |
| (codecs.BOM_UTF16_LE, 'utf16-le'), | |
| (codecs.BOM_UTF32, 'utf32'), | |
| (codecs.BOM_UTF32_BE, 'utf32-be'), | |
| (codecs.BOM_UTF32_LE, 'utf32-le'), | |
| ] |
And:
pip/src/pip/_internal/utils/encoding.py
Lines 23 to 25 in e5ab7f6
| for bom, encoding in BOMS: | |
| if data.startswith(bom): | |
| return data[len(bom):].decode(encoding) |
So this has 2 problems:
- why does this fail on a big endian architecture and not on all?
- pip tries to use nonexsiting encodings
I have a small reproducer here (run on my machine, x86_64):
>>> from pip._internal.utils.encoding import BOMS
>>> for bom, encoding in BOMS:
... print(bom, encoding, end=': ')
... try:
... _ = ''.encode(encoding)
... print('ok')
... except Exception as e:
... print(type(e), e)
...
b'\xef\xbb\xbf' utf8: ok
b'\xff\xfe' utf16: ok
b'\xfe\xff' utf16-be: <class 'LookupError'> unknown encoding: utf16-be
b'\xff\xfe' utf16-le: <class 'LookupError'> unknown encoding: utf16-le
b'\xff\xfe\x00\x00' utf32: ok
b'\x00\x00\xfe\xff' utf32-be: <class 'LookupError'> unknown encoding: utf32-be
b'\xff\xfe\x00\x00' utf32-le: <class 'LookupError'> unknown encoding: utf32-leThis is the output on s390x:
b'\xef\xbb\xbf' utf8: ok
b'\xfe\xff' utf16: ok
b'\xfe\xff' utf16-be: <class 'LookupError'> unknown encoding: utf16-be
b'\xff\xfe' utf16-le: <class 'LookupError'> unknown encoding: utf16-le
b'\x00\x00\xfe\xff' utf32: ok
b'\x00\x00\xfe\xff' utf32-be: <class 'LookupError'> unknown encoding: utf32-be
b'\xff\xfe\x00\x00' utf32-le: <class 'LookupError'> unknown encoding: utf32-leClearly we see that utf16-be, utf16-le, utf32-be and utf32-le encoding are not even possible to use.
Is that expected? The code should not reach those anyway?
The testing bytestring is:
b'\xff\xfeD\x00j\x00a\x00n\x00g\x00o\x00=\x00=\x001\x00.\x004\x00.\x002\x00'It starts with \xff\xfe and hence should be decoded by first encoding that has this bom. On little endian, that is utf16: Everything works, we haven't reached the nonexisiting encodings.
However on big endian system, the utf16 bom is big endian and hence the first item with the \xff\xfe bom is utf16-le - it blows up.
To reproduce this problem on little endian architectures, add a test_auto_decode_utf16_be tests with:
def test_auto_decode_utf16_le(self):
data = (
b'\xfe\xffD\x00j\x00a\x00n\x00g\x00o\x00=\x00'
b'=\x001\x00.\x004\x00.\x002\x00'
)
assert auto_decode(data) == "Django==1.4.2">>> data = (
... b'\xfe\xffD\x00j\x00a\x00n\x00g\x00o\x00=\x00'
... b'=\x001\x00.\x004\x00.\x002\x00'
... )
>>> from pip._internal.utils.encoding import auto_decode
>>> auto_decode(data)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.7/site-packages/pip/_internal/utils/encoding.py", line 25, in auto_decode
return data[len(bom):].decode(encoding)
LookupError: unknown encoding: utf16-be