Skip to content

LookupError: unknown encoding: utf16-le #6054

@hroncok

Description

@hroncok

Environment

  • pip version: 18.1
  • Python version: 3.7.1
  • OS: Fedora 30 s390x

This is a bug that manifests itself on a Big Endian architecture, when the tests are run.
However it can be examined on Little Endian as well.

Description

This is the test failure on s390x:

=================================== FAILURES ===================================
____________________ TestEncoding.test_auto_decode_utf16_le ____________________
self = <tests.unit.test_utils.TestEncoding object at 0x3ff9cb5b5c0>
    def test_auto_decode_utf16_le(self):
        data = (
            b'\xff\xfeD\x00j\x00a\x00n\x00g\x00o\x00=\x00'
            b'=\x001\x00.\x004\x00.\x002\x00'
        )
>       assert auto_decode(data) == "Django==1.4.2"
tests/unit/test_utils.py:459: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
data = '\xff\xfeD\x00j\x00a\x00n\x00g\x00o\x00=\x00=\x001\x00.\x004\x00.\x002\x00'
    def auto_decode(data):
        """Check a bytes string for a BOM to correctly detect the encoding
    
        Fallback to locale.getpreferredencoding(False) like open() on Python3"""
        for bom, encoding in BOMS:
            if data.startswith(bom):
>               return data[len(bom):].decode(encoding)
E               LookupError: unknown encoding: utf16-le
src/pip/_internal/utils/encoding.py:25: LookupError

Expected behavior

The tests should pass on all architectures alike.

How to Reproduce

  1. Get a big endian machine (virtualize maybe?)
  2. Run the tests.

More info

I've checked and pip has:

BOMS = [
(codecs.BOM_UTF8, 'utf8'),
(codecs.BOM_UTF16, 'utf16'),
(codecs.BOM_UTF16_BE, 'utf16-be'),
(codecs.BOM_UTF16_LE, 'utf16-le'),
(codecs.BOM_UTF32, 'utf32'),
(codecs.BOM_UTF32_BE, 'utf32-be'),
(codecs.BOM_UTF32_LE, 'utf32-le'),
]

And:

for bom, encoding in BOMS:
if data.startswith(bom):
return data[len(bom):].decode(encoding)

So this has 2 problems:

  • why does this fail on a big endian architecture and not on all?
  • pip tries to use nonexsiting encodings

I have a small reproducer here (run on my machine, x86_64):

>>> from pip._internal.utils.encoding import BOMS
>>> for bom, encoding in BOMS:
...     print(bom, encoding, end=': ')
...     try:
...         _ = ''.encode(encoding)
...         print('ok')
...     except Exception as e:
...         print(type(e), e)
... 
b'\xef\xbb\xbf' utf8: ok
b'\xff\xfe' utf16: ok
b'\xfe\xff' utf16-be: <class 'LookupError'> unknown encoding: utf16-be
b'\xff\xfe' utf16-le: <class 'LookupError'> unknown encoding: utf16-le
b'\xff\xfe\x00\x00' utf32: ok
b'\x00\x00\xfe\xff' utf32-be: <class 'LookupError'> unknown encoding: utf32-be
b'\xff\xfe\x00\x00' utf32-le: <class 'LookupError'> unknown encoding: utf32-le

This is the output on s390x:

b'\xef\xbb\xbf' utf8: ok
b'\xfe\xff' utf16: ok
b'\xfe\xff' utf16-be: <class 'LookupError'> unknown encoding: utf16-be
b'\xff\xfe' utf16-le: <class 'LookupError'> unknown encoding: utf16-le
b'\x00\x00\xfe\xff' utf32: ok
b'\x00\x00\xfe\xff' utf32-be: <class 'LookupError'> unknown encoding: utf32-be
b'\xff\xfe\x00\x00' utf32-le: <class 'LookupError'> unknown encoding: utf32-le

Clearly we see that utf16-be, utf16-le, utf32-be and utf32-le encoding are not even possible to use.
Is that expected? The code should not reach those anyway?

The testing bytestring is:

b'\xff\xfeD\x00j\x00a\x00n\x00g\x00o\x00=\x00=\x001\x00.\x004\x00.\x002\x00'

It starts with \xff\xfe and hence should be decoded by first encoding that has this bom. On little endian, that is utf16: Everything works, we haven't reached the nonexisiting encodings.

However on big endian system, the utf16 bom is big endian and hence the first item with the \xff\xfe bom is utf16-le - it blows up.

To reproduce this problem on little endian architectures, add a test_auto_decode_utf16_be tests with:

    def test_auto_decode_utf16_le(self):
        data = (
            b'\xfe\xffD\x00j\x00a\x00n\x00g\x00o\x00=\x00'
            b'=\x001\x00.\x004\x00.\x002\x00'
        )
        assert auto_decode(data) == "Django==1.4.2"
>>> data = (
...     b'\xfe\xffD\x00j\x00a\x00n\x00g\x00o\x00=\x00'
...     b'=\x001\x00.\x004\x00.\x002\x00'
... )
>>> from pip._internal.utils.encoding import auto_decode
>>> auto_decode(data)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.7/site-packages/pip/_internal/utils/encoding.py", line 25, in auto_decode
    return data[len(bom):].decode(encoding)
LookupError: unknown encoding: utf16-be

Metadata

Metadata

Assignees

No one assigned

    Labels

    C: encodingRelated to text encoding and likely, UnicodeErrorsauto-lockedOutdated issues that have been locked by automationtype: bugA confirmed bug or unintended behavior

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions