Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LookupError: unknown encoding: utf16-le #6054

Closed
hroncok opened this issue Nov 30, 2018 · 7 comments

Comments

Projects
None yet
4 participants
@hroncok
Copy link
Contributor

commented Nov 30, 2018

Environment

  • pip version: 18.1
  • Python version: 3.7.1
  • OS: Fedora 30 s390x

This is a bug that manifests itself on a Big Endian architecture, when the tests are run.
However it can be examined on Little Endian as well.

Description

This is the test failure on s390x:

=================================== FAILURES ===================================
____________________ TestEncoding.test_auto_decode_utf16_le ____________________
self = <tests.unit.test_utils.TestEncoding object at 0x3ff9cb5b5c0>
    def test_auto_decode_utf16_le(self):
        data = (
            b'\xff\xfeD\x00j\x00a\x00n\x00g\x00o\x00=\x00'
            b'=\x001\x00.\x004\x00.\x002\x00'
        )
>       assert auto_decode(data) == "Django==1.4.2"
tests/unit/test_utils.py:459: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
data = '\xff\xfeD\x00j\x00a\x00n\x00g\x00o\x00=\x00=\x001\x00.\x004\x00.\x002\x00'
    def auto_decode(data):
        """Check a bytes string for a BOM to correctly detect the encoding
    
        Fallback to locale.getpreferredencoding(False) like open() on Python3"""
        for bom, encoding in BOMS:
            if data.startswith(bom):
>               return data[len(bom):].decode(encoding)
E               LookupError: unknown encoding: utf16-le
src/pip/_internal/utils/encoding.py:25: LookupError

Expected behavior

The tests should pass on all architectures alike.

How to Reproduce

  1. Get a big endian machine (virtualize maybe?)
  2. Run the tests.

More info

I've checked and pip has:

BOMS = [
(codecs.BOM_UTF8, 'utf8'),
(codecs.BOM_UTF16, 'utf16'),
(codecs.BOM_UTF16_BE, 'utf16-be'),
(codecs.BOM_UTF16_LE, 'utf16-le'),
(codecs.BOM_UTF32, 'utf32'),
(codecs.BOM_UTF32_BE, 'utf32-be'),
(codecs.BOM_UTF32_LE, 'utf32-le'),
]

And:

for bom, encoding in BOMS:
if data.startswith(bom):
return data[len(bom):].decode(encoding)

So this has 2 problems:

  • why does this fail on a big endian architecture and not on all?
  • pip tries to use nonexsiting encodings

I have a small reproducer here (run on my machine, x86_64):

>>> from pip._internal.utils.encoding import BOMS
>>> for bom, encoding in BOMS:
...     print(bom, encoding, end=': ')
...     try:
...         _ = ''.encode(encoding)
...         print('ok')
...     except Exception as e:
...         print(type(e), e)
... 
b'\xef\xbb\xbf' utf8: ok
b'\xff\xfe' utf16: ok
b'\xfe\xff' utf16-be: <class 'LookupError'> unknown encoding: utf16-be
b'\xff\xfe' utf16-le: <class 'LookupError'> unknown encoding: utf16-le
b'\xff\xfe\x00\x00' utf32: ok
b'\x00\x00\xfe\xff' utf32-be: <class 'LookupError'> unknown encoding: utf32-be
b'\xff\xfe\x00\x00' utf32-le: <class 'LookupError'> unknown encoding: utf32-le

This is the output on s390x:

b'\xef\xbb\xbf' utf8: ok
b'\xfe\xff' utf16: ok
b'\xfe\xff' utf16-be: <class 'LookupError'> unknown encoding: utf16-be
b'\xff\xfe' utf16-le: <class 'LookupError'> unknown encoding: utf16-le
b'\x00\x00\xfe\xff' utf32: ok
b'\x00\x00\xfe\xff' utf32-be: <class 'LookupError'> unknown encoding: utf32-be
b'\xff\xfe\x00\x00' utf32-le: <class 'LookupError'> unknown encoding: utf32-le

Clearly we see that utf16-be, utf16-le, utf32-be and utf32-le encoding are not even possible to use.
Is that expected? The code should not reach those anyway?

The testing bytestring is:

b'\xff\xfeD\x00j\x00a\x00n\x00g\x00o\x00=\x00=\x001\x00.\x004\x00.\x002\x00'

It starts with \xff\xfe and hence should be decoded by first encoding that has this bom. On little endian, that is utf16: Everything works, we haven't reached the nonexisiting encodings.

However on big endian system, the utf16 bom is big endian and hence the first item with the \xff\xfe bom is utf16-le - it blows up.

To reproduce this problem on little endian architectures, add a test_auto_decode_utf16_be tests with:

    def test_auto_decode_utf16_le(self):
        data = (
            b'\xfe\xffD\x00j\x00a\x00n\x00g\x00o\x00=\x00'
            b'=\x001\x00.\x004\x00.\x002\x00'
        )
        assert auto_decode(data) == "Django==1.4.2"
>>> data = (
...     b'\xfe\xffD\x00j\x00a\x00n\x00g\x00o\x00=\x00'
...     b'=\x001\x00.\x004\x00.\x002\x00'
... )
>>> from pip._internal.utils.encoding import auto_decode
>>> auto_decode(data)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.7/site-packages/pip/_internal/utils/encoding.py", line 25, in auto_decode
    return data[len(bom):].decode(encoding)
LookupError: unknown encoding: utf16-be
@hroncok

This comment has been minimized.

Copy link
Contributor Author

commented Feb 28, 2019

Still happens on 19.x.

@cjerdonek

This comment has been minimized.

Copy link
Member

commented Mar 1, 2019

It looks like that code was added in PR #3485. @xavfernandez, can you take a look?

@cjerdonek

This comment has been minimized.

Copy link
Member

commented Mar 1, 2019

It looks like the fix might be as simple as changing utf16-be to utf-16-be and similarly for the others.

There should be a regression test to iterate over the BOMS list and check that its entries are valid.

@hroncok

This comment has been minimized.

Copy link
Contributor Author

commented Mar 1, 2019

Indeed, utf-16-be seems to exist.

@hroncok

This comment has been minimized.

Copy link
Contributor Author

commented Mar 1, 2019

I'll submit a PR with the fix and regression test.

hroncok added a commit to hroncok/pip that referenced this issue Mar 1, 2019

Fix utils.encoding.auto_decode() LookupError with invalid encodings
utils.encoding.auto_decode() was broken when decoding Big Endian BOM
byte-strings on Little Endian or vice versa.

The TestEncoding.test_auto_decode_utf16_le test was failing on Big Endian
systems, such as Fedora's s390x builders. A similar test, but with BE BOM
test_auto_decode_utf16_be was added in order to reproduce this on a Little
Endian system (which is much easier to come by).

A regression test was added to check that all listed encodings in
utils.encoding.BOMS are valid.

Fixes pypa#6054
@hroncok

This comment has been minimized.

Copy link
Contributor Author

commented Mar 1, 2019

@pfmoore

This comment has been minimized.

Copy link
Member

commented Mar 1, 2019

The table of aliases here would seem to confirm that utf16-be isn't a valid alias (although utf-16be is...)

hroncok added a commit to hroncok/pip that referenced this issue Mar 1, 2019

Fix utils.encoding.auto_decode() LookupError with invalid encodings
utils.encoding.auto_decode() was broken when decoding Big Endian BOM
byte-strings on Little Endian or vice versa.

The TestEncoding.test_auto_decode_utf16_le test was failing on Big Endian
systems, such as Fedora's s390x builders. A similar test, but with BE BOM
test_auto_decode_utf16_be was added in order to reproduce this on a Little
Endian system (which is much easier to come by).

A regression test was added to check that all listed encodings in
utils.encoding.BOMS are valid.

Fixes pypa#6054

hroncok added a commit to hroncok/pip that referenced this issue Mar 1, 2019

Fix utils.encoding.auto_decode() LookupError with invalid encodings
utils.encoding.auto_decode() was broken when decoding Big Endian BOM
byte-strings on Little Endian or vice versa.

The TestEncoding.test_auto_decode_utf_16_le test was failing on Big Endian
systems, such as Fedora's s390x builders. A similar test, but with BE BOM
test_auto_decode_utf_16_be was added in order to reproduce this on a Little
Endian system (which is much easier to come by).

A regression test was added to check that all listed encodings in
utils.encoding.BOMS are valid.

Fixes pypa#6054

@cjerdonek cjerdonek removed the needs triage label Mar 1, 2019

hroncok added a commit to hroncok/pip that referenced this issue Mar 1, 2019

Fix utils.encoding.auto_decode() LookupError with invalid encodings
utils.encoding.auto_decode() was broken when decoding Big Endian BOM
byte-strings on Little Endian or vice versa.

The TestEncoding.test_auto_decode_utf_16_le test was failing on Big Endian
systems, such as Fedora's s390x builders. A similar test, but with BE BOM
test_auto_decode_utf_16_be was added in order to reproduce this on a Little
Endian system (which is much easier to come by).

A regression test was added to check that all listed encodings in
utils.encoding.BOMS are valid.

Fixes pypa#6054
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.