Skip to content

Hypothesis: builtins.UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte #153

Open
@wsanchez

Description

@wsanchez
Contributor

The Hypothesis strategies now shipping with Hyperlink are producing this error occasionally in Klein:

Traceback (most recent call last):
324
  File "/home/runner/work/klein/klein/.tox/coverage-py37-tw192/lib/python3.7/site-packages/klein/test/test_request_compat.py", line 74, in test_uri
325
    def test_uri(self, url: DecodedURL) -> None:
326
  File "/home/runner/work/klein/klein/.tox/coverage-py37-tw192/lib/python3.7/site-packages/hypothesis/core.py", line 1163, in wrapped_test
327
    raise the_error_hypothesis_found
328
  File "/home/runner/work/klein/klein/.tox/coverage-py37-tw192/lib/python3.7/site-packages/hyperlink/hypothesis.py", line 321, in decoded_urls
329
    return DecodedURL(draw(encoded_urls()))
330
  File "/home/runner/work/klein/klein/.tox/coverage-py37-tw192/lib/python3.7/site-packages/hyperlink/_url.py", line 2046, in __init__
331
    self.host, self.userinfo, self.path, self.query, self.fragment
332
  File "/home/runner/work/klein/klein/.tox/coverage-py37-tw192/lib/python3.7/site-packages/hyperlink/_url.py", line 2179, in path
333
    for p in self._url.path
334
  File "/home/runner/work/klein/klein/.tox/coverage-py37-tw192/lib/python3.7/site-packages/hyperlink/_url.py", line 2179, in <listcomp>
335
    for p in self._url.path
336
  File "/home/runner/work/klein/klein/.tox/coverage-py37-tw192/lib/python3.7/site-packages/hyperlink/_url.py", line 766, in _percent_decode
337
    return unquoted_bytes.decode(subencoding)
338
builtins.UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
339

340
klein.test.test_request_compat.HTTPRequestWrappingIRequestTests.test_uri
341

Activity

wsanchez

wsanchez commented on Jan 25, 2021

@wsanchez
ContributorAuthor

It would be helpful to catch this error and print the URL that produced it, so one might see what data is tripping us up.

wsanchez

wsanchez commented on Jan 29, 2021

@wsanchez
ContributorAuthor

Here are some failing examples:

error-causing bytes: b'\x80'
URL: URL.from_text('http://0.0/%80')
error-causing bytes: b'\xe1\x8c\x84\xc3\xa9\xf1\xb1\xa9\x9d\x9b'
URL: URL.from_text('https://ɓ.ő𣫫á:26/ጄé\U00071a5d%9b')
error-causing bytes: b'\xe1\x8c\x84\xc3\xa9\xf1\xb1\xa9\x9d\x9b0'
URL: URL.from_text('https://𐎹pɓ.ő𣫫á:51159/ጄé\U00071a5d%9b0/E7*\x13𐬃\x94\x8e')
error-causing bytes: b'\xe1\x8c\x84\xc3\xa9\xf1\xb1\xa9\x9d\x9b0'
URL: URL.from_text('https://𐎹p1ɜ10貭.в.𢙑dɓ.ő𣫫á:51159/ጄé\U00071a5d%9b0/E7*\x13\U0004216a\x9d𠤈\x94\x8e')
wsanchez

wsanchez commented on Jan 29, 2021

@wsanchez
ContributorAuthor

…which one can reproduce in the REPL:

>>> from hyperlink import EncodedURL, DecodedURL
>>> encodedURL = EncodedURL.from_text('http://0.0/%80')
>>> encodedURL
URL.from_text('http://0.0/%80')
>>> DecodedURL(encodedURL)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/wsanchez/Dropbox/Developer/Twisted/klein/.tox/coverage-py38-twcurrent/lib/python3.8/site-packages/hyperlink/_url.py", line 2046, in __init__
    self.host, self.userinfo, self.path, self.query, self.fragment
  File "/Users/wsanchez/Dropbox/Developer/Twisted/klein/.tox/coverage-py38-twcurrent/lib/python3.8/site-packages/hyperlink/_url.py", line 2177, in path
    [
  File "/Users/wsanchez/Dropbox/Developer/Twisted/klein/.tox/coverage-py38-twcurrent/lib/python3.8/site-packages/hyperlink/_url.py", line 2178, in <listcomp>
    _percent_decode(p, raise_subencoding_exc=True)
  File "/Users/wsanchez/Dropbox/Developer/Twisted/klein/.tox/coverage-py38-twcurrent/lib/python3.8/site-packages/hyperlink/_url.py", line 766, in _percent_decode
    return unquoted_bytes.decode(subencoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
>>> encodedURL = EncodedURL.from_text('https://ɓ.ő𣫫á:26/ጄé\U00071a5d%9b')
>>> encodedURL
URL.from_text('https://ɓ.ő𣫫á:26/ጄé\U00071a5d%9b')
>>> DecodedURL(encodedURL)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/wsanchez/Dropbox/Developer/Twisted/klein/.tox/coverage-py38-twcurrent/lib/python3.8/site-packages/hyperlink/_url.py", line 2046, in __init__
    self.host, self.userinfo, self.path, self.query, self.fragment
  File "/Users/wsanchez/Dropbox/Developer/Twisted/klein/.tox/coverage-py38-twcurrent/lib/python3.8/site-packages/hyperlink/_url.py", line 2177, in path
    [
  File "/Users/wsanchez/Dropbox/Developer/Twisted/klein/.tox/coverage-py38-twcurrent/lib/python3.8/site-packages/hyperlink/_url.py", line 2178, in <listcomp>
    _percent_decode(p, raise_subencoding_exc=True)
  File "/Users/wsanchez/Dropbox/Developer/Twisted/klein/.tox/coverage-py38-twcurrent/lib/python3.8/site-packages/hyperlink/_url.py", line 766, in _percent_decode
    return unquoted_bytes.decode(subencoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9b in position 9: invalid start byte
wsanchez

wsanchez commented on Jan 29, 2021

@wsanchez
ContributorAuthor

@glyph @mahmoud I'm curious if you think this may suggest a bug in Hyperlink… that we have allowed the creation of an EncodedURL which cannot be decoded…?

self-assigned this
on Jan 29, 2021
glyph

glyph commented on Jan 29, 2021

@glyph
Collaborator
glyph

glyph commented on Jan 29, 2021

@glyph
Collaborator

I think DecodedURL maybe has a bit of leeway with a URL like this to mangle it or make it not completely round-trip-able through every API. Browsers have to cope with this kind of a mess, and they definitely do some mangling. For example, if you try pasting https://example.com/%80é into Safari or Chrome, you get https://example.com/%80%C3%A9. Now, granted, that's a bit more like an EncodedURL, but you can deliver the percent-encoded text directly to the application in that case. Because if you manually delete the %80, you'll notice that you get https://example.com/é back again, visually.

glyph

glyph commented on Jan 29, 2021

@glyph
Collaborator

If you were to manipulate a busted URL like this, or manually create a copy via moving strings with DecodedURL, you'd get %2580%25C3%25A9 - but I think that's fine. Maybe there should be a switch about whether to raise or mangle on encoding errors when you create the object?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

    Development

    Participants

    @wsanchez@glyph

    Issue actions

      Hypothesis: builtins.UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte · Issue #153 · python-hyper/hyperlink