"č" - UTF-8 UnicodeDecodeError #1674

Closed
jkatzer opened this Issue Oct 16, 2013 · 14 comments

Comments

Projects
None yet
4 participants

jkatzer commented Oct 16, 2013

Traceback (most recent call last):
File "pysnap/get_snaps.py", line 37, in
if not s.login().get('logged'):
File "/Users/jason/github-random/pysnap/pysnap/pysnap.py", line 71, in login
result = r.json()
File "/Users/jason/github-random/pysnap/venv/lib/python2.7/site-packages/requests/models.py", line 650, in json
return json.loads(self.content.decode(encoding), **kwargs)
File "/Users/jason/github-random/pysnap/venv/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe8 in position 2134: invalid continuation byte

can reproduce by pulling the json version of this issue using requests and then doing r.json()

Owner

Lukasa commented Oct 16, 2013

What version of requests are you using? Using requests 2.0 in both Python 2.7 and Python 3.3 I can happily get the json version of this issue and call r.json(), no problems.

Owner

sigmavirus24 commented Oct 16, 2013

Context, (the original) pysnap seems to be a snapchat api wrapper that is no longer on GitHub or PyPI. I'd be interested to see what version of requests it is looking for.

@jkatzer python -c 'import requests; print(requests.__version__)' will give you the version of requests you're using.

Owner

sigmavirus24 commented Oct 16, 2013

And I just noticed there's an issue on martinp/pysnap and that they're pinned to 1.2.3. I still have to wonder how that was happening there but ... okay.

jkatzer commented Oct 17, 2013

i updated to version 2.0 before submitting the issue. that stack trace is from 1.2.3, the 2.0.0 one is very similar. is there a better way to dump the text to make sure the character is correct?

my friends name has an accented e, but adding r.text put out the character in the title of the issue.

Owner

Lukasa commented Oct 17, 2013

Just to be clear: can you reproduce the bug in 2.0.0 using the JSON version of this issue?

Owner

sigmavirus24 commented Oct 20, 2013

@jkatzer can you tell us what self.encoding is when encountering this issue? Furthermore, can you tell us what guess_json_utf(self.content) returns? Something @mjpieters or @sburns contributed seems to be causing the issue here.

The stack trace seems to imply that self.encoding is None or some other falsey value (e.g., '') so we use guess_json_utf with self.content. self.content at that point is the raw bytes object we get from urllib3. So we use self.content.decode(encoding) which seems to be what's causing this issue. Judging by the stack trace (again) it seems that guess_json_utf is returning utf8.

One other note is that on requests master (on python 2.7), when I use r.json() the title of this issue comes back replaced like so: u'"\u010d" - UTF-8 UnicodeDecodeError' which if I remember correctly is how the stdlib replaces errors and is a consequence of us always using errors='replace'. This suggests that the call to str.decode on line 692 needs an errors='replace' parameter passed in since that's what we do for self.text.

Objections? I feel like using that particular option is a bad idea but we'd break the API were we to change it now.

Contributor

mjpieters commented Oct 20, 2013

I'd say that is a bad idea; in this case the JSON is either encoded with a non-standard codec (the RFC only allows for UTF-8, UTF-16 or UTF-32) and no Content-Type header with charset parameter either, or the data has been encoded to UTF-8 with a faulty encoder.

I don't think requests should try to 'repair' either case; returning self.content.decode(encoding, errors='replace') will only mask such errors and result in more confusion, not less.

Contributor

mjpieters commented Oct 20, 2013

For cases where JSON has been encoded with a non-standard codec and no charset provided, I'd try to explicitly set response.encoding to a codec that does work for that response and leave it at that. Explicit is better than implicit.

In this case, I suspect a Windows Latin codepage variant was used instead of UTF-8; 0xE8 is è, not č in Latin-1. Codepage 1250 seems to fit instead. Set r.encoding = 'cp1250' before calling r.json() here and it could just work.

Or, (much) better still, fix the JSON generation to use a UTF codec instead.

Owner

sigmavirus24 commented Oct 21, 2013

I don't think requests should try to 'repair' either case

I agree. But changing it now would result in what would likely be considered API breakage.

I wonder though, since self.encoding is None, why we don't fall back on the apparent_encoding. In this case, it guesses that the encoding is ISO-8859-2 .

Or, (much) better still, fix the JSON generation to use a UTF codec instead.

Unfortunately Snapchat is generating the JSON here so I doubt OP or anyone OP can contact can fix this.

Contributor

mjpieters commented Oct 21, 2013

It used to fall back to using self.text if decoding failed, but @kennethreitz removed that at some point, without explanation. guess_json_utf assumes that the content is correctly encoded to a UTF codec, and the exception handling would handle the edge cases where a non-RFC-compliant JSON response is to be handled.

Contributor

mjpieters commented Oct 21, 2013

I think the first try suite should be reinstated for the 'guessed UTF' case:

    if not self.encoding and len(self.content) > 3:
        # No encoding set. JSON RFC 4627 section 3 states we should expect
        # UTF-8, -16 or -32. Detect which one to use; If the detection or
        # decoding fails, fall back to `self.text` (using chardet to make
        # a best guess).
        encoding = guess_json_utf(self.content)
        if encoding is not None:
            try:
                return json.loads(self.content.decode(encoding), **kwargs)
            except UnicodeDecodeError:
                # Wrong UTF codec detected; usually because it's not UTF-8 but some other 8-bit codec
                # This is a RFC violation, and the server didn't bother to tell us what codec *was* used.
                pass
    return json.loads(self.text, **kwargs)
Owner

Lukasa commented Oct 21, 2013

That looks reasonable to me. I'd open a PR with that. =)

Owner

Lukasa commented Feb 3, 2014

Having not seen an associated PR, I'm marking this issue 'contributor friendly' and hoping someone will swing by and pick it up.

@mjpieters mjpieters added a commit to mjpieters/requests that referenced this issue Feb 3, 2014

@mjpieters mjpieters Reinstate falling back to self.text for JSON responses
A JSON response that has no encoding specified will be decoded with a detected UTF codec (compliant with the JSON RFC), but if that fails, we guessed wrong and need to fall back to charade character detection (via `self.text`). Kenneth removed this functionality (by accident?) in 1451ba0, this reinstates it again and adds a log warning.

Fixes #1674
5ee8b34
Contributor

mjpieters commented Feb 3, 2014

Having not seen an associated PR, I'm marking this issue 'contributor friendly' and hoping someone will swing by and pick it up.

Mea Culpa; I created one now: #1900

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment