Chardet sometimes fails and force the wrong encoding #765

josemariaruiz · 2012-08-07T10:07:31Z

I've a problem that I've traced back to the method/property "text" of Request. Chadet is returning the wrong encoding "ISO-8859-2" when the string is UTF-8! The result are, as you can imagine, nasty strings everywhere you have something that is not ascii. This is happening in production :(

What can I do?

You are shipping chardet with requests :(

turicas · 2012-08-07T10:13:03Z

@josemariaruiz, do you have sample strings to reproduce the bug?

josemariaruiz · 2012-08-07T10:17:40Z

I wanted to cancel a comment and closed the bug! sorry. The string contains my personal data... can I send it to someone by email? I don't want to see it published everywhere :-/

turicas · 2012-08-07T10:19:36Z

@josemariaruiz, reduce it to the characters that are creating the problem and post the representation of the object here (repr(my_string)).

Lukasa · 2012-08-07T10:23:33Z

In the meantime, if you want a work-around, you can use Response.content property and decode it yourself. This will work so long as you know that you're receiving UTF-8.

In short, instead of:

r = requests.get('http://example.com/')
print r.text

use:

r = requests.get('http://example.com/')
print r.content.decode('utf-8')

josemariaruiz · 2012-08-07T10:29:22Z

Lukasa, this is exactly what I'm doing now: use r.content and json.loads().

The string is my own name: "José María Ruiz" (this is like in http://xkcd.com/327/)

>>> repr(response.content)
 '...."name":["Jos\\xc3\\xa9 Mar\\xc3\\xada Ruiz"]....'

josemariaruiz · 2012-08-07T10:46:22Z

After installing the last chardet version with pip I made this test (the webservice is firewalled sorry):

$ curl "WEBSERVICE URL" | chardet
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                Dload  Upload   Total   Spent    Left  Speed
100   967  100   967    0     0   1968      0 --:--:-- --:--:-- --:--:--  2571
<stdin>: ISO-8859-2 (confidence: 0.82)

If I dump the data to a file and use file:

$ file dump.txt 
dump.txt: UTF-8 Unicode text, with very long lines, with no line terminators

The problem is related to the file and not to my name. The data returned by the webservice that cause the problem have my name once while when the name appears twice the detected encoding is utf-8.

kennethreitz · 2012-08-07T18:17:36Z

Requests uses ISO-8859-1 when the content-type begins with text, as the RFC states.

kennethreitz · 2012-08-07T18:18:39Z

http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.7.1

The "charset" parameter is used with some media types to define the character set (section 3.4) of the data. When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP. Data in character sets other than "ISO-8859-1" or its subsets MUST be labeled with an appropriate charset value. See section 3.4.1 for compatibility problems.

josemariaruiz · 2012-08-08T08:40:23Z

Hi Kenneth this is a dump of the headers from Amazon WS CloudSearch (the service I'm requesting):

{'connection': 'keep-alive',
 'content-length': '5344',
 'content-type': 'application/json',
 'date': 'Wed, 08 Aug 2012 08:39:36 GMT',
 'server': 'Server'}

Does content-type begins with "text"? And the problem is not ISO-8859-1, but that the code then uses Chardet and asign as encoding ISO-8859-2!

kennethreitz · 2012-08-08T20:32:12Z

Ah, sorry I misread.

Unfortunately, this detection cannot be improved at this time. Amazon should really be providing their charset in the headers.

Luckily, you can actually set the value of encoding yourself to suit your needs.

mjpieters · 2012-10-25T13:58:00Z

Note that JSON is always supposed to be encoded in one of the UTF encodings; UTF-8 is the default, but -16 and -32 are allowed too.

The improvement thus is to not use self.text (which uses chardet to guess wrong in the case where no encoding has been set) but to detect the UTF encoding used instead; the first 4 bytes are enough to determine this. The section 3 of the JSON RFC tells you how to do this even.

I'd say: when no encoding has been set, use self.content and use the RFC rules to detect the encoding used and pass that to the json.loads function. If that raises an error, or self.encoding has been set, fall back to self.text.

(Correction: json.loads only handles UTF-8 or Unicode objects).

mjpieters · 2012-10-25T14:02:08Z

Useful test case; it does not specify an encoding, the contained JSON is correctly encoded as UTF-8, but chardet pegs it as ISO-8859-2 and the end-result is that you get mis-decoded content in .json. Using json.loads(r.content) gives correct results.

Work-around: set .encoding to UTF-8 before accessing .json.

mjpieters · 2012-10-27T15:41:22Z

0.14.2 was just released, which includes my proposed JSON UTF handling code from pull request #909. That release will now automatically detect what UTF encoding was used for a JSON response without an encoding set.

kennethreitz · 2012-10-27T16:07:58Z

✨ 🍰 ✨

josemariaruiz closed this as completed Aug 7, 2012

josemariaruiz reopened this Aug 7, 2012

kennethreitz closed this as completed Aug 7, 2012

mjpieters mentioned this issue Oct 25, 2012

Use a JSON-specific encoding detection when no encoding has been specified #909

Merged

itsadok mentioned this issue Nov 14, 2013

[Suggestion] Simplify charset handling #1737

Open

github-actions bot locked as resolved and limited conversation to collaborators Sep 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chardet sometimes fails and force the wrong encoding #765

Chardet sometimes fails and force the wrong encoding #765

josemariaruiz commented Aug 7, 2012

turicas commented Aug 7, 2012

josemariaruiz commented Aug 7, 2012

turicas commented Aug 7, 2012

Lukasa commented Aug 7, 2012

josemariaruiz commented Aug 7, 2012

josemariaruiz commented Aug 7, 2012

kennethreitz commented Aug 7, 2012

kennethreitz commented Aug 7, 2012

josemariaruiz commented Aug 8, 2012

kennethreitz commented Aug 8, 2012

mjpieters commented Oct 25, 2012

mjpieters commented Oct 25, 2012

mjpieters commented Oct 27, 2012

kennethreitz commented Oct 27, 2012

Chardet sometimes fails and force the wrong encoding #765

Chardet sometimes fails and force the wrong encoding #765

Comments

josemariaruiz commented Aug 7, 2012

turicas commented Aug 7, 2012

josemariaruiz commented Aug 7, 2012

turicas commented Aug 7, 2012

Lukasa commented Aug 7, 2012

josemariaruiz commented Aug 7, 2012

josemariaruiz commented Aug 7, 2012

kennethreitz commented Aug 7, 2012

kennethreitz commented Aug 7, 2012

josemariaruiz commented Aug 8, 2012

kennethreitz commented Aug 8, 2012

mjpieters commented Oct 25, 2012

mjpieters commented Oct 25, 2012

mjpieters commented Oct 27, 2012

kennethreitz commented Oct 27, 2012