Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chardet sometimes fails and force the wrong encoding #765

Closed
josemariaruiz opened this issue Aug 7, 2012 · 14 comments
Closed

Chardet sometimes fails and force the wrong encoding #765

josemariaruiz opened this issue Aug 7, 2012 · 14 comments

Comments

@josemariaruiz
Copy link

I've a problem that I've traced back to the method/property "text" of Request. Chadet is returning the wrong encoding "ISO-8859-2" when the string is UTF-8! The result are, as you can imagine, nasty strings everywhere you have something that is not ascii. This is happening in production :(

What can I do?

You are shipping chardet with requests :(

@turicas
Copy link

turicas commented Aug 7, 2012

@josemariaruiz, do you have sample strings to reproduce the bug?

@josemariaruiz
Copy link
Author

I wanted to cancel a comment and closed the bug! sorry. The string contains my personal data... can I send it to someone by email? I don't want to see it published everywhere :-/

@turicas
Copy link

turicas commented Aug 7, 2012

@josemariaruiz, reduce it to the characters that are creating the problem and post the representation of the object here (repr(my_string)).

@Lukasa
Copy link
Member

Lukasa commented Aug 7, 2012

In the meantime, if you want a work-around, you can use Response.content property and decode it yourself. This will work so long as you know that you're receiving UTF-8.

In short, instead of:

r = requests.get('http://example.com/')
print r.text

use:

r = requests.get('http://example.com/')
print r.content.decode('utf-8')

@josemariaruiz
Copy link
Author

Lukasa, this is exactly what I'm doing now: use r.content and json.loads().

The string is my own name: "José María Ruiz" (this is like in http://xkcd.com/327/)

>>> repr(response.content)
 '...."name":["Jos\\xc3\\xa9 Mar\\xc3\\xada Ruiz"]....'

@josemariaruiz
Copy link
Author

After installing the last chardet version with pip I made this test (the webservice is firewalled sorry):

$ curl "WEBSERVICE URL" | chardet
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                Dload  Upload   Total   Spent    Left  Speed
100   967  100   967    0     0   1968      0 --:--:-- --:--:-- --:--:--  2571
<stdin>: ISO-8859-2 (confidence: 0.82)

If I dump the data to a file and use file:

$ file dump.txt 
dump.txt: UTF-8 Unicode text, with very long lines, with no line terminators

The problem is related to the file and not to my name. The data returned by the webservice that cause the problem have my name once while when the name appears twice the detected encoding is utf-8.

@kennethreitz
Copy link
Contributor

Requests uses ISO-8859-1 when the content-type begins with text, as the RFC states.

@kennethreitz
Copy link
Contributor

http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.7.1

The "charset" parameter is used with some media types to define the character set (section 3.4) of the data. When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP. Data in character sets other than "ISO-8859-1" or its subsets MUST be labeled with an appropriate charset value. See section 3.4.1 for compatibility problems.

@josemariaruiz
Copy link
Author

Hi Kenneth this is a dump of the headers from Amazon WS CloudSearch (the service I'm requesting):

{'connection': 'keep-alive',
 'content-length': '5344',
 'content-type': 'application/json',
 'date': 'Wed, 08 Aug 2012 08:39:36 GMT',
 'server': 'Server'}

Does content-type begins with "text"? And the problem is not ISO-8859-1, but that the code then uses Chardet and asign as encoding ISO-8859-2!

@kennethreitz
Copy link
Contributor

Ah, sorry I misread.

Unfortunately, this detection cannot be improved at this time. Amazon should really be providing their charset in the headers.

Luckily, you can actually set the value of encoding yourself to suit your needs.

@mjpieters
Copy link
Contributor

Note that JSON is always supposed to be encoded in one of the UTF encodings; UTF-8 is the default, but -16 and -32 are allowed too.

The improvement thus is to not use self.text (which uses chardet to guess wrong in the case where no encoding has been set) but to detect the UTF encoding used instead; the first 4 bytes are enough to determine this. The section 3 of the JSON RFC tells you how to do this even.

I'd say: when no encoding has been set, use self.content and use the RFC rules to detect the encoding used and pass that to the json.loads function. If that raises an error, or self.encoding has been set, fall back to self.text.

(Correction: json.loads only handles UTF-8 or Unicode objects).

@mjpieters
Copy link
Contributor

Useful test case; it does not specify an encoding, the contained JSON is correctly encoded as UTF-8, but chardet pegs it as ISO-8859-2 and the end-result is that you get mis-decoded content in .json. Using json.loads(r.content) gives correct results.

Work-around: set .encoding to UTF-8 before accessing .json.

@mjpieters
Copy link
Contributor

0.14.2 was just released, which includes my proposed JSON UTF handling code from pull request #909. That release will now automatically detect what UTF encoding was used for a JSON response without an encoding set.

@kennethreitz
Copy link
Contributor

✨ 🍰 ✨

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 9, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants