Join GitHub today
Charset detection in text() can be pathologically slow #2359
@property def apparent_encoding(self): """The apparent encoding, provided by the chardet library""" return chardet.detect(self.content)['encoding']
Unfortunately, chardet can be pathologically slow and memory-hungry to do its job. For example, processing the text property of a response with the following content:
causes python-requests to use 131.00MB and 40+ seconds (for a 1MB response!)
In case others are coming here, as I did, and wondering why cchardet isn't included in requests, well, I can provide an answer just gleaned by talking to one of the maintainers on IRC. There was, at one time, a conditional import for cchardet, but it has since been removed. I asked why. Two reasons. First, chardet and cchardet are not fully compatible and have different strengths and accuracies. So, having a conditional import means that requests wouldn't be deterministic. This is a very bad thing.
The second reason is that conditional imports are vaguely causing trouble in other areas of requests that the devs want to trim down. I don't know the details here, exactly, but there's an import for simplejson that they say has caused trouble, so they're disinclined to do more conditional imports.
So, if you want faster processing and your comfy with the fact that cchardet has different responses to chardet, you can do this in your code just before the first time you access