New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Charset detection in text() can be pathologically slow #2359

Closed
marcocova opened this Issue Nov 27, 2014 · 9 comments

Comments

Projects
None yet
6 participants
@marcocova

marcocova commented Nov 27, 2014

When calling response.text() and no encoding was set or determined, requests relies on chardet to detect the encoding:

    @property
    def apparent_encoding(self):
        """The apparent encoding, provided by the chardet library"""
        return chardet.detect(self.content)['encoding']

Unfortunately, chardet can be pathologically slow and memory-hungry to do its job. For example, processing the text property of a response with the following content:

"a" * (1024 * 1024) + "\xa9"

causes python-requests to use 131.00MB and 40+ seconds (for a 1MB response!)

@sigmavirus24

This comment has been minimized.

Show comment
Hide comment
@sigmavirus24

sigmavirus24 Nov 27, 2014

Member

We're aware of this. Do you have a proposal to replace it with something better?

Member

sigmavirus24 commented Nov 27, 2014

We're aware of this. Do you have a proposal to replace it with something better?

@rsnair2

This comment has been minimized.

Show comment
Hide comment
@rsnair2

rsnair2 Nov 28, 2014

How about using cChardet ?

rsnair2 commented Nov 28, 2014

How about using cChardet ?

@sigmavirus24

This comment has been minimized.

Show comment
Hide comment
@sigmavirus24

sigmavirus24 Nov 28, 2014

Member

We can't vendor anything that uses C extensions, so no @rsnair2 we can't use cChardet

Member

sigmavirus24 commented Nov 28, 2014

We can't vendor anything that uses C extensions, so no @rsnair2 we can't use cChardet

@Terr

This comment has been minimized.

Show comment
Hide comment
@Terr

Terr Nov 28, 2014

Couldn't it be the solution in @marcocova's case though? If instead of letting requests determine the charset, he could use cChardet and tell requests what charset to expect.

Terr commented Nov 28, 2014

Couldn't it be the solution in @marcocova's case though? If instead of letting requests determine the charset, he could use cChardet and tell requests what charset to expect.

@sigmavirus24

This comment has been minimized.

Show comment
Hide comment
@sigmavirus24

sigmavirus24 Nov 29, 2014

Member

@Terr the solution of setting the encoding manually is well documented, so I would hope that @marcocova had considered that.

Member

sigmavirus24 commented Nov 29, 2014

@Terr the solution of setting the encoding manually is well documented, so I would hope that @marcocova had considered that.

@marcocova

This comment has been minimized.

Show comment
Hide comment
@marcocova

marcocova Dec 2, 2014

@sigmavirus24, yes, workarounds are well understood. I was just concerned that the default behavior leaves the user exposed to this issue (with no warnings in the docs - that I could find at least).

marcocova commented Dec 2, 2014

@sigmavirus24, yes, workarounds are well understood. I was just concerned that the default behavior leaves the user exposed to this issue (with no warnings in the docs - that I could find at least).

@sigmavirus24

This comment has been minimized.

Show comment
Hide comment
@sigmavirus24

sigmavirus24 Apr 7, 2015

Member

Closing due to inactivity

Member

sigmavirus24 commented Apr 7, 2015

Closing due to inactivity

@simion

This comment has been minimized.

Show comment
Hide comment
@simion

simion Jul 1, 2015

@marcocova I had the same issue, then i found this: https://pypi.python.org/pypi/cchardet/
From 5 seconds (chardet), I got 1 milisecond :)

simion commented Jul 1, 2015

@marcocova I had the same issue, then i found this: https://pypi.python.org/pypi/cchardet/
From 5 seconds (chardet), I got 1 milisecond :)

@mlissner

This comment has been minimized.

Show comment
Hide comment
@mlissner

mlissner Jul 30, 2015

Contributor

In case others are coming here, as I did, and wondering why cchardet isn't included in requests, well, I can provide an answer just gleaned by talking to one of the maintainers on IRC. There was, at one time, a conditional import for cchardet, but it has since been removed. I asked why. Two reasons. First, chardet and cchardet are not fully compatible and have different strengths and accuracies. So, having a conditional import means that requests wouldn't be deterministic. This is a very bad thing.

The second reason is that conditional imports are vaguely causing trouble in other areas of requests that the devs want to trim down. I don't know the details here, exactly, but there's an import for simplejson that they say has caused trouble, so they're disinclined to do more conditional imports.

So, if you want faster processing and your comfy with the fact that cchardet has different responses to chardet, you can do this in your code just before the first time you access r.text:

if r.encoding is None:
        # Requests detects the encoding when the item is GET'ed using
        # HTTP headers, and then when r.text is accessed, if the encoding
        # hasn't been set by that point. By setting the encoding here, we
        # ensure that it's done by cchardet, if it hasn't been done with
        # HTTP headers. This way it is done before r.text is accessed
        # (which would do it with vanilla chardet). This is a big
        # performance boon.
        r.encoding = cchardet.detect(r.content)['encoding']
Contributor

mlissner commented Jul 30, 2015

In case others are coming here, as I did, and wondering why cchardet isn't included in requests, well, I can provide an answer just gleaned by talking to one of the maintainers on IRC. There was, at one time, a conditional import for cchardet, but it has since been removed. I asked why. Two reasons. First, chardet and cchardet are not fully compatible and have different strengths and accuracies. So, having a conditional import means that requests wouldn't be deterministic. This is a very bad thing.

The second reason is that conditional imports are vaguely causing trouble in other areas of requests that the devs want to trim down. I don't know the details here, exactly, but there's an import for simplejson that they say has caused trouble, so they're disinclined to do more conditional imports.

So, if you want faster processing and your comfy with the fact that cchardet has different responses to chardet, you can do this in your code just before the first time you access r.text:

if r.encoding is None:
        # Requests detects the encoding when the item is GET'ed using
        # HTTP headers, and then when r.text is accessed, if the encoding
        # hasn't been set by that point. By setting the encoding here, we
        # ensure that it's done by cchardet, if it hasn't been done with
        # HTTP headers. This way it is done before r.text is accessed
        # (which would do it with vanilla chardet). This is a big
        # performance boon.
        r.encoding = cchardet.detect(r.content)['encoding']

stevenschlansker added a commit to opentable/nagios-mesos that referenced this issue Oct 8, 2015

mhils added a commit to mitmproxy/netlib that referenced this issue Nov 14, 2015

improve http message content semantics
This commit improves the semantics of request and response contents.
We now have:

  - message.raw_content: The unmodified, raw content as seen on the wire
  - message.content: The HTTP Content-Type-decoded content. That is, gzip
    or deflate compression are removed.
  - message.text: The HTTP Content-Type-decoded and charset-decoded
    content. This also takes into account the text encoding and returns a
    unicode string.

Both .content and .text are engineered in a way that they shall not fail
under any circumstances. If there is an invalid HTTP Content-Type,
accessing `.content` will return the verbatim `.raw_content`.
For the text encoding, we first try the provided charset and then
immediately fall back to surrogate-escaped utf8. Using chardet would be an
alternative, but chardet is terribly slow in some cases
(requests/requests#2359).
While `surrogateescape` may cause issues at other places
(http://lucumr.pocoo.org/2013/7/2/the-updated-guide-to-unicode/),
it nicely preserves "everything weird" if we apply mitmproxy replacements -
this property probably justifies keeping it.

GregBowyer added a commit to URXtech/dd-agent that referenced this issue Jan 8, 2016

Mesos: Prevent requests from using chardet
Requests internally (and irritatingly imho) bundles `chardet`. It uses this library to figure out the encoding for a given response if there are no suitable HTTP headers detailing the character set of the response.

This library is pathologically slow (its loop heavy) on any python implementation _not_ running on top of a VM. There are better implementations of `chardet` for vanilla pythons but the requests library authors wish to avoid bundling C extensions. requests/requests#2359

Taking a leaf from opentable, we now force the encoding of the mesos response to always be UTF8 which avoids requests pulling large mesos payloads through `chardet` and prevents the datadog agent taking large amounts of CPU time to run the check.

GregBowyer added a commit to URXtech/dd-agent that referenced this issue Jan 8, 2016

Mesos: Prevent requests from using chardet
Requests internally (and irritatingly imho) bundles `chardet`. It uses this library to figure out the encoding for a given response if there are no suitable HTTP headers detailing the character set of the response.

This library is pathologically slow (its loop heavy) on any python implementation _not_ running on top of a VM. There are better implementations of `chardet` for vanilla pythons but the requests library authors wish to avoid bundling C extensions. requests/requests#2359

Taking a leaf from opentable, we now force the encoding of the mesos response to always be UTF8 which avoids requests pulling large mesos payloads through `chardet` and prevents the datadog agent taking large amounts of CPU time to run the check.

GregBowyer added a commit to URXtech/dd-agent that referenced this issue Jan 8, 2016

Mesos: Prevent requests from using chardet
Requests internally (and irritatingly imho) bundles `chardet`. It uses this library to figure out the encoding for a given response if there are no suitable HTTP headers detailing the character set of the response.

This library is pathologically slow (its loop heavy) on any python implementation _not_ running on top of a VM. There are better implementations of `chardet` for vanilla pythons but the requests library authors wish to avoid bundling C extensions. requests/requests#2359

Taking a leaf from opentable, we now force the encoding of the mesos response to always be UTF8 which avoids requests pulling large mesos payloads through `chardet` and prevents the datadog agent taking large amounts of CPU time to run the check.

kostas0 pushed a commit to DataDog/dd-agent that referenced this issue Feb 11, 2016

Mesos: Prevent requests from using chardet
Requests internally (and irritatingly imho) bundles `chardet`. It uses this library to figure out the encoding for a given response if there are no suitable HTTP headers detailing the character set of the response.

This library is pathologically slow (its loop heavy) on any python implementation _not_ running on top of a VM. There are better implementations of `chardet` for vanilla pythons but the requests library authors wish to avoid bundling C extensions. requests/requests#2359

Taking a leaf from opentable, we now force the encoding of the mesos response to always be UTF8 which avoids requests pulling large mesos payloads through `chardet` and prevents the datadog agent taking large amounts of CPU time to run the check.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment