Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

Adding helper function to HTTPResponse for getting charset of response body #776

Open
wants to merge 1 commit into from

4 participants

@derekchiang

The body of HTTPResponse is bytes in Python 3, as a result of which I had to decode it before I could use it with other functions that expect a string object. However I couldn't find a way to easily get the charset of the response body. I had to manually inspect the header. Therefore, I figured that adding a helper function might be helpful.

@MartinMartimeo

Not all browsers send the charset in content-type, so there should be a fallback to 'latin1' or a better exception than IndexError I think.

@bdarnell
Owner

In addition to Martin's concern about missing charsets (is latin1 really the right default?), it is possible to include parameters other than charset so it's not correct to just look for the first equals sign.

The charset doesn't make much sense without the actual content-type, so maybe this function should return a (content_type, charset) tuple, or there should be a companion function to return the content type. The same construct is used elsewhere in HTTP (at least in the Accept header, where the interesting parameter is "q" instead of "charset"), so it might be nice if there was a more generic interface to parse this syntax (the mimetools and email modules have some code to do this but they don't appear to be easy to use in isolation).

@MartinMartimeo

looked it now up: RFC 2045 says us-ascii is the default.

In another place I parse the content-charset via:

def get_content_encoding(self):
    content_type_args = {k.strip(): v for k, v in parse_qs(self.request.headers['Content-Type']).items()}
    if 'charset' in content_type_args and content_type_args['charset']:
        return content_type_args['charset'][0]
    else:
        return 'latin1'

But I think that is not beautiful at all either.

@bdarnell
Owner

The standards are complicated.

RFC 2045 (mime) says that if there is no content-type at all, the default is "text/plain; charset=us-ascii". It doesn't say what should happen if a text/* content-type does not have a charset (although using us-ascii is a reasonable inference)

HTTP 1.1 (RFC 2616) said that the default is latin1 (section 3.4.1), although this is being dropped in the httpbis revisions (appendix B of http://tools.ietf.org/html/draft-ietf-httpbis-p2-semantics-22). HTTPbis defers to the media type definition.

The media type definition for html is RFC 2854, which points to both of the previous document and points out that most browsers ignore the specs and use a different default anyway. HTML5 defines a complicated algorithm for guessing and recommends utf8 as the most sane and universal default.

In general, I prefer a default of utf8. In addition to being a complete encoding of unicode, if a byte string can be decoded as utf8 without error it is very likely that this is the correct result. Decoding as latin1 will never raise an exception but will return incorrect results if it is not the intended encoding.

@MartinMartimeo

From my experience with old python I find unicode support in P<=2.7 still a little bit ugly and would prefer the utf8 default only in P>=3 where strings are real unicode anyway. I would stick in P<=2.7 with latin1 or us-ascii encoded str objects (with the addition that they will not break existing code relaying on str objects, I think there is still many code outside with just checks on str instead of basestring) and only use in P>=3 utf8 encoding.

@bdarnell bdarnell added the httpclient label
@mjpieters

The correct way to parse out parameters from a HTTP header is to use cgi.parse_header(); it returns the content type plus a dictionary of parameters. You should probably return None if no charset was set, it depends on the exact content type what default the application should use (like UTF-8 for application/json). I'd accept a default instead:

def get_content_charset(self, default=None)
    content_type, params = cgi.parse_header(self.request.headers['Content-Type'])
    return params.get('charset', default)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
This page is out of date. Refresh to see the latest.
Showing with 8 additions and 0 deletions.
  1. +8 −0 tornado/httpclient.py
View
8 tornado/httpclient.py
@@ -414,6 +414,14 @@ def _get_body(self):
body = property(_get_body)
+ def get_content_charset(self):
+ """Gets the charset of the response body"""
+ content_type = self.headers['Content-Type']
+
+ # Example: 'application/json;charset=utf-8' -> 'utf-8'
+ charset = content_type.split(';')[1].split('=')[1]
+ return charset
+
def rethrow(self):
"""If there was an error on the request, raise an `HTTPError`."""
if self.error:
Something went wrong with that request. Please try again.