Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Adding helper function to HTTPResponse for getting charset of response body #776

Open
wants to merge 1 commit into
from

Conversation

Projects
None yet
4 participants

The body of HTTPResponse is bytes in Python 3, as a result of which I had to decode it before I could use it with other functions that expect a string object. However I couldn't find a way to easily get the charset of the response body. I had to manually inspect the header. Therefore, I figured that adding a helper function might be helpful.

Contributor

MartinMartimeo commented May 9, 2013

Not all browsers send the charset in content-type, so there should be a fallback to 'latin1' or a better exception than IndexError I think.

Owner

bdarnell commented May 11, 2013

In addition to Martin's concern about missing charsets (is latin1 really the right default?), it is possible to include parameters other than charset so it's not correct to just look for the first equals sign.

The charset doesn't make much sense without the actual content-type, so maybe this function should return a (content_type, charset) tuple, or there should be a companion function to return the content type. The same construct is used elsewhere in HTTP (at least in the Accept header, where the interesting parameter is "q" instead of "charset"), so it might be nice if there was a more generic interface to parse this syntax (the mimetools and email modules have some code to do this but they don't appear to be easy to use in isolation).

Contributor

MartinMartimeo commented May 11, 2013

looked it now up: RFC 2045 says us-ascii is the default.

In another place I parse the content-charset via:

def get_content_encoding(self):
    content_type_args = {k.strip(): v for k, v in parse_qs(self.request.headers['Content-Type']).items()}
    if 'charset' in content_type_args and content_type_args['charset']:
        return content_type_args['charset'][0]
    else:
        return 'latin1'

But I think that is not beautiful at all either.

Owner

bdarnell commented May 11, 2013

The standards are complicated.

RFC 2045 (mime) says that if there is no content-type at all, the default is "text/plain; charset=us-ascii". It doesn't say what should happen if a text/* content-type does not have a charset (although using us-ascii is a reasonable inference)

HTTP 1.1 (RFC 2616) said that the default is latin1 (section 3.4.1), although this is being dropped in the httpbis revisions (appendix B of http://tools.ietf.org/html/draft-ietf-httpbis-p2-semantics-22). HTTPbis defers to the media type definition.

The media type definition for html is RFC 2854, which points to both of the previous document and points out that most browsers ignore the specs and use a different default anyway. HTML5 defines a complicated algorithm for guessing and recommends utf8 as the most sane and universal default.

In general, I prefer a default of utf8. In addition to being a complete encoding of unicode, if a byte string can be decoded as utf8 without error it is very likely that this is the correct result. Decoding as latin1 will never raise an exception but will return incorrect results if it is not the intended encoding.

Contributor

MartinMartimeo commented May 11, 2013

From my experience with old python I find unicode support in P<=2.7 still a little bit ugly and would prefer the utf8 default only in P>=3 where strings are real unicode anyway. I would stick in P<=2.7 with latin1 or us-ascii encoded str objects (with the addition that they will not break existing code relaying on str objects, I think there is still many code outside with just checks on str instead of basestring) and only use in P>=3 utf8 encoding.

@bdarnell bdarnell added the httpclient label Jul 16, 2014

The correct way to parse out parameters from a HTTP header is to use cgi.parse_header(); it returns the content type plus a dictionary of parameters. You should probably return None if no charset was set, it depends on the exact content type what default the application should use (like UTF-8 for application/json). I'd accept a default instead:

def get_content_charset(self, default=None)
    content_type, params = cgi.parse_header(self.request.headers['Content-Type'])
    return params.get('charset', default)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment