Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

[Suggestion] Simplify charset handling #1737

Open
itsadok opened this Issue Nov 14, 2013 · 6 comments

Comments

Projects
None yet
3 participants

itsadok commented Nov 14, 2013

To cut to the chase, here are my suggestions:

  • Remove automatic character set detection (charade) from the library
  • Have response.encoding represent the charset from the Content-Type header
  • Mention the caveat in the documentation

Long version

There seems to be a lot of confusion regarding how the .text property works. After getting into some trouble with it myself, I searched the issues list, and found a dozen or so issues, all boiling down to the same mismatch between users' expectations and the intent of the library designers.
#147 - bytecode string returned when page has charset=UTF-8
#156 - get_unicode_from_response does not check charsets from meta tags
#592 - Internal encoding detection doesn't match external chardet call
#654 - requests.get() ignores charset=UTF-8 and BOM
#765 - Chardet sometimes fails and force the wrong encoding
#861 - parsing encoding utf-8 page doesn't as expected
#1087 - Encodings from content
#1150 - On some pages requests detect encoding incorrectly
#1546 - use a default encoding in Response's text property
#1589 - Make sure content is checked when setting encoding
#1604 - Response.text returns improperly decoded text
#1683 - models.text Behaviour (encoding choice)

(It must be tiring to have the same conversation over and over again. I hope I'm being helpful here and not just piling on).

The argument seems to be:

  • As an HTTP library, requests should not know or care about HTML and META attributes
  • RFC 2616 states that if no charset is defined, "text/*" media types should be regarded as ISO-8859-1

I accept both these arguments. However, the documentation seems a bit coy, saying that "Requests makes an educated guess about the encoding", implying chardet/charade. In practice, for any content with a "text" media subtype, charade will not be used unless the user explicitly sets the response.encoding to None before reading the .text property.

Additionally, while ISO-8859-1 can be used as a default, won't it make more sense to handle that default in .text and not in the get_encoding_from_headers method? This way, the encoding property will be None if indeed no encoding is specified, allowing the user to make the decision on how to proceed.

If you're going to keep the .text property, I think it should do a simple decoding if the charset is specified in the headers, and throw an exception otherwise. This way is much less confusing than the state of affairs now. Additionally, the documentation should contain a warning not to use it for arbitrary web pages, and perhaps a code sample showing the proper way to do it.

import re
import charade
import requests

def get_encodings_from_content(content):
    charset_re = re.compile(r'<meta.*?charset=["\']*(.+?)["\'>]', flags=re.I)
    pragma_re = re.compile(r'<meta.*?content=["\']*;?charset=(.+?)["\'>]', flags=re.I)
    xml_re = re.compile(r'^<\?xml.*?encoding=["\']*(.+?)["\'>]')

    # FIXME: Does not work in python 3
    return (charset_re.findall(content) +
            pragma_re.findall(content) +
            xml_re.findall(content))

r = requests.get('https://example.com/page.html')
if "charset" not in r.headers.get("content-type", ""):
    encodings = get_encodings_from_content(r.content)
    if encodings:
        r.encoding = encodings[0]
    else:
        r.encoding = charade.detect(r.content)['encoding']

html = r.text
Owner

Lukasa commented Nov 14, 2013

You've got lots of good suggestions here. I'll respond to some of them. In no particular order:

  1. Response.text should never ever ever ever throw an exception if we can possibly avoid it. Properties throwing exceptions is bad. If we were going to start throwing exceptions from Response.text I'd want it to become Response.text().
  2. Your observation that charade will never get called for any content with a text/* MIME type is totally correct, and by design. RFC 2616 is incredibly clear on this point.
  3. I am open to moving the ISO-8859-1 default to Response.text and out of get_encoding_from_headers.
  4. I am not open to removing flexibility from Response.text.

Stripping all the functionality from Response.text, as you suggest in your last point, seems silly to me. If we're going that far, we should remove Response.text altogether.

Response.text has carte blanche to do whatever it can to correctly decode the output. There are points of finesse here, but that will always be the case. No matter what Response.text does, someone will disapprove.

Owner

sigmavirus24 commented Nov 14, 2013

@Lukasa you were tricked into saying this:

Stripping all the functionality from Response.text, as you suggest in your last point, seems silly to me. If we're going that far, we should remove Response.text altogether.

This is clearly the agenda of this issue as you can tell by:

If you're going to keep the .text property

@itsadok clearly wants the .text property to disappear because issues have been filed regarding it in the past.

Let me address one other thing that @Lukasa didn't before I add my opinion.

Additionally, the documentation should contain a warning not to use it for arbitrary web pages, and perhaps a code sample showing the proper way to do it.

  1. charade works fairly well for well established codecs. There are new ones that subsume old ones which it doesn't support yet. Why? There aren't publicly available statistics for those encodings and that's what charade relies on. If you disagree with how something is being detected, why not file a bug report on charade?
  2. That code sample is not the proper way to do it. Using regular expressions on HTML is insanity and is never the correct answer.

With that addressed, let me address one more theme of this issue: Because some negligible percentage of all issues have been filed about x, x should be (changed|removed).

One thing to note is that all the issues with numbers lower than 1000 were filed before requests 1.0 which is when the API was finalized. If there were legitimate bugs in this attribute prior to that, I would be far from surprised. Also some of those issues are instead about the choice that chardet/charade made. Those are not bugs in requests or .text but instead in the underlying support.

Finally, after the release of 1.0 we had a lot of issues about the change from json being a property on a Response object to becoming a method. We didn't remove it or change it back for a good reason. It was a deliberate design decision.

The .text attribute is quite crucial to this library, especially to the json method, and it will likely never be removed. Can it be improved? Almost certainly. You provided a couple of good suggestions, but the overall tone this issue is meant to convince the reader that it should be removed and that will not happen. Without a reasonable guess at the encoding of the text, we cannot provide the user with the json method which also will not go away. Simply, the user is not the sole consumer of .text.

@itsadok itsadok closed this Nov 14, 2013

@itsadok itsadok reopened this Nov 14, 2013

itsadok commented Nov 14, 2013

Sorry about the closing and reopening, that was a mis-click.

@sigmavirus24 I'm sorry if it seems like I have an agenda. I'm honestly just trying to help. I read through the discussions in all of the issues I posted. They all really seem to revolve around the same basic confusion, with people expecting Response.text to be an all-encompassing solution where in reality it is not.

Let me just clarify some misunderstandings in what I wrote.

Additionally, the documentation should contain a warning not to use it for arbitrary web pages, and perhaps a code sample showing the proper way to do it.
charade works fairly well for well established codecs. There are new ones that subsume old ones which it doesn't support yet. Why? There aren't publicly available statistics for those encodings and that's what charade relies on. If you disagree with how something is being detected, why not file a bug report on charade?

I merely meant that it should be noted that Response.text should not be used willy-nilly on arbitrary web page, precisely because it avoids using charade in many places where it can be used.

That code sample is not the proper way to do it. Using regular expressions on HTML is insanity and is never the correct answer.

The get_encodings_from_content function is fully copied from requests.utils, with the added comment by me that it doesn't really work in Python 3. In any case the point was that this needs to be clarified, not my specific solution.

The .text attribute is quite crucial to this library, especially to the json method, and it will likely never be removed. Can it be improved? Almost certainly. You provided a couple of good suggestions, but the overall tone this issue is meant to convince the reader that it should be removed and that will not happen. Without a reasonable guess at the encoding of the text, we cannot provide the user with the json method which also will not go away. Simply, the user is not the sole consumer of .text.

This is a valid point, and perhaps serves to explain the weird nature of .text. Perhaps all that is needed is a note in the documentation.

itsadok commented Nov 14, 2013

OK, mea culpa. I missed the part where Response.json relies almost entirely on Response.text for charset issues. That explains some of the design, and I would have phrased my post differently if I had noticed.

It seems like I hit a nerve, which was really not my intention. I'm going to close the issue, but there is one pun that I just have to make.

  1. Your observation that charade will never get called for any content with a text/* MIME type is totally correct, and by design. RFC 2616 is incredibly clear on this point.

The thing is, it seems that in most cases where you'd even want to access the .text property, you would also have a text/* MIME type (application/json being the exception). That means that the cases where automatic charset detection is actually used are pretty rare, and unpredictable for the user. So why keep the charade?

@itsadok itsadok closed this Nov 14, 2013

@Lukasa Lukasa reopened this Nov 14, 2013

Owner

sigmavirus24 commented Nov 14, 2013

I merely meant that it should be noted that Response.text should not be used willy-nilly on arbitrary web page, precisely because it avoids using charade in many places where it can be used.

Just because something can be used somewhere does not mean it should be. charade is slow as a result of its accuracy and painstaking meticulousness. Using it when it can be used as opposed to when it must be used makes the performance difference in the user's eyes.

The get_encodings_from_content function is fully copied from requests.utils

But we never use it. It is cruft and should be removed. It is the wrong way to do things.

That means that the cases where automatic charset detection is actually used are pretty rare, and unpredictable for the user. So why keep the charade?

You're assuming everything behaves the same way and that RFCs are followed by servers. They're not. Charade is occasionally used. We keep it because it is essentially part of the API. Response's couldn't have an apparent_encoding attribute if we discarded charade. We can break the API if we ever release 3.0 but until then, the API is frozen except for backwards compatible changes. Also, that's an excellent pun.

It seems like I hit a nerve, which was really not my intention. I'm going to close the issue

I'm going to re-open it. You made valid points as @Lukasa pointed out. It just needs to be clear that this is not any agreement about removing the property.

Owner

Lukasa commented Nov 14, 2013

Heh, @sigmavirus24 both have the same reactions. Neither of us thinks this issue should be closed: I reopened it just before he could!

Neither of us is angry, or unhappy about having this feedback. We're both delighted. You just can't tell because of the limitations of textual communication! =D

The reality is that text encoding is hard, everyone is doing the wrong thing from time to time, and we cannot possibly please anyone. For that reason, we do the best we can, and then we expose the Response.encoding property for people who don't like the way we do it.

You've identified good issues with Response.text, and I plan to fix them (unless someone else does so first). They're not all backward compatible, so we'll need to sit on them for a bit, but they're good.

@sigmavirus24 sigmavirus24 added this to the 3.0.0 milestone Oct 5, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment