New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
latin-1 codec can't encode characters #2733
Comments
Actually the warning doesn't matter if I use flask over tornado to handle the POST request. The warning is still there while flask(request.values.get) correctly extract the key-value pair. However, the get_argument function in tornado can't extract the params correctly and raise "Missing argument data" whenever there is a codec warning.
|
This is indeed a very confusing corner of tornado. It's using latin1 because "any bytes can be considered latin1 so we can smuggle bytes through latin1". See also #2572 Here's a change that fixes your problem, but it's not really the right fix, as I'll explain: --- a/tornado/escape.py
+++ b/tornado/escape.py
@@ -161,7 +161,8 @@ def parse_qs_bytes(
)
encoded = {}
for k, v in result.items():
- encoded[k] = [i.encode("latin1") for i in v]
+ # qs was decoded utf-8, so values are encoded utf-8
+ encoded[k] = [i.encode("utf-8") for i in v]
return encoded What you're actually supposed to do, for
and your example will print as expected:
|
The "smuggle bytes through latin1" comment refers to decoding bytes as latin1 (which is done by passing latin1 to Line 786 in 8e5ecad
This is a tricky case because the input is malformed (all the non-ascii bytes should be percent escaped) but there is a reasonable interpretation of the data (it works in flask, and i think it probably worked in tornado in python 2). It's difficult to fix this while both using the stdlib's If you need to support this malformed input, the simplest solution is probably for you to call urllib.parse.parse_qs yourself instead of using tornado's functions. |
I've got another idea here (maybe newly practical now that tornado is python3 only?) --- a/tornado/httputil.py
+++ b/tornado/httputil.py
@@ -784,7 +784,7 @@ def parse_body_arguments(
try:
- uri_arguments = parse_qs_bytes(native_str(body), keep_blank_values=True)
+ uri_arguments = parse_qs_bytes(body.decode('latin1'), keep_blank_values=True)
except Exception as e:
gen_log.warning("Invalid x-www-form-urlencoded body: %s", e) Tested with utf-8 with and without url encoding, and then tested with the "gbk" encoding with the following addition to the example: class DataHandler(tornado.web.RequestHandler):
def decode_argument(self, value, name=None):
return value.decode('gbk')
|
Ah, that looks like it should work. |
Thank you for your patience! Your comments are really helpful. I've contacted the owner of the clients and persuaded him to fix the malformed requests in future updates. For now I'm using uri_arguments = parse_qs_bytes(body.decode('latin1'), keep_blank_values=True) and def decode_argument(self, value, name=None):
try:
return value.decode('utf-8')
except:
return value.decode('gbk') Annoying part is, sometimes there're both gbk and utf-8 encoded Chinese characters in a single request(I know that's weird...). So is there a way to decode such request correctly? If not, how can I just ignore the few gbk encoded characters and decode the rest of the request correctly? |
You could do |
Thank you. That works. I'm closing this. |
I'm using tornado to accept some POST data sended from clients I don't have access to. Everything works fine if only English characters appear in the data. When utf-8 encoded Chinese characters(3 bytes) are within the data, Tornado gives me this warning and the 'get_argument' function can't get anything at all.
I debuged and simplified my code to the simplest, yet the warning still comes up
The data the clients post is like:
The data is x-www-form-urlencoded and WireShark shows the Chinese characters are perfectly 3-bytes utf-8 which starts with E(1110).
It has nothing to do with the print function because I actually tried simpler codes:
The warning is still there. So could anyone tell me where does this encoding thing come from since I did nothing about encoding in my code?
The text was updated successfully, but these errors were encountered: