latin-1 codec can't encode characters #2733

FlameSky-S · 2019-08-28T07:37:36Z

I'm using tornado to accept some POST data sended from clients I don't have access to. Everything works fine if only English characters appear in the data. When utf-8 encoded Chinese characters(3 bytes) are within the data, Tornado gives me this warning and the 'get_argument' function can't get anything at all.

I debuged and simplified my code to the simplest, yet the warning still comes up

class DataHandler(tornado.web.RequestHandler):
    def post(self):
        print("test")
        print(self.get_argument("data"))
        print("1")

application = tornado.web.Application([
    (r"/data", Data),
])

application.listen(5000)
tornado.ioloop.IOLoop.instance().start()

The data the clients post is like:

data={"id":"00f1c423","name":"张三"}

The data is x-www-form-urlencoded and WireShark shows the Chinese characters are perfectly 3-bytes utf-8 which starts with E(1110).
It has nothing to do with the print function because I actually tried simpler codes:

class Data(tornado.web.RequestHandler):
    def post(self):
        return

The warning is still there. So could anyone tell me where does this encoding thing come from since I did nothing about encoding in my code?

The text was updated successfully, but these errors were encountered:

FlameSky-S · 2019-08-28T10:05:39Z

Actually the warning doesn't matter if I use flask over tornado to handle the POST request. The warning is still there while flask(request.values.get) correctly extract the key-value pair. However, the get_argument function in tornado can't extract the params correctly and raise "Missing argument data" whenever there is a codec warning.
So I think the issue contains two problems:

Why would requesthandler automatically call encoding related functions with 'latin-1' codec when there's actually nothing to encode?
Why can't get_argument function extract params correctly from a POST request(while request.values.get in Flask can) when a encode warning occurs?

ploxiln · 2019-08-29T00:51:44Z

This is indeed a very confusing corner of tornado. It's using latin1 because "any bytes can be considered latin1 so we can smuggle bytes through latin1". See also #2572

Here's a change that fixes your problem, but it's not really the right fix, as I'll explain:

--- a/tornado/escape.py
+++ b/tornado/escape.py
@@ -161,7 +161,8 @@ def parse_qs_bytes(
     )
     encoded = {}
     for k, v in result.items():
-        encoded[k] = [i.encode("latin1") for i in v]
+        # qs was decoded utf-8, so values are encoded utf-8
+        encoded[k] = [i.encode("utf-8") for i in v]
     return encoded

What you're actually supposed to do, for x-www-form-urlencoded, is to url-encode any non-ascii bytes (and some ascii bytes). Then you'll get bytes in your handler, which you can decode as utf-8. So use curl's --data-urlencode option:

curl -v http://localhost:5000/data --data-urlencode 'data={"id":"00f1c423","name":"张三"}'

and your example will print as expected:

$ python3 example.py 
test
{"id":"00f1c423","name":"张三"}
1

bdarnell · 2019-08-29T01:10:58Z

The "smuggle bytes through latin1" comment refers to decoding bytes as latin1 (which is done by passing latin1 to parse_qs). Encoding back to latin1 is correct here if all the non-ascii bytes come from this latin1 decoding, but it's incorrect if some of them come from the native_str call here:

tornado/tornado/httputil.py

Line 786 in 8e5ecad

uri_arguments = parse_qs_bytes(native_str(body), keep_blank_values=True)

This is a tricky case because the input is malformed (all the non-ascii bytes should be percent escaped) but there is a reasonable interpretation of the data (it works in flask, and i think it probably worked in tornado in python 2). It's difficult to fix this while both using the stdlib's parse_qs function and supporting non-utf8 encodings which can be specified via get_argument.

If you need to support this malformed input, the simplest solution is probably for you to call urllib.parse.parse_qs yourself instead of using tornado's functions.

ploxiln · 2019-08-29T01:51:57Z

I've got another idea here (maybe newly practical now that tornado is python3 only?)

--- a/tornado/httputil.py
+++ b/tornado/httputil.py
@@ -784,7 +784,7 @@ def parse_body_arguments(
         try:
-            uri_arguments = parse_qs_bytes(native_str(body), keep_blank_values=True)
+            uri_arguments = parse_qs_bytes(body.decode('latin1'), keep_blank_values=True)
         except Exception as e:
             gen_log.warning("Invalid x-www-form-urlencoded body: %s", e)

Tested with utf-8 with and without url encoding, and then tested with the "gbk" encoding with the following addition to the example:

class DataHandler(tornado.web.RequestHandler):
    def decode_argument(self, value, name=None):
        return value.decode('gbk')

$ python3
>>> "张三".encode('gbk')
b'\xd5\xc5\xc8\xfd'

$ curl -v http://localhost:5000/data --data "data=$(printf '\xd5\xc5\xc8\xfd')"
$ curl -v http://localhost:5000/data --data-urlencode "data=$(printf '\xd5\xc5\xc8\xfd')"

bdarnell · 2019-08-29T01:53:31Z

Ah, that looks like it should work.

FlameSky-S · 2019-08-29T02:58:47Z

Thank you for your patience! Your comments are really helpful. I've contacted the owner of the clients and persuaded him to fix the malformed requests in future updates. For now I'm using

uri_arguments = parse_qs_bytes(body.decode('latin1'), keep_blank_values=True)

and

def decode_argument(self, value, name=None):
    try:
        return value.decode('utf-8')
    except:
        return value.decode('gbk')

Annoying part is, sometimes there're both gbk and utf-8 encoded Chinese characters in a single request(I know that's weird...). So is there a way to decode such request correctly? If not, how can I just ignore the few gbk encoded characters and decode the rest of the request correctly?
Sry for the noob question.

ploxiln · 2019-08-29T03:05:03Z

You could do value.decode('utf-8', errors='ignore') or errors='replace'

FlameSky-S · 2019-08-29T03:12:09Z

Thank you. That works. I'm closing this.

ploxiln mentioned this issue Aug 29, 2019

parse_body_arguments: allow incomplete url-escaping #2735

Merged

FlameSky-S closed this as completed Aug 29, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

latin-1 codec can't encode characters #2733

latin-1 codec can't encode characters #2733

FlameSky-S commented Aug 28, 2019

FlameSky-S commented Aug 28, 2019

ploxiln commented Aug 29, 2019

bdarnell commented Aug 29, 2019

ploxiln commented Aug 29, 2019

bdarnell commented Aug 29, 2019

FlameSky-S commented Aug 29, 2019 •

edited

ploxiln commented Aug 29, 2019

FlameSky-S commented Aug 29, 2019

latin-1 codec can't encode characters #2733

latin-1 codec can't encode characters #2733

Comments

FlameSky-S commented Aug 28, 2019

FlameSky-S commented Aug 28, 2019

ploxiln commented Aug 29, 2019

bdarnell commented Aug 29, 2019

ploxiln commented Aug 29, 2019

bdarnell commented Aug 29, 2019

FlameSky-S commented Aug 29, 2019 • edited

ploxiln commented Aug 29, 2019

FlameSky-S commented Aug 29, 2019

FlameSky-S commented Aug 29, 2019 •

edited