Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

latin-1 codec can't encode characters #2733

Closed
FlameSky-S opened this issue Aug 28, 2019 · 8 comments
Closed

latin-1 codec can't encode characters #2733

FlameSky-S opened this issue Aug 28, 2019 · 8 comments

Comments

@FlameSky-S
Copy link

I'm using tornado to accept some POST data sended from clients I don't have access to. Everything works fine if only English characters appear in the data. When utf-8 encoded Chinese characters(3 bytes) are within the data, Tornado gives me this warning and the 'get_argument' function can't get anything at all.

I debuged and simplified my code to the simplest, yet the warning still comes up

class DataHandler(tornado.web.RequestHandler):
    def post(self):
        print("test")
        print(self.get_argument("data"))
        print("1")

application = tornado.web.Application([
    (r"/data", Data),
])

application.listen(5000)
tornado.ioloop.IOLoop.instance().start()

The data the clients post is like:

data={"id":"00f1c423","name":"张三"}

The data is x-www-form-urlencoded and WireShark shows the Chinese characters are perfectly 3-bytes utf-8 which starts with E(1110).
It has nothing to do with the print function because I actually tried simpler codes:

class Data(tornado.web.RequestHandler):
    def post(self):
        return

The warning is still there. So could anyone tell me where does this encoding thing come from since I did nothing about encoding in my code?

@FlameSky-S
Copy link
Author

Actually the warning doesn't matter if I use flask over tornado to handle the POST request. The warning is still there while flask(request.values.get) correctly extract the key-value pair. However, the get_argument function in tornado can't extract the params correctly and raise "Missing argument data" whenever there is a codec warning.
So I think the issue contains two problems:

  1. Why would requesthandler automatically call encoding related functions with 'latin-1' codec when there's actually nothing to encode?
  2. Why can't get_argument function extract params correctly from a POST request(while request.values.get in Flask can) when a encode warning occurs?

@ploxiln
Copy link
Contributor

ploxiln commented Aug 29, 2019

This is indeed a very confusing corner of tornado. It's using latin1 because "any bytes can be considered latin1 so we can smuggle bytes through latin1". See also #2572

Here's a change that fixes your problem, but it's not really the right fix, as I'll explain:

--- a/tornado/escape.py
+++ b/tornado/escape.py
@@ -161,7 +161,8 @@ def parse_qs_bytes(
     )
     encoded = {}
     for k, v in result.items():
-        encoded[k] = [i.encode("latin1") for i in v]
+        # qs was decoded utf-8, so values are encoded utf-8
+        encoded[k] = [i.encode("utf-8") for i in v]
     return encoded

What you're actually supposed to do, for x-www-form-urlencoded, is to url-encode any non-ascii bytes (and some ascii bytes). Then you'll get bytes in your handler, which you can decode as utf-8. So use curl's --data-urlencode option:

curl -v http://localhost:5000/data --data-urlencode 'data={"id":"00f1c423","name":"张三"}'

and your example will print as expected:

$ python3 example.py 
test
{"id":"00f1c423","name":"张三"}
1

@bdarnell
Copy link
Member

The "smuggle bytes through latin1" comment refers to decoding bytes as latin1 (which is done by passing latin1 to parse_qs). Encoding back to latin1 is correct here if all the non-ascii bytes come from this latin1 decoding, but it's incorrect if some of them come from the native_str call here:

uri_arguments = parse_qs_bytes(native_str(body), keep_blank_values=True)

This is a tricky case because the input is malformed (all the non-ascii bytes should be percent escaped) but there is a reasonable interpretation of the data (it works in flask, and i think it probably worked in tornado in python 2). It's difficult to fix this while both using the stdlib's parse_qs function and supporting non-utf8 encodings which can be specified via get_argument.

If you need to support this malformed input, the simplest solution is probably for you to call urllib.parse.parse_qs yourself instead of using tornado's functions.

@ploxiln
Copy link
Contributor

ploxiln commented Aug 29, 2019

I've got another idea here (maybe newly practical now that tornado is python3 only?)

--- a/tornado/httputil.py
+++ b/tornado/httputil.py
@@ -784,7 +784,7 @@ def parse_body_arguments(
         try:
-            uri_arguments = parse_qs_bytes(native_str(body), keep_blank_values=True)
+            uri_arguments = parse_qs_bytes(body.decode('latin1'), keep_blank_values=True)
         except Exception as e:
             gen_log.warning("Invalid x-www-form-urlencoded body: %s", e)

Tested with utf-8 with and without url encoding, and then tested with the "gbk" encoding with the following addition to the example:

class DataHandler(tornado.web.RequestHandler):
    def decode_argument(self, value, name=None):
        return value.decode('gbk')
$ python3
>>> "张三".encode('gbk')
b'\xd5\xc5\xc8\xfd'

$ curl -v http://localhost:5000/data --data "data=$(printf '\xd5\xc5\xc8\xfd')"
$ curl -v http://localhost:5000/data --data-urlencode "data=$(printf '\xd5\xc5\xc8\xfd')"

@bdarnell
Copy link
Member

Ah, that looks like it should work.

@FlameSky-S
Copy link
Author

FlameSky-S commented Aug 29, 2019

Thank you for your patience! Your comments are really helpful. I've contacted the owner of the clients and persuaded him to fix the malformed requests in future updates. For now I'm using

uri_arguments = parse_qs_bytes(body.decode('latin1'), keep_blank_values=True)

and

def decode_argument(self, value, name=None):
    try:
        return value.decode('utf-8')
    except:
        return value.decode('gbk')

Annoying part is, sometimes there're both gbk and utf-8 encoded Chinese characters in a single request(I know that's weird...). So is there a way to decode such request correctly? If not, how can I just ignore the few gbk encoded characters and decode the rest of the request correctly?
Sry for the noob question.

@ploxiln
Copy link
Contributor

ploxiln commented Aug 29, 2019

You could do value.decode('utf-8', errors='ignore') or errors='replace'

@FlameSky-S
Copy link
Author

Thank you. That works. I'm closing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants