Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode character parsing #1329

Closed
sjsadowski opened this issue Sep 26, 2018 · 32 comments
Closed

Unicode character parsing #1329

sjsadowski opened this issue Sep 26, 2018 · 32 comments
Assignees

Comments

@sjsadowski
Copy link
Contributor

Per #539 there may be a continued problem in parsing unicode characters due to a dependency.

Per @vltr: /中国 or even /jacaré would break on request.py, mostly because of httptools.parse_url that receives bytes and is unable to parse non-ascii chars. I'll need to dig deeper to see if this is a restraint in httptools itself or just in the Python bindings.

@vltr
Copy link
Member

vltr commented Sep 26, 2018

Just passing my eyes through Node's http-parser source code (base of httptools), I've seen in their tests that some level of unicode characters should be possible to use. So, the next step would be to test httptools itself - in a next chapter 😉

@yunstanford
Copy link
Member

I remembered there was a PR addressing this #1081

@ahopkins
Copy link
Member

@huge-success/sanic-release-managers add this to 18.12LTS milestone?

@sjsadowski
Copy link
Contributor Author

We need to re-test, per #1081 this may have been solved as it was merged back in january.

@yunstanford
Copy link
Member

there is a unit test for verifying.

@vltr
Copy link
Member

vltr commented Sep 28, 2018

@yunstanford would this be a valid test?

from sanic.request import Request

uri = "/jacaré"

Request(
    url_bytes=uri.encode("utf-8"),
    headers={},
    version=None,
    method='GET',
    transport=None
)

The result:

$ python test-sanic-unicode.py
Traceback (most recent call last):
  File "test-sanic-unicode.py", line 10, in <module>
    transport=None
  File "/home/richard/.pyenv/versions/playground-3.7.0/lib/python3.7/site-packages/sanic/request.py", line 57, in __init__
    self._parsed_url = parse_url(url_bytes)
  File "httptools/parser/parser.pyx", line 468, in httptools.parser.parser.parse_url
httptools.parser.errors.HttpParserInvalidURLError: invalid url b'/jacar\xc3\xa9'

@sjsadowski
Copy link
Contributor Author

Test looks valid to me.

After dugging, it looks like that the issue still exists in the dependency. It actually seems to exist upstream from there in http-tools (see here, where they are still relying on RFC2616 for valid url characters - which are only latin-1). After digging, while RFC2616 was updated by RFC3986 and obsoleted by RFC7230, the latin-1 character sets (backreferenced to RFC3986) and percent encoding remain the only official valid characters for URIs.

So we have two problems, one of which I don't care about. i18n and non-latin-1 characters are still technically disallowed - don't care. W3C allows internationalization of domain names using non latin-1 characters, and so should we.

The second part is that our dependency chain DOES care. The http-tools parser will and does fail if it's trying to parse any portion of the schema that non-latin-1 encoded. And I don't think we can fix this unless we eliminate the dependency, and I'm sure that's going to be a problem. So I do care about this issue.

I think there is a workaround though, but I think it will be costly: if the unicode characters are intercepted before parsing and translated to their %characters, that solves the url parsing. But we also have to do that for route registration. I'd like to get some more eyes on this issue to make sure I'm not crazy. @ahopkins @r0fls @seemethere @yunstanford

@sjsadowski
Copy link
Contributor Author

I'm also concerned because @vltr has a test that is clearly failing yet the test from #1081 passes

@vltr
Copy link
Member

vltr commented Sep 28, 2018

@sjsadowski for this simple script I made, using latin1 provides the same error:

from sanic.request import Request

uri = "/jacaré"

Request(
    url_bytes=uri.encode("latin1"),
    headers={},
    version=None,
    method='GET',
    transport=None
)

Result:

$ python test-sanic-unicode.py
Traceback (most recent call last):
  File "test-sanic-unicode.py", line 10, in <module>
    transport=None
  File "/home/richard/.pyenv/versions/playground-3.7.0/lib/python3.7/site-packages/sanic/request.py", line 57, in __init__
    self._parsed_url = parse_url(url_bytes)
  File "httptools/parser/parser.pyx", line 468, in httptools.parser.parser.parse_url
httptools.parser.errors.HttpParserInvalidURLError: invalid url b'/jacar\xe9'

httptools version:

$ pip freeze | grep httptools
httptools==0.0.11

But, even if I create a simple server:

from sanic import Sanic
from sanic.response import text

app = Sanic(__name__)


@app.route("/test/<hello>")
async def myroute(request, hello):
    return text("hello, {}!".format(hello))


if __name__ == '__main__':
    app.run(host="0.0.0.0", port=8000, workers=1)

It will raise (basically) the same error.

Server:

$ python test-sanic-unicode.py
[2018-09-28 10:28:54 -0300] [8411] [INFO] Goin' Fast @ http://0.0.0.0:8000
[2018-09-28 10:28:54 -0300] [8411] [INFO] Starting worker [8411]
[2018-09-28 10:29:05 -0300] [8411] [ERROR] Traceback (most recent call last):
  File "/home/richard/.pyenv/versions/playground-3.7.0/lib/python3.7/site-packages/sanic/server.py", line 216, in data_received
    self.parser.feed_data(data)
  File "httptools/parser/parser.pyx", line 193, in httptools.parser.parser.HttpParser.feed_data
httptools.parser.errors.HttpParserInvalidURLError: invalid URL

cURL:

$ curl -v http://127.0.0.1:8000/test/jacaré
*   Trying 127.0.0.1...
* TCP_NODELAY set
* Connected to 127.0.0.1 (127.0.0.1) port 8000 (#0)
> GET /test/jacaré HTTP/1.1
> Host: 127.0.0.1:8000
> User-Agent: curl/7.61.1
> Accept: */*
> 
< HTTP/1.1 400 Bad Request
< Connection: close
< Content-Length: 18
< Content-Type: text/plain; charset=utf-8
< 
* Closing connection 0
Error: Bad Request

@sjsadowski
Copy link
Contributor Author

sjsadowski commented Sep 28, 2018

I think the encoding is supposed to be 'latin-1' but either way é is not valid in RFC2616.

here is a direct link to the character table for RFC2616 that is implemented in http-tools

@vltr
Copy link
Member

vltr commented Sep 28, 2018

Oh. Sorry, my bad. I'll pay more attention to the RFC 😁

Regarding the dash on latin-1, it works with or without:

>>> "é".encode("utf-8")
b'\xc3\xa9'

>>> "é".encode("utf8")
b'\xc3\xa9'

>>> "é".encode("latin1")
b'\xe9'

>>> "é".encode("latin-1")
b'\xe9'

@sjsadowski
Copy link
Contributor Author

Yeah it's just dumb. It's very outdated (1999 - almost 20 years) and even though it's been superseded twice, they haven't updated the "legal" characters for a URI.

@vltr
Copy link
Member

vltr commented Sep 28, 2018

We need to see if any updated RFC brings this question forward, since 2616 is obsoleted and updated.

@vltr
Copy link
Member

vltr commented Sep 28, 2018

@sjsadowski wow, almost at the same time (the comments). I'll take a look into these other RFCs just for precaution.

@sjsadowski
Copy link
Contributor Author

@vltr you want 3986 which is an update and 7230 which obsoletes 2616

@vltr
Copy link
Member

vltr commented Sep 28, 2018

Thanks, @sjsadowski, I'll read them carefully when I get some free time and will get back here with (or without) new conclusions 👍

@ahopkins
Copy link
Member

I have not had a full chance to review the issue.... but my first thought when I read the idea of catching non Latin characters and translating them would be the potential performance hit compared to the use case. I'll come back after the weekend with some more thoughts.

@vltr
Copy link
Member

vltr commented Sep 28, 2018

@ahopkins indeed, depending on how to address this issue. I would not want to address this inside Sanic because of the performance hit. If possible, I would like to address this on httptools because it seems to work on Node's http-parser (see here).

@yunstanford
Copy link
Member

didn't get time to dig into this issue, will take a look when getting a chance.

If you guys've investigated and thought it's because of httptools, we can pull @1st1 in for discussion

@sjsadowski
Copy link
Contributor Author

yeah it's not httptools, it's the upstream parser it relies on http-tools. I don't think the httptools package can implement a fix.

I wonder if we want to consider a workaround that would enable unicode parsing with a flag.

@vltr
Copy link
Member

vltr commented Oct 1, 2018

Everyone, I did some more digging into this and things got a little more complicated. From the RFC 3986, see the topic "2.4. When to Encode or Decode".

From what I could understand:

  1. For every non US-ASCII character (let's pick a simple example that I used a lot here, é) that is used, it should be encoded to its "percent-encoded octet", which in this example would be %C3%A9;
  2. By doing this, httptools would correctly parse everything and all is good.

What happens:

  1. Some clients (like cURL) doesn't encode anything prior to the request, it just sends the request "as is", that would of course break inside Sanic for not being encoded properly first;
  2. We can see that some other web frameworks can handle this kind of values, but the question is: are they doing it to "embrace the world" and make users happy, instead of properly handle the issue?

What can be done:

  • Nothing. Sanic is working strictly under the terms of the RFC, so if anything comes wrong, a bad request will be raised;
  • Half embrace the world [1]: try to encode these "strange chars" (using urllib.parse.quote) prior to send to httptools, adding some overhead;
  • Half embrace the world [2]: try do decode these "strange chars" (using urllib.parse.unquote) after what httptools provides us (if it not breaks); my personal opinion is that this should be considered (I can send /%C3%A9 and read ); or
  • Full embrace the world: try to encode everything weird before and after, to get the whole world satisfied.

Thoughts?

@sjsadowski
Copy link
Contributor Author

I'd lean towards nothing. If a client is supposed to be % encoding non-ascii characters that are in the RFC2616 charmap, then bare unicode characters should not be parsed, and httptools/http-tools are doing the right thing.

So long as routes that have unicode characters in them are getting % encoded as well, we have done (what I think) are our jobs, as the routes will match the correct client encoding.

@vltr
Copy link
Member

vltr commented Oct 1, 2018

@sjsadowski that would be my opinion too. We can perhaps add a server option to unquote after httptools have returned correctly, but that's just an idea -- the developer can implement this as well. It may have some more trouble than expected because it needs to overwrite Request.__init__ and use the newer class in the application constructor if intended to be used with the Router, so perhaps this should be a matter of heavy documentation on the subject?

@sjsadowski
Copy link
Contributor Author

@vltr unless we get a chorus of 'no' I'm going to say we just need to document the hell out of it. A special section on unicode characters or something for handling unicode in Request.

@vltr
Copy link
Member

vltr commented Oct 1, 2018

@sjsadowski I completely agree with you.

@vltr
Copy link
Member

vltr commented Oct 1, 2018

Perhaps we can even provide some pre-built examples of extending the Request class to deal with all these questions - even though I personally think that these kind of parsing should be done inside the server, but anyway ...

@sjsadowski
Copy link
Contributor Author

I'm actually closing this per #1424

@vltr
Copy link
Member

vltr commented Feb 8, 2019

@sjsadowski I'm sorry, I really forgot to document some examples with the Request class. Please, leave it open and I'll create a PR for this in a couple of hours - I just have to revise what I have wrote in here already and etc. I'm really sorry.

@sjsadowski sjsadowski reopened this Feb 8, 2019
@sjsadowski
Copy link
Contributor Author

@vltr no worries, just trying to do some house keeping!

@vltr
Copy link
Member

vltr commented Feb 8, 2019

@sjsadowski I know 😉 And I'm glad you're doing it because I'm so full of stuff to do that this went straight through my todo list and I completely forgot about.

@vltr
Copy link
Member

vltr commented Feb 8, 2019

@sjsadowski I made a mess in my mind here - I thought this was related to httptools on the Request object, but it is in fact in the HttpProtocol. It's better to close this or take the bumpy road: add optional (and configurable) hooks to the protocol, which might not even solve the issue 😕

@sjsadowski
Copy link
Contributor Author

I'm going to close it. Solving the issue itself is going to require significant work and should be tracked separately - if we choose to take it on - and I think this issue number has run its course.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants