Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decode headers as UTF-8 also in ASGI #2606

Merged
merged 17 commits into from Mar 20, 2023
Merged

Decode headers as UTF-8 also in ASGI #2606

merged 17 commits into from Mar 20, 2023

Conversation

ChihweiLHBird
Copy link
Member

Closes #2604

@ChihweiLHBird ChihweiLHBird marked this pull request as ready for review November 24, 2022 04:57
@ChihweiLHBird ChihweiLHBird requested a review from a team as a code owner November 24, 2022 04:57
@ChihweiLHBird ChihweiLHBird changed the title Decode header as utf8 in asgi Decode header with latin-1 Nov 24, 2022
@codecov
Copy link

codecov bot commented Nov 29, 2022

Codecov Report

Patch coverage: 100.000% and project coverage change: -0.079 ⚠️

Comparison is base (a245ab3) 88.617% compared to head (9483af8) 88.539%.

❗ Current head 9483af8 differs from pull request most recent head f792bf0. Consider uploading reports for the commit f792bf0 to get more accurate results

Additional details and impacted files
@@              Coverage Diff              @@
##              main     #2606       +/-   ##
=============================================
- Coverage   88.617%   88.539%   -0.079%     
=============================================
  Files           87        87               
  Lines         6844      6841        -3     
  Branches      1178      1176        -2     
=============================================
- Hits          6065      6057        -8     
- Misses         537       539        +2     
- Partials       242       245        +3     
Impacted Files Coverage Δ
sanic/asgi.py 91.729% <100.000%> (+0.190%) ⬆️

... and 5 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link
Member

@ahopkins ahopkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On hold until v23

@Tronic
Copy link
Member

Tronic commented Jan 31, 2023

This should probably continued. Needs some practical testing with ASGI servers to see how they behave, but the likely outcome is that we need to use UTF-8 on that side as well, the same as with the integrated server.

Two points of interest:

  • Header encoding (values in particular, keys should stay ASCII).
  • Path encoding, and with that also how clients handle it.

I can help run those tests once I get my other PRs sorted out, but @ChihweiLHBird feel free to do your own tests if you can. If it looks like that UTF-8 is going to work identically with ASGI as it does with the Sanic server, we are all good to implement UTF-8 on ASGI as well.

@ChihweiLHBird
Copy link
Member Author

@Tronic Sure, I can try to test it.

@Tronic
Copy link
Member

Tronic commented Mar 8, 2023

@ChihweiLHBird Could you get this finished? We had a discussion and want to have this in the upcoming 23.3 release.

@Tronic Tronic changed the title Decode header with latin-1 Decode headers as UTF-8 also in ASGI Mar 8, 2023
@ChihweiLHBird
Copy link
Member Author

@Tronic Sorry, I was busy last few days. I am on it right now.

@ChihweiLHBird
Copy link
Member Author

ChihweiLHBird commented Mar 9, 2023

@Tronic Just done some manual testing, and it seems for the test cases I tried, there is not an issue for header decoding. If it uses UTF-8 decoding, it will successfully decode emojis and other UTF-8 characters but failed to decode some characters in latin-1 which isn't in UTF-8, like b"\x80". Similar result to path. And I think this is the expected results, right? I am not sure what are the more comprehensive test cases I should try.

Tried uvicorn and hypercorn

sanic/asgi.py Outdated Show resolved Hide resolved
@Tronic
Copy link
Member

Tronic commented Mar 9, 2023

Could you also encode("ASCII") for url_bytes, and add try-except to catch UnicodeDecodeError/UnicodeEncodeError in these locations, raising BadRequest("Header names can only contain US-ASCII characters"), and BadURL("URL can only contain US-ASCII characters.")

This will require some tests that reach those errors.

@Tronic
Copy link
Member

Tronic commented Mar 9, 2023

Note on further changes: (not necessarily in this PR)

  • http1 needs to use ASCII for url_bytes encoding too (and raise that same exception)
  • http3 alike needs to encode in ASCII, and also needs ASCII/surrogateescape field name/value as here.

After these changes all three modes should behave more closely to each other, and be stricter in RFC compliance.

@Tronic
Copy link
Member

Tronic commented Mar 9, 2023

PR #2710 now implements charsets in URL handling, so that should not be addressed here. But implement error handling for headers as instructed above.

@ChihweiLHBird
Copy link
Member Author

PR #2710 now implements charsets in URL handling, so that should not be addressed here. But implement error handling for headers as instructed above.

What about http1 headers?

raw_headers = head.decode(errors="surrogateescape")

@Tronic
Copy link
Member

Tronic commented Mar 11, 2023

What about http1 headers?

It will stay laxer with header names. The entire header is decoded at once because that is faster than decoding each field separately, and thus we cannot check that they are ASCII (without performance penalty).

sanic/asgi.py Outdated Show resolved Hide resolved
sanic/http/http3.py Outdated Show resolved Hide resolved
Tronic
Tronic previously approved these changes Mar 13, 2023
Copy link
Member

@Tronic Tronic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work :)

@ChihweiLHBird
Copy link
Member Author

@ahopkins May I have your review?

@ahopkins ahopkins merged commit 61aa16f into main Mar 20, 2023
30 checks passed
@ahopkins ahopkins deleted the zhiwei/asgi-header-decode branch March 20, 2023 13:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

HTTP 1 request headers decoded using default encoding instead of ISO-8859-1
4 participants