-
Notifications
You must be signed in to change notification settings - Fork 41
Unicode mixed percent decoding #59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #59 +/- ##
==========================================
+ Coverage 97.96% 97.97% +<.01%
==========================================
Files 8 8
Lines 1424 1431 +7
Branches 166 166
==========================================
+ Hits 1395 1402 +7
Misses 14 14
Partials 15 15
Continue to review full report at Codecov.
|
markrwilliams
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems weird to work with partially percent encoded strings, but I think it's OK because this is an internal API that converges intermediate values to a correct final output.
I don't understand subencoding=False. To be fair, It's just as limited now as it was when it was ASCII, so that's a concern that's not specific to this PR.
Please explain the state of sub encoding and write some additional tests and fix _percent_decode's docstring before merging.
| Args: | ||
| text (unicode): The ASCII text with percent-encoding present. | ||
| text (unicode): Text with percent-encoding present. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An earlier part of the docstring contradicts this. Please rewrite it to document the new API.
hyperlink/_url.py
Outdated
| Returns: | ||
| unicode: The percent-decoded version of *text*, with UTF-8 | ||
| decoding applied. | ||
| unicode: The percent-decoded version of *text*, with decoding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The percent-decoded version of text,
with decoding applieddecoded withsubencoding,
| normalize_case (bool): Whether undecoded percent segments, such | ||
| as encoded delimiters, should be uppercased, per RFC 3986 | ||
| Section 2.1. See :func:`_decode_path_part` for an example. | ||
| subencoding (unicode): The name of the encoding underlying the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
subencoding=False will now return bytes that include UTF-8 sequences; before it would only contain ASCII. That seems worth documenting here.
hyperlink/_url.py
Outdated
| """ | ||
| try: | ||
| quoted_bytes = text.encode("ascii") | ||
| quoted_bytes = text.encode(subencoding or 'utf-8') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the actual API surface, with ascii as a stand in for an arbitrary non-UTF-8 encoding:
text |
subencoding |
quoted_bytes |
return value |
|---|---|---|---|
<unicode> |
"utf-8" (default) |
<bytes, UTF-8 encoded> |
<unicode, percent-encoded> |
<unicode> |
"ascii" |
UnicodeEncodeError |
<unicode, unaltered> |
<unicode, ASCII congruent> |
"ascii" |
<bytes, ASCII> |
<bytes, ASCII> |
<unicode, charmap?> |
False |
<bytes, UTF-8 encoded> |
<bytes, UTF-8 encoded> |
The last case seems weird - I want bytes back because I know there isn't actually a "subencoding" (and I ended up with unicode presumably because somewhere else there's something like .decode("charmap")), but instead I get UTF-8 encoded bytes back. What then is the purpose of subencoding=False if it always implies UTF-8?
I think splitting this line up into an explicit if statement will make the situation clearer, and also make it easier to write tests that cover each case.
…ng for _percent_decode, per @markrwilliams review
Enhance
_percent_decode()so that it properly decodes percents pairs even when non-ASCII is present, and extended docstring. Remove re-percent-encoding logic from DecodedURL. Now_percent_decode()actually does what the name says!