Skip to content

Conversation

@mahmoud
Copy link
Member

@mahmoud mahmoud commented Jan 11, 2018

Enhance _percent_decode() so that it properly decodes percents pairs even when non-ASCII is present, and extended docstring. Remove re-percent-encoding logic from DecodedURL. Now _percent_decode() actually does what the name says!

@codecov-io
Copy link

codecov-io commented Jan 11, 2018

Codecov Report

Merging #59 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #59      +/-   ##
==========================================
+ Coverage   97.96%   97.97%   +<.01%     
==========================================
  Files           8        8              
  Lines        1424     1431       +7     
  Branches      166      166              
==========================================
+ Hits         1395     1402       +7     
  Misses         14       14              
  Partials       15       15
Impacted Files Coverage Δ
hyperlink/test/test_decoded_url.py 100% <100%> (ø) ⬆️
hyperlink/_url.py 96.14% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 05cea28...4dd846d. Read the comment docs.

Copy link
Member

@markrwilliams markrwilliams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems weird to work with partially percent encoded strings, but I think it's OK because this is an internal API that converges intermediate values to a correct final output.

I don't understand subencoding=False. To be fair, It's just as limited now as it was when it was ASCII, so that's a concern that's not specific to this PR.

Please explain the state of sub encoding and write some additional tests and fix _percent_decode's docstring before merging.

Args:
text (unicode): The ASCII text with percent-encoding present.
text (unicode): Text with percent-encoding present.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An earlier part of the docstring contradicts this. Please rewrite it to document the new API.

Returns:
unicode: The percent-decoded version of *text*, with UTF-8
decoding applied.
unicode: The percent-decoded version of *text*, with decoding
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The percent-decoded version of text, with decoding applied decoded with subencoding,

normalize_case (bool): Whether undecoded percent segments, such
as encoded delimiters, should be uppercased, per RFC 3986
Section 2.1. See :func:`_decode_path_part` for an example.
subencoding (unicode): The name of the encoding underlying the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

subencoding=False will now return bytes that include UTF-8 sequences; before it would only contain ASCII. That seems worth documenting here.

"""
try:
quoted_bytes = text.encode("ascii")
quoted_bytes = text.encode(subencoding or 'utf-8')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the actual API surface, with ascii as a stand in for an arbitrary non-UTF-8 encoding:

text subencoding quoted_bytes return value
<unicode> "utf-8" (default) <bytes, UTF-8 encoded> <unicode, percent-encoded>
<unicode> "ascii" UnicodeEncodeError <unicode, unaltered>
<unicode, ASCII congruent> "ascii" <bytes, ASCII> <bytes, ASCII>
<unicode, charmap?> False <bytes, UTF-8 encoded> <bytes, UTF-8 encoded>

The last case seems weird - I want bytes back because I know there isn't actually a "subencoding" (and I ended up with unicode presumably because somewhere else there's something like .decode("charmap")), but instead I get UTF-8 encoded bytes back. What then is the purpose of subencoding=False if it always implies UTF-8?

I think splitting this line up into an explicit if statement will make the situation clearer, and also make it easier to write tests that cover each case.

@mahmoud mahmoud merged commit 28c908b into master Feb 24, 2018
@mahmoud mahmoud removed the request for review from alexwlchan February 24, 2018 22:53
@mahmoud mahmoud deleted the i58_mixed_percent_decoding branch April 8, 2019 07:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants