Unicode mixed percent decoding #59

mahmoud · 2018-01-11T02:55:58Z

Enhance _percent_decode() so that it properly decodes percents pairs even when non-ASCII is present, and extended docstring. Remove re-percent-encoding logic from DecodedURL. Now _percent_decode() actually does what the name says!

… supports mixed decoding

codecov-io · 2018-01-11T02:58:37Z

Codecov Report

Merging #59 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master      #59      +/-   ##
==========================================
+ Coverage   97.96%   97.97%   +<.01%     
==========================================
  Files           8        8              
  Lines        1424     1431       +7     
  Branches      166      166              
==========================================
+ Hits         1395     1402       +7     
  Misses         14       14              
  Partials       15       15

Impacted Files	Coverage Δ
hyperlink/test/test_decoded_url.py	`100% <100%> (ø)`	⬆️
hyperlink/_url.py	`96.14% <100%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 05cea28...4dd846d. Read the comment docs.

markrwilliams

It seems weird to work with partially percent encoded strings, but I think it's OK because this is an internal API that converges intermediate values to a correct final output.

I don't understand subencoding=False. To be fair, It's just as limited now as it was when it was ASCII, so that's a concern that's not specific to this PR.

Please explain the state of sub encoding and write some additional tests and fix _percent_decode's docstring before merging.

markrwilliams · 2018-02-24T19:51:17Z

hyperlink/_url.py


    Args:
-       text (unicode): The ASCII text with percent-encoding present.
+       text (unicode): Text with percent-encoding present.


An earlier part of the docstring contradicts this. Please rewrite it to document the new API.

markrwilliams · 2018-02-24T19:52:44Z

hyperlink/_url.py

    Returns:
-       unicode: The percent-decoded version of *text*, with UTF-8
-         decoding applied.
+       unicode: The percent-decoded version of *text*, with decoding


The percent-decoded version of text, ~~with decoding applied~~ decoded with subencoding,

markrwilliams · 2018-02-24T19:56:50Z

hyperlink/_url.py

       normalize_case (bool): Whether undecoded percent segments, such
          as encoded delimiters, should be uppercased, per RFC 3986
          Section 2.1. See :func:`_decode_path_part` for an example.
+       subencoding (unicode): The name of the encoding underlying the


subencoding=False will now return bytes that include UTF-8 sequences; before it would only contain ASCII. That seems worth documenting here.

markrwilliams · 2018-02-24T20:17:39Z

hyperlink/_url.py

    """
    try:
-        quoted_bytes = text.encode("ascii")
+        quoted_bytes = text.encode(subencoding or 'utf-8')


I think this is the actual API surface, with ascii as a stand in for an arbitrary non-UTF-8 encoding:

text subencoding quoted_bytes return value

<unicode> "utf-8" (default) <bytes, UTF-8 encoded> <unicode, percent-encoded>

<unicode> "ascii" UnicodeEncodeError <unicode, unaltered>

<unicode, ASCII congruent> "ascii" <bytes, ASCII> <bytes, ASCII>

<unicode, charmap?> False <bytes, UTF-8 encoded> <bytes, UTF-8 encoded>

The last case seems weird - I want bytes back because I know there isn't actually a "subencoding" (and I ended up with unicode presumably because somewhere else there's something like .decode("charmap")), but instead I get UTF-8 encoded bytes back. What then is the purpose of subencoding=False if it always implies UTF-8?

I think splitting this line up into an explicit if statement will make the situation clearer, and also make it easier to write tests that cover each case.

@markrwilliams

…ng for _percent_decode, per @markrwilliams review

mahmoud added 2 commits January 10, 2018 18:48

enable _percent_decode to decode percent encodings within unicode text

5d4b542

remove excessive _encode_* from DecodedURL now that _percent_decode()…

a0cf6d5

… supports mixed decoding

mahmoud mentioned this pull request Jan 23, 2018

Decode percent-encoding in mixed text #58

Closed

mahmoud requested a review from alexwlchan February 16, 2018 22:41

markrwilliams approved these changes Feb 24, 2018

View reviewed changes

add a couple more tests around mixed percent decoding and fix docstri…

4dd846d

…ng for _percent_decode, per @markrwilliams review

mahmoud merged commit 28c908b into master Feb 24, 2018

mahmoud removed the request for review from alexwlchan February 24, 2018 22:53

mahmoud deleted the i58_mixed_percent_decoding branch April 8, 2019 07:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unicode mixed percent decoding #59

Unicode mixed percent decoding #59

Uh oh!

mahmoud commented Jan 11, 2018

Uh oh!

codecov-io commented Jan 11, 2018 •

edited

Loading

Uh oh!

markrwilliams left a comment

Uh oh!

markrwilliams Feb 24, 2018

Uh oh!

markrwilliams Feb 24, 2018

Uh oh!

markrwilliams Feb 24, 2018

Uh oh!

markrwilliams Feb 24, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

`text`	`subencoding`	`quoted_bytes`	return value
`<unicode>`	`"utf-8"` (default)	`<bytes, UTF-8 encoded>`	`<unicode, percent-encoded>`
`<unicode>`	`"ascii"`	`UnicodeEncodeError`	`<unicode, unaltered>`
`<unicode, ASCII congruent>`	`"ascii"`	`<bytes, ASCII>`	`<bytes, ASCII>`
`<unicode, charmap?>`	`False`	`<bytes, UTF-8 encoded>`	`<bytes, UTF-8 encoded>`

Unicode mixed percent decoding #59

Unicode mixed percent decoding #59

Uh oh!

Conversation

mahmoud commented Jan 11, 2018

Uh oh!

codecov-io commented Jan 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

markrwilliams left a comment

Choose a reason for hiding this comment

Uh oh!

markrwilliams Feb 24, 2018

Choose a reason for hiding this comment

Uh oh!

markrwilliams Feb 24, 2018

Choose a reason for hiding this comment

Uh oh!

markrwilliams Feb 24, 2018

Choose a reason for hiding this comment

Uh oh!

markrwilliams Feb 24, 2018

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-io commented Jan 11, 2018 •

edited

Loading