DecodedURL #54

mahmoud · 2017-12-31T08:27:53Z

This still needs some documentation, but to address #44 and the handful of issues surrounding the central problem, I present DecodedURL. It takes care of handling reserved characters so you don't have to.

From the docstring:

DecodedURL is a type meant to act as a higher-level interface to the URL. It is the unicode to URL's bytes. DecodedURL has almost exactly the same API as URL, but everything going in and out is in its maximally decoded state. All percent decoding is handled automatically.

Where applicable, a UTF-8 encoding is presumed. Be advised that some interactions, can raise UnicodeEncodeErrors and UnicodeDecodeErrors, just like when working with bytestrings.

Examples of such interactions include handling query strings encoding binary data, and paths containing segments with special characters encoded with codecs other than UTF-8.

It's tested, works, and seems practical, though, so take a look!

…fo a bit.

…L.get()

…der for any reserved characters, and used it in .child() and .sibling(), will add it in further methods shortly

…obably have the issue with not-yet-normalized query parameter names (mixed decoded and encoded query parameter names that overlap).

… overlapping due to not being normalized

…nor checks

…ation (.append, etc.). Also add and test __eq__, __ne__, and __hash__

…RLs of some complexity. Had to add userinfo to URL.normalize() to help with equality checks.

… and other Twisted codebases

codecov-io · 2017-12-31T08:30:07Z

Codecov Report

Merging #54 into master will increase coverage by 0.13%.
The diff coverage is 98.59%.

@@            Coverage Diff             @@
##           master      #54      +/-   ##
==========================================
+ Coverage    97.8%   97.94%   +0.13%     
==========================================
  Files           6        8       +2     
  Lines        1137     1408     +271     
  Branches      137      164      +27     
==========================================
+ Hits         1112     1379     +267     
- Misses         13       14       +1     
- Partials       12       15       +3

Impacted Files	Coverage Δ
hyperlink/test/test_parse.py	`100% <100%> (ø)`
hyperlink/test/test_decoded_url.py	`100% <100%> (ø)`
hyperlink/__init__.py	`100% <100%> (ø)`	⬆️
hyperlink/_url.py	`96.1% <97.57%> (+0.37%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 17dc8d3...e8616fa. Read the comment docs.

alexwlchan · 2017-12-31T10:22:45Z

hyperlink/_url.py

 _UNRESERVED_DECODE_MAP = dict([(k, v) for k, v in _HEX_CHAR_MAP.items()
                               if v.decode('ascii', 'replace')
                               in _UNRESERVED_CHARS])

 _ROOT_PATHS = frozenset(((), (u'',)))


+def _encode_reserved(text, maximal=True):
+    """A very comprehensive percent encoding for encoding all
+    delimeters. Used for arguments to DecodedURL, where a % means a


nit: "delimiters"

Oooh, this kind of typo is so very unlike me, I assure you! ;) Thanks!

alexwlchan · 2017-12-31T10:35:03Z

hyperlink/_url.py

+        return not self.__eq__(other)
+
+    def __hash__(self):
+        return hash((self.__class__, self.scheme, self.userinfo, self.host,


minor: Since equality is delegated upwards to URL, would it be better to do the same for hashing?

e.g.

def __hash__(self): return hash(self.normalize().to_uri())

That's a good point. I may have subtly changed my mind midway through development, and now I lean more toward not delegating up to URL for equality. I'll get that fixed, thanks!

Whenever you're delegating hash like this, remember you should also include a tweak, so that the DecodedURL and the URL representing the same data don't hash the same.

@glyph why is that? (correctness shouldn't be effected since dictionaries compare by equality, and I'm not entirely sure what the performance problem you're trying to forestall)

I share @moshez's curiosity.

@mahmoud - will you follow @alexwlchan's advice about delegating __hash__?

glyph · 2017-12-31T23:11:10Z

hyperlink/__init__.py


 __all__ = [
    "URL",
+    "DecodedURL",


I want to say that we should just avoid exposing this entirely, but it probably needs to be exposed for type annotations.

However, as per #44, could we have a "decoded" property on URL that provides an interface to this, and an "encoded" property on DecodedURL that maps back to a URL?

(At this point I think I'm in favor of adding an EncodedURL alias for URL, then maybe adding a top-level entry point like hyperlink.parse() which takes a decoded kwarg flag which defaults to true, to make it easier to get started with DecodedURL which is what I think we all want most of the time. That can definitely be deferred to a separate ticket though.)

Yes, I am pretty much in favor of all those conveniences :) And I also agree that this probably needs to be exposed.

…vel API. fix a bug with userinfo where double-escapes were possible because % wasn't marked as safe. add lazy option to DecodedURL, also exposed in parse and URL's new get_decoded_url() method. add a few more notes and comment tweaks

…verage-oriented tests. Remove the password attributes of DecodedURL pending future discussion.

…nto i44_decoded_url

markrwilliams

The approach seems good and it addresses my question:

>>> from hyperlink import DecodedURL, URL
>>> d = DecodedURL(URL())
>>> d.add(u'parameter', u'#value').to_text()
u'?parameter=%23value'

I think this means we should use DecodedURL. in twisted.python.url. That's great!

I've left comments about documentation improvements and lingering TODOs. I'd also like to see clarity around __hash__. Please address these issues with changes or PR comments before merging.

markrwilliams · 2018-01-07T01:09:44Z

hyperlink/_url.py

@@ -1040,7 +1083,7 @@ def from_text(cls, text):
                   rooted, userinfo, uses_netloc)

    def normalize(self, scheme=True, host=True, path=True, query=True,
-                  fragment=True):
+                  fragment=True, userinfo=True):


The docstring should mention userinfo.

There's a TODO for userinfo. Should that go?

markrwilliams · 2018-01-07T01:57:40Z

hyperlink/_url.py

+    handled automatically.
+
+    Where applicable, a UTF-8 encoding is presumed. Be advised that
+    some interactions, can raise UnicodeEncodeErrors and


comma splice:

some interactions , can raise UnicodeEncodeErrors...

Also, maybe you want backticks around UnicodeEncodeErrors and UnicodeDecodeErrors.

markrwilliams · 2018-01-07T01:59:43Z

hyperlink/_url.py

+    UnicodeDecodeErrors, just like when working with
+    bytestrings.
+
+    Examples of such interactions include handling query strings


Should this be part of the previous paragraph?

markrwilliams · 2018-01-07T02:02:59Z

hyperlink/_url.py

+    encoding binary data, and paths containing segments with special
+    characters encoded with codecs other than UTF-8.
+    """
+    def __init__(self, url, lazy=False):


If this is a public class, then its initializer should be documented. Please include an explanation of url and lazy in the docstring.

markrwilliams · 2018-01-07T02:04:07Z

hyperlink/_url.py

+
+    def click(self, href=u''):
+        "Return a new DecodedURL wrapping the result of :meth:`~hyperlink.URL.click()`"
+        return type(self)(self._url.click(href=href))


Why not self.__class__?

markrwilliams · 2018-01-07T02:33:37Z

hyperlink/test/test_decoded_url.py

+        durl = durl.set(' ', 'spa%ed')
+        assert durl.get(' ') == ['spa%ed']
+
+        durl = DecodedURL(url=durl.to_uri())


What's the point of this? Round tripping?

markrwilliams · 2018-01-07T02:34:46Z

hyperlink/test/test_decoded_url.py

+
+        assert durl.set('arg', 'd').get('arg') == ['d']
+
+    def test_equivalences(self):


This really tests __eq__ and __hash__, so maybe test_equality_and_hashability?

markrwilliams · 2018-01-07T02:35:39Z

hyperlink/test/test_decoded_url.py

+
+        durl_map = {}
+        durl_map[durl] = durl
+        durl_map[durl2] = durl2


This should also test that a URL and a DecodedURL that represent the same underlying URL don't overlap in a dict (or set.)

markrwilliams · 2018-01-07T02:36:58Z

hyperlink/test/test_decoded_url.py

+
+        assert len(durl_map) == 1
+
+    def test_replace_roundtrip(self):


This is good, but there should also be a roundtrip test between TOTAL_URL and DecodedURL.

markrwilliams · 2018-01-07T02:39:33Z

hyperlink/test/test_parse.py

+
+BASIC_URL = 'http://example.com/#'
+TOTAL_URL = "https://%75%73%65%72:%00%00%00%00@xn--bcher-kva.ch:8080/a/nice%20nice/./path/?zot=23%25&zut#frég"
+UNDECODABLE_FRAG_URL = TOTAL_URL + '%C3'


It'd be good to note that this is undecodable because %C3 makes it invalid UTF-8.

mahmoud added 16 commits December 17, 2017 01:36

WIP DecodedURL with path and replace

98edfc6

DecodedURL.replace(query) and DecodedURL.query working. Fix up userin…

dce971b

…fo a bit.

DecodedURL userinfo now working

2162c67

add new arguments to _percent_decode and use them in the DecodedURL

d1bfc68

added DecodedURL to the public API, started on tests, added DecodedUR…

adc7370

…L.get()

add host and port to DecodedURL

fc28ee3

add DecodedURL.scheme property, plus some more tests

358402f

add and test a bunch of DecodedURL passthrough methods

34d4212

test DecodedURL userinfo-related properties. Add generic percent enco…

161e93a

…der for any reserved characters, and used it in .child() and .sibling(), will add it in further methods shortly

add, test, and slightly refactor query manipulation methods. still pr…

95ea9ea

…obably have the issue with not-yet-normalized query parameter names (mixed decoded and encoded query parameter names that overlap).

fix and test the aforementioned issue with query parameters which are…

976c083

… overlapping due to not being normalized

bit of housekeeping on DecodedURL, rearranging and adding a couple mi…

02a841c

…nor checks

DecodedURL: made .query a tuple. more obvious errors on attempted mut…

9a0438e

…ation (.append, etc.). Also add and test __eq__, __ne__, and __hash__

DecodedURL.replace() now fixed and tested working for roundtripping U…

662ec79

…RLs of some complexity. Had to add userinfo to URL.normalize() to help with equality checks.

add some Twisted-style methods to DecodedURL for consistency with URL…

3aa4b61

… and other Twisted codebases

some notes and docs on DecodedURL

82deb29

mahmoud requested a review from glyph December 31, 2017 08:27

mahmoud mentioned this pull request Dec 31, 2017

Recommended practice for adding reserved characters? #44

Closed

alexwlchan reviewed Dec 31, 2017

View reviewed changes

glyph reviewed Dec 31, 2017

View reviewed changes

mahmoud added 2 commits December 31, 2017 18:26

split out host decoding to its own function, remove duplicated code

0b311d7

fix delimiter typo in docstring, thanks alex!

30e19a6

glyph mentioned this pull request Jan 1, 2018

what if my password has a reserved delimiter in it? #11

Closed

mahmoud and others added 6 commits December 31, 2017 21:29

add lazy keyword to DecodedURL methods where appropriate. Add some co…

67ab0ec

…verage-oriented tests. Remove the password attributes of DecodedURL pending future discussion.

DecodedURL pretty much fully covered by tests

6a90f4a

docstrings for all DecodedURL members

ad63b9b

cover another line in _percent_decode

afc907b

Merge branch 'master' into i44_decoded_url

9f3212b

mahmoud added 2 commits January 6, 2018 15:54

fixing test coverage for parse

ac3be79

Merge branch 'i44_decoded_url' of github.com:python-hyper/hyperlink i…

dd8248c

…nto i44_decoded_url

mahmoud mentioned this pull request Jan 7, 2018

why is str() the same as repr()? #49

Closed

markrwilliams approved these changes Jan 7, 2018

View reviewed changes

mahmoud added 2 commits January 6, 2018 19:24

DecodedURL: address most of the comments about docstrings

6dd3272

add a few more DecodedURL/parse tests and a couple comments

e8616fa

mahmoud changed the title ~~WIP: DecodedURL~~ DecodedURL Jan 7, 2018

mahmoud merged commit a23a1a4 into master Jan 7, 2018

mahmoud mentioned this pull request Jan 11, 2018

Decode percent-encoding in mixed text #58

Closed

glyph deleted the i44_decoded_url branch January 23, 2018 00:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DecodedURL #54

DecodedURL #54

mahmoud commented Dec 31, 2017

codecov-io commented Dec 31, 2017 •

edited

Loading

alexwlchan Dec 31, 2017

mahmoud Dec 31, 2017

alexwlchan Dec 31, 2017

mahmoud Dec 31, 2017

glyph Jan 2, 2018

moshez Jan 2, 2018

markrwilliams Jan 7, 2018

glyph Dec 31, 2017

mahmoud Dec 31, 2017

markrwilliams left a comment

markrwilliams Jan 7, 2018

markrwilliams Jan 7, 2018

markrwilliams Jan 7, 2018

markrwilliams Jan 7, 2018

markrwilliams Jan 7, 2018

markrwilliams Jan 7, 2018

markrwilliams Jan 7, 2018

markrwilliams Jan 7, 2018

markrwilliams Jan 7, 2018

markrwilliams Jan 7, 2018


		assert durl.set('arg', 'd').get('arg') == ['d']

		def test_equivalences(self):

DecodedURL #54

DecodedURL #54

Conversation

mahmoud commented Dec 31, 2017

codecov-io commented Dec 31, 2017 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

markrwilliams left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-io commented Dec 31, 2017 •

edited

Loading