Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove "uses_netloc" parameter and attribute to .__init__ and .replace #24

Open
glyph opened this issue Jul 3, 2017 · 10 comments
Open

Comments

@glyph
Copy link
Collaborator

glyph commented Jul 3, 2017

In twisted.python.url I tried very hard to eliminate the use of the confusing, vague, and antiquated term "netloc". (NB: neither https://tools.ietf.org/html/rfc3986 nor https://url.spec.whatwg.org includes the string "netloc"). I would therefore like to eliminate its use in uses_netloc as well.

uses_netloc describes an attribute of a scheme. However, in its current incarnation it can be made nonsensically inconsistent with the scheme it's actually using. Additionally, schemes also have other attributes, such as a name, and a default port number. (Possibly more in the future?) As such, the presence of the scheme registry makes URL objects kinda/sorta mutable after the fact.

My suggestion for a replacement would be for URL to reference a structured scheme object; the external scheme registry could then be a collection of such objects. If a URL is created before the registry is populated, its scheme object could then be inconsistent with the global one, but it could be replaced. At this point, I believe schemes would just have the three attributes given above (i.e. the parameters to registerScheme.

@mahmoud
Copy link
Member

mahmoud commented Jul 3, 2017

WHATWG and 3986 love to avoid talking about the slashes, but I don't know what to call them other than "netloc slashes", and uses_netloc_slashes seems a bit much.

Naming aside, uses_netloc is there for roundtripping and "just works"-ness. While I tried very hard to create a best-in-class scheme registry, with ~60 preregistered schemes, I still see it as inevitable that someone will have a scheme that's _not_registered. (postgres and other SQLAlchemy conventional driver names are such examples.)

In those cases, as mentioned in #23, URL uses uses_netloc to pass forward whether the original URL passed to from_text() used the netloc slashes. Not only is explicit better than implicit, but it's real information that we would be dropping on the ground otherwise.

If we switched to an object-based scheme, we could generate Scheme objects on the fly for unrecognized schemes, but there's nothing stopping people from basically doing the same thing. It'd be slightly harder, but I'd contend that it's also currently quite hard to make this mistake.

@glyph
Copy link
Collaborator Author

glyph commented Jul 3, 2017

WHATWG doesn't care much about the slashes since they're just "the things that come after http://" :) but 3986 specifically documents them; they are the slashes that come before the authority.

Which… actually raises the question. They always and only come before the authority. The scheme attribute is just whether or not it has an authority. Which is just a validation thing: the selected scheme should be able to prevent you from using an authority, but the information about the scheme isn't necessary past that point. So the attribute seems like it ought to be useless.

In fact, reading the implementation alerted me that there's a bug in this area (another failed asText / fromText round trip):

>>> URL().replace(scheme=u'http', path=[u'',u'',u'google.com'])
URL.from_text(u'http://google.com')

@glyph
Copy link
Collaborator Author

glyph commented Jul 3, 2017

(My thinking on that particular bug is that scheme=u'anything', host=u'', path=[u'', u'', u'...'] is itself a validation error and should just raise an exception; it is unrepresentable in the text format.)

@mahmoud
Copy link
Member

mahmoud commented Jul 3, 2017

Haha, bravo, that's a good one!

One question that's been on my mind from the start is: what is the goal of the __init__ method? Like these are two different things:

  1. __init__ will only care about Python types and storage
  2. __init__ will not accept arguments that will create invalid URLs
  3. __init__ will raise on any arguments that can't be roundtripped through a valid URL

Because twisted.python.URL was definitely on 1, and I assumed it was spiritually intended to be there, with __init__ acting as a low-level interface.

But now we're definitely in 2 verging on 3. If that's the goal, then there's a lot of parsing logic spread around (e.g., in from_text and parse_host and to_text and elsewhere) that needs to move into __init__.

@glyph
Copy link
Collaborator Author

glyph commented Jul 3, 2017

Are 2 and 3 meaningfully different?

@glyph
Copy link
Collaborator Author

glyph commented Jul 3, 2017

__init__ is a public API, and we have a clear distinction between valid and invalid states; we're already well down that road with the changes to validating the path.

In other cases I might be swayed that __init__ is a "you break it you can keep both pieces" kind of API, but .replace is very definitely public, and is effectively just a light wrapper over __init__, and should be able to prevent you from shooting yourself in the foot.

@glyph
Copy link
Collaborator Author

glyph commented Jul 3, 2017

So I think it's becoming clear that http://fsharpforfunandprofit.com/posts/designing-with-types-making-illegal-states-unrepresentable/ is __init__'s job for URL in this case.

@mahmoud
Copy link
Member

mahmoud commented Jul 3, 2017

2 and 3 are of course different, as you've shown in your case! You constructed a valid URL through creative means.

@glyph
Copy link
Collaborator Author

glyph commented Jul 3, 2017

Oh, I see what you mean. I would just consider it an "invalid" URL if it doesn't have a string representation, so I'd say both 2&3 :)

@glyph
Copy link
Collaborator Author

glyph commented Feb 16, 2020

I think I'm going to close this, because #112 deals with the practical problem here, as well as updating the documentation to avoid making the mistake of calling the path portion of mailto: URLs the "host".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants