Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add string UTF-8 percent encode #503

Merged
merged 4 commits into from
May 12, 2020
Merged

Conversation

annevk
Copy link
Member

@annevk annevk commented May 9, 2020

And use it internally. This is also an initial step for #369.

If reviewers have thoughts on how to resolve #296 that'd be appreciated. Renaming things is in scope for this PR modulo the caveats mentioned there. Resolving to continue using the existing names is also fine with me.


Preview | Diff

@annevk annevk requested review from domenic and rmisev May 9, 2020 08:47
@annevk
Copy link
Member Author

annevk commented May 10, 2020

@rmisev so what do you think? Keep the existing names and close #296?

@rmisev
Copy link
Member

rmisev commented May 10, 2020

I am fine with existing names. Although "conditionally-UTF-8-percent encode" isn't bad too.

@masinter
Copy link

I've been thinking about this a lot and I think part of the problem is the use of the word "string" to sometimes mean a sequence of bytes, a sequence of code points, a sequence of abstract characters.
The charset-named translations like utf-8 or us-ascii are mappings between sequence of abstract characters and sequence of bytes.

If you name the levels of encoding, some of the ambiguity drops away.

@annevk
Copy link
Member Author

annevk commented May 11, 2020

If we use string (as defined by https://infra.spec.whatwg.org/#string) to mean bytes anywhere that's a bug. With regards to UTF-8 percent encode there's two confusing things I think:

  • Not all input is necessarily percent encoded. This could be addressed by adding "conditional" if this is confusing enough.
  • Percent encoding doesn't always mean bytes > strings. It just means that the input will be transformed to use the %.. encoding, but the inputs and outputs in terms of bytes/strings differ and are mostly made to suit the callers. I'm not sure there's clean way out of this other than avoiding the word "encoding" altogether.

Copy link
Member

@saschanaz saschanaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As both will be used from other specs

url.bs Outdated Show resolved Hide resolved
url.bs Outdated Show resolved Hide resolved
url.bs Outdated Show resolved Hide resolved
@domenic
Copy link
Member

domenic commented May 11, 2020

I think it would help to have an example table with clear inputs and outputs. Something like:

Here is a summary, by way of example, of these various operations:

Operation Example Input Example Output
percent encode 0x23 "%23"
percent decode `blah` `blah`
string percent decode "blah" `blah`
UTF-8 percent encode with userinfo percent-encode set U+LARGENUMBER "%12%34"
UTF-8 percent encode with userinfo percent-encode set "blah" "%blah%blah"

I think having these examples of inputs and outputs, which use the appropriate typography, will convey more than simply stating the return types.

@annevk
Copy link
Member Author

annevk commented May 11, 2020

That sounds pretty good. Note to self: Fetch and HTML have dependencies on "string percent decode".

@sideshowbarker
Copy link
Contributor

sideshowbarker commented May 12, 2020

(Warning: This long comment may seem like getting into “how many angels can fit on head of pin” territory but is offered FWIW from my (long-ago) training and experience as a technical writer.)

As far as use of hyphens/dashes in the operation names, some options to consider:

  1. (which I recommend) Always hyphenate percent-encode and percent-decode, and:
    • Don’t put a hyphen before percent-encode or percent-decode in any operation name that just has a single qualifier; so “UTF-8 percent-encode” and “string percent-encode”.
    • Do put a hyphen before percent-encode or percent-decode in any operation name that has more than one qualifier; so ”conditionally UTF-8-percent-encode”.
  2. Hyphenate everything: “UTF-8-percent-encode” and “conditionally-UTF-8-percent-encode” and “UTF-8-percent-encode-with-userinfo-percent-encode-set”.
  3. Hyphenate nothing (except UTF-8 of course, and except when “percent-encode” or “percent-decode” are adjectives): “UTF-8 percent encode” and “conditionally UTF-8 percent encode” and “UTF-8 percent encode with userinfo percent-encode set.

Option 2 above is most-clearly not the best choice in this case, because if you apply it, you end up with that “UTF-8-percent-encode-with-userinfo-percent-encode-set” monstrosity.

Option 3 is also sub-optimal in this case, since in the same operation name, you end up with both “percent encode” (unhyphenated) and “percent-encode” (hyphenated). That risks making readers stumble — because at first reading, it might seem odd/inconsistent (even though it really isn’t).

Option 1 in this case therefore kind of hits the sweet spot (given all of the above).


Further rationale

While there aren’t set-in-stone rules for when to add hyphens — it’s always partly a judgement call — the main guiding principle (especially in technical docs) is to add hyphens to avoid ambiguity; that is, to avoid having readers mis-read or mis-parse or stumble over a particular phrase.

That’s especially important in cases where what you’re hyphenating is a compound adjective — because those are the cases where there’s the greatest risk of ambiguity and misreading. It’s less important in cases where what you’re hyphenating is a noun or verb — which is essentially the case here, because these are operation names.

But another principle is consistency: If you know you’re otherwise gonna end up with cases like “UTF-8 percent encode with userinfo percent-encode set” — one operation name that has the same combination of words both unhyphenated (“percent encode”) and hyphenated (“percent-encode”) — then avoid that by just always hyphenating it.

And the final principle is: When you’ve ended up with a term that’s itself referencing/qualifying some other multi-word term, use hyphens to make that more clear; so that’s what leads you to “conditionally UTF-8-percent-encode” rather than just “conditionally UTF-8 percent-encode”.

And use it internally. This is also an initial step for #369.
@annevk annevk force-pushed the annevk/string-utf-8-percent-encode branch from 6932e5b to bc67da4 Compare May 12, 2020 06:57
annevk added a commit to whatwg/fetch that referenced this pull request May 12, 2020
annevk added a commit to whatwg/html that referenced this pull request May 12, 2020
It's UTF-8 percent-encode (for code points) and percent-decode (for strings) now. See whatwg/url#503.
@annevk annevk merged commit 8e1c9e3 into master May 12, 2020
annevk added a commit to whatwg/html that referenced this pull request May 12, 2020
It's UTF-8 percent-encode (for code points) and percent-decode (for strings) now. See whatwg/url#503.
@annevk annevk deleted the annevk/string-utf-8-percent-encode branch May 13, 2020 10:13
annevk added a commit to whatwg/fetch that referenced this pull request May 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

"UTF-8 percent encode c using the path percent-e..."
6 participants