Add string UTF-8 percent encode #503

annevk · 2020-05-09T08:47:15Z

And use it internally. This is also an initial step for #369.

If reviewers have thoughts on how to resolve #296 that'd be appreciated. Renaming things is in scope for this PR modulo the caveats mentioned there. Resolving to continue using the existing names is also fine with me.

Preview | Diff

annevk · 2020-05-10T04:55:43Z

@rmisev so what do you think? Keep the existing names and close #296?

rmisev · 2020-05-10T18:08:35Z

I am fine with existing names. Although "conditionally-UTF-8-percent encode" isn't bad too.

masinter · 2020-05-10T18:30:32Z

I've been thinking about this a lot and I think part of the problem is the use of the word "string" to sometimes mean a sequence of bytes, a sequence of code points, a sequence of abstract characters.
The charset-named translations like utf-8 or us-ascii are mappings between sequence of abstract characters and sequence of bytes.

If you name the levels of encoding, some of the ambiguity drops away.

annevk · 2020-05-11T07:50:32Z

If we use string (as defined by https://infra.spec.whatwg.org/#string) to mean bytes anywhere that's a bug. With regards to UTF-8 percent encode there's two confusing things I think:

Not all input is necessarily percent encoded. This could be addressed by adding "conditional" if this is confusing enough.
Percent encoding doesn't always mean bytes > strings. It just means that the input will be transformed to use the %.. encoding, but the inputs and outputs in terms of bytes/strings differ and are mostly made to suit the callers. I'm not sure there's clean way out of this other than avoiding the word "encoding" altogether.

saschanaz

As both will be used from other specs

url.bs

domenic · 2020-05-11T17:13:55Z

I think it would help to have an example table with clear inputs and outputs. Something like:

Here is a summary, by way of example, of these various operations:

Operation	Example Input	Example Output
percent encode	0x23	"`%23`"
percent decode	``blah``	``blah``
string percent decode	"`blah`"	``blah``
UTF-8 percent encode with userinfo percent-encode set	U+LARGENUMBER	"`%12%34`"
UTF-8 percent encode with userinfo percent-encode set	"`blah`"	"`%blah%blah`"

I think having these examples of inputs and outputs, which use the appropriate typography, will convey more than simply stating the return types.

annevk · 2020-05-11T17:35:53Z

That sounds pretty good. Note to self: Fetch and HTML have dependencies on "string percent decode".

sideshowbarker · 2020-05-12T04:26:31Z

(Warning: This long comment may seem like getting into “how many angels can fit on head of pin” territory but is offered FWIW from my (long-ago) training and experience as a technical writer.)

As far as use of hyphens/dashes in the operation names, some options to consider:

(which I recommend) Always hyphenate percent-encode and percent-decode, and:
- Don’t put a hyphen before percent-encode or percent-decode in any operation name that just has a single qualifier; so “UTF-8 percent-encode” and “string percent-encode”.
- Do put a hyphen before percent-encode or percent-decode in any operation name that has more than one qualifier; so ”conditionally UTF-8-percent-encode”.
Hyphenate everything: “UTF-8-percent-encode” and “conditionally-UTF-8-percent-encode” and “UTF-8-percent-encode-with-userinfo-percent-encode-set”.
Hyphenate nothing (except UTF-8 of course, and except when “percent-encode” or “percent-decode” are adjectives): “UTF-8 percent encode” and “conditionally UTF-8 percent encode” and “UTF-8 percent encode with userinfo percent-encode set.

Option 2 above is most-clearly not the best choice in this case, because if you apply it, you end up with that “UTF-8-percent-encode-with-userinfo-percent-encode-set” monstrosity.

Option 3 is also sub-optimal in this case, since in the same operation name, you end up with both “percent encode” (unhyphenated) and “percent-encode” (hyphenated). That risks making readers stumble — because at first reading, it might seem odd/inconsistent (even though it really isn’t).

Option 1 in this case therefore kind of hits the sweet spot (given all of the above).

Further rationale

While there aren’t set-in-stone rules for when to add hyphens — it’s always partly a judgement call — the main guiding principle (especially in technical docs) is to add hyphens to avoid ambiguity; that is, to avoid having readers mis-read or mis-parse or stumble over a particular phrase.

That’s especially important in cases where what you’re hyphenating is a compound adjective — because those are the cases where there’s the greatest risk of ambiguity and misreading. It’s less important in cases where what you’re hyphenating is a noun or verb — which is essentially the case here, because these are operation names.

But another principle is consistency: If you know you’re otherwise gonna end up with cases like “UTF-8 percent encode with userinfo percent-encode set” — one operation name that has the same combination of words both unhyphenated (“percent encode”) and hyphenated (“percent-encode”) — then avoid that by just always hyphenating it.

And the final principle is: When you’ve ended up with a term that’s itself referencing/qualifying some other multi-word term, use hyphens to make that more clear; so that’s what leads you to “conditionally UTF-8-percent-encode” rather than just “conditionally UTF-8 percent-encode”.

And use it internally. This is also an initial step for #369.

See whatwg/url#503.

It's UTF-8 percent-encode (for code points) and percent-decode (for strings) now. See whatwg/url#503.

See whatwg/url#503.

annevk requested review from domenic and rmisev May 9, 2020 08:47

rmisev approved these changes May 9, 2020

View reviewed changes

annevk mentioned this pull request May 10, 2020

Editorial: use an encode set in the query state #508

Merged

annevk mentioned this pull request May 11, 2020

Adjust registerProtocolHandler() handling whatwg/html#5524

Merged

3 tasks

saschanaz suggested changes May 11, 2020

View reviewed changes

url.bs Outdated Show resolved Hide resolved

url.bs Outdated Show resolved Hide resolved

domenic reviewed May 11, 2020

View reviewed changes

url.bs Outdated Show resolved Hide resolved

annevk added 2 commits May 12, 2020 07:49

Add string UTF-8 percent encode

0d0bfe7

And use it internally. This is also an initial step for #369.

address feedback

bc67da4

annevk force-pushed the annevk/string-utf-8-percent-encode branch from 6932e5b to bc67da4 Compare May 12, 2020 06:57

link type

0c142bb

annevk added a commit to whatwg/fetch that referenced this pull request May 12, 2020

Meta: URL changed how it exports percent-decode

f66370b

See whatwg/url#503.

annevk mentioned this pull request May 12, 2020

Meta: URL changed how it exports percent-decode whatwg/fetch#1026

Merged

annevk added a commit to whatwg/html that referenced this pull request May 12, 2020

Meta: align with URL export changes

a3c5428

It's UTF-8 percent-encode (for code points) and percent-decode (for strings) now. See whatwg/url#503.

annevk mentioned this pull request May 12, 2020

Meta: align with URL export changes whatwg/html#5537

Merged

add note too

d0fc83f

annevk mentioned this pull request May 12, 2020

"UTF-8 percent encode c using the path percent-e..." #296

Closed

annevk merged commit 8e1c9e3 into master May 12, 2020

annevk added a commit to whatwg/html that referenced this pull request May 12, 2020

Meta: align with URL export changes

a11a96c

It's UTF-8 percent-encode (for code points) and percent-decode (for strings) now. See whatwg/url#503.

annevk deleted the annevk/string-utf-8-percent-encode branch May 13, 2020 10:13

annevk added a commit to whatwg/fetch that referenced this pull request May 13, 2020

Meta: URL changed how it exports percent-decode

e057e92

See whatwg/url#503.

annevk mentioned this pull request Sep 28, 2020

Is the first example (an octet 0x7F is percent encoded as "%23") correct? #545

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add string UTF-8 percent encode #503

Add string UTF-8 percent encode #503

annevk commented May 9, 2020 •

edited by pr-preview bot

Loading

annevk commented May 10, 2020

rmisev commented May 10, 2020

masinter commented May 10, 2020

annevk commented May 11, 2020

saschanaz left a comment

domenic commented May 11, 2020

annevk commented May 11, 2020

sideshowbarker commented May 12, 2020 •

edited

Loading

Add string UTF-8 percent encode #503

Add string UTF-8 percent encode #503

Conversation

annevk commented May 9, 2020 • edited by pr-preview bot Loading

annevk commented May 10, 2020

rmisev commented May 10, 2020

masinter commented May 10, 2020

annevk commented May 11, 2020

saschanaz left a comment

Choose a reason for hiding this comment

domenic commented May 11, 2020

annevk commented May 11, 2020

sideshowbarker commented May 12, 2020 • edited Loading

annevk commented May 9, 2020 •

edited by pr-preview bot

Loading

sideshowbarker commented May 12, 2020 •

edited

Loading