Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why should keys be restricted to lowercase ASCII? #508

Closed
aphillips opened this issue Oct 23, 2022 · 7 comments
Closed

Why should keys be restricted to lowercase ASCII? #508

aphillips opened this issue Oct 23, 2022 · 7 comments
Assignees
Labels
i18n-needs-resolution Issue the Internationalization Group has raised and looks for a response on.

Comments

@aphillips
Copy link

3.3.2.2.1 Key
https://www.w3.org/TR/trace-context-2/#key

A key MUST begin with a lowercase letter or a digit and contain up to 256 characters including lowercase letters (a-z), digits (0-9), underscores (_), dashes (-), asterisks (*), forward slashes (/), and at signs (@).

The definitions here should be more rigorously laid out. The key is part of the tracestate header's value and is described as "...an identifier that describes the vendor". The identifier is restricted to lowercase ASCII letters, digits, and the punctuation shown above in the quote.

Allowing keys/values that are non-ASCII is one possibility here, since vendors presumably can be named anything. Restricting the value to an ASCII token is understandable because of the need for encoding efficiency, but the text should be clearer about the reasoning behind the ASCII restriction for the key.

Note that key names like this are sometimes machine generated and a restriction to ASCII can be irksome when non-ASCII Unicode characters might be present in the data. The pre-computed portions probably should be well-known (ASCII !) tokens, but it's unclear why the names are otherwise restricted.

@aphillips aphillips added the i18n-needs-resolution Issue the Internationalization Group has raised and looks for a response on. label Oct 23, 2022
@kalyanaj
Copy link
Contributor

Thanks for the review and for filing this issue. We discussed this in the Distributed Tracing working group meeting today and below captures the summary from that discussion.

Since this is Level 2 of the spec, the primary reason is back compatibility with existing Level 1 standard.

Some of the original reasons (from Level 1) for restricting the keys and values:

  1. Processing efficiency: For example, an implementation could split on commas to get the list of entries.
  2. If we extended the allowed character list, we would have had to have specific escape characters or disallowed characters (to meet the processing efficiency goal) which would have added more complexity to the specification.

We will update the specification to include the above rationale.

At this point, the main reason to not revisit this decision is backwards compatibility since Level 1 is already a standard.

@aphillips
Copy link
Author

Some of the original reasons (from Level 1) for restricting the keys and values:

  1. Processing efficiency: For example, an implementation could split on commas to get the list of entries.
  2. If we extended the allowed character list, we would have had to have specific escape characters or disallowed characters (to meet the processing efficiency goal) which would have added more complexity to the specification.

I think these reasons are questionable.

The ASCII comma is still a comma. Disallowing a small list of characters (comma, space, etc.) because they are syntax-meaningful would allow implementations to split on commas. There is no need to introduce escapes unless you want to allow restricted characters (comma, space) into the namespace.

At this point, the main reason to not revisit this decision is backwards compatibility since Level 1 is already a standard.

I can't find ANY record of I18N having reviewed trace-context 1. In fact, I see your transreq (w3c/transitions#133) does not mention us--only security and privacy. As you might imagine, I am unhappy to see you relying on back compat if we were not afforded a look at your first edition.

@plehegar was there a process failure here or did I overlook a review somehow? @r12a for visibility.

@basti1302
Copy link
Contributor

Restricting the value to an ASCII token is understandable because of the need for encoding efficiency, but the text should be clearer about the reasoning behind the ASCII restriction for the key.

#515 is an attempt to improve that partial aspect by providing the reason to restrict the allowed characters.

You have a point that the original design choice for this restriction could have been different. I also understand your unhappiness of level 1 not having gone through an i18n review (if that is what happened). I cannot speak on @plehegar behalf so I will let him comment on that.

But even if there was a process failure for level 1, the fact remains that there are implementations out there which we would break if we changed this now. We are adamant that we cannot break existing implementations by allowing more characters in tracestate in level 2.

@basti1302
Copy link
Contributor

I wonder if w3c/i18n-request#108 was meant to be the i18n review request for trace context? It fails to mention which spec it is for. But I'm guessing wildly here, nevermind.

@aphillips
Copy link
Author

aphillips commented Dec 2, 2022

@basti1302 I don't know how the process failed us here. I think I see a gap in our tooling and I'll take it up with @plehegar and company to help prevent repeats.

Regarding this spec/issue, I understand your commitment to backwards compatibility: it's important.

Regarding #515, that's the right intention, but the logic of that statement is flawed. It says:

This restriction ensures that the tracestate header can be parsed efficiently into individual list members by splitting the header value at each comma, and splitting the list members an implementation is interested in at the equals sign.

A string consisting of Unicode code points or a bytestring consisting of UTF-8 can still be parsed on commas and equal signs, since those are still characters--with the same byte values even (in the case of UTF-8). Consider the string: мзкез=日本語,文字化け,éèçéèç,etc, whose UTF-8 byte values are:

D0 BC D0 B7 D0 BA D0 B5 D0 B7 3D E6 97 A5 E6 9C AC E8 AA 9E 2C E6 96 87 E5 AD 97 E5 8C 96 E3 81 91 2C C3 A9 C3 A8 C3 A7 C3 A9 C3 A8 C3 A7 2C 65 74 63

Notice that 0x3D (=) and 0x2C (,) are still present...

I would tend to say instead:

This restriction ensures that the tracestate header is encoded in a compact and efficient manner and that values can be processed by even very old systems that do not support non-ASCII characters.

( I added a review just now to #515 that has a different wording suggestion. Use that one)

I suppose we could introduce a way to smuggle non-ASCII characters into the values (Punycode [RFC3492] looks like about the only commonly available option compatible with the character set you have), but given the nature and probable uses of tracestate, I think you might as well skip it for now.

For future reference: restricted-to-ASCII mechanisms have a long history of coming back to bite standardizers/implementers. The existence of Punycode to support IDNs is just one example of inventing convoluted schemes for getting Unicode code points into protocols or formats after the fact. But v1's day is done. If you fix your note above I think you can close this issue.

@basti1302
Copy link
Contributor

Closing this as satisfied as discussed in #515 (comment)

@aphillips
Copy link
Author

See discussion in PR. I'm satisfied by this outcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
i18n-needs-resolution Issue the Internationalization Group has raised and looks for a response on.
Projects
None yet
Development

No branches or pull requests

3 participants