Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keys don't permit non-ASCII? #99

Closed
aphillips opened this issue May 19, 2022 · 20 comments
Closed

Keys don't permit non-ASCII? #99

aphillips opened this issue May 19, 2022 · 20 comments
Assignees
Labels
i18n-needs-resolution Issue the Internationalization Group has raised and looks for a response on.

Comments

@aphillips
Copy link

3.2.1.2 key
https://www.w3.org/TR/baggage/#key

A token which identifies a value in the baggage. token is defined in RFC7230, Section 3.2.6. Leading and trailing whitespaces (OWS) are allowed and are not considered to be a part of the key.

RFC7230 (which is quoted as the definition) only allows the (usual) restricted subset of ASCII for keys. It's not clear why this restriction applies to keys here, since the keys are part of the value space of the header. Should the full range of Unicode be allowed?

@aphillips aphillips added the i18n-needs-resolution Issue the Internationalization Group has raised and looks for a response on. label May 19, 2022
@kalyanaj
Copy link
Contributor

kalyanaj commented Jun 7, 2022

We discussed this in the working group meeting:

Our rationale on why we don't currently support this is that we wanted to simplify the amount of work implementations have to do when processing these keys. Such implementations could be low-latency scenarios such as load balancers where having a smaller set of supported characters in the keys would simplify processing.

@SergeyKanzhelev @yurishkuro: Would love to get your thoughts as well. Have you heard of any reasons/feedback for supporting the full Unicode range of characters in the baggage keys?

@yurishkuro
Copy link
Member

I think this is a question that needs to be answered by engineers working in countries where English is not the main language. Do people use non-English alphabets when writing code, defining variables, data schemas? I don't have a first-hand experience, but I assume they do, given that Unicode support in languages and databases has been gradually improving over the years. So if I can use Unicode for naming variables, it's not a stretch to extend the reasoning to naming baggage keys similarly.

Processing is indeed a valid concern, but I don't think utf8 introduces any additional overhead there, especially for people who still stick with ASCII key names. If you want to use a Unicode character that takes 3 utf8 bytes, you pay the cost, but nobody else does.

Having said that, I remember having quite a hard time implementing baggage in Python2 with all kinds of weird encoding issues popping up. Things may have improved since then (and I had to deal with keys encoded in the HTTP header name uber-ctx-{key}, which was a different story).

@aphillips
Copy link
Author

The baggage spec is pretty non-specific about what can appear in baggage. It just says that they propagate:

... user-supplied key-value pairs through a distributed request

This gives a wide range of potential uses. Lots of systems permit actual end-user defined values and these are not always set by engineers. The values permit Unicode: why not the keys? If you have a specific reason why the restrictions of RFC7230 should be applied to the key values, that should probably be stated. If this is intended as a general purpose mechanism, though, then not adding arbitrary restrictions might ultimately be easier.

One call-out: if you do expand the range of permitted characters, you'll probably want to define key matching in the simplest possible way (case-sensitive non-normalizing comparison, e.g. what WHATWG's Infra spec defines as identical to or is).

@yurishkuro
Copy link
Member

This gives a wide range of potential uses. Lots of systems permit actual end-user defined values and these are not always set by engineers.

In the context of this spec, "user" is an engineer who's using baggage to solve a certain cross-cutting problem. Baggage is not meant to be manipulated by the end-users of the applications. This is why I equate baggage keys to variables in the code - it's effectively just another variable that is magically globally available on the path of a given request.

@basti1302
Copy link
Contributor

We have discussed this issue once more in the working group and decided to not change the range of allowed characters for keys for the first version of the baggage spec. To be honest, the main motivation for this decision is that we want to get baggage to recommendation stage with the limited resources the working group has available. Therefore, reducing the scope is paramount.

As Yuri pointed out, the users of the specification are engineers, not end users. We believe that ASCII keys do not constitute a significant drawback for users of the specification; or, to phrase it the other way around, extending keys to the full range of Unicode does not provide a significant benefit.

We might revisit this decision in a later version of the specification.

@aphillips
Copy link
Author

I'm going to push back on this a bit. I don't see how this helps you reduce scope?

@kalyanaj notes that "...we wanted to simplify the amount of work implementations have to do with processing these keys...". I would suggest that, since non-ASCII values are percent-encoded, the only change needed is to apply the encode (or decode) to the key token as well. Since the value portion already has to do this, the code presumably already exists and there is no material effort involved. The processed key is still an ASCII token (it's UTF-8 bytes having been rendered into percent sequences) and parsing the header is materially the same (split on =, perform a trim of OWS).

So making this change would probably not measurably impact your progress to CR--but including the ASCII-only limit would create backwards compatibility pressures that prevents later expansion of the key namespace. This is a one-way door.

The argument that "users are engineers" may be true, but many systems use data values to generate keys. An ASCII-only namespace can thus become inconvenient when working with non-ASCII datasets (you have to write code to check the data is in the proper subset of ASCII).

@basti1302
Copy link
Contributor

basti1302 commented Oct 1, 2022

Sorry for not coming back to this earlier.

There are already implementations of the current state of the spec. I think we need to take these into account. I would even argue that they should be the deciding factor. The most prominent ones are the baggage APIs in the various OpenTelemetry SDKs for different runtimes. I took a closer look at a subset of those:

It might be worth checking other existing implementations as well.

So there definitely are existing, popular implementations which are based on the current wording and assume that they do not need to apply any encoding/decoding to keys. What are the consequences of expanding the key namespace now? Let's assume we expand the key namespace to utf-8 and require percent-encoding/-decoding of characters outside of the baggage-octet range, or outside the token range from RFC 7230, Section 3.2.6.

The following scenario seem relevant:

  • One of the currently existing implementations without key encoding sets the baggage header and a newer implementation that applies decoding to keys receives it. This should just work, as the sender will only send keys from the token range.
  • A newer implementation (one that encodes keys) sends a baggage header containing list-members with non-ASCII characters in keys and one of the currently existing implementations without key decoding receives it. The list member would not be discarded as invalid (since the non-ASCII character has been percent encoded), but the receiver would observe a different key than the sender. When propagating that list-member downstream, the percent-encoded character sequence would be kept, and other (new) implementations that are downstream would decode it and observe the same key as the original sender.

I guess the question we need to answer is: Is the benefit of expanding the key namespace significant enough to introduce issues like this? Or the other way around: Are the possible issues introduced by this change prohibitive, that is, do they break existing implementations bad enough to refrain from this change?

And this is basically the same question we would need to ask when making this change later in the lifecycle of the baggage spec (that is, when introducing this change in version 2 etc.). IMO it is either prohibitive already (and then we can also not easily make it later, as Addison correctly pointed out). Or we can make it now without introducing severe issues, but that argument would also hold for later versions then.

Actually this is all more or less besides the point since the request was not about allowing percent encoding for the keys (implementations can actually do that already anyway), but actually extending the character set. My apologies for derailing the conversation.

@kalyanaj kalyanaj assigned dyladan and unassigned basti1302 Nov 22, 2022
@dyladan
Copy link
Member

dyladan commented Dec 20, 2022

As bastian mentioned, the request was not about requiring or preventing percent encoding in keys. This can already be done (and is done in some implementations) with the allowed character set, but it is not required, disallowed, encouraged, or discouraged. It is simply possible. The question was about extending the range of allowed characters in a key to the full unicode range.

The reasons not to do this, as I understand them:

  1. As yuri said, the baggage keys are not meant to be authored and consumed by application end-users but by application developers. The baggage key is more akin to a variable name in a programming language or a column name in a database than it is to an end-user concern. If the application in question has a feature which allows the end-user to interact with baggage, it is up to the application developer to do any encoding of user input or decoding for display.
  2. If keys were allowed to use any unicode character, we would have to define a significantly more complex grammar in order to allow escaping of characters used by the protocol such as , and ;. If we allow almost all unicode and disallow specific characters to solve this problem then we complicate the spec significantly, and run into problems with human readability when considering there are many unicode glyphs which are not commas but look like commas (U+0326 and U+201A for example).
  3. This would also open the spec and implementations to a whole new category of questions around normalization and what constitutes two equal strings.
  4. It would significantly complicate parsing the string into key/value/metadata for reading, validation, and display. Currently it is possible to do this with simple splits.
  5. It also requires the spec and/or implementations to decide how to handle invalid unicode, something more common and more difficult than invalid ASCII.
  6. There are existing implementations already in use which support the character encoding we already have
  7. Nothing is preventing you from using percent encoding to encode arbitrary values if your application needs it.

For these reasons, I think the extension of the allowed character set in keys is not worth the additional cost and complexity it would bring to the specification and its implementors.

edit: sorry @aphillips for the long wait for a reply. hope this resolves your concerns.

@dyladan
Copy link
Member

dyladan commented Jan 31, 2023

@aphillips can you confirm if this sufficiently addresses your concerns? This issue will be closed in 2 weeks at the next meeting (14 February 2023).

@aphillips
Copy link
Author

Thanks for the ping @dyladan: I had overlooked your reply previously. I have added this to I18N's agenda for this week.

(writing personally) I don't agree with most of your arguments. I (and the I18N WG) might end up supporting a subset of ASCII solution, but I sense that folks are looking for reasons not to support Unicode when it really is not that difficult 😉

the baggage keys are not meant to be authored and consumed by application end-users but by application developers.

I agree that application end-users don't author baggage keys--but application developers aren't necessarily the authors either. Many of these keys are assembled from program internal values or from runtime values. When systems or their components are localized, the names provided can be non-ASCII characters. In other words, we agree that the assignments aren't necessarily or even generally done by -- or intended for -- people, but that doesn't mean that implementations need to smash down identifiers into a subset of ASCII.

we would have to define a significantly more complex grammar in order to allow escaping of characters used by the protocol such as , and ;

Why? Parsing is not that hard and adding most Unicode characters would not change or break how it is done. The , and ; characters are as easy to split on in UTF-8 as they are in ASCII. Restricting only syntax-meaningful characters doesn't require any additional effort. You don't have to allow the syntax characters into the namespace. There are some Unicode ranges that should also be restricted (C1 controls, surrogates, and a few others). I'd suggest using the same EBNF that XML does (found here in our guidelines for spec developers)

... run into problems with human readability when considering there are many unicode glyphs which are not commas but look like commas (U+0326 and U+201A for example).

Visual spoofiing can be a problem for sure. Note that machines will not be fooled by this--you won't get "wrong" tokens out, but some might look unusual. There can also be certain kinds of attacks using (for example) bidi controls.

This would also open the spec and implementations to a whole new category of questions around normalization and what constitutes two equal strings.

Not if you follow our guidance. In particular here and here. You don't case normalize now, for example.

It would significantly complicate parsing the string into key/value/metadata for reading, validation, and display. Currently it is possible to do this with simple splits.

There is no reason why the parsing would have to change. It is not more complicated if you maintain the (exact same) reserved set of characters and trim ASCII whitespace as already provided in the spec: in fact the exact same code will work! This is actually one of the core features of Unicode's encodings.

It also requires the spec and/or implementations to decide how to handle invalid unicode, something more common and more difficult than invalid ASCII.

I'm not sure what "invalid unicode" is? Do you mean incorrectly encoded UTF-8? There are clear rules on how to handle an invalid byte sequence, e.g. in Encoding. Can you explain?

Nothing is preventing you from using percent encoding to encode arbitrary values if your application needs it.

No, nothing prevents emitting some percent-encoded values, but since it isn't a standard, those values probably look like percent encoded garbage on the receiving side (how can I tell the difference between % and percent-encoded?? Maybe you need to prohibit % to avoid tripping up implementations that decode the whole header before trying to parse??).

I tend to view the percent-encoding of HTTP headers as part of the serialization for the transport layer, not necessarily part of the API that generates and reads the values found in headers.

There are existing implementations already in use which support the character encoding we already have

Fair enough. But they were written before the 1.0 standard was published.

@dyladan
Copy link
Member

dyladan commented Mar 28, 2023

Thanks for your comments. We've decided to readdress this issue based on this feedback. We have decided to make a prototype of the specification allowing utf-8 encoded unicode characters in baggage keys. We are also restricting the values to ASCII which I will also extend to be inclusive of unicode in the prototype.

@dyladan
Copy link
Member

dyladan commented Apr 12, 2023

We discussed this feedback in the working group meeting but i'm going to transcribe our feedback here for visibility. In general, we decided to re-evaluate our ideas from the perspective of assuming unicode and trying to decide if each argument was a reason not to use unicode.

Thanks for the ping @dyladan: I had overlooked your reply previously. I have added this to I18N's agenda for this week.

(writing personally) I don't agree with most of your arguments. I (and the I18N WG) might end up supporting a subset of ASCII solution, but I sense that folks are looking for reasons not to support Unicode when it really is not that difficult 😉

the baggage keys are not meant to be authored and consumed by application end-users but by application developers.

I agree that application end-users don't author baggage keys--but application developers aren't necessarily the authors either. Many of these keys are assembled from program internal values or from runtime values. When systems or their components are localized, the names provided can be non-ASCII characters. In other words, we agree that the assignments aren't necessarily or even generally done by -- or intended for -- people, but that doesn't mean that implementations need to smash down identifiers into a subset of ASCII.

Approaching this feedback from the stated perspective above, this is not a reason not to use unicode. Application developers would find unicode at least as useful as ASCII when authoring baggage keys. Indeed, many application developers are not english speaking and this may improve accessibility from that regard. Most modern programming languages are also moving in this direction.

we would have to define a significantly more complex grammar in order to allow escaping of characters used by the protocol such as , and ;

Why? Parsing is not that hard and adding most Unicode characters would not change or break how it is done. The , and ; characters are as easy to split on in UTF-8 as they are in ASCII. Restricting only syntax-meaningful characters doesn't require any additional effort. You don't have to allow the syntax characters into the namespace. There are some Unicode ranges that should also be restricted (C1 controls, surrogates, and a few others). I'd suggest using the same EBNF that XML does (found here in our guidelines for spec developers)

For baggage keys we decided to restrict the allowed character set to the same character set used in ECMAScript identifiers. This character set already excludes everything meaningful to our specification

... run into problems with human readability when considering there are many unicode glyphs which are not commas but look like commas (U+0326 and U+201A for example).

Visual spoofiing can be a problem for sure. Note that machines will not be fooled by this--you won't get "wrong" tokens out, but some might look unusual. There can also be certain kinds of attacks using (for example) bidi controls.

We decided that the visual spoofing concern is not a high risk. As Addison pointed out, the machine will not be fooled and this is not something like a URL where it is important for a human to be able to read it accurately.

This would also open the spec and implementations to a whole new category of questions around normalization and what constitutes two equal strings.

Not if you follow our guidance. In particular here and here. You don't case normalize now, for example.

We decided simply not normalizing is the best way forward. Two strings which appear to a human to be identical may not be if they are formed with different characters.

It would significantly complicate parsing the string into key/value/metadata for reading, validation, and display. Currently it is possible to do this with simple splits.

There is no reason why the parsing would have to change. It is not more complicated if you maintain the (exact same) reserved set of characters and trim ASCII whitespace as already provided in the spec: in fact the exact same code will work! This is actually one of the core features of Unicode's encodings.

Given the character set used, this is not a concern.

It also requires the spec and/or implementations to decide how to handle invalid unicode, something more common and more difficult than invalid ASCII.

I'm not sure what "invalid unicode" is? Do you mean incorrectly encoded UTF-8? There are clear rules on how to handle an invalid byte sequence, e.g. in Encoding. Can you explain?

Yes incorrectly encoded UTF-8 is what was meant. Thank you for the link.

Nothing is preventing you from using percent encoding to encode arbitrary values if your application needs it.

No, nothing prevents emitting some percent-encoded values, but since it isn't a standard, those values probably look like percent encoded garbage on the receiving side (how can I tell the difference between % and percent-encoded?? Maybe you need to prohibit % to avoid tripping up implementations that decode the whole header before trying to parse??).

I tend to view the percent-encoding of HTTP headers as part of the serialization for the transport layer, not necessarily part of the API that generates and reads the values found in headers.

Not being able to tell the difference between a percent encoded string and a string which happens to contain a percent is indeed an issue. We've sidestepped this issue a little bit by just letting implementations decide what to do, but a more full character set will extend the useful characters without needing to resort to tricks.

There are existing implementations already in use which support the character encoding we already have

Fair enough. But they were written before the 1.0 standard was published.

Fair enough. It's also probably a fairly easy migration.

@dyladan
Copy link
Member

dyladan commented Jan 30, 2024

As a group we have considered this and made prototypes to test the viability and performance of this change, as well as looked at existing implementations and spoken with their implementors. Many of these implementations are already in widespread use and are considered stable by their authors. We have decided that, because this would be a significant breaking change for existing implementations, we are not going to implement this change. It provides a nice convenience for some users, but does not enable any use cases which are currently impossible, making it difficult to justify breaking existing implementations.

/cc @aphillips

@aphillips
Copy link
Author

@dyladan Thanks for the update. I have added this issue to the I18N 2024-02-01 teleconference agenda.

@aphillips
Copy link
Author

The Internationalization WG discussed this in our call of 2024-02-01 and we're not yet at the point of accepting your proposed resolution.

What is unclear to us is what "widespread use" means with regard to existing implementation. If this is code that was authored (let's say) ten years ago and you're just now documenting a de-facto standard, that suggests a certain level of flexibility on I18N's part regarding legacy character encoding restrictions. If, by contrast, this is relatively new work and implementations were created without regard for instability stemming from the standards creation process, it's a lot harder for us to accept restrictions of this sort. Our reading, which might be unfair, is that you're "discounting" the importance of using Unicode.

I'll reiterate the process here, so that you don't have to look it up. If you feel this issue is addressed and don't wish to discuss it further, you can close this issue. Because our matching issue is not closed, you will not be able to complete your CR transition without explaining this issue (and I'll be contacted about it as the I18N chair). I don't believe that I18N would raise a formal objection at that point.

If you do want to discuss it (and I recommend that we close it out), might I suggest that we join a teleconference with your group or schedule a side conversation? I think we're effectively saying the same things to each other in each iteration of the thread above. We can better (and quickly) reach the right outcome if we use a "higher bandwidth communications mechanism" 😄

@dyladan
Copy link
Member

dyladan commented Feb 6, 2024

I agree it would be easier to discuss in person. Would it be easier for us to join your regular meeting or for you to join ours next week Tuesday?

@aphillips
Copy link
Author

@dyladan Thanks. Either works for us. Our meetings are Thursdays at 1500 UTC. However, it might be easier to have a quorum from your group on your own call. Let me know your preference!

@kalyanaj
Copy link
Contributor

kalyanaj commented Feb 7, 2024

Thanks @aphillips for the flexibility. Our group's next call is Feb 13 Tuesday noon Pacific Time. We would love to have you join us in that meeting - I will update the invite. Please let us know if that works for you.

@aphillips
Copy link
Author

Thank you: that works for me and may work for other WG members. I have time constraints that day, but suspect this can be a quick conversation. Talk to you then!

@kalyanaj
Copy link
Contributor

kalyanaj commented Feb 14, 2024

Thanks @aphillips for joining our WG meeting yesterday. Per our discussion, I have filed #128 to track the changes to document the reasoning behind the supported character set (thanks to @dyladan for volunteering to do that change and create a PR).

Key notes from the meeting:

  • I18n ideally would have liked to see this, however would not object to the transition due to this technical reason of breaking existing implementations.
  • We will be sure to include the i18n review early in the cycle for future specs.
  • Addison to take this summary to the i18n working group & write a summary of the issue.
  • We can close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
i18n-needs-resolution Issue the Internationalization Group has raised and looks for a response on.
Projects
None yet
Development

No branches or pull requests

5 participants