-
Notifications
You must be signed in to change notification settings - Fork 242
Description
Context
Smithy currently supports defining blob values for traits as base64 encoded strings in both the IDL and JSON AST. Per both the 1.0 and 2.0 specs for trait node values, blob values are "A string value that is base64 encoded.".
Despite the spec being clear on the intended representation, existing trait validators and related code generators do not necessarily follow this specification due to the reality that most uses of blob values really are better expressed as human-readable strings.
For example:
- Some trait validators like the DefaultTraitValidator only warn on invalid values
- Some codegen implementations like smithy-typescript's HttpProtocolTestGenerator explicitly parse the blob value as an array of utf-16 bytes
Proposal
To support use-cases where human-readable blob values better serve the model author's expression of the value, we will support two new flavors of strings, referred to as byte string and byte text block, which are represented in the IDL as human-readable text but are automatically converted to the equivalent base64 value in the JSON AST. These two flavors will be differentiated by a prefix of b in front of the double quotes of the standard quoted text or text block token respectively. The parsing, whitespace, and escape character semantics will all work in the same manner as the standard token values.
Usage will look like the following:
$version: "2.0"
namespace smithy.example
@default("e30=")
blob previousString
@default(b"{}")
blob newString
@default("ewogICAgImZvbyI6ICJiYXIiCn0=")
blob previousTextBlock
@default(
b"""
{
"foo": "bar"
}
"""
)
blob newTextBlockThis would translate into the following JSON AST:
{
"smithy": "2.0",
"shapes": {
"smithy.example#newString": {
"type": "blob",
"traits": {
"smithy.api#default": "e30="
}
},
"smithy.example#newTextBlock": {
"type": "blob",
"traits": {
"smithy.api#default": "ewogICAgImZvbyI6ICJiYXIiCn0K"
}
},
"smithy.example#previousString": {
"type": "blob",
"traits": {
"smithy.api#default": "e30="
}
},
"smithy.example#previousTextBlock": {
"type": "blob",
"traits": {
"smithy.api#default": "ewogICAgImZvbyI6ICJiYXIiCn0="
}
}
}
}Open Questions
- How should a byte string be converted into a byte array in order to be base64 encoded?
- Per the spec, IDL files are UTF-8 encoded and strings support Unicode characters with basic ASCII escape sequences and a Unicode code-point escape sequence.
- There are significant parallels in this feature from Python's byte literals, which only support ASCII characters with hex escape sequences to allow arbitrary byte sequences. Unicode characters and the Unicode escape sequence are explicitly not allowed in those byte literals.
- In the end, we must choose to either support arbitrary Unicode text with a specific encoding (presumptively UTF-8) or arbitrary byte sequences (by supporting hex-based escape sequences and not supporting Unicode characters or Unicode escape sequences). Support for these are mutually exclusive for a given string flavor.
- My current inclination is that we should support Unicode characters by encoding the byte strings into UTF-8 bytes, given the prevalence of the UTF-8 encoding in modern computing. The expression of arbitrary byte sequences is already supported by base64 encoding standard strings and supporting them in byte strings is unlikely to make them any more understandable.
Changes Required
- Two new
IdlTokens must be defined for the new string representations - The
Tokenizerimplementations must be updated to recognize the new string representations - The
IdlNodeParserandIdlTraitParsermust be updated to properly re-encode the parsed tokens as their equivalent base64 value - Two new
TreeTypes must be defined for the new string representations - The
FormatVisitor,BracketFormatter, andCapturedTokenclasses must be updated to properly handle the newIdlTokenandTreeTypes. - Documentation must be updated to explain the behavior of the new string flavors
- The language server and any other IDL syntax-supporting projects must be updated
Additional Commentary
This idea originated from a comment made by @JordonPhillips as part of the conversation in #2821 related to current inconsistency with blob value literal behaviors.
I plan to cut a draft PR shortly with a working proof of concept of this feature.