Skip to content

Feature request: Support for human-readable blob values in the IDL #2852

@brandondahler

Description

@brandondahler

Context

Smithy currently supports defining blob values for traits as base64 encoded strings in both the IDL and JSON AST. Per both the 1.0 and 2.0 specs for trait node values, blob values are "A string value that is base64 encoded.".

Despite the spec being clear on the intended representation, existing trait validators and related code generators do not necessarily follow this specification due to the reality that most uses of blob values really are better expressed as human-readable strings.

For example:

Proposal

To support use-cases where human-readable blob values better serve the model author's expression of the value, we will support two new flavors of strings, referred to as byte string and byte text block, which are represented in the IDL as human-readable text but are automatically converted to the equivalent base64 value in the JSON AST. These two flavors will be differentiated by a prefix of b in front of the double quotes of the standard quoted text or text block token respectively. The parsing, whitespace, and escape character semantics will all work in the same manner as the standard token values.

Usage will look like the following:

$version: "2.0"

namespace smithy.example

@default("e30=")
blob previousString

@default(b"{}")
blob newString

@default("ewogICAgImZvbyI6ICJiYXIiCn0=")
blob previousTextBlock

@default(
    b"""
    {
        "foo": "bar"
    }
    """
)
blob newTextBlock

This would translate into the following JSON AST:

{
    "smithy": "2.0",
    "shapes": {
        "smithy.example#newString": {
            "type": "blob",
            "traits": {
                "smithy.api#default": "e30="
            }
        },
        "smithy.example#newTextBlock": {
            "type": "blob",
            "traits": {
                "smithy.api#default": "ewogICAgImZvbyI6ICJiYXIiCn0K"
            }
        },
        "smithy.example#previousString": {
            "type": "blob",
            "traits": {
                "smithy.api#default": "e30="
            }
        },
        "smithy.example#previousTextBlock": {
            "type": "blob",
            "traits": {
                "smithy.api#default": "ewogICAgImZvbyI6ICJiYXIiCn0="
            }
        }
    }
}

Open Questions

  • How should a byte string be converted into a byte array in order to be base64 encoded?
    • Per the spec, IDL files are UTF-8 encoded and strings support Unicode characters with basic ASCII escape sequences and a Unicode code-point escape sequence.
    • There are significant parallels in this feature from Python's byte literals, which only support ASCII characters with hex escape sequences to allow arbitrary byte sequences. Unicode characters and the Unicode escape sequence are explicitly not allowed in those byte literals.
    • In the end, we must choose to either support arbitrary Unicode text with a specific encoding (presumptively UTF-8) or arbitrary byte sequences (by supporting hex-based escape sequences and not supporting Unicode characters or Unicode escape sequences). Support for these are mutually exclusive for a given string flavor.
    • My current inclination is that we should support Unicode characters by encoding the byte strings into UTF-8 bytes, given the prevalence of the UTF-8 encoding in modern computing. The expression of arbitrary byte sequences is already supported by base64 encoding standard strings and supporting them in byte strings is unlikely to make them any more understandable.

Changes Required

  • Two new IdlTokens must be defined for the new string representations
  • The Tokenizer implementations must be updated to recognize the new string representations
  • The IdlNodeParser and IdlTraitParser must be updated to properly re-encode the parsed tokens as their equivalent base64 value
  • Two new TreeTypes must be defined for the new string representations
  • The FormatVisitor, BracketFormatter, and CapturedToken classes must be updated to properly handle the new IdlToken and TreeTypes.
  • Documentation must be updated to explain the behavior of the new string flavors
  • The language server and any other IDL syntax-supporting projects must be updated

Additional Commentary

This idea originated from a comment made by @JordonPhillips as part of the conversation in #2821 related to current inconsistency with blob value literal behaviors.

I plan to cut a draft PR shortly with a working proof of concept of this feature.

Metadata

Metadata

Assignees

No one assigned

    Labels

    idl-2.1Features to be included in IDL version 2.1

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions