Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add message parse mode (code vs text) design doc #474

Merged
merged 25 commits into from
Oct 25, 2023
Merged

Conversation

eemeli
Copy link
Collaborator

@eemeli eemeli commented Sep 13, 2023

Submitting this initially as a draft, as it does not yet propose a solution, only alternatives.

The intent here is to document the choice we make, and to provide the basis for explaining it to others. This is also in response to comments made by @LeaVerou in discussions at W3C TPAC.

The terms "most", "many", "sometimes", "some", and "rarely" in the use cases is intentional, and draws on my experience with localizable messages. If it's useful, I can dig up actual statistics from the corpus of Fluent messages at Mozilla, which as a format is relatively close to MF2. Are there other sets of comparable existing messages from which we could get any such data?

@LeaVerou
Copy link

Nice start, thanks! Could you add examples of the current syntax so I can share it with a couple people? There are currently only examples of the alternative syntax.

@eemeli
Copy link
Collaborator Author

eemeli commented Sep 13, 2023

@LeaVerou The Start in code, encapsulate text proposal is effectively our current syntax, amended by adding input, a change that we've found tentative agreement on this week.

I have presented it here initially as an "alternative", so that its selection as our choice going forward may be made based on its merits, rather than pre-existing conditions.

@aphillips
Copy link
Member

@LeaVerou I would actually give us some time before pushing into this too far. A lot has happened in our F2F this week. I suspect that there will be a significant modification of both the syntax (for reasons unrelated to text mode as well as the discussion about authoring considerations)... say give it a week (i.e. by 2023-09-19)?

I do thank you for sparking what was a simmering issue in our group. I think we'd like to queue up our thinking and some backing material so that its accessible. We super value fresh eyes, since it is easy to get into groupthink.

Limiting the range of characters that need to be escaped in plain text is important.
Following past precedent,
this design doc will only consider encapsulation styles which
start with `{` and end with `}`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "characters that need to be escaped in plain text" refers to the text portion of the syntax. We should be able to consider whatever natural-to-use characters make sense in the non-textual parts of the syntax, if they would represent a significant improvement in usability. That is, this constraint applies mainly to the pattern portion of the syntax and might not apply outside that.

exploration/0474-text-vs-code.md Outdated Show resolved Hide resolved
exploration/0474-text-vs-code.md Outdated Show resolved Hide resolved
@LeaVerou
Copy link

@LeaVerou I would actually give us some time before pushing into this too far. A lot has happened in our F2F this week. I suspect that there will be a significant modification of both the syntax (for reasons unrelated to text mode as well as the discussion about authoring considerations)... say give it a week (i.e. by 2023-09-19)?

I do thank you for sparking what was a simmering issue in our group. I think we'd like to queue up our thinking and some backing material so that its accessible. We super value fresh eyes, since it is easy to get into groupthink.

Gotcha. FWIW our discussion sparked some ideas about new TAG design principles to guide the design of text-based syntaxes, which would hopefully be more broadly useful. I can ping you once there's more on that front, in case they also help this syntax redoing.

@eemeli
Copy link
Collaborator Author

eemeli commented Sep 14, 2023

I think we’ve identified that there are messages that ought to have a maximally simple representation (e.g. “Hello word”) and that beyond some increase in complexity a more complex structure is needed (e.g. “You have 3 messages”). I think we’re mostly agreed on the simple representation (e.g. Hello {$place}), but have not yet found agreement on the complex representation (is text delimited vs. not).

As I see it, there are two relevant viewpoints from which to look at the complex message syntax:

  1. Should we look like a resource format, or a template format?
  2. Can/should we give special treatment to leading/trailing whitespace compared to other localized content?

Regarding the first viewpoint, I think that’s tantamount to asking if we’d like users to see MF2 as having one or more internal layers. Our current syntax has two such layers, where to start we’re explicitly in “code” and then we may enter “text”. This question is excacerbated by adopting “text” for simple messages; by dropping the {} from around them, we could end up with “simple”, “code”, and “text” as effectively separate conceptual layers within a single message.

My preference would be to work towards making the syntax feel more like a template format, and to have “simple” and “text” conceptually similar, as we have with our current syntax. Are there some explicit benefits from having a simple vs. text separation that I’m not aware of, other than whitespace representation?

Regarding the second viewpoint, I don’t think I have sufficient data to understand how messages with intentional and necessary leading or trailing whitespace are formed. My presumption and our discussion yesterday suggests that in a majority of cases this is a bug that’s easy for developers to make, so trimming all of it could prevent a decent amount of surprisal. But when is that not the case? What do these messages look like, and most importantly, how are translators made aware that the spaces are required and should be retained?

I think we need to answer these questions to be able to explain whatever format we end up with for complex messages.

@aphillips
Copy link
Member

@eemeli I think these are some key insights. Thanks for this.

I think we made a choice (we can reconsider it if necessary, although I don't think we need to) that MF2 is not a resource format. I'm not sure if "templating language" is the right characterization, but let's go with it for now. The reason it isn't a resource format is the same reason that we have all-on-a-line as something we support. The resource format wraps around the MF syntax and, necessarily, embeds us.

This is where many of the considerations surrounding escaping come from. We want users to be able to author and edit our format when it is hosted in a variety of tools/formats... and with the explicit recognition that they will edit messages in place using whatever passes for a text editor--or a resource editor that understands the wrapper format (and is trying to syntax highlight that, rather than trying to highlight our format nested inside it). This puts some extra pressure on us, witness | vs. ", in which we are trying to ensure that people editing in many formats are not faced with the "how many backslashes until I get what I want" issue.

Regarding the second viewpoint, I don’t think I have sufficient data to understand how messages with intentional and necessary leading or trailing whitespace are formed. My presumption and our discussion yesterday suggests that in a majority of cases this is a bug that’s easy for developers to make, so trimming all of it could prevent a decent amount of surprisal. But when is that not the case? What do these messages look like, and most importantly, how are translators made aware that the spaces are required and should be retained?

The thing I would want to convey here is that the world is a big place. There are many applications, runtime environments, UI frameworks, etc. Developers are trying to meet various different needs/demands at the same time and most have only marginal familiarity with I18N. We need to enable these folks to get work done because they are our primary customers--they will vote with their feet if we don't make their lives better and that is the main thing that will ensure the success or failure of MF2.

I am not saying we should bend over backwards to let developers write bad strings, notably "starts with space so I can concatenate". But there exist a variety of places where developers want to control whitespace and each is a special snowflake of developer need--some want tabs; some want some newlines; some want to mimic horizontal spacing; some are used for emphasis or to provide space for a visual element inserted as an overlay. But, frankly, these are all corner cases and the primary case is: I want to write an I18N bug.

When evaluating an application, one of the first things I look for are resource strings that start with space, contain only a space, or contain only a period (or other punctuation)--these signal "string math".

So with regard to spacing, my line of questioning last night was intended to sound out how to enable those folks to get their job done with a minimum of fuss without over-encouraging the use of "the bug factory". My current thinking is that there are three proposals in play:

All textual space is meaningful

Any whitespace that appears outside "code mode" has meaning and must be preserved. That means, in our demo messages, that this message:

#local $foo = {42 :number}
Hello {$foo}

Produces the message:

\nHello 42

This is unsurprising to a very rigorous developer, but probably a surprise to newcomers and translators. Note that most developers see this in the resource format, btw, as some flavor of:

"myHelloString" = "#local $foo = {42 :number}\nHello {$foo}"; // newline is text is slightly less surprising here

Exterior whitespace has no meaning

This means that space around the pattern string must be escaped to be part of the pattern. Using the same example as above produces:

Hello 42

To get a space or newline, it must be quoted onto the string.

Option A: Character quoting

Quote the individual characters you want:

#local $foo = {42 :number}
{|\n  |}Hello {$foo}    <- put a newline and two spaces in front, using \n for visibility rather than an actual newline

Option B: Quote the pattern

#local $foo = {42 :number}
{\n  Hello {$foo}}

I think I currently prefer Option 2B. This makes any leading whitespace explicitly part of the message when quoted and authors and translators don't have to do "special things" with the spaces. It keeps those spaces from being a "special literal thing" and maybe being a "part" when formatted to parts. This is actually the intention of the developer and translator--a "languageX" translator can remove or change the number of spaces or remove them all without doing anything outside of normal localization. I think that's a feature that's worth more than the implicit spanking for our customer the developer.

(Tools should be encouraged to emit warnings about spaces inside the quotes (or in the case of Option 1, about all whitespace outside the quotes)

Note that Option 2A or B works perfectly fine as a leading space free message:

#local $foo = {42 :number}
Hello {$foo} with no newline or spaces on the front

Your second question was:

  1. Can/should we give special treatment to leading/trailing whitespace compared to other localized content?

This is misleading as phrased? Leading/trailing whitespace that is part of the message is not special. They are just characters. The problem is identifying which characters are inside the pattern. In 2B, without quotes, there are never any whitespace characters in that position 🙈😈

@stasm
Copy link
Collaborator

stasm commented Sep 14, 2023

I think we made a choice (we can reconsider it if necessary, although I don't think we need to) that MF2 is not a resource format. I'm not sure if "templating language" is the right characterization, but let's go with it for now.

I'd go further: we’re neither resource format nor templating language. Instead, I tend to think we’re a storage format for variants. That’s smaller scope than an entire resource and smaller scope than an entire template.

We made that choice back when we decided to only allow top-level selection. Our logic happens "outside", while variants are "inside". The outside syntax should be friendly to developers and machines. The inside syntax should be friendly to translators. These are the two layers @eemeli mentions above—I think they are a feature.

Furthermore, we don't really even have logic. Messages are not imperative templates with foreach and if blocks. They are literally a flat map of variants.

This is why I'm hesitant to draw too much inspiration from existing templating languages. When I see things like the following Jinja:

{% for item in navigation %}
    <li><a href="{{ item.href }}">{{ item.caption }}</a></li>
{% endfor %}

...I can see some parallels to what we're doing, but I'm also cautious of the differences. I'm rather happy with the current syntax in that it doesn't make code statements looks like placeholders — because placeholders can be moved around, and text can move around them too, and we specifically don't want that for input, local, or match.

Templating languages also typically either don't worry too much about whitespace (and delegate the problem to HTML), or need to introduce extra syntax to control it. For example, Jinja uses + and - as prefixes or suffixes to tags. I'm sure that's useful and needed, but it also proves that whitespace is hard and requires special handling in templating languages.

I'm not opposed to introducing modality and starting in text, especially for single-variant messages without declarations. However, the more explicit we are about whitespace, the easier it will be, in my mind, to embed messages in code or container formats, because code and container formats most likely will already have opinions about whitespace. This is indeed similar to escaping.

aphillips and others added 5 commits September 22, 2023 13:28
... not expecting us to adopt it, but we need to make progress in deciding the specific issues here.
... which is perhaps indicative of an answer to one of the questions about double-bracketing `match`...
{match {$count :number integer=true}}
{when 0} Hello {$user}. Today is {$now} and you have no geese.
{when one} Hello {$user}. Today is {$now} and you have {$count} goose.
{when few} { Hello {$user}, this message has spaces on the front and end. }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would this work, parsing-wise?

{when few} {  Hello {$user}, this message has spaces on the front and end.  }
           ^
           how does a parser know this isn't a placeholder's open brace?

I think it may be a good idea to consider using double braces for certain fetures (e.g. for placeholders, or as pattern delimiters). Alternatively, we may want to revisit the idea of using double sigils for different meanings, e.g. {%, {[, and {{. See #269.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doubling will make the syntax harder to use/write?

In the above, the whitespace is consumed until you see the { (or any text). The parser knows this isn't a placeholder's opening brace only by scanning ahead. In this message, the embedded {$user is what resolves it. It is possible that one could reach the closing bracket in some messages and that would be the resolution.

The question here is whether we favor this syntax for its usability more than efficiency in parsing.

In my opinion, the real problem here is the when clause. Not only is it visually hard to distinguish but it makes {/} ambiguous (the brackets can be any of three different things). Consider instead:

#match {$count :number integer=true}
[when 0] This has no spaces.
[when one]      This has no spaces.
[when few] {   This has spaces quoted   }
[when *] {|  |}  This has spaces quoted on the front only as placeholder literal.

And it writes single line as:

#match {$count :number integer=true}[when 0]This has no spaces.[when one]   No sp...
...aces   [when few]{ Quoted spaces  }[when *]{|  |} Quoted spaces

The above still has some forward looking ambiguity:

{   $user :foo}   // not resolved till you see $
{   foo :foo  }   // not resolved till you see the `:`
{   foo  {$bar}}  // not resolved till you see the 2nd `{`
{   foo foo   }   // not resolved till you see the closing `}`

So you're right: the parser is more of an adventure once the pattern can be unquoted.

One option we discussed is that the pattern must still be quoted once we are in code mode:

Hello world
Hello {$user}

{input $user}
{Hello {$user}}  // because code mode


{match {$count :number integer=true}}
{when 0} Hello {$user}. Today is {$now} and you have no geese.
{when one} Hello {$user}. Today is {$now} and you have {$count} goose.
Copy link
Collaborator

@stasm stasm Sep 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The {when one} Hello {$user} part makes me really nervous: it looks like two placeholders on both sides of Hello. Furthermore, I think the fact that the space left of Hello will be trimmed but the one on the right will not is a footgun. I think this is the main reason why I've been opposed to trimming (although I understand which problems it addresses).

Could we consider using a different set of brackets for statements? For example, if we made # special in patterns, too, we could consider something like the following:

#[one] Hello {$user}.

Does this look like less of a footgun now when it comes to the rules about which space will be trimmed and which one won't?


Interestingly, there's something about square brackets that makes me not actually mind the following:

#[one]Hello {$user}.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above 😸

I kind of like this proposal.

#match {$expr} {$expr}
#[key key] Hello {$user}
#[key *  ] {$user} hello
#[*   *  ] { Quoted pattern }

Or:

#match {$expr} {$expr}#[key key] Hello {$user}#[key *  ] {$user} hello#[*   *  ] { Quoted pattern }

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Past feedback from CLDR in particular has been very critical about reserving additional characters than \{} in text mode. However, a two-character sequence like #[ would probably be much more acceptable, given how rare it is in actual message contents.

If we do have a real concern about an all-code-delimited solution looking like the "statement" parts may look too "placeholder", that should be added to the doc.

In the continuing absence of any exemplars of non-i18n-buggy leading or trailing whitespace, that may end up as a sole reason to prefer quoting entire patterns.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One quotes the pattern because one is doing something weird inside the pattern. I'm way way way more cautious about quoting sub-patterns, because those feel more like I18N bugs to me:

This is{| |}{$foo}{| |}an I18N bug waiting to happen.

{This is }{$bar}{ another I18N bug waiting to happen.}

{This} {is} {just} {silly} {😸}

Interestingly, this past week I was consulting with some folks and saw a string like "\n\n{$placeholder}", which was being used to make an on-screen list (in a for loop). That's not really an I18N bug, although it's not normal either. (And actually what it was was a hardcoded string "\n\n" followed by a + someVar that I made them move to externalized with a formatter...)

As I noted in Seville, there are many special-snowflake cases for exterior whitespace. They are not "compelling" use cases and yes one could code around them. If the proposal is to remove quoted patterns, I suppose I could get behind that...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, a two-character sequence like #[ would probably be much more acceptable, given how rare it is in actual message contents.

With the warning that # is a comment in some of the container formats (for example Java properties and gettext .po files)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I see potential in the #[...] syntax for code (statements). A lot of languages allows these extra directive for code, often called attributes: C++ uses [[foo]], C# uses [foo], Rust uses #[foo].

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly, this past week I was consulting with some folks and saw a string like "\n\n{$placeholder}", which was being used to make an on-screen list (in a for loop). That's not really an I18N bug, although it's not normal either. (And actually what it was was a hardcoded string "\n\n" followed by a + someVar that I made them move to externalized with a formatter...)

This appears to be a non-locale-dependent use case of leading whitespace, where the \n\n is effectively used as markup, yes? So if (for whatever reason) this message needed to be expressed entirely within MF2, would it make sense to expect this to be represented as {|\n\n|}, where the \n represent actual newline characters?

One comparable and valid locale-dependent case that I can imagine existing is sentence concatenation in a context that needs to account for both CJK and non-CJK scripts. Similarly to the \n\n, I could imagine a space after a period to be included as a leading or trailing space in a single-sentence message for the non-CJK scripts, rather than being handled in code depending on the locale's script.

Are there any other locale-dependent uses of leading or trailing whitespace that we ought to consider? And is the case I represent above actual, or purely hypothetical? As in, could someone here state that they do have messages like this in their corpus? And if they do, how do they communicate to their translators that in these particular cases, the whitespace should be removed in CJK locales, while something like the \n\n above should be left alone?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any other locale-dependent uses of leading or trailing whitespace that we ought to consider? And is the case I represent above actual, or purely hypothetical?

There are lots of locale-dependent cases too. I cited a random example that came my way in the past week because I don't have the luxury of grepping an employer's vast collection of strings at the moment. I'm hopeful someone else can do that, but I don't expect to learn anything from it.

how do they communicate to their translators that in these particular cases, the whitespace should be removed in CJK locales, while something like the \n\n above should be left alone?

"Carefully." I have seen comments in resource files, comments in translation kits, comments in tooling. I also note that most localization engineering shops maintain tools for checking that target language strings match source strings in terms of start/end spacing, punctuation, and placeables. Some languages (such as CJK) produce a lot of noise or the need for message suppression or tuning in these cases.

I don't have a problem with users putting {| \n \t |} onto the front of a pattern as a quoted blob to preserve across translations. But you appear to be building towards the suggestion of only allowing that case. I think that format is inconvenient to write and runs up against languages that want other behavior--and also that this isn't really a problem for most translation processes (where the whitespace is already treated as meaningful)

Either way we have to describe how exterior whitespace is handled in patterns. So it doesn't really help me, individually, decide how to choose between auto-trimmed vs. non-trimmed unquoted patterns. I think the thing I'm trying to puzzle out for myself is: which is the most natural and least surprising representation of a pattern?

when [*]No whitespace is no whitespace.
when [*] Whitespace is trimmed.
when [*]
    All whitespace is trimmed.
when [*] Whitespace is meaingful, so there is a space before this string.
when [*]\n   All whitespace is meaningful. // read \n as newline
when [*]
   {
   Unquoted whitespace is trimmed, but this message has newline and spaces around it
   }
when [*]{\n   The same message normalized to a single line with\n   }

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are lots of locale-dependent cases too. [...] I don't have the luxury of grepping an employer's vast collection of strings at the moment. I'm hopeful someone else can do that, but I don't expect to learn anything from it.

I really hope someone can do that. I have heard now on numerous occasions that there are many locale-dependent uses for leading or trailing whitespace, but this specific one -- how CJK scripts do not use spaces between clauses -- is literally the only one I am aware of.

If we end up making a specific accommodation for this in MF2, this argument needs to be really well made, and presented in this design doc. I am not the right person for doing that, and so I continue to ask others to help here.

[...] Some languages (such as CJK) produce a lot of noise or the need for message suppression or tuning in these cases.

Could we improve that experience? Rather than perpetuating a sub-optimal CJK translation experience, could we somehow explicitly differentiate localizable leading/trailing whitespace from markup whitespace? If those spaces needed to be explicit, and we did adopt expression attributes, that could be done with

{| | @translate}

I don't have a problem with users putting {| \n \t |} onto the front of a pattern as a quoted blob to preserve across translations. But you appear to be building towards the suggestion of only allowing that case. I think that format is inconvenient to write and runs up against languages that want other behavior--and also that this isn't really a problem for most translation processes (where the whitespace is already treated as meaningful)

My claim is that almost all leading & trailing whitespace is not really localizable content, and by default should not be in messages. I do still want to allow for the possibility of including such whitespace, and having a way of making it clear to both humans and tooling when such whitespace is markup, and when it is localizable.

We need to identify and enumerate the explicit use cases for leading/trailing whitespace, and to make our syntax choices based on that. Thus far, we have not done so; we've just accepted the assertion that leading/trailing whitespace must be accommodated for ergonomically. Once we do have such a list written down somewhere, we may use that to direct our choice, e.g. within the context of the #485 beauty contest.

And yes, until convinced otherwise, my current position is indeed that we should require leading/trailing whitespace to be explicitly quoted, because that's the only way to differentiate localizable and non-localizable whitespace. To make me change my mind, I can imagine at least the following categories of arguments that could be made:

  1. Locale-dependence. Show me that there is a great range of different types of localizable dynamic message strings where leading/trailing whitespace needs to be handled differently in different locales, such that my base assumption about most such cases being "markup" is invalid, and that either both are as common, or that it's more common for there to be a locale dependency.
  2. Better syntax. Show me syntax which will lead developers to leaving more non-localizable whitespace out of messages and/or communicating better to translators the localizability of whitespace.
  3. Numbers. Look into the data, i.e. the corpus of localizable dynamic message strings that you have access to, and tell me that such a large fraction of them include valid, localizable leading/trailing whitespace, that special accommodation must be made for them in the syntax. Extra credit if you can further say how common this is in multi-variant messages. I don't need to see your messages, I'm interested in the frequency, and a data-driven argument.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My claim is that almost all leading & trailing whitespace is not really localizable content, and by default should not be in messages. I do still want to allow for the possibility of including such whitespace, and having a way of making it clear to both humans and tooling when such whitespace is markup, and when it is localizable.

I think your claim is reasonable. But I would also observe that this is less true for desktop and CLI applications. I would also note that trailing whitespace is probably as important as leading whitespace.

Overall, I think my reaction on this thread is to look to overall syntax first and whitespace handling as a downstream consideration. A number of syntax options end up with pattern quoting that may make this discussion moot. And, as noted above, each of the whitespace handling options represent different compromises, depending on who is looking and what the use case is.

@eemeli
Copy link
Collaborator Author

eemeli commented Oct 24, 2023

Following yesterday's call, I've updated this PR. It's no longer considering any explicit syntax, but more precisely the "parse mode" question that this is nominally about, approaching this from two related axes:

  1. Whether to start in "code" or "text".
  2. For "text", whether/how to trim whitespace.

On the trimming, I've split out three choices from the previous design and propose one as a part of the solution here.

@aphillips With the specific syntax questions taken out, I think the remainder here could be used as a basis for our next-step discussions. In addition, of course, to whatever changes @echeran and @mihnita may be proposing to our pattern-exterior whitespace consensus.

@eemeli eemeli marked this pull request as ready for review October 24, 2023 13:11
@eemeli eemeli added design Design principles, decisions Agenda+ Requested for upcoming teleconference labels Oct 24, 2023
Copy link
Member

@aphillips aphillips left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a key insight that this version currently hides is this:

The whitespace handling is not about the message as a whole. It is about identifying the boundary between the pattern and code.

In a couple of places the options given here talk about, e.g., whitespace between declarations--which no one in this group would expect to be part of a pattern (I think?).

By recasting this as pattern/code boundary handling I think we could make it clearer. That would make options look more like:


(assumption: simple patterns are not trimmed or are a separate debate; quoted variant patterns are not trimmed)

  1. (old syntax) Start in code, all patterns are quoted.
  2. (Implement text-mode-first syntax #500 syntax) Start in text, code is quoted, all variant patterns are quoted.
  3. Start in text, code is quoted, unquoted variant patterns are trimmed.
  4. Same as 2 except uses @eemeli's "minimal trimming" of unquoted variant patterns
  5. Start in text, code is quoted, unquoted variant patterns are not trimmed.

exploration/text-vs-code.md Outdated Show resolved Hide resolved

## Objective

Decide whether text patterns or code statements should be enclosed in MF2.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is that clear.

Suggested change
Decide whether text patterns or code statements should be enclosed in MF2.
Decide how to segregate and identify between _pattern_ text and code statements in MF2.
This includes whether parsing a message expects to start with _pattern_ text or with code statements.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "how" part would be an extension of what the design doc is currently doing. Is that intentional?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, but this isn't about "whether text patterns or code statements should be enclosed" but rather about which should be enclosed and when. And the design decision is really about the general syntax (text-vs-code and trimmed-vs-untrimmed-vs-quoted)

exploration/text-vs-code.md Outdated Show resolved Hide resolved
exploration/text-vs-code.md Outdated Show resolved Hide resolved
exploration/text-vs-code.md Outdated Show resolved Hide resolved
exploration/text-vs-code.md Show resolved Hide resolved
Comment on lines +212 to +216
Expressing the trimming on patterns rather than statements
means that leading and trailing spaces are also trimmed from simple messages.
This option is not chosen due to this being somewhat surprising,
especially when messages are embedded in host formats that have predefined means
of escaping and/or trimming leading and trailing spaces from a value.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trimming simple patterns like this is a bridge too far for me. It should be a separate decision for the "trim XXX" options whether they are trimmed. I can make a plausible argument for why simple patterns should behave differently when trimmed than variant patterns do.

Comment on lines +174 to +175
This option is not chosen due to adding an excessive
quoting burden on all messages.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should not include these "This option is not chosen due..." paragraphs. I think it is okay to call out objective or subjective reasons for why we might not choose a given alternative, e.g.

Suggested change
This option is not chosen due to adding an excessive
quoting burden on all messages.
- This option makes plain text strings invalid as messages.
- This option requires additional quoting for simple messages.

Our choice section should deal with the logic of why a given option was chosen.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify what you mean by "choice section"? I'm not sure that I understand what that is.

Co-authored-by: Addison Phillips <addison@unicode.org>
exploration/text-vs-code.md Outdated Show resolved Hide resolved
Comment on lines +158 to +161
Trim whitespace between and around statements such as `input` and `when`,
but do not otherwise trim any leading or trailing whitespace from a message.
This allows for whitespace such as spaces and newlines to be used outside patterns
to make a message more readable.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned about this resulting in surprising misbehavior... consider a developer making the following change, in which removing all inputs from a message that still starts and ends with a line feed would not be expected to affect whitespace:

logOutMessage = ```
-{%input username}
-
-Log out {$username}?
+Log out?
```;

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, fair point. The issue here is that the above edit would add a new line to the message's start, yes? One way to avoid this would be to require a message with statements to not have leading whitespace. That might be a reasonable restriction.

Copy link
Member

@aphillips aphillips Oct 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @gibson042's illustration being informative. I would have guessed (if I didn't know anything about MF2) that the "before" of logOutMessage included a newline. With some of the options here, the newline after {%input $username} is consumed, but not the blank link after that. In other options both newlines are consumed unless the user specifically quoted the pattern or whitespace.

Also, note that many formats would encode the example as:

// using our current syntax but without quoting the pattern
var logOutMessage = "{{{#input $username}\n\nLog out {$username}?}}"

@eemeli
Copy link
Collaborator Author

eemeli commented Oct 24, 2023

I think a key insight that this version currently hides is this:

The whitespace handling is not about the message as a whole. It is about identifying the boundary between the pattern and code.

The way I see it, it's about enabling the same sort of thing we do elsewhere in the syntax with e.g.

match {$foo} {$bar}

The spaces there are not required by the syntax, but they make that line much more readable.

In a couple of places the options given here talk about, e.g., whitespace between declarations--which no one in this group would expect to be part of a pattern (I think?).

That's needed in the "trim minimally" option to clarify its corner cases. I do not think we should choose such an approach.

(assumption: simple patterns are not trimmed or are a separate debate; quoted variant patterns are not trimmed)

I rather hope that we could establish that here, or at least explicitly put it down in writing.

By recasting this as pattern/code boundary handling I think we could make it clearer.

I agree that there are multiple ways of looking at what we're doing. I think different choices on trimming lead to different points of view being more or less appropriate, such that no one viewpoint is optimal for all possible solutions. If you've specific suggestions for the alternatives presented here, those might be easier to assess individually.

Note also that I've tried to not say here how code ought to be "encapsulated"; that I think can be discussed as a separate dimension: Do we use wrapping {quotes} or a starting %sigil with an implicit end? Is the encapsulation around one or multiple statements? These questions hopefully don't need to be mingled into this discussion.

@aphillips
Copy link
Member

Following a discussion with @eemeli @echeran @mihnita and @stasm, I am merging this PR without approvals so that @echeran and @mihnita can take the pen. They drew an action item in the 2023-10-23 call to produce a design document regarding whether we support unquoted variant patterns or not.

@aphillips aphillips merged commit 3daea5a into main Oct 25, 2023
1 check passed
@aphillips aphillips deleted the text-vs-code branch October 25, 2023 16:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Agenda+ Requested for upcoming teleconference design Design principles, decisions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants