Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add bidirectional isolation #30

Merged
merged 4 commits into from
Dec 4, 2023
Merged

Add bidirectional isolation #30

merged 4 commits into from
Dec 4, 2023

Conversation

eemeli
Copy link
Member

@eemeli eemeli commented Aug 18, 2023

This fulfills the spec requirements for Handling Bidirectional Text.

The ApplyBidiIsolation AO will in the spec text need more formal descriptions of how the "compatibility" and "none" strategies work, but I thought that would be a bit heavy for the README.

We could also define something like a "best fit" option to match what's done with locale negotiation, but I left that out as I wasn't sure exactly what it would look like.

CC @aphillips

Copy link
Contributor

@aphillips aphillips left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The locale can only be a hint.

In my view, the message as a whole should have a base paragraph direction. And each placeholder should have a base paragraph direction property.

When formatting to parts, one ends up with a sequence of parts that each have a queryable direction. When formatting to HTML this direction can be used to set the dir attribute on markup, which is how bidi isolation is done in HTML.

When formatting to a string, this algorithm is thus mostly correct in its behavior, since it places isolating controls around placeholders.

Some will note that a placeholder can, itself, contains "parts" that have direction. For example, a date format produces a sequence of "parts" that are date/time fields (such as year or hour or such) or literals. However, I note that these "interior parts" are the responsibility of the formatting function. The important part of bidi isolation in MF2 is that the placeholder "sticks together" (is spanned with isolates).

My concern with this proposal is that it does not start from parts or with having a directionality property. It tries to infer direction from locale (which is not always appropriate and thus shouldn't form the foundation of directionality support)

@eemeli
Copy link
Member Author

eemeli commented Aug 21, 2023

@aphillips Do I understand right that you think we should include dir as an explicit option for the formatter in addition to locale, and to include it for each of the formatted parts?

My presumption was that from the locale code we could get the script, from which we could determine the direction. But that's not always valid then?

@aphillips
Copy link
Contributor

aphillips commented Aug 21, 2023

@eemeli More like: every formatted part should have a dir property. This can be generated by a formatter (and might, indeed, be inferred from the locale if that is how the formatter figures out the base paragraph direction of e.g. a date string). However, not every placeholder is generated by a formatter. Having a dir property also allows assembly of messages whose character sequences would fool auto. Consider:

Input variables:

{  "item": {
    "id": "978-111887164-5",
    "title": {
        "value": "HTML و CSS: تصميم و إنشاء مواقع الويب!",
        "lang": "ar",
        "dir": "rtl"
    }
}

And message:

{Your purchased the item {$title.value @lang=$title.lang @dir=$title.dir}.}

In this case, you want the title bidi isolated (it's in Arabic) and you don't want a direction to be LTR (implied by HTML on the front). Only the metadata associated with the original data value conveys this.

We have a long document discussing this: Strings on the Web: Language and Direction Metadata. It includes a discussion of why language tags (and their extant or implied script subtag) is only a fallback.

Thus dir isn't really a formatter option. But it might be input about some operand. And it might be something that is overridden on occasion.

@eemeli
Copy link
Member Author

eemeli commented Sep 1, 2023

After discussion with @aphillips elsewhere, I've rebased and updated this PR to include dir as a supported option, and to use that rather than the locale to determine the isolation to use.

This also adds dir: 'ltr' | 'rtl' | 'auto' as a property of the formatted parts that also include locale.

Copy link
Contributor

@aphillips aphillips left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for these changes. Some comments.

Comment on lines +156 to +160
By default the message's directionality is determined from
the script corresponding to the first locale,
but this may be overridden by `dir`.
Its `"auto"` value corresponds to messages with unknown directionality,
for which the direction is determined by the first strongly directional character.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would unpack this so that the hierarchy is clear.

Suggested change
By default the message's directionality is determined from
the script corresponding to the first locale,
but this may be overridden by `dir`.
Its `"auto"` value corresponds to messages with unknown directionality,
for which the direction is determined by the first strongly directional character.
The base paragraph direction of a message is set by the `dir` property, if it is present.
If there is no `dir` property, the direction is determined by the result of `getTextInfo()` for the `locale` of the message.
If the direction cannot be determined, the value of `dir` is `"auto"` and directionality of the
message will be determined at runtime using the directionality of characters in the message
(known as "first-strong" detection).

Custom user-defined message formatting and selection functions may defined by the `functions` option.
These allow for any data types to be handled by custom functions.
Such functions may be referenced within messages,
and then called with the resolved values of their arguments and options.

```ts
interface MessageFormatOptions {
bidiIsolation?: 'compatibility' | 'none';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compatibility sounds like it exists for legacy reasons rather than wanting to be the default options. Perhaps give it a less redolent name, like unicode or controls? Would formatToParts have a different option (for when the caller wants to use markup)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "compatibility" name is explicitly mentioned in the MF2 spec: https://github.com/unicode-org/message-format-wg/blob/main/spec/formatting.md#handling-bidirectional-text

If it should be called something else, it would be best to re-name it there so that the same name will end up in other implementations as well.

The bidi isolation strategy is set in the constructor, so each needs to work for all formatting targets.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we should figure out the naming there.

README.md Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
1. Let `parts` be the result of calling `mv.toParts()`.
1. If the call fails or `parts` is not an array:
1. Set `parts` to be `[{ type: 'fallback', locale: 'und', source: mv.source }]`.
1. Set `parts` to be `[{ type: "fallback", source: mv.source }]`.
1. Set `dir` to be `"auto"`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above.

Comment on lines +249 to +256
1. Let `bidi` be the `{ start: string, end: string }` result of calling
`ApplyBidiIsolation(bidiIsolation, msgDir, dir)`.
1. If `bidi.start` is not an empty string:
1. Append `{ type: 'bidiIsolation', value: bidi.start }` to `res`.
1. For each `part` or `parts`:
1. Append `part` to `res`.
1. If `bidi.end` is not an empty string:
1. Append `{ type: 'bidiIsolation', value: bidi.end }` to `res`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the discussion we were having on the parts spec in MF2. This design makes it hard for developers who are using the parts to build HTML using HTML's dir attribute, because instead of each part just having a direction, the developer getting a bidiIsolation part has to peek ahead to figure out what the text is that it applies to. They then need to decide whether to generate separate bidi markup (such as a span element) or try to apply dir to a known inline element (which is why they got parts in the first place--to be able to decorate the string with em, strong, bdi 😉, or some CSS class.

This algorithm works for compatibility mode, since it just has the bidi control in it. Is your intention that none conveys what I just described above? If so, then the name might be wrong?

Perhaps make the names:

compatibility => controls
none => inline or maybe attributes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bidiIsolation parts are in addition to the dir that's on the formatted parts. So a developer building HTML could just ignore the bidiIsolation parts and build e.g. <span dir> elements as necessary from the dir attributes.

On the other hand, in this particular case the bidiIsolation parts also act as start/end markers in case the value formats to more than one top-level part (Note: this is different from e.g. number parts containing sub-parts), so that a single multi-part value could be completely encompassed in an appropriate <span dir> if necessary.

Regarding the "compatibility" name, see my comment here. Regarding "none", I still think it's right to say that using it no bidi isolation is provided. Given that the dir values are still included in the parts output, it is of course possible for a consumer to construct its own bidi isolation from the provided information, but Intl.MessageFormat isn't doing it in that case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, that works for me, although I still want to think about the naming.

README.md Outdated
Comment on lines 583 to 584
locale: 'und';
dir: 'auto'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both locale and dir can be known from the call to MF. Any reason to suppress that information?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For fallback, we can't know without introspection what the source direction is, and we should not presume that the contents match the locale. In other words, we do not know the locale or direction of a fallback part.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, if source is externally provided and isn't the pattern string, I can agree about the direction. The locale, though, really is the locale of the formatter, I think. Otherwise the caller won't know if some locale defaulting happened. If the source is just an identifier, its actual language is probably "programmer-ese" and not any particular natural language. But the language of the words in the source don't have to be in the locale's language for that to be the correct locale for the message.

README.md Show resolved Hide resolved
Co-authored-by: Addison Phillips <addisonI18N@gmail.com>
Copy link
Contributor

@aphillips aphillips left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good. My comments about naming remain, but they are minor and I could live with what's here.

@eemeli
Copy link
Member Author

eemeli commented Dec 4, 2023

I've added an upstream issue about the "compatibility" name: unicode-org/message-format-wg#549

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants