Simplify source bidi isolation rules #781

eemeli · 2024-05-06T11:37:56Z

Drop the bidi rule, and allow name to be LR/RL/FS -isolated.

Allow an LRI immediately after a non-content newline.

Relax expression & markup isolation to not require pairing on a syntactic level, as the LRI can also be terminated by a newline.

aphillips · 2024-05-06T13:49:03Z

I wish you'd added this as a separate alternative.

I don't like that the isolates are part of the name rule---I worked hard to keep the isolates outside the rules for important constructs (like name)

You removed unquoted literals from being amenable to bidi isolation, but they should still be isolatable, no?

eemeli · 2024-05-06T15:16:26Z

I don't like that the isolates are part of the name rule---I worked hard to keep the isolates outside the rules for important constructs (like name)

Including the isolates in name doesn't change its parsed meaning, much like the | aren't a part of the parsed meaning of a quoted literal. It's the same situation as with isolated expressions, markup and patterns.

You removed unquoted literals from being amenable to bidi isolation, but they should still be isolatable, no?

They are, covered by the change to name:

unquoted       = name / number-literal

number-literal doesn't need isolation, because we've limited its valid values, so isolating name is enough.

aphillips · 2024-05-06T15:36:17Z

The problem with allowing isolates into name is that it makes name comparison harder. Shouldn't the following two names be equal?

\u2066name\u2069
name

number-literal doesn't need isolation, because we've limited its valid values, so isolating name is enough.

Actually, numbers are complicated in bidi because digits are weakly directional. The minus sign can swing around onto the "wrong" side visually.

The other reason I had unquoted and quoted together is that it simplifies what tools have to do. A tool can blindly isolate any literal separate from the decision to quote it and can blindly remove isolates from literals without looking at the contents.

eemeli · 2024-05-07T09:44:03Z

The problem with allowing isolates into name is that it makes name comparison harder. Shouldn't the following two names be equal?
\u2066name\u2069
name

As proposed, both of those strings would match the name rule, but as \u2066 and \u2069 are not valid name-char characters, they would be parsed according to the open-isolate and close-isolate rules, with name-body matching the four-character "name" string in both cases.

So the parsed value of the name would be "name" for both of the above, and they would be considered equal.

number-literal doesn't need isolation, because we've limited its valid values, so isolating name is enough.

Actually, numbers are complicated in bidi because digits are weakly directional. The minus sign can swing around onto the "wrong" side visually.

But number-literal only shows up in "code", which is always LTR, yes?

The other reason I had unquoted and quoted together is that it simplifies what tools have to do. A tool can blindly isolate any literal separate from the decision to quote it and can blindly remove isolates from literals without looking at the contents.

The proposed change doesn't change the number of constructs for which this can be done; it replaces "unquoted literals" with "names". Doing so lets us remove needing to separately and additionally pick out the LRM/RLM/ALM from the productions that include name.

aphillips

Please change this PR to make your proposal an additional option, not overwriting the original design.

aphillips · 2024-05-09T13:30:39Z

exploration/bidi-usability.md

-               / (quoted / (unquoted [bidi]))
-quoted-pattern = ( open-isolate "{{" pattern "}}" close-isolate)
-               / ("{{" pattern "}}")
+name           = (open-isolate name-body close-isolate) / name-body


This is a problem because name is used to build a variety of other constructs (variable, reserved-keyword, identifier, etc.). This change puts the isolates inside these constructs, e.g. $\u2066name\u2069 rather than on the outside.

This will make it harder for implementations, since they can't take the parsed token and compare it immediately. They have to stop to remove isolates. My original design avoided this problem by making the isolates not parse into names/identifiers/tokens.

eemeli · 2024-05-13T09:30:31Z

As requested, refactored as an alternative to the proposed solution. Also addressed the concerns identified in #787 and #788, and added an example showing how name isolation avoids a spillover the current proposal cannot.

I have also validated this solution by implementing it in my parser.

aphillips

The requested changes are editorial. Otherwise I would approve this addition.

I don't think I agree with this option. The strongly directional marks are included in the proposed solution for a different reason than might be assumed (I mention this below) and I don't think putting isolates into name has been fully accounted for.

exploration/bidi-usability.md

aphillips · 2024-05-13T14:13:56Z

exploration/bidi-usability.md

+2. Rather than patching the `name` rule with an optional trailing LRM/RLM/ALM,
+   allow for its proper isolation.


I wouldn't call what we did above "patching". What we allow above with the strongly directional marks is allow bidi users to include them (to make the string look okay in a normal text editor) the way they might normally do when editing text. The productions we used don't make these marks part of the token, so they don't affect processing.

Allowing isolation is a separate consideration.

aphillips · 2024-05-13T14:15:41Z

exploration/bidi-usability.md

+
+Quoted patterns, quoted literals, and names may be isolated by LRI/RLI/FSI...PDI.
+For names and quoted literals, the isolate characters are outside the body of the token,
+but for quoted patterns, the isolates are in the middle of the `{{` and `}}` characters.


"middle" could mean anywhere inside the pattern quotes.

Suggested change

but for quoted patterns, the isolates are in the middle of the `{{` and `}}` characters.

but for quoted patterns, the isolates are in between the `{` and `}` in the `{{` and `}}` sequences.

exploration/bidi-usability.md

aphillips · 2024-05-13T14:19:03Z

exploration/bidi-usability.md

+```abnf
+name           = [open-isolate] name-start *name-char [close-isolate]
+quoted         = [open-isolate] "|" *(quoted-char / quoted-escape) "|" [close-isolate]
+quoted-pattern = "{" [open-isolate] "{" pattern "}" [close-isolate] "}"


This puts the isolate inside the {{ and }}? Asking to be sure I'm reading this right. The above text didn't seem to mean this, although now I see your intention.

Yes, that's the intent: {\u2066{

Co-authored-by: Addison Phillips <addison@unicode.org>

Simplify source bidi isolation rules

5156778

eemeli requested a review from aphillips May 6, 2024 11:37

aphillips added syntax Issues related with MF Syntax design Design principles, decisions LDML46 LDML46 Release (Tech Preview - October 2024) labels May 6, 2024

aphillips requested changes May 9, 2024

View reviewed changes

This was referenced May 13, 2024

[FEEDBACK] Isolating quoted patterns on the outside adds a lookahead to the syntax #787

Closed

[FEEDBACK] Unpaired bidi isolates should not be a parse error #788

Closed

Refactor as added alternative

55a9f40

eemeli requested a review from aphillips May 13, 2024 09:30

aphillips requested changes May 13, 2024

View reviewed changes

Apply suggestions from code review

921ed56

Co-authored-by: Addison Phillips <addison@unicode.org>

aphillips merged commit 7cdea8e into main May 13, 2024
1 check passed

aphillips deleted the bidi-less branch May 13, 2024 16:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify source bidi isolation rules #781

Simplify source bidi isolation rules #781

eemeli commented May 6, 2024

aphillips commented May 6, 2024

eemeli commented May 6, 2024

aphillips commented May 6, 2024

eemeli commented May 7, 2024

aphillips left a comment

aphillips May 9, 2024

eemeli commented May 13, 2024

aphillips left a comment

aphillips May 13, 2024

aphillips May 13, 2024

aphillips May 13, 2024

eemeli May 13, 2024

		2. Rather than patching the `name` rule with an optional trailing LRM/RLM/ALM,
		allow for its proper isolation.

	but for quoted patterns, the isolates are in the middle of the `{{` and `}}` characters.
	but for quoted patterns, the isolates are in between the `{` and `}` in the `{{` and `}}` sequences.

Simplify source bidi isolation rules #781

Simplify source bidi isolation rules #781

Conversation

eemeli commented May 6, 2024

aphillips commented May 6, 2024

eemeli commented May 6, 2024

aphillips commented May 6, 2024

eemeli commented May 7, 2024

aphillips left a comment

Choose a reason for hiding this comment

aphillips May 9, 2024

Choose a reason for hiding this comment

eemeli commented May 13, 2024

aphillips left a comment

Choose a reason for hiding this comment

aphillips May 13, 2024

Choose a reason for hiding this comment

aphillips May 13, 2024

Choose a reason for hiding this comment

aphillips May 13, 2024

Choose a reason for hiding this comment

eemeli May 13, 2024

Choose a reason for hiding this comment