Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify source bidi isolation rules #781

Merged
merged 3 commits into from
May 13, 2024
Merged

Simplify source bidi isolation rules #781

merged 3 commits into from
May 13, 2024

Conversation

eemeli
Copy link
Collaborator

@eemeli eemeli commented May 6, 2024

Drop the bidi rule, and allow name to be LR/RL/FS -isolated.

Allow an LRI immediately after a non-content newline.

Relax expression & markup isolation to not require pairing on a syntactic level, as the LRI can also be terminated by a newline.

@eemeli eemeli requested a review from aphillips May 6, 2024 11:37
@aphillips
Copy link
Member

I wish you'd added this as a separate alternative.

I don't like that the isolates are part of the name rule---I worked hard to keep the isolates outside the rules for important constructs (like name)

You removed unquoted literals from being amenable to bidi isolation, but they should still be isolatable, no?

@aphillips aphillips added syntax Issues related with MF Syntax design Design principles, decisions LDML46 LDML46 Release (Tech Preview - October 2024) labels May 6, 2024
@eemeli
Copy link
Collaborator Author

eemeli commented May 6, 2024

I don't like that the isolates are part of the name rule---I worked hard to keep the isolates outside the rules for important constructs (like name)

Including the isolates in name doesn't change its parsed meaning, much like the | aren't a part of the parsed meaning of a quoted literal. It's the same situation as with isolated expressions, markup and patterns.

You removed unquoted literals from being amenable to bidi isolation, but they should still be isolatable, no?

They are, covered by the change to name:

unquoted       = name / number-literal

number-literal doesn't need isolation, because we've limited its valid values, so isolating name is enough.

@aphillips
Copy link
Member

The problem with allowing isolates into name is that it makes name comparison harder. Shouldn't the following two names be equal?

\u2066name\u2069
name

number-literal doesn't need isolation, because we've limited its valid values, so isolating name is enough.

Actually, numbers are complicated in bidi because digits are weakly directional. The minus sign can swing around onto the "wrong" side visually.

The other reason I had unquoted and quoted together is that it simplifies what tools have to do. A tool can blindly isolate any literal separate from the decision to quote it and can blindly remove isolates from literals without looking at the contents.

@eemeli
Copy link
Collaborator Author

eemeli commented May 7, 2024

The problem with allowing isolates into name is that it makes name comparison harder. Shouldn't the following two names be equal?

\u2066name\u2069
name

As proposed, both of those strings would match the name rule, but as \u2066 and \u2069 are not valid name-char characters, they would be parsed according to the open-isolate and close-isolate rules, with name-body matching the four-character "name" string in both cases.

So the parsed value of the name would be "name" for both of the above, and they would be considered equal.

number-literal doesn't need isolation, because we've limited its valid values, so isolating name is enough.

Actually, numbers are complicated in bidi because digits are weakly directional. The minus sign can swing around onto the "wrong" side visually.

But number-literal only shows up in "code", which is always LTR, yes?

The other reason I had unquoted and quoted together is that it simplifies what tools have to do. A tool can blindly isolate any literal separate from the decision to quote it and can blindly remove isolates from literals without looking at the contents.

The proposed change doesn't change the number of constructs for which this can be done; it replaces "unquoted literals" with "names". Doing so lets us remove needing to separately and additionally pick out the LRM/RLM/ALM from the productions that include name.

Copy link
Member

@aphillips aphillips left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change this PR to make your proposal an additional option, not overwriting the original design.

/ (quoted / (unquoted [bidi]))
quoted-pattern = ( open-isolate "{{" pattern "}}" close-isolate)
/ ("{{" pattern "}}")
name = (open-isolate name-body close-isolate) / name-body
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a problem because name is used to build a variety of other constructs (variable, reserved-keyword, identifier, etc.). This change puts the isolates inside these constructs, e.g. $\u2066name\u2069 rather than on the outside.

This will make it harder for implementations, since they can't take the parsed token and compare it immediately. They have to stop to remove isolates. My original design avoided this problem by making the isolates not parse into names/identifiers/tokens.

@eemeli
Copy link
Collaborator Author

eemeli commented May 13, 2024

As requested, refactored as an alternative to the proposed solution. Also addressed the concerns identified in #787 and #788, and added an example showing how name isolation avoids a spillover the current proposal cannot.

I have also validated this solution by implementing it in my parser.

@eemeli eemeli requested a review from aphillips May 13, 2024 09:30
Copy link
Member

@aphillips aphillips left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The requested changes are editorial. Otherwise I would approve this addition.

I don't think I agree with this option. The strongly directional marks are included in the proposed solution for a different reason than might be assumed (I mention this below) and I don't think putting isolates into name has been fully accounted for.

exploration/bidi-usability.md Outdated Show resolved Hide resolved
Comment on lines +359 to +360
2. Rather than patching the `name` rule with an optional trailing LRM/RLM/ALM,
allow for its proper isolation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't call what we did above "patching". What we allow above with the strongly directional marks is allow bidi users to include them (to make the string look okay in a normal text editor) the way they might normally do when editing text. The productions we used don't make these marks part of the token, so they don't affect processing.

Allowing isolation is a separate consideration.


Quoted patterns, quoted literals, and names may be isolated by LRI/RLI/FSI...PDI.
For names and quoted literals, the isolate characters are outside the body of the token,
but for quoted patterns, the isolates are in the middle of the `{{` and `}}` characters.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"middle" could mean anywhere inside the pattern quotes.

Suggested change
but for quoted patterns, the isolates are in the middle of the `{{` and `}}` characters.
but for quoted patterns, the isolates are in between the `{` and `}` in the `{{` and `}}` sequences.

exploration/bidi-usability.md Outdated Show resolved Hide resolved
```abnf
name = [open-isolate] name-start *name-char [close-isolate]
quoted = [open-isolate] "|" *(quoted-char / quoted-escape) "|" [close-isolate]
quoted-pattern = "{" [open-isolate] "{" pattern "}" [close-isolate] "}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This puts the isolate inside the {{ and }}? Asking to be sure I'm reading this right. The above text didn't seem to mean this, although now I see your intention.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's the intent: {\u2066{

Co-authored-by: Addison Phillips <addison@unicode.org>
@aphillips aphillips merged commit 7cdea8e into main May 13, 2024
1 check passed
@aphillips aphillips deleted the bidi-less branch May 13, 2024 16:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design Design principles, decisions LDML46 LDML46 Release (Tech Preview - October 2024) syntax Issues related with MF Syntax
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants