Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider fully supporting RTL and Bidi URLs #43

Open
mohsen1 opened this issue Dec 18, 2020 · 17 comments
Open

Consider fully supporting RTL and Bidi URLs #43

mohsen1 opened this issue Dec 18, 2020 · 17 comments
Labels
decided i18n-needs-resolution Issue the Internationalization Group has raised and looks for a response on.

Comments

@mohsen1
Copy link

mohsen1 commented Dec 18, 2020

RTL text in URL is not as uncommon as many think. This API must support RTL and bidi text in URL and patterns. It's important because : is used for path parameters in the beginning of the parameter name. In a RTL URL it would be a bit confusing unless it's fully specified.

@wanderview
Copy link
Member

I won't be able to look at this closely until January, but can you provide some examples showing how these kind of urls would work with the current proposal and examples of what you think would be better? I'm not that familiar with how urls are used with rtl languages. Thanks!

@wanderview
Copy link
Member

Also note that currently URLPattern and path-to-regexp require group names to contain only ASCII alphanumerics and underscores. A group's matched value may include unicode characters that have been url encoded, but the group name cannot.

@annevk
Copy link
Member

annevk commented Dec 19, 2020

I somewhat doubt that would pass i18n review, especially given that the surrounding ecosystem does support Unicode identifiers.

@wanderview
Copy link
Member

Sure we can discuss unicode in group names. It will create some implementation difficulties because of layering but its probably solvable.

(Difficulties stem from wanting to use the browser's existing URL support for encoding characters to be consistent, but the parsing of the pattern is currently in a separate isolated library since its a derived work of path-to-regexp. The isolated library doesn't have dependencies on the other bits of the browser in order to be portable. So its difficult to move encoding into the parsing library where we it would be best to make enforce something like "percent encode most characters, but not group names". Solvable, but maybe not in a super clean way.)

It would still be helpful for me to understand how URLs are used in RTL languages, though, and what kind of change is really be suggested.

The only idea I have had so far is to make named groups have a colon at both the beginning and end. So like /foo/:name:, but that would deviate from the convention folks are used to in a lot of other systems like ruby on rails, etc. We would be leaving the well lit path we were trying to follow by using path-to-regexp.

@blakeembrey, do you have any thoughts on this?

@wanderview
Copy link
Member

Just a further thought on my implementation difficulties from above... I could maybe move encoding to be post-parse instead of pre-parse today. That would let us still keep browser URL specifics out of the isolated library. The main downside would then the library would need to operate on wide strings which creates other portability issues; WTF::String vs nsString, etc.

@wanderview
Copy link
Member

I opened #46 to tackle unicode characters in group names. I'd like to keep this issue for the RTL issue specifically.

I'm still looking for more guidance on the RTL use case and what these kinds of URLs look like. I'm very hesitant to deviate from the established convention of :name that goes back through many different systems; e.g. ruby on rails, etc.

@annevk
Copy link
Member

annevk commented Feb 3, 2021

I would recommend reaching out to the i18n group at the W3C.

@wanderview
Copy link
Member

At this point I don't plan to explore changes for RTL group names. Partly this is for reasons I expressed above. We are following existing systems that use :name style. This extends beyond just path-to-regexp back to ruby-on-rails, etc. Conforming to conventions is one of our goals here.

In addition, in #46 we adopted javascript identifier rules for what unicode characters to allow in a group name. The rules about what the first character can be are implicitly at the left of the string in this system. While its unfortunate this makes it more difficult to adopt some RTL change, I think it is reasonable to adopt the semantics of the language this API will be used in.

For these reasons I don't have immediate plans to pursue a change for RTL group names.

@annevk
Copy link
Member

annevk commented Mar 13, 2021

Why not get some external input?

@wanderview
Copy link
Member

Well, I thought that was what I was doing here. And I had the impression i18n review would be of the spec once it was written.

For example, this says I should provide a link to the spec under discussion:

https://w3c.github.io/i18n-activity/guidelines/review-instructions.html#writeup-issue

Anyway, I tried filing an issue now even though there is no spec.

@r12a r12a added the i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. label Mar 24, 2021
@wanderview
Copy link
Member

A draft spec (still needs some polish) is written and captures the current decision. I'm leaving this open since the i18n review request is still open AFAIK. (Note, we've gotten no other feedback from developers during dev trials indicating this is a problem.)

@aphillips
Copy link

I18N review of core design concerns after the paint is dry on your spec is a recipe for hard conversations 😉. @r12a just brought this conversation to the top of our queue. Sadly, the above conversation is pretty old...

My initial personal observation is that there is a difference between logical order and visual order. Regular expressions, variable names, and other syntax are not processed visually. They are processed by machines that do not care about the visual ordering and it is not "LTR biased" to refer to code points at the start vs. end of an identifier or string, even if the examples, when presented in the Latin script, always show the starter (: here) on the left.

If name tokens are restricted to some subset of Unicode (such as ASCII letters and digits) that can lend to robustness of the solution. But it isn't very I18N friendly. If you choose to allow a larger range of Unicode into identifiers and expressions you will have to deal with the I18N implications of that choice. We have a whole document about some of the considerations you might encounter as a result. Worth a look if you're not aware of it.

If you allow RTL characters into identifiers, while (as noted above) it probably will not change how you process the resulting expressions, but it is a consideration for how users will interact with the syntax. Most plain text editors are aware of RTL characters and their visual representation. This can affect how the syntax appears to users, e.g. :مصر is really \u003a\u0645\u0635\u0631, but it appears to have the colon "on the right". [Note: it does not look that way in the github editor--I decorated it with a span to ensure I get the right effect] You may need to address how bidi will affect users (including so-called "Trojan Source" attacks that use bidi to make source visually appear to mean something different from what the processor understands).

Ultimately it helps to not refer to the "left" and "right" of a string. These terms are (erm) "positionally biased" vs. logical terms such as "start" and "end" (which is how things actually work anyway). Using such logical terms helps eliminate distractions in conversations about syntactical elements.

Anyway, I've added this topic to W3C-I18N's agenda for this week's teleconference.

@wanderview
Copy link
Member

In the time since I requested the review the spec has been written. There are also multiple implementations shipped.

The permitted characters are aligned with those permitted by javascript identifiers:

https://wicg.github.io/urlpattern/#is-a-valid-name-code-point

Note, javascript identifier encoding has LTR baked in due to how it treats IndentifyStart differently than IdentifyPart. Aligning with javascript seemed appropriate since this API integrates with javascript.

I'm sorry this review did not happen earlier, but unless a severe problem is found I doubt we will be able to make a change now.

@aphillips
Copy link

Thanks for the update.

I doubt we'll find a severe problem, but I will again point out: identifier encoding doesn't have "LTR baked in" because it works in terms of code points in logical order (it is identifierStart, not identifierLeft after all...). Maintaining the distinction can be helpful to you and others when answering requests such as the above.

@wanderview
Copy link
Member

I'm going to close this for now, but please re-open if there is new feedback.

@aphillips
Copy link

aphillips commented Feb 8, 2024

I drew an action item to re-review this issue on behalf of I18N.

I'm going to reopen this issue because I don't think urlpattern has successfully addressed the problem of bidirectional characters in patterns or, perhaps more correctly, that urlpattern could do a better job of clarifying how bidi interacts with patterns.

Note, javascript identifier encoding has LTR baked in due to how it treats IndentifyStart differently than IdentifyPart. Aligning with javascript seemed appropriate since this API integrates with javascript.

This doesn't fix anything. JavaScript identiferStart and identifierPart restrict the range of characters allowed in an identifier to try to prevent e.g. combining marks at the start of an identifier. That is, this works:

const مصر  = new Date();
console.log(مصر );

Bidi characters can cause the pattern to be displayed in a degraded way, even though a parser would have no difficulty processing the pattern correctly (but correcting the character sequence in a pattern to make it "look right" could result in a malformed or non-functional pattern--since this usually involves inserting control characters). This is especially a challenge in URL pattern because the characters used as delimiters are punctuation (neutral-direction) characters.

In the spec one example looks like this:

/products/{:id}?

If the names product and id are replaced by Arabic (or other RTL) tokens, the pattern might look like this:

/منتج/{:معرف}?

This is still a valid pattern (it is the exact same pattern as the original example). The code point sequence is displayed in what appears to be an invalid way, but the processor will handle it just fine.

More complex things can happen with patterns that result in confusion by users when looking at a pattern (where the effect of bidi will be less obvious to the casual viewer). I note that urlpattern does not include a security considerations section and it doesn't include any mention of bidi. I think some mention of UTR#55 is probably called for. Right-to-left characters are permitted in url pattern identifiers and could be used to create patterns that spoof different patterns.

(FWIW, I just recently had to write a security considerations sections in Unicode's MFv2 project, which has extremely similar potential issues (we use the same delimiters for similar purposes). That version is here)

Happy to discuss.

@aphillips aphillips reopened this Feb 8, 2024
@aphillips aphillips added i18n-needs-resolution Issue the Internationalization Group has raised and looks for a response on. and removed i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. labels Feb 8, 2024
@jeremyroman
Copy link
Collaborator

I spent some time today reading through UTS 55, UAX 31, and CHARMOD. I'm still not entirely clear on what your request/recommendation here is, so would appreciate more insight on that point.

We have a few things we'd like to be consistent with, including WHATWG URL, ECMA-262 regular expressions, and ECMAScript (JavaScript) -- the web technologies most likely to be used in combination with URL patterns.

The major thing you've raised, if I understand correctly, is the potential for RTL characters to lead to a misleading visual rendering of a URL pattern, though of course if you were to read the code point sequence directly there is no ambiguity.

As far as I can tell, URLs have the same treatment when not URL-encoded https://example.com/%D9%85%D9%86%D8%AA%D8%AC/%D9%85%D8%B9%D8%B1%D9%81 renders as https://example.com/منتج/معرف, even though these place the two path components in opposite visual orders. Though URLs under the hood operate entirely in ASCII. At least, this doesn't seem a new issue to the platform -- but more importantly I'm worried that some resolutions might end up being equally confusing by mismatching with how URLs work.

One thing UAX 31 § 4.1.1 discusses is ignore directional marks next to identifiers, which would allow a source file to represent it in a way that flowed in a consistent direction, which would allow containing the bidi flow to within a particular identifier or similar without introducing on direction-neutral syntax characters, but I'm not sure whether that's an improvement (it doesn't prevent misleadingly rendered ones from being typed, editors may not emit it, and if everything else is RTL it may depart further from expectations -- but I don't know enough about RTL languages to know for sure what the expectations are), and it's not permitted in the similar case of named capture groups in ECMAScript regular expressions as far as I can tell.

CHARMOD advises somewhat against allowing non-ASCII characters in "application internal identifiers", which these identifiers (but not literals that appear in the URL) are. But ECMAScript and ECMAScript regular expressions do prevent them, and disallowing non-ASCII seems like it's also not a great experience for speakers of those languages, anyway.

Otherwise, we could certainly write a "security considerations" section that warns that it is possible to write URL patterns using non-ASCII characters which may have a misleading or confusing visual rendering. There's currently nothing that displays these to the user, so this wouldn't really have any normative effect -- just a warning to developers and a possible opportunity for tool authors.

What did you have in mind?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
decided i18n-needs-resolution Issue the Internationalization Group has raised and looks for a response on.
Development

No branches or pull requests

6 participants