-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve handling of malformed unicode bidi control characters #10251
Comments
cc @whatwg/i18n |
I18N discussed this in our call of 2024-04-18 and I drew an action to respond. The initial reaction I18N has was that this sounds like a problem and is worth solving. It feels like the idea is that element-level isolation is independent of text-level isolation. I would probably not link it just to The code looks like it might be effective. |
@aphillips I'm not sure if that question was directed at me, but that makes enough sense to me. I don't have any strong opinions about specifically how this behavior is added, just a general problem statement. |
cc @whatwg/css @fantasai |
I wrote up some thoughts on this on Twitter, but I figure I'd write it up properly for this issue: The thing that probably should be fixed is this text:
What you basically want this to do is for this to have the effect of inserting these control codes such that the corresponding push/pop directives are correctly retained in the bidi algorithm. The bidi algorithm matches things up in BD8-BD11. A bidi implementation could potentially allow the caller to specify a priority as to how things get matched. The spec doesn't have any affordances for this, but it could (we could expand on HL3, for example). However I don't really see implementations moving to such a model easily; since the API for such a model is a mess. Another way of fixing this is just sanitizing the content: remove any unmatched override/embedding/isolate initiators or pop directives. Not too complicated. Kind of annoying. I think the best thing to do from the CSS spec standpoint is to spec these as basically creating a new bidi context such that their content is processed in a separate bidi algorithm run than their surroundings. In this case, both override and isolate would run the bidi algorithm on their content with strong directionality as the directionality of the override/isolate element, and their surrounding content treats them as opaque embedded content (similar to how inline images / other inline-block elements get handled today), with the override/embedding ones being treated as LRM/RLM in the surrounding content, and the isolates being treated exactly like other inline-block elements (as the spec says, use U+FFFC). I believe It's unclear to me exactly which spec (CSS2 visual formatting?) handles how inline-block elements in bidi text are handled so I'm not quite clear on how this is best written in spec text, but I think that's the best model to follow. cc @eggrobin |
How is any other improperly paired element handled? I don't see this as different. Well, some of the tags can also be disruptive than a simple bold. And what about tags that are not explicitly included in the text I copy, but as a user I still expect the result to look the same:
If I copy-paste the marked section of text, when I paste it I expect it to be bold and red. Are these already solved? |
This is handled in the (X|XHT|HT)ML parser: closing tags are inserted automatically where necessary. (Unpaired closing tags are just ignored). Unfortunately, handling bidi controls in the parser would be major scope creep. It should absolutely be possible to represent unpaired bidi controls in XML, and it even should be possible to represent that in HTML, so sanitizing in the parser seems right out. Here, we want the application to be able to control where they actually "escape". So this cannot be handled the same way, or at least cannot be handled in the same place. The algorithm used could be the same: above I mention that one solution is to sanitize content by "remov[ing] any unmatched override/embedding/isolate initiators or pop directives", but a different way to do it would be to insert matching pop directives for unmatched control characters and remove lone pop directives. It's not clear to me that sanitization is the best path forward. The main three paths I see are:
|
I tend to agree with Manish, that #3 would be more difficult to get the lower-level implementations to handle. As to either Sanitize or Isolate, I think either one could work. I think a goal for Sanitize should be to "produce the same results as Isolate. |
I would also add from a web developer perspective: including untrusted HTML is one of the most well-known security vulnerabilities as it leads to XSS. As a result, typically all HTML special characters In contrast, it's generally assumed that characters other than these are safe to include. The vast majority of web applications make no attempt to sanitize bidi control characters present in untrusted user input. This is both because many people are not aware that this is even a possible issue and because there is no clear, authoritative guidance on how to do this correctly. For example @jamiebuilds's code at the top of this issue is a valiant effort but I understand based on my reading of UAX #9 that it may fail to reject strings with directional formatting characters invalidly interleaved with directional isolate characters. I would look to the Unicode consortium to clearly document a correct algorithm for doing this type of sanitization and would look to HTML to ideally make it easy to include this, such as by changing some of the existing bidi CSS properties to have an effect as others have mentioned. I also mentioned on Twitter that I'm unsure if there are other similar Unicode "security" concerns to be aware of. For example I imagine I don't generally want to allow combining marks at the beginning of a user-provided string. What else? This is ideally something where simple guidance would exist somewhere so that all web content authors can easily get this right. |
In theory, I guess a control character such as U+206C INHIBIT ARABIC FORM SHAPING in user input could also have undesirable effects on the context, though in practice (AFAIK) such codes are not widely implemented in current systems. |
They are also deprecated: https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5Cp%7Bdeprecated%7D |
The Sanitize algorithm that works identically to Isolate would be one that drops unpaired closing characters, and inserts extra closing characters at the end of text as needed. It would not need to insert extra closing characters in the middle of the text, for example LRI RLO PDI has an unclosed RLO, however the algorithm will automatically close it anyway. Probably best to do it to be safe anyway? There may still be tricky ways to break this around the fact that the bidi algorithm has a stack size, so an unpaired control character that is fine may become not-fine because its surrounding context got ignored because of the stack. Unclear. |
I may be misunderstanding but my interpretation was that this is likely not an issue due to these notes in UAX# 9:
|
Right, but that's for the entire run, if we're following the model of "sanitize and then run the algorithm on the entire text" rather than "isolate the algorithm runs", then the matching PDIs/etc may get inserted in places that cross the isolation boundary we want. UAX #9 does not contain any concept of strongly associated isolation boundary as desired here. The sanitization algorithm I listed prevents things from crossing the isolation boundary but I haven't put thought into all of the cases where depth limit stuff may cause certain things to change which boundaries actually show up. I'm not sure if there's a problem here, to be clear, but there might be. |
Interestingly, I ran into this problem this morning. I'm working on a presentation about bidi. My input looks like this: <p>You bought <q lang="ar" dir="rtl">\u2067HTML \u0648 CSS: \u062a\u0635\u0645\u064a\u0645 \u0648
\u0625\u0646\u0634\u0627\u0621 \u0645\u0648\u0627\u0642\u0639 \u0627\u0644\u0648\u064a\u0628</q>
by <span lang="en" dir="ltr">Jon Duckett</span>. My demo page unescapes the Arabic (the The resulting display looks like: Whereas the expected display looks like this: The challenge here is that the extraneous U+2067 starts an isolate run, which interferes with the If a string started life as |
Very good example. To put in other terms, it sounds like the ...<q><RLI>...</q>...<span>...</span>... is, in Unicode terms, acting like: ...<RLI><RLI>...<PDI>...<LRI>...<PDI>... which effectively adds a termination to the end that matches the very first <RLI> (since that is the unmatched one). ...<RLI><RLI>...<PDI>...<LRI>...<PDI>...<PDI> Whereas Manish's proposal would pin within to the isolate it belongs to, thus ...<RLI><RLI>...<PDI><PDI>...<LRI>...<PDI>... |
What is the issue with the HTML Standard?
Discussed this with @fantasai
When accepting user input, users will often paste text from other programs that sometimes contains malformed (unclosed, unopened) unicode bidi control characters. When you use this user input in HTML, wrapping them with
<bdi>
,<bdo>
, and/ordir
is not enough to avoid it breaking out of the supposed 'isolation', and it can wreak havoc on the surrounding text.In the simplest case, the text is appears as expected:
But if we pop the directional isolate, you can break out of the
<bdi>
You could of course wrap it with another
<bdo>
and it will be fixed again:What's missing is a balancing of these control characters. In order to combat this we've written a function to close fix-up these strings before they end up in the HTML.
Source: https://github.com/signalapp/Signal-Desktop/blob/ce0fb220411b97722e1e080c14faa65d23165784/ts/util/unicodeBidi.ts#L124
See Code
By running text through this function before it ends up in HTML, we can at least prevent the malformed text from affecting its surrounding text which better aligns with the stated goal of
<bdi>
:@fantasai suggested that this could possibly grafted onto
<bdi>
,<bdo>
, and possibly all other uses ofdir
as wellThe text was updated successfully, but these errors were encountered: