-
-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BIDI algorithm not working properly? #76
Comments
More to the point I wonder how, without explicitly selecting an Arabic font, you're getting any arabic output at all. Could you run with the latest repository and the command line option |
Sorry, I have a typo above. Let me try to explain again with the code. The output from the code below is wrong. Each Urdu word is being typeset LTR but the paragraph is being set RTL.
The code below (note the additional
|
Oh, and the output that you posted is wrong as well. The Urdu words are being typeset LTR, like in my first example. |
OK, thanks; I'm afraid I don't read Arabic/Urdu very well so I can't see at a glance if it's doing the right thing or not! Now I understand the problem, I will try and work on it. |
Please try with that commit. I think it does the right thing but I'm not sure. |
This definitely fixes the common case, which is nice. Here is a file that exposes some edge cases. Whenever the number is separated by spaces on both sides, it is typeset correctly. But when it appear right after the ‘۔’ (Arabic Full Stop), or after an Arabic letter, it gets typeset RTL. I am not an expert on the BIDI algorithm, so I can’t say what is the right thing to do here. Just pointing out these cases.
|
Thanks, those are some good test cases. I know what's happening - SILE takes a short cut and assumes that all characters in a shaping run are of the shame bidi class, but that doesn't hold up in the cases you've found. I shall work on it... |
I think the proper way to do itemization is to first resolve the script property of each character, do bidi itemization, then further split the bidi runs into script runs. Resolve the script before breaking the bidi runs is important so that script property of common characters across bidi runs are resolved properly. |
After doing a little bit more research and reading, it appears to me that associating a direction property to the font (like SILE does) is not entirely appropriate. Each unicode character has an implicit bidirectional type. In addition there is a default base direction of the document, page, frame (or whatever context you want to associate a base direction with). Associating a direction property with a font incorrectly conflates two unrelated things. A font is just a collection of glyphs. By itself, it does not have a direction. For example, there may be fonts which provide glyphs for both English and Arabic, in which case typesetting a paragraph in the same font, but containing English and Arabic characters should obey the Unicode bidi algorithm. |
I agree completely with the above comment. |
Thanks, both - looks like a lot of rethinking will need to be done. I will work on it, but it may be a couple of weeks before I get back to this. |
I don't know the internals of SILE, but here are some ideas:
|
@khaledhosny Thank you for your input. Do you have any ideas on how to deal with BiDi reordering when an entire paragraph is not immediately available to you as a bunch of characters. For example, how do you treat directives within a paragraph like italicizing or bolding of text, font changes etc. Do all of these get stored in text nodes first, and acted upon later after the BiDi reordering has completed? Are there any code examples that one could look at to get a sense of how something like that could be done. From what I could figure out, SILE currently does shaping (using Harfbuzz) before it does BiDi processing. In fact, it tries to shape the characters, as soon as it can before moving on to the rest of the paragraph. For example, a paragraph like:
would be processed in the following sequence:
(This is obviously incorrect for at least a couple of reasons: shaping is happening before BiDi reordering, and even then, the BiDi reordering is buggy because it is not acting upon the entire sequence of characters, but on HBoxes of words) My question is, how does one represent a paragraph like above as a series of text nodes that could be subjected to proper BiDi reordering (as laid out in UA9) and be shaped with the proper fonts as well. |
Oh, and just FYI I am doing some ad-hoc hacking in an experimental branch: https://github.com/deepakjois/sile/tree/exp It started off as an effort to understand SILE’s internals better. In that process, I was able to improve the typesetting of my Urdu document from this (typeset with latest SILE from master) to this (typeset with SILE from my experimental branch). The BiDi reordering is still buggy, but I managed to get the most common case work correctly. I was hoping to tackle proper full paragraph BiDi reordering before shaping, once I wrap my head around UAX #9 fully. |
I think font, colour etc. should be stored as node properties (assuming TeX-like nodes, or whatever SILE use internally) and after BiDi they should be used when deciding run boundaries as appropriate e.g. font change should end the current run, just like script or language change, but colour change shouldn’t and should instead be applying to the glyphs corresponding to characters it were applied to (HarfBuzz’s cluster values should be used to map output glyphs to input characters). |
One of the things I'm working on now is rewriting the shaping interface so that (a) Pango and Harfbuzz share common code (even if Pango is more-or-less deprecated this helps us to get the separation of concerns in the right places), and therefore (b) we can add character direction information as part of the shaping process. More in a few days. |
@simoncozens Would that mean that shaping would still happen before BIDI reordering? I just want to point out – it may not make a lot of difference for scripts like English, but Arabic script for example has contextual glyphs for when (beginning, middle or end) the character appears in the word. So determining the direction of the characters is necessary to shaping them correctly. Otherwise, reversing the direction of characters that have already been shaped would lead to incorrect results. |
Basically my reason for doing it (other than moving all the shaping bugs into the one file :-) ) is to provide more granularity in altering the shaping process and annotate notes before, during and after shaping. It won't completely fix the bidi issue---you will still need to fiddle around during |
Urgh, I'm confused about how my own code works. Ignore what I set about |
@simoncozens Makes sense. Will wait for your changes. As an aside, what do you think about @khaledhosny’s suggestion above of storing font, colour etc. as ‘node’ properties. I get that it is not how it works currently, and the whole process of building a paragraph is stateful. Any specific reason you chose to go with that sort of approach? |
OK, so I have refactored the shapers and I think it's helped me understand this issue. In my mind, SILE does store font, color as a node property; that's basically what the The problem comes that when SILE was implemented in pango/cairo, SILE's When I moved to Harfbuzz, I now realise that I skipped that itemizing step, and I think at the very least we need is to add an itemize method to the shaper class. For Pango this can (I think) call @khaledhosny, does that make sense? |
Yes, |
…s for more sophisticated tokenization later. See #76.
…t / state and gets shaped later. See #76.
Well, I think I've fixed it, and made a bunch of other improvements to Arabic rendering as well. (Tashkil and other diacritics should now be positioned properly.) Now text is split into tokens of the same bidi group, then the bidi algorithm is run, and then the nodes are passed to Harfbuzz for shaping. Unfortunately I still can't read, so let me know if anything is not working properly. |
From some initial testing against a test document, it seems to work properly. Will do a more thorough investigation next week. |
Combining marks seem to be breaking the shaping (at least in the sample PDF you committed). |
OK. The code you highlighted in |
Wait, I think a light has come on: is |
Right, It is important that the whole paragraph is processed as a whole, because neutral and weak characters are influenced by their context, so if the context is lost the output of the algorithm can be substantially different. |
… to the bidi algorithm individually, then put back into tokens before shaping.
OK. I'm sorry I'm being so slow on this bug, but I really don't know what I'm doing with bidi and/or Arabic. I think it works now. (but I thought that several times before already...) Now I split the tokens up into individual Unicode characters before sending them to the bidi algorithm, which reorders them. After the bidi algorithm does its stuff, we have a list of (properly-ordered) Unicode characters. But we can't send those characters one-by-one to be shaped, because otherwise we get a load of isolated Arabic forms. And we can't send the whole paragraph to be shaped all at once, because we want fine control over turning spaces into glue nodes, and because there may be font changes / etc. during the course of the paragraph. So we have to turn the list of characters into something else. The It seems to produce output that looks right to me, even for @deepakjois's 1947-1955 test cases above. But I don't know well enough to know if neutral, weak and combining characters are being handled properly. |
It seems to work fine now. |
Not sure if this was the right thread (too lazy to re-read all the comments again), but here is something I tried with Scribus which seems to work fine so far:
This seems to give me proper line breaking, and works whether the base paragraph direction in LTR or RTL and allows me to keep using whatever line breaking code I had unmodified. |
Incidentally I have been looking into https://github.com/simoncozens/sile/blob/master/examples/arabic.pdf and it suffers from the same issue I had with Scribus initially; when line breaks happens in LTR text in a RTL paragraph, the 1st part of the text goes to the 2nd line and vice versa. On the other hand we have been using the method I outlined above in Scribus for a while now and it has been working well so far. |
Note that the issue initially reported and tested for here is currently working as expected, but the issue was actually re-opened on account of the implementation being thought wrong.
Here is a sample document. The numbers within the script should be formatted LTR, as per Unicode BIDI rules, but they are not.
I did notice that in the example docs (like showoff.sil) the numbers are explicitly formatted LTR using a
\font
directive to wrap them. But I believe that should not be required. Any idea what is causing this?The text was updated successfully, but these errors were encountered: