BIDI algorithm not working properly? #76

deepakjois · 2015-04-27T04:24:06Z

Here is a sample document. The numbers within the script should be formatted LTR, as per Unicode BIDI rules, but they are not.

I did notice that in the example docs (like showoff.sil) the numbers are explicitly formatted LTR using a \font directive to wrap them. But I believe that should not be required. Any idea what is causing this?

\begin[papersize=a4]{document}
\script[src=packages/bidi]
\thisframeRTL%
\font[size=18pt]
سعادت حسن منٹو اردو کے ایک جانے مانے افسانہ نگار ہیں۔ وہ پنجاب کے لدھانہ ضلع میں  1912 میں پیدا ہوئے۔ ان کے ماں
 باپ کشمیری مسلمان تھے۔ وہ علی گڑھ مسلم یونیورسٹی سے تعلیم حاصل کر نے کے بعد انجمن ترقی پسند مصنفین ہند کے
 ساتھ جڑے۔ 1936 میں وہ بمبئی شہر جا کر بس گئے جہاں انہوں نے اپنے کئی مشہور افسانے اور فلم کے سکرپٹ لکھے
۔1947 میں بٹوارے کے بعد وہ لاہور چلے گئے۔ وہاں ان کی ملاقات فیض احمد فیض، سید ناصر رضا كاظمی
 اور احمد ندیم قاسمی‎ سے ہوئی اور وہ اخباروں کے لئے مضمون لکھنے لگے۔ 1955 میں ان کا انتقال ہوا۔
\end{document}

The text was updated successfully, but these errors were encountered:

simoncozens · 2015-04-27T04:36:04Z

Not sure what you mean; here is what I see:

Although I set the font family explicitly. (\font[size=18pt,family=Scheherazade])

I think what might be happening is that SILE has a concept of font direction and frame direction. You've set the frame direction to RTL, which is fine, but if you're still using the default font, (Gentium, English) then the expected output is LTR. So things might be getting reversed because of that.

simoncozens · 2015-04-27T04:41:15Z

More to the point I wonder how, without explicitly selecting an Arabic font, you're getting any arabic output at all. Could you run with the latest repository and the command line option --debug=fonts?

deepakjois · 2015-04-27T04:50:48Z

Sorry, I have a typo above. Let me try to explain again with the code.

The output from the code below is wrong. Each Urdu word is being typeset LTR but the paragraph is being set RTL.

\begin[papersize=a5]{document}
\script[src=packages/bidi]
\thisframeRTL%
\font[family=Amiri,language=urd,script=Arab]
\font[size=18pt]
سعادت حسن منٹو اردو کے ایک جانے مانے افسانہ نگار ہیں۔ وہ پنجاب کے لدھانہ ضلع میں  1912 میں پیدا ہوئے۔ ان کے ماں
 باپ کشمیری مسلمان تھے۔ وہ علی گڑھ مسلم یونیورسٹی سے تعلیم حاصل کر نے کے بعد انجمن ترقی پسند مصنفین ہند کے
 ساتھ جڑے۔ 1936 میں وہ بمبئی شہر جا کر بس گئے جہاں انہوں نے اپنے کئی مشہور افسانے اور فلم کے سکرپٹ لکھے
۔1947 میں بٹوارے کے بعد وہ لاہور چلے گئے۔ وہاں ان کی ملاقات فیض احمد فیض، سید ناصر رضا كاظمی
 اور احمد ندیم قاسمی‎ سے ہوئی اور وہ اخباروں کے لئے مضمون لکھنے لگے۔ 1955 میں ان کا انتقال ہوا۔
\end{document}

The code below (note the additional direction=RTL in the font declaration) typesets the Urdu words correctly. It typesets the paragraph RTL as well. But the numbers are being typeset wrongly. Hope that makes sense.

\begin[papersize=a5]{document}
\script[src=packages/bidi]
\thisframeRTL%
\font[family=Amiri,language=urd,direction=RTL,script=Arab]
\font[size=18pt]
سعادت حسن منٹو اردو کے ایک جانے مانے افسانہ نگار ہیں۔ وہ پنجاب کے لدھانہ ضلع میں  1912 میں پیدا ہوئے۔ ان کے ماں
 باپ کشمیری مسلمان تھے۔ وہ علی گڑھ مسلم یونیورسٹی سے تعلیم حاصل کر نے کے بعد انجمن ترقی پسند مصنفین ہند کے
 ساتھ جڑے۔ 1936 میں وہ بمبئی شہر جا کر بس گئے جہاں انہوں نے اپنے کئی مشہور افسانے اور فلم کے سکرپٹ لکھے
۔1947 میں بٹوارے کے بعد وہ لاہور چلے گئے۔ وہاں ان کی ملاقات فیض احمد فیض، سید ناصر رضا كاظمی
 اور احمد ندیم قاسمی‎ سے ہوئی اور وہ اخباروں کے لئے مضمون لکھنے لگے۔ 1955 میں ان کا انتقال ہوا۔
\end{document}

deepakjois · 2015-04-27T04:52:08Z

Oh, and the output that you posted is wrong as well. The Urdu words are being typeset LTR, like in my first example.

simoncozens · 2015-04-27T05:46:41Z

OK, thanks; I'm afraid I don't read Arabic/Urdu very well so I can't see at a glance if it's doing the right thing or not! Now I understand the problem, I will try and work on it.

simoncozens · 2015-04-27T06:20:30Z

Please try with that commit. I think it does the right thing but I'm not sure.

deepakjois · 2015-04-27T06:38:15Z

This definitely fixes the common case, which is nice. Here is a file that exposes some edge cases. Whenever the number is separated by spaces on both sides, it is typeset correctly. But when it appear right after the ‘۔’ (Arabic Full Stop), or after an Arabic letter, it gets typeset RTL.

I am not an expert on the BIDI algorithm, so I can’t say what is the right thing to do here. Just pointing out these cases.

\begin[papersize=a5]{document}
\script[src=packages/bidi]
\thisframeRTL%
\font[family=Amiri,language=urd,direction=RTL,script=Arab]
\font[size=18pt]
1955 میں ان کا انتقال ہوا۔

1955 میں ان کا انتقال ہوا۔1947 میں وہ۔

1955 میں ان کا انتقال ہوا۔ 1947 میں وہ۔

میں 1955 میں آیا تھا

میں1955 میں آیا تھا
\end{document}

simoncozens · 2015-04-27T07:00:06Z

Thanks, those are some good test cases. I know what's happening - SILE takes a short cut and assumes that all characters in a shaping run are of the shame bidi class, but that doesn't hold up in the cases you've found. I shall work on it...

khaledhosny · 2015-04-27T19:27:50Z

I think the proper way to do itemization is to first resolve the script property of each character, do bidi itemization, then further split the bidi runs into script runs. Resolve the script before breaking the bidi runs is important so that script property of common characters across bidi runs are resolved properly.

deepakjois · 2015-04-28T15:31:38Z

After doing a little bit more research and reading, it appears to me that associating a direction property to the font (like SILE does) is not entirely appropriate. Each unicode character has an implicit bidirectional type. In addition there is a default base direction of the document, page, frame (or whatever context you want to associate a base direction with).

Associating a direction property with a font incorrectly conflates two unrelated things. A font is just a collection of glyphs. By itself, it does not have a direction. For example, there may be fonts which provide glyphs for both English and Arabic, in which case typesetting a paragraph in the same font, but containing English and Arabic characters should obey the Unicode bidi algorithm.

khaledhosny · 2015-04-28T17:24:29Z

I agree completely with the above comment.

simoncozens · 2015-04-28T23:36:17Z

Thanks, both - looks like a lot of rethinking will need to be done. I will work on it, but it may be a couple of weeks before I get back to this.

khaledhosny · 2015-05-03T15:36:54Z

I don't know the internals of SILE, but here are some ideas:

The document, frames and paragraphs should have a direction, and by default it inherits from its parent.
The paragraph direction is the fare direction when applying the BiDi algorithm.
Text nodes should have a script and language properties. The script defaults to the value from Unicode character properties, and the language to the document language. There should be a way to override text language.
Characters with common or inherit script should be resolved according the Unicode algorithm.
BiDi algorithm should then be applied to each paragraph, splitting it into BiDi runs.
BiDi runs should then be reordered, but without rendering the text of the RTL runs.
The runs should then be split further based on script and language, and then each run shaped with HarfBuzz using the script, language and direction. HarfBuzz will take care of reversing the glyphs in RTL runs.

deepakjois · 2015-05-03T17:39:48Z

@khaledhosny Thank you for your input. Do you have any ideas on how to deal with BiDi reordering when an entire paragraph is not immediately available to you as a bunch of characters. For example, how do you treat directives within a paragraph like italicizing or bolding of text, font changes etc. Do all of these get stored in text nodes first, and acted upon later after the BiDi reordering has completed? Are there any code examples that one could look at to get a sense of how something like that could be done.

From what I could figure out, SILE currently does shaping (using Harfbuzz) before it does BiDi processing. In fact, it tries to shape the characters, as soon as it can before moving on to the rest of the paragraph. For example, a paragraph like:

Neque porro \em{quisquam est} \font[size=24pt] dolorem ipsum quia dolor sit amet

would be processed in the following sequence:

shape Neque porro, a word at a time (which are then stored as TeX like Hboxes)
change some internal state to indicate that the italic font face must be chosen
shape quisquam est
change internal state again to indicate the font size is 24pt regular
shape dolorem ipsum quia dolor sit amet, a word at a time.
BiDi reorder the Hboxes

(This is obviously incorrect for at least a couple of reasons: shaping is happening before BiDi reordering, and even then, the BiDi reordering is buggy because it is not acting upon the entire sequence of characters, but on HBoxes of words)

My question is, how does one represent a paragraph like above as a series of text nodes that could be subjected to proper BiDi reordering (as laid out in UA9) and be shaped with the proper fonts as well.

deepakjois · 2015-05-03T17:49:35Z

Oh, and just FYI I am doing some ad-hoc hacking in an experimental branch: https://github.com/deepakjois/sile/tree/exp

It started off as an effort to understand SILE’s internals better. In that process, I was able to improve the typesetting of my Urdu document from this (typeset with latest SILE from master) to this (typeset with SILE from my experimental branch). The BiDi reordering is still buggy, but I managed to get the most common case work correctly. I was hoping to tackle proper full paragraph BiDi reordering before shaping, once I wrap my head around UAX #9 fully.

khaledhosny · 2015-05-03T19:34:21Z

I think font, colour etc. should be stored as node properties (assuming TeX-like nodes, or whatever SILE use internally) and after BiDi they should be used when deciding run boundaries as appropriate e.g. font change should end the current run, just like script or language change, but colour change shouldn’t and should instead be applying to the glyphs corresponding to characters it were applied to (HarfBuzz’s cluster values should be used to map output glyphs to input characters).

simoncozens · 2015-05-03T22:48:59Z

One of the things I'm working on now is rewriting the shaping interface so that (a) Pango and Harfbuzz share common code (even if Pango is more-or-less deprecated this helps us to get the separation of concerns in the right places), and therefore (b) we can add character direction information as part of the shaping process. More in a few days.

deepakjois · 2015-05-04T01:30:44Z

@simoncozens Would that mean that shaping would still happen before BIDI reordering? I just want to point out – it may not make a lot of difference for scripts like English, but Arabic script for example has contextual glyphs for when (beginning, middle or end) the character appears in the word. So determining the direction of the characters is necessary to shaping them correctly. Otherwise, reversing the direction of characters that have already been shaped would lead to incorrect results.

simoncozens · 2015-05-04T01:37:33Z

Basically my reason for doing it (other than moving all the shaping bugs into the one file :-) ) is to provide more granularity in altering the shaping process and annotate notes before, during and after shaping. It won't completely fix the bidi issue---you will still need to fiddle around during boxUpNodes; this will (still) be where you apply the bidi algorithm and do the reordering---but hopefully it will allow the bidi package to interact with shaping at the right level.

simoncozens · 2015-05-04T01:42:03Z

Urgh, I'm confused about how my own code works. Ignore what I set about boxUpNodes. We will have to do bidi stuff in SILE.typesetter.setPar.

deepakjois · 2015-05-04T01:55:34Z

@simoncozens Makes sense. Will wait for your changes.

As an aside, what do you think about @khaledhosny’s suggestion above of storing font, colour etc. as ‘node’ properties. I get that it is not how it works currently, and the whole process of building a paragraph is stateful. Any specific reason you chose to go with that sort of approach?

simoncozens · 2015-05-05T00:15:50Z

OK, so I have refactored the shapers and I think it's helped me understand this issue.

In my mind, SILE does store font, color as a node property; that's basically what the shape method does - it turns basic text plus the properties which are to be applied to that text into a node.

The problem comes that when SILE was implemented in pango/cairo, SILE's shape method first called pango_itemize to split the text into items (I guess that is the step where the checking the text script, splitting into items of similar properties/scripts, reordering etc. should be done), and then shaped each item into a hbox, and finally put the whole bunch of items into an nnode - an nnode represents a word/token/whatever that belongs together in the text but may contain multiple items: for instance "میں1912میں" will be three items.

When I moved to Harfbuzz, I now realise that I skipped that itemizing step, and SILE.shapers.harfbuzz.shape just sends the whole token to hb_shape. That is why shape now produces an nnode always containing one hbox, and why "میں1912میں" doesn't work properly.

I think at the very least we need is to add an itemize method to the shaper class. For Pango this can (I think) call pango_itemize as before; for Harfbuzz we (I think) need to do the itemization manually. I think this will give us what we need to then reorder the nodes.

@khaledhosny, does that make sense?

khaledhosny · 2015-05-05T03:56:04Z

Yes, psngo_itemize does something similar to what I describe above, and if it were fed the whole paragraph then what you describe should do the trick, otherwise we still want to itemize the paragraph as a whole.

…s for more sophisticated tokenization later. See #76.

…t / state and gets shaped later. See #76.

…o a fix of #76

simoncozens · 2015-05-14T02:52:29Z

Well, I think I've fixed it, and made a bunch of other improvements to Arabic rendering as well. (Tashkil and other diacritics should now be positioned properly.)

Now text is split into tokens of the same bidi group, then the bidi algorithm is run, and then the nodes are passed to Harfbuzz for shaping.

Unfortunately I still can't read, so let me know if anything is not working properly.

deepakjois · 2015-05-14T12:06:24Z

From some initial testing against a test document, it seems to work properly. Will do a more thorough investigation next week.

khaledhosny · 2015-05-15T15:15:09Z

Combining marks seem to be breaking the shaping (at least in the sample PDF you committed).

simoncozens · 2015-05-15T23:30:02Z

OK. The code you highlighted in bidi.lua takes an input token (usually a space-separated word but this is customizable by language code) and splits it into tokens suitable for passing to the bidi algorithm. Currently it does this just by looking at each character and starting a new token when it detects a change in bidi class. I could add a rule that says "don't start a new token if this is a combining mark", but is that going to be adequate or will there be other cases which break this?

simoncozens · 2015-05-15T23:42:16Z

Wait, I think a light has come on: is unicode-bidi-algorithm expecting to work on individual characters (not tokens at all)? Perhaps I should just split it up into UTF8 characters and send everything to it.

khaledhosny · 2015-05-16T18:33:13Z

Right, unicode-bidi-algorithm expects a full paragraph, all node_to_table do is to convert the node list to text characters (replacing non-character nodes with U+FFFC which is neutral character and thus does not affect the algorithm), it was does this way so that I can replace the implementation with, say, FriBiDi without much work, otherwise the code can be made to run on the node list directly.

It is important that the whole paragraph is processed as a whole, because neutral and weak characters are influenced by their context, so if the context is lost the output of the algorithm can be substantially different.

… to the bidi algorithm individually, then put back into tokens before shaping.

simoncozens · 2015-05-19T14:33:24Z

OK. I'm sorry I'm being so slow on this bug, but I really don't know what I'm doing with bidi and/or Arabic. I think it works now. (but I thought that several times before already...)

Now I split the tokens up into individual Unicode characters before sending them to the bidi algorithm, which reorders them. After the bidi algorithm does its stuff, we have a list of (properly-ordered) Unicode characters. But we can't send those characters one-by-one to be shaped, because otherwise we get a load of isolated Arabic forms. And we can't send the whole paragraph to be shaped all at once, because we want fine control over turning spaces into glue nodes, and because there may be font changes / etc. during the course of the paragraph. So we have to turn the list of characters into something else.

The bidi package now tries to reconstitute the characters returned into meaningful tokens to be handed off to the shaper, where a meaningful token is defined as being "not containing glue" and "containing a bunch of characters of the same font/size/language/script".

It seems to produce output that looks right to me, even for @deepakjois's 1947-1955 test cases above. But I don't know well enough to know if neutral, weak and combining characters are being handled properly.

khaledhosny · 2015-05-24T16:21:13Z

It seems to work fine now.

khaledhosny · 2016-03-25T16:09:48Z

Not sure if this was the right thread (too lazy to re-read all the comments again), but here is something I tried with Scribus which seems to work fine so far:

do bidi, script, font itemisation etc. and shape the text.
instead of reversing RTL runs (which I tried but gave me all sorts of funny bugs which mixed direction text and line breaks), store the index of all the shaped glyphs before doing any line breaking (that will be the visual index), then sort all the glyphs by their cluster values so now they are in logical order.
Do line breaking pretending the text is unidirectional LTR.
after line breaking, sort the glyphs in each line separately by their visual indices.

This seems to give me proper line breaking, and works whether the base paragraph direction in LTR or RTL and allows me to keep using whatever line breaking code I had unmodified.

khaledhosny · 2016-04-18T23:37:40Z

Incidentally I have been looking into https://github.com/simoncozens/sile/blob/master/examples/arabic.pdf and it suffers from the same issue I had with Scribus initially; when line breaks happens in LTR text in a RTL paragraph, the 1st part of the text goes to the 2nd line and vice versa. On the other hand we have been using the method I outlined above in Scribus for a while now and it has been working well so far.

Note that the issue initially reported and tested for here is currently working as expected, but the issue was actually re-opened on account of the implementation being thought wrong.

simoncozens closed this as completed in 28f3186 Apr 27, 2015

simoncozens reopened this Apr 27, 2015

simoncozens added a commit that referenced this issue May 6, 2015

Baby-steps refactoring to introduce the concept of an itemizer. Allow…

3131ad1

…s for more sophisticated tokenization later. See #76.

simoncozens added a commit that referenced this issue May 7, 2015

Refactor to add the concept of an unshaped node which just stores tex…

29ecc95

…t / state and gets shaped later. See #76.

simoncozens added a commit that referenced this issue May 13, 2015

Remove the autoshaping of unshaped nodes. This should get us closer t…

40e7b6b

…o a fix of #76

simoncozens closed this as completed in 2c65770 May 14, 2015

deepakjois mentioned this issue May 14, 2015

\setLTR and \setRTL support #78

Closed

simoncozens reopened this May 15, 2015

simoncozens added a commit that referenced this issue May 19, 2015

More hacking on issue #76. Send each Unicode character of a paragraph…

9054feb

… to the bidi algorithm individually, then put back into tokens before shaping.

simoncozens closed this as completed May 24, 2015

simoncozens reopened this Apr 20, 2016

alerque modified the milestone: v0.9.5 Aug 12, 2016

alerque modified the milestones: v0.9.5, v0.9.6 Jan 11, 2019

alerque modified the milestones: v0.11.0, v0.12.0 Sep 1, 2021

This was referenced Apr 16, 2022

Bable strikes back — chaos in language settings #1367

Open

Re-scope settings stack to typesetter, document, or frame as appropriate #1327

Open

alerque modified the milestones: v0.13.0, v0.13.x May 21, 2022

alerque modified the milestones: v0.13.x, v0.14.x Jun 24, 2022

Omikhleia added the bug Software bug issue label Apr 2, 2023

mahmadzaid mentioned this issue Jun 26, 2023

Hbox fixes for RTL languages #1822

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BIDI algorithm not working properly? #76

BIDI algorithm not working properly? #76

deepakjois commented Apr 27, 2015

simoncozens commented Apr 27, 2015

simoncozens commented Apr 27, 2015

deepakjois commented Apr 27, 2015

deepakjois commented Apr 27, 2015

simoncozens commented Apr 27, 2015

simoncozens commented Apr 27, 2015

deepakjois commented Apr 27, 2015

simoncozens commented Apr 27, 2015

khaledhosny commented Apr 27, 2015

deepakjois commented Apr 28, 2015

khaledhosny commented Apr 28, 2015

simoncozens commented Apr 28, 2015

khaledhosny commented May 3, 2015

deepakjois commented May 3, 2015

deepakjois commented May 3, 2015

khaledhosny commented May 3, 2015

simoncozens commented May 3, 2015

deepakjois commented May 4, 2015

simoncozens commented May 4, 2015

simoncozens commented May 4, 2015

deepakjois commented May 4, 2015

simoncozens commented May 5, 2015

khaledhosny commented May 5, 2015

simoncozens commented May 14, 2015

deepakjois commented May 14, 2015

khaledhosny commented May 15, 2015

simoncozens commented May 15, 2015

simoncozens commented May 15, 2015

khaledhosny commented May 16, 2015

simoncozens commented May 19, 2015

khaledhosny commented May 24, 2015

khaledhosny commented Mar 25, 2016

khaledhosny commented Apr 18, 2016

BIDI algorithm not working properly? #76

BIDI algorithm not working properly? #76

Comments

deepakjois commented Apr 27, 2015

simoncozens commented Apr 27, 2015

simoncozens commented Apr 27, 2015

deepakjois commented Apr 27, 2015

deepakjois commented Apr 27, 2015

simoncozens commented Apr 27, 2015

simoncozens commented Apr 27, 2015

deepakjois commented Apr 27, 2015

simoncozens commented Apr 27, 2015

khaledhosny commented Apr 27, 2015

deepakjois commented Apr 28, 2015

khaledhosny commented Apr 28, 2015

simoncozens commented Apr 28, 2015

khaledhosny commented May 3, 2015

deepakjois commented May 3, 2015

deepakjois commented May 3, 2015

khaledhosny commented May 3, 2015

simoncozens commented May 3, 2015

deepakjois commented May 4, 2015

simoncozens commented May 4, 2015

simoncozens commented May 4, 2015

deepakjois commented May 4, 2015

simoncozens commented May 5, 2015

khaledhosny commented May 5, 2015

simoncozens commented May 14, 2015

deepakjois commented May 14, 2015

khaledhosny commented May 15, 2015

simoncozens commented May 15, 2015

simoncozens commented May 15, 2015

khaledhosny commented May 16, 2015

simoncozens commented May 19, 2015

khaledhosny commented May 24, 2015

khaledhosny commented Mar 25, 2016

khaledhosny commented Apr 18, 2016