Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BIDI algorithm not working properly? #76

Open
deepakjois opened this issue Apr 27, 2015 · 33 comments
Open

BIDI algorithm not working properly? #76

deepakjois opened this issue Apr 27, 2015 · 33 comments
Labels
bug Software bug issue
Milestone

Comments

@deepakjois
Copy link
Contributor

Here is a sample document. The numbers within the script should be formatted LTR, as per Unicode BIDI rules, but they are not.

I did notice that in the example docs (like showoff.sil) the numbers are explicitly formatted LTR using a \font directive to wrap them. But I believe that should not be required. Any idea what is causing this?

\begin[papersize=a4]{document}
\script[src=packages/bidi]
\thisframeRTL%
\font[size=18pt]
سعادت حسن منٹو اردو کے ایک جانے مانے افسانہ نگار ہیں۔ وہ پنجاب کے لدھانہ ضلع میں  1912 میں پیدا ہوئے۔ ان کے ماں
 باپ کشمیری مسلمان تھے۔ وہ علی گڑھ مسلم یونیورسٹی سے تعلیم حاصل کر نے کے بعد انجمن ترقی پسند مصنفین ہند کے
 ساتھ جڑے۔ 1936 میں وہ بمبئی شہر جا کر بس گئے جہاں انہوں نے اپنے کئی مشہور افسانے اور فلم کے سکرپٹ لکھے
۔1947 میں بٹوارے کے بعد وہ لاہور چلے گئے۔ وہاں ان کی ملاقات فیض احمد فیض، سید ناصر رضا كاظمی
 اور احمد ندیم قاسمی‎ سے ہوئی اور وہ اخباروں کے لئے مضمون لکھنے لگے۔ 1955 میں ان کا انتقال ہوا۔
\end{document}
@simoncozens
Copy link
Member

Not sure what you mean; here is what I see:
bug-76

Although I set the font family explicitly. (\font[size=18pt,family=Scheherazade])

I think what might be happening is that SILE has a concept of font direction and frame direction. You've set the frame direction to RTL, which is fine, but if you're still using the default font, (Gentium, English) then the expected output is LTR. So things might be getting reversed because of that.

@simoncozens
Copy link
Member

More to the point I wonder how, without explicitly selecting an Arabic font, you're getting any arabic output at all. Could you run with the latest repository and the command line option --debug=fonts?

@deepakjois
Copy link
Contributor Author

Sorry, I have a typo above. Let me try to explain again with the code.

The output from the code below is wrong. Each Urdu word is being typeset LTR but the paragraph is being set RTL.

\begin[papersize=a5]{document}
\script[src=packages/bidi]
\thisframeRTL%
\font[family=Amiri,language=urd,script=Arab]
\font[size=18pt]
سعادت حسن منٹو اردو کے ایک جانے مانے افسانہ نگار ہیں۔ وہ پنجاب کے لدھانہ ضلع میں  1912 میں پیدا ہوئے۔ ان کے ماں
 باپ کشمیری مسلمان تھے۔ وہ علی گڑھ مسلم یونیورسٹی سے تعلیم حاصل کر نے کے بعد انجمن ترقی پسند مصنفین ہند کے
 ساتھ جڑے۔ 1936 میں وہ بمبئی شہر جا کر بس گئے جہاں انہوں نے اپنے کئی مشہور افسانے اور فلم کے سکرپٹ لکھے
۔1947 میں بٹوارے کے بعد وہ لاہور چلے گئے۔ وہاں ان کی ملاقات فیض احمد فیض، سید ناصر رضا كاظمی
 اور احمد ندیم قاسمی‎ سے ہوئی اور وہ اخباروں کے لئے مضمون لکھنے لگے۔ 1955 میں ان کا انتقال ہوا۔
\end{document}

screenshot 2015-04-27 10 10 31

The code below (note the additional direction=RTL in the font declaration) typesets the Urdu words correctly. It typesets the paragraph RTL as well. But the numbers are being typeset wrongly. Hope that makes sense.

\begin[papersize=a5]{document}
\script[src=packages/bidi]
\thisframeRTL%
\font[family=Amiri,language=urd,direction=RTL,script=Arab]
\font[size=18pt]
سعادت حسن منٹو اردو کے ایک جانے مانے افسانہ نگار ہیں۔ وہ پنجاب کے لدھانہ ضلع میں  1912 میں پیدا ہوئے۔ ان کے ماں
 باپ کشمیری مسلمان تھے۔ وہ علی گڑھ مسلم یونیورسٹی سے تعلیم حاصل کر نے کے بعد انجمن ترقی پسند مصنفین ہند کے
 ساتھ جڑے۔ 1936 میں وہ بمبئی شہر جا کر بس گئے جہاں انہوں نے اپنے کئی مشہور افسانے اور فلم کے سکرپٹ لکھے
۔1947 میں بٹوارے کے بعد وہ لاہور چلے گئے۔ وہاں ان کی ملاقات فیض احمد فیض، سید ناصر رضا كاظمی
 اور احمد ندیم قاسمی‎ سے ہوئی اور وہ اخباروں کے لئے مضمون لکھنے لگے۔ 1955 میں ان کا انتقال ہوا۔
\end{document}

screenshot 2015-04-27 10 10 06

@deepakjois
Copy link
Contributor Author

Oh, and the output that you posted is wrong as well. The Urdu words are being typeset LTR, like in my first example.

@simoncozens
Copy link
Member

OK, thanks; I'm afraid I don't read Arabic/Urdu very well so I can't see at a glance if it's doing the right thing or not! Now I understand the problem, I will try and work on it.

@simoncozens
Copy link
Member

Please try with that commit. I think it does the right thing but I'm not sure.

@deepakjois
Copy link
Contributor Author

This definitely fixes the common case, which is nice. Here is a file that exposes some edge cases. Whenever the number is separated by spaces on both sides, it is typeset correctly. But when it appear right after the ‘۔’ (Arabic Full Stop), or after an Arabic letter, it gets typeset RTL.

I am not an expert on the BIDI algorithm, so I can’t say what is the right thing to do here. Just pointing out these cases.

\begin[papersize=a5]{document}
\script[src=packages/bidi]
\thisframeRTL%
\font[family=Amiri,language=urd,direction=RTL,script=Arab]
\font[size=18pt]
1955 میں ان کا انتقال ہوا۔

1955 میں ان کا انتقال ہوا۔1947 میں وہ۔

1955 میں ان کا انتقال ہوا۔ 1947 میں وہ۔

میں 1955 میں آیا تھا

میں1955 میں آیا تھا
\end{document}

screenshot 2015-04-27 12 04 43

@simoncozens
Copy link
Member

Thanks, those are some good test cases. I know what's happening - SILE takes a short cut and assumes that all characters in a shaping run are of the shame bidi class, but that doesn't hold up in the cases you've found. I shall work on it...

@simoncozens simoncozens reopened this Apr 27, 2015
@khaledhosny
Copy link
Contributor

I think the proper way to do itemization is to first resolve the script property of each character, do bidi itemization, then further split the bidi runs into script runs. Resolve the script before breaking the bidi runs is important so that script property of common characters across bidi runs are resolved properly.

@deepakjois
Copy link
Contributor Author

After doing a little bit more research and reading, it appears to me that associating a direction property to the font (like SILE does) is not entirely appropriate. Each unicode character has an implicit bidirectional type. In addition there is a default base direction of the document, page, frame (or whatever context you want to associate a base direction with).

Associating a direction property with a font incorrectly conflates two unrelated things. A font is just a collection of glyphs. By itself, it does not have a direction. For example, there may be fonts which provide glyphs for both English and Arabic, in which case typesetting a paragraph in the same font, but containing English and Arabic characters should obey the Unicode bidi algorithm.

@khaledhosny
Copy link
Contributor

I agree completely with the above comment.

@simoncozens
Copy link
Member

Thanks, both - looks like a lot of rethinking will need to be done. I will work on it, but it may be a couple of weeks before I get back to this.

@khaledhosny
Copy link
Contributor

I don't know the internals of SILE, but here are some ideas:

  • The document, frames and paragraphs should have a direction, and by default it inherits from its parent.
  • The paragraph direction is the fare direction when applying the BiDi algorithm.
  • Text nodes should have a script and language properties. The script defaults to the value from Unicode character properties, and the language to the document language. There should be a way to override text language.
  • Characters with common or inherit script should be resolved according the Unicode algorithm.
  • BiDi algorithm should then be applied to each paragraph, splitting it into BiDi runs.
  • BiDi runs should then be reordered, but without rendering the text of the RTL runs.
  • The runs should then be split further based on script and language, and then each run shaped with HarfBuzz using the script, language and direction. HarfBuzz will take care of reversing the glyphs in RTL runs.

@deepakjois
Copy link
Contributor Author

@khaledhosny Thank you for your input. Do you have any ideas on how to deal with BiDi reordering when an entire paragraph is not immediately available to you as a bunch of characters. For example, how do you treat directives within a paragraph like italicizing or bolding of text, font changes etc. Do all of these get stored in text nodes first, and acted upon later after the BiDi reordering has completed? Are there any code examples that one could look at to get a sense of how something like that could be done.

From what I could figure out, SILE currently does shaping (using Harfbuzz) before it does BiDi processing. In fact, it tries to shape the characters, as soon as it can before moving on to the rest of the paragraph. For example, a paragraph like:

Neque porro \em{quisquam est} \font[size=24pt] dolorem ipsum quia dolor sit amet

would be processed in the following sequence:

  • shape Neque porro, a word at a time (which are then stored as TeX like Hboxes)
  • change some internal state to indicate that the italic font face must be chosen
  • shape quisquam est
  • change internal state again to indicate the font size is 24pt regular
  • shape dolorem ipsum quia dolor sit amet, a word at a time.
  • BiDi reorder the Hboxes

(This is obviously incorrect for at least a couple of reasons: shaping is happening before BiDi reordering, and even then, the BiDi reordering is buggy because it is not acting upon the entire sequence of characters, but on HBoxes of words)

My question is, how does one represent a paragraph like above as a series of text nodes that could be subjected to proper BiDi reordering (as laid out in UA9) and be shaped with the proper fonts as well.

@deepakjois
Copy link
Contributor Author

Oh, and just FYI I am doing some ad-hoc hacking in an experimental branch: https://github.com/deepakjois/sile/tree/exp

It started off as an effort to understand SILE’s internals better. In that process, I was able to improve the typesetting of my Urdu document from this (typeset with latest SILE from master) to this (typeset with SILE from my experimental branch). The BiDi reordering is still buggy, but I managed to get the most common case work correctly. I was hoping to tackle proper full paragraph BiDi reordering before shaping, once I wrap my head around UAX #9 fully.

@khaledhosny
Copy link
Contributor

I think font, colour etc. should be stored as node properties (assuming TeX-like nodes, or whatever SILE use internally) and after BiDi they should be used when deciding run boundaries as appropriate e.g. font change should end the current run, just like script or language change, but colour change shouldn’t and should instead be applying to the glyphs corresponding to characters it were applied to (HarfBuzz’s cluster values should be used to map output glyphs to input characters).

@simoncozens
Copy link
Member

One of the things I'm working on now is rewriting the shaping interface so that (a) Pango and Harfbuzz share common code (even if Pango is more-or-less deprecated this helps us to get the separation of concerns in the right places), and therefore (b) we can add character direction information as part of the shaping process. More in a few days.

@deepakjois
Copy link
Contributor Author

@simoncozens Would that mean that shaping would still happen before BIDI reordering? I just want to point out – it may not make a lot of difference for scripts like English, but Arabic script for example has contextual glyphs for when (beginning, middle or end) the character appears in the word. So determining the direction of the characters is necessary to shaping them correctly. Otherwise, reversing the direction of characters that have already been shaped would lead to incorrect results.

@simoncozens
Copy link
Member

Basically my reason for doing it (other than moving all the shaping bugs into the one file :-) ) is to provide more granularity in altering the shaping process and annotate notes before, during and after shaping. It won't completely fix the bidi issue---you will still need to fiddle around during boxUpNodes; this will (still) be where you apply the bidi algorithm and do the reordering---but hopefully it will allow the bidi package to interact with shaping at the right level.

@simoncozens
Copy link
Member

Urgh, I'm confused about how my own code works. Ignore what I set about boxUpNodes. We will have to do bidi stuff in SILE.typesetter.setPar.

@deepakjois
Copy link
Contributor Author

@simoncozens Makes sense. Will wait for your changes.

As an aside, what do you think about @khaledhosny’s suggestion above of storing font, colour etc. as ‘node’ properties. I get that it is not how it works currently, and the whole process of building a paragraph is stateful. Any specific reason you chose to go with that sort of approach?

@simoncozens
Copy link
Member

OK, so I have refactored the shapers and I think it's helped me understand this issue.

In my mind, SILE does store font, color as a node property; that's basically what the shape method does - it turns basic text plus the properties which are to be applied to that text into a node.

The problem comes that when SILE was implemented in pango/cairo, SILE's shape method first called pango_itemize to split the text into items (I guess that is the step where the checking the text script, splitting into items of similar properties/scripts, reordering etc. should be done), and then shaped each item into a hbox, and finally put the whole bunch of items into an nnode - an nnode represents a word/token/whatever that belongs together in the text but may contain multiple items: for instance "میں1912میں" will be three items.

When I moved to Harfbuzz, I now realise that I skipped that itemizing step, and SILE.shapers.harfbuzz.shape just sends the whole token to hb_shape. That is why shape now produces an nnode always containing one hbox, and why "میں1912میں" doesn't work properly.

I think at the very least we need is to add an itemize method to the shaper class. For Pango this can (I think) call pango_itemize as before; for Harfbuzz we (I think) need to do the itemization manually. I think this will give us what we need to then reorder the nodes.

@khaledhosny, does that make sense?

@khaledhosny
Copy link
Contributor

Yes, psngo_itemize does something similar to what I describe above, and if it were fed the whole paragraph then what you describe should do the trick, otherwise we still want to itemize the paragraph as a whole.

simoncozens added a commit that referenced this issue May 6, 2015
…s for more sophisticated tokenization later. See #76.
simoncozens added a commit that referenced this issue May 7, 2015
@simoncozens
Copy link
Member

Well, I think I've fixed it, and made a bunch of other improvements to Arabic rendering as well. (Tashkil and other diacritics should now be positioned properly.)

Now text is split into tokens of the same bidi group, then the bidi algorithm is run, and then the nodes are passed to Harfbuzz for shaping.

Unfortunately I still can't read, so let me know if anything is not working properly.

@deepakjois
Copy link
Contributor Author

From some initial testing against a test document, it seems to work properly. Will do a more thorough investigation next week.

@khaledhosny
Copy link
Contributor

Combining marks seem to be breaking the shaping (at least in the sample PDF you committed).

@simoncozens
Copy link
Member

OK. The code you highlighted in bidi.lua takes an input token (usually a space-separated word but this is customizable by language code) and splits it into tokens suitable for passing to the bidi algorithm. Currently it does this just by looking at each character and starting a new token when it detects a change in bidi class. I could add a rule that says "don't start a new token if this is a combining mark", but is that going to be adequate or will there be other cases which break this?

@simoncozens simoncozens reopened this May 15, 2015
@simoncozens
Copy link
Member

Wait, I think a light has come on: is unicode-bidi-algorithm expecting to work on individual characters (not tokens at all)? Perhaps I should just split it up into UTF8 characters and send everything to it.

@khaledhosny
Copy link
Contributor

Right, unicode-bidi-algorithm expects a full paragraph, all node_to_table do is to convert the node list to text characters (replacing non-character nodes with U+FFFC which is neutral character and thus does not affect the algorithm), it was does this way so that I can replace the implementation with, say, FriBiDi without much work, otherwise the code can be made to run on the node list directly.

It is important that the whole paragraph is processed as a whole, because neutral and weak characters are influenced by their context, so if the context is lost the output of the algorithm can be substantially different.

simoncozens added a commit that referenced this issue May 19, 2015
… to the bidi algorithm individually, then put back into tokens before shaping.
@simoncozens
Copy link
Member

OK. I'm sorry I'm being so slow on this bug, but I really don't know what I'm doing with bidi and/or Arabic. I think it works now. (but I thought that several times before already...)

Now I split the tokens up into individual Unicode characters before sending them to the bidi algorithm, which reorders them. After the bidi algorithm does its stuff, we have a list of (properly-ordered) Unicode characters. But we can't send those characters one-by-one to be shaped, because otherwise we get a load of isolated Arabic forms. And we can't send the whole paragraph to be shaped all at once, because we want fine control over turning spaces into glue nodes, and because there may be font changes / etc. during the course of the paragraph. So we have to turn the list of characters into something else.

The bidi package now tries to reconstitute the characters returned into meaningful tokens to be handed off to the shaper, where a meaningful token is defined as being "not containing glue" and "containing a bunch of characters of the same font/size/language/script".

It seems to produce output that looks right to me, even for @deepakjois's 1947-1955 test cases above. But I don't know well enough to know if neutral, weak and combining characters are being handled properly.

@khaledhosny
Copy link
Contributor

It seems to work fine now.

@khaledhosny
Copy link
Contributor

Not sure if this was the right thread (too lazy to re-read all the comments again), but here is something I tried with Scribus which seems to work fine so far:

  1. do bidi, script, font itemisation etc. and shape the text.
  2. instead of reversing RTL runs (which I tried but gave me all sorts of funny bugs which mixed direction text and line breaks), store the index of all the shaped glyphs before doing any line breaking (that will be the visual index), then sort all the glyphs by their cluster values so now they are in logical order.
  3. Do line breaking pretending the text is unidirectional LTR.
  4. after line breaking, sort the glyphs in each line separately by their visual indices.

This seems to give me proper line breaking, and works whether the base paragraph direction in LTR or RTL and allows me to keep using whatever line breaking code I had unmodified.

@khaledhosny
Copy link
Contributor

Incidentally I have been looking into https://github.com/simoncozens/sile/blob/master/examples/arabic.pdf and it suffers from the same issue I had with Scribus initially; when line breaks happens in LTR text in a RTL paragraph, the 1st part of the text goes to the 2nd line and vice versa. On the other hand we have been using the method I outlined above in Scribus for a while now and it has been working well so far.

@simoncozens simoncozens reopened this Apr 20, 2016
alerque added a commit to alerque/sile that referenced this issue May 30, 2016
Note that the issue initially reported and tested for here is currently
working as expected, but the issue was actually re-opened on account of
the implementation being thought wrong.
@alerque alerque modified the milestone: v0.9.5 Aug 12, 2016
@alerque alerque modified the milestones: v0.9.5, v0.9.6 Jan 11, 2019
@alerque alerque modified the milestones: v0.11.0, v0.12.0 Sep 1, 2021
@alerque alerque modified the milestones: v0.13.0, v0.13.x May 21, 2022
@alerque alerque modified the milestones: v0.13.x, v0.14.x Jun 24, 2022
@Omikhleia Omikhleia added the bug Software bug issue label Apr 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Software bug issue
Projects
Development

No branches or pull requests

5 participants