Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Catalan hyphenation details poorly supported #34

Open
jmontane opened this issue Aug 18, 2023 · 0 comments
Open

Catalan hyphenation details poorly supported #34

jmontane opened this issue Aug 18, 2023 · 0 comments
Labels
i:line_breaking Line breaking & hyphenation l:ca Catalan s:latn Latin (script)

Comments

@jmontane
Copy link

Catalan needs hyphenation. Hyphenation has few complex rules (Catalan L·L, compound words, prefixes, etc.).

Gecko doesn't provide break opportunities at Catalan "L·L". E.g., in "cancel·lar" the "l·l" could be hyphenated as "can-cel-lar", i.e., "l·l" is can be hyphenated as "l-l"; which Gecko doesn't do.

Another hyphenation issue is related with word boundaries and hyphenation rules. Some hyphenation rules are applied to start of word (such rules start with a dot .) or applied to end of word (such rules end with a dot .). These rules are useful for Catalan to manage compound words, prefixes, and inflected verbal forms, but they could not be applied if word has an article joined with apostrophe, or if word is an verbal form with a pronoun attached with hyphen. E.g.:

  • "inèdit" requires special rule ".i4n3èdit" to get proper breakpoints "in-è-dit". Good. But "inèdit" could be joined to an article, "l'inèdit", so if hyphenation engines gets "l'inèdit" as a single word, then it provides wrong breakpoints "l'i-nè-dit". Of course, it can be patched, just adding a twin rule ".l'i4n3èdit" to get properly breakpoints "l'in-è-dit"
  • "conduint" requires special rule "u1int." to get proper breakpoints "con-du-int". Good. But "conduint" could have a pronoun attached with hyphen "conduint-ho", so if hyphenation engines gets "conduint-ho" as a single word, then it provides wrong breakpoints "con-duint-ho". Of course, it can be patched, just adding a twin rule "u1int-" to get properly breakpoints "con-du-int-ho".

So this word segmentation cases can be fixed in rule development side, but I write here as documentation to illustrate that word segmentation used before hyphenation library can generate incorrect breakpoints.

Tests & results:
Catalan hyphenation is supported by Gecko, but not by Blink or Webkit.
Gecko hyphenation rules used by Gecko are the old ones, from TeX.
Better, updated, hyphenation rules are used by LibreOffice. Upstream is here.

More systematic tests are needed to ascertain whether Gecko handles everything for Catalan language (such as the L·L mentioned above or word joined to an article).

@r12a r12a added i:line_breaking Line breaking & hyphenation l:ca Catalan s:latn Latin (script) labels Jun 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
i:line_breaking Line breaking & hyphenation l:ca Catalan s:latn Latin (script)
Projects
None yet
Development

No branches or pull requests

2 participants