Drop characters when hyphenating #355

alerque · 2016-08-08T15:06:04Z

So the feature added at my request in #265 to setup custom hyphenation patterns is great for words that are exceptions to the normal rules, but I'm finding it leaves me hanging when it comes to the actual rules.

I've recently discovered that there is more than one style guide for how to hyphenate words with internal apostrophes in Turkish. My personal preference is to place the hyphenation points after the apostrophe, but this is not what all publishing houses want to see. Many of them require the apostrophe to be dropped in the event that the word gets hyphenated at that point. For example

müjdesinin

might normally hyphenate as

müj-de-si-nin

However in the event it is used as a proper noun the suffixes (after the third person possessive) are set apart with an apostrophe:

Müjdesi’nin

Müj-de-si’-nin

However in the event that a hyphenation point is actually used, the apostrophe should go away.

Mesih'in Müjdesi-
nin tek gerçek

The problem here is two fold.

There are a huge number of words where this could be an issue, so adding case by case rules is cumbersome. My current solution for getting hyphens after apostrophes rather than before is to search the full text of my books for all the apostrophized words and manually add them to an exception list with the pre-apostrophe hyphen points removed.
The current exception handler does not provide a mechanism for the input and output strings to be different. The input of the function is the hyphenated form and the function assumes the word is the text string with hyphens removed. This won't work to setup a publishing workflow that conforms to the expectation of apostrophes being dropped if used as hyphenation points.

The text was updated successfully, but these errors were encountered:

simoncozens · 2016-08-10T00:12:06Z

I'll need to come back to this, but I think the answer will be to replace the apostrophe with a \discretionary command like so: \discretionary{-}{}{'}. i.e. produce an apostrophe if not hyphenated, and produce a hyphen with no apostrophe if hyphenated. Of course this means I will have to check that the nobreak part of the discretionary works. I can't remember if I implemented that yet.

alerque · 2016-12-20T11:11:26Z

At what stage in the processing would you envision this being implemented? An inputfilter()?

alerque · 2017-01-05T14:28:06Z

If I replace all my apostrophes in the input text with discretionary then none of my text that gets run through pushBack() has any apostrophes at all.

khaledhosny · 2017-01-05T15:09:13Z

The Hyphen hyphenation library, supports what it calls “non-standard hyphenation” where there are changes applied to the text when it is hyphenated. It might be what you are looking for.

This was requested in #277, but SILE will either need to use Hyphen instead of its own code, or support the enhanced hyphenation patterns.

alerque · 2019-02-19T06:23:23Z

Small poke. I've got another book going to press next week and it's turning up a rather large number of these errors (hyphenation at apostrophe's not removing the apostrophe). If you've had any ideas about where in the process this should be implemented I'd be all up for trying it in the next few days.

simoncozens · 2019-02-20T13:42:52Z

So, this works:

\begin[papersize=a6]{document}
\language[main=tr]
\font[size=30pt]
Müjdesi\discretionary[prebreak="-",replacement="'"]nin
Müjdesi\discretionary[prebreak="-",replacement="'"]nin
\end{document}

At the moment we don't have a way for hyphenation dictionaries to do that clever stuff. If it's a small number of obvious cases I would suggest either using an input filter or defining a command like \magicapostrophehyphenator{Müjdesi’nin}.

The bigger fix is obviously to support enhanced hyphenation patterns as Khaled suggests. I will try to implement that over the next day or two.

To be removed when [upstream issue][1] is fixed. [1]: sile-typesetter/sile#355

alerque · 2019-02-21T07:48:43Z

I don't know what to say. Either my testing two years ago was flawed (without an MWE that's entirely possible) or something else has changed in SILE since then, because \discretionary does in fact work. Not without some weird artifacts however. For example for some reason it botches up in centered chapter headlines, so I have to exclude those. With this workaround any apostrophe's in chapter titles would show up after the end of the chapter title instead of their normal place in the text.

Doing it my hand or for specific words would be a nightmare given the hundreds of words involved. I was able to get my current book looking better by preprocessing the Markdown source. This is a lot easier than trying to set it up in SILE using an input filter simply by virtue of brute force access to the raw text.

perl -Mutf8 -CS -pne '/^#/ or s/(?<=\p{L})’(?=\p{L})/\\ah{}/g'

That skips all headings and matches apostrophes both preceded and succeeded by a letter character and replaces them with an inline SILE command (apostrophe hack) defined as follows:

SILE.registerCommand("ah", function ()
  SILE.call("discretionary", { prebreak = "-", replacement = "’" })
end)

That will get me by for this book (and the output looks a lot better) but it also has some down sides. For whatever reason it completely changes the outcome of the line breaker even in cases where apostrophe's do not fall at a possible break point. Lines that had an apostrophed words even in the middle of a line are being broken at different points, and more emergency stretching is involved. I suspect the usual penalty weight imposed by hyphens is causing the math to work out differently, but I'm not sure exactly how. The change is certainly not for the better, so a proper fix to the hyphenation system to understand more advanced patterns is certainly still in order.

Thanks for the input.

alerque · 2019-02-21T08:28:58Z

Another unfortunate side effect of this hack is single character suffixes ending up on the next line from the word they are attached too!

Kutsal Kitap'ı shouldn't really be left with the ı hanging here, but the \discretionary hack described above makes the break point too attractive.

simoncozens · 2019-02-21T11:13:53Z

Well, if that’s not a hyphenation point, don’t put a discretionary there! Maybe your Perl regexp needs adjusting?

I am half way through implementing libhyphen support, but since I can’t find extended hyphenation patterns for Turkish, I am not sure how useful it will actually be for you...

alerque · 2022-08-13T12:50:54Z

Well, if that’s not a hyphenation point, don’t put a discretionary there! Maybe your Perl regexp needs adjusting?

Adding the hack isn't the problem, If I don't do anything at all SILE makes the sames mistake only worse because it also dosen't make the replacement.

In other news, the workaround I've been using for years now isn't serving any more in v0.14+, not sure why yet but introducing manual discretionary nodes now breaks alignment completely.

alerque · 2022-08-13T18:21:07Z

This is not going to be easy to solve. I started tinkering with the hyphenator and got something working—only to run into another problem more visible than the one I was trying to solve. If you setup correct patterns for hyphenation around intra-word ' the next problem you get is that it throws off all the other cases. The hyphenator is designed to work on word tokens. Words that have apostrophes in them should be considered as whole words on each side of the apostrophe, i.e. this is bad:

\begin[papersize=a7]{document}
\nofolios
\neverindent
\set[parameter=document.lskip,value=9%pw]
\set[parameter=document.rskip,value=9%pw]
\language[main=tr]
\font[size=25pt]
Afyonkarahisar'danmış
\end{document}

That single letter should not have been allowed to break because it should have been considered the end of a word. My hack actually avoided this problem because it dropped a command in between two words so the tokenizer always treated them separately.

Bah!

alerque modified the milestones: v0.9.5, v0.9.6 Aug 12, 2016

alerque self-assigned this Jun 19, 2018

alerque modified the milestones: v0.9.6, v0.9.7 Jan 11, 2019

alerque added a commit to sile-typesetter/casile that referenced this issue Feb 21, 2019

Add hack for Turkish apostrophe-hyphenation rule

3f454df

To be removed when [upstream issue][1] is fixed. [1]: sile-typesetter/sile#355

alerque added a commit to sile-typesetter/casile that referenced this issue Feb 21, 2019

Add hack for Turkish apostrophe-hyphenation rule

3bdd107

To be removed when [upstream issue][1] is fixed. [1]: sile-typesetter/sile#355

alerque mentioned this issue Apr 22, 2019

Discressionaries in \href aren't counted towards width #583

Closed

alerque mentioned this issue Jul 22, 2020

Fix #532 by sorting table keys when processing content #974

Merged

alerque mentioned this issue Sep 17, 2020

Shaper doesn't have any way to break long strings of digits into lines #1060

Open

alerque modified the milestones: v0.11.2, v0.11.x Sep 16, 2021

alerque mentioned this issue Dec 16, 2021

Hyphenation issue after apostrophe? #1297

Closed

alerque modified the milestones: v0.12.1, v0.12.x Jan 12, 2022

alerque modified the milestones: v0.12.3, v0.12.x Mar 2, 2022

alerque mentioned this issue Mar 26, 2022

Something in break.lua broken #1227

Open

alerque modified the milestones: v0.12.x, v0.13.0 Apr 18, 2022

alerque modified the milestones: v0.13.0, v0.13.x May 21, 2022

alerque modified the milestones: v0.13.x, v0.14.x Jun 24, 2022

alerque closed this as completed Aug 13, 2022

alerque reopened this Aug 13, 2022

This was referenced Aug 13, 2022

Ligatures and line justification issues #1362

Closed

Discretionary glyphs vs. hyphenation #1527

Merged

Pushback not completely fixed #368

Open

Discretionaries used for measurements incorrectly added to node queue #1528

Closed

alerque closed this as completed in #1527 Aug 16, 2022

alerque modified the milestones: v0.14.x, v0.14.3 Nov 18, 2022

alerque added the enhancement Software improvement or feature request label Nov 18, 2022

alerque mentioned this issue Nov 3, 2023

Remove apostrophes when hyphenating Turkish typst/typst#2580

Open

alerque mentioned this issue Jan 23, 2024

Accommodate alternative Turkish hyphenation rules #1969

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Drop characters when hyphenating #355

Drop characters when hyphenating #355

alerque commented Aug 8, 2016

simoncozens commented Aug 10, 2016

alerque commented Dec 20, 2016 •

edited

alerque commented Jan 5, 2017

khaledhosny commented Jan 5, 2017

alerque commented Feb 19, 2019

simoncozens commented Feb 20, 2019

alerque commented Feb 21, 2019 •

edited

alerque commented Feb 21, 2019

simoncozens commented Feb 21, 2019

alerque commented Aug 13, 2022 •

edited

alerque commented Aug 13, 2022 •

edited

Drop characters when hyphenating #355

Drop characters when hyphenating #355

Comments

alerque commented Aug 8, 2016

simoncozens commented Aug 10, 2016

alerque commented Dec 20, 2016 • edited

alerque commented Jan 5, 2017

khaledhosny commented Jan 5, 2017

alerque commented Feb 19, 2019

simoncozens commented Feb 20, 2019

alerque commented Feb 21, 2019 • edited

alerque commented Feb 21, 2019

simoncozens commented Feb 21, 2019

alerque commented Aug 13, 2022 • edited

alerque commented Aug 13, 2022 • edited

alerque commented Dec 20, 2016 •

edited

alerque commented Feb 21, 2019 •

edited

alerque commented Aug 13, 2022 •

edited

alerque commented Aug 13, 2022 •

edited