Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drop characters when hyphenating #355

Closed
alerque opened this issue Aug 8, 2016 · 11 comments · Fixed by #1527
Closed

Drop characters when hyphenating #355

alerque opened this issue Aug 8, 2016 · 11 comments · Fixed by #1527
Assignees
Labels
enhancement Software improvement or feature request
Milestone

Comments

@alerque
Copy link
Member

alerque commented Aug 8, 2016

So the feature added at my request in #265 to setup custom hyphenation patterns is great for words that are exceptions to the normal rules, but I'm finding it leaves me hanging when it comes to the actual rules.

I've recently discovered that there is more than one style guide for how to hyphenate words with internal apostrophes in Turkish. My personal preference is to place the hyphenation points after the apostrophe, but this is not what all publishing houses want to see. Many of them require the apostrophe to be dropped in the event that the word gets hyphenated at that point. For example

müjdesinin

might normally hyphenate as

müj-de-si-nin

However in the event it is used as a proper noun the suffixes (after the third person possessive) are set apart with an apostrophe:

Müjdesi’nin

Müj-de-si’-nin

However in the event that a hyphenation point is actually used, the apostrophe should go away.

Mesih'in Müjdesi-
nin tek gerçek

The problem here is two fold.

  1. There are a huge number of words where this could be an issue, so adding case by case rules is cumbersome. My current solution for getting hyphens after apostrophes rather than before is to search the full text of my books for all the apostrophized words and manually add them to an exception list with the pre-apostrophe hyphen points removed.
  2. The current exception handler does not provide a mechanism for the input and output strings to be different. The input of the function is the hyphenated form and the function assumes the word is the text string with hyphens removed. This won't work to setup a publishing workflow that conforms to the expectation of apostrophes being dropped if used as hyphenation points.
@simoncozens
Copy link
Member

I'll need to come back to this, but I think the answer will be to replace the apostrophe with a \discretionary command like so: \discretionary{-}{}{'}. i.e. produce an apostrophe if not hyphenated, and produce a hyphen with no apostrophe if hyphenated. Of course this means I will have to check that the nobreak part of the discretionary works. I can't remember if I implemented that yet.

@alerque alerque modified the milestones: v0.9.5, v0.9.6 Aug 12, 2016
@alerque
Copy link
Member Author

alerque commented Dec 20, 2016

At what stage in the processing would you envision this being implemented? An inputfilter()?

@alerque
Copy link
Member Author

alerque commented Jan 5, 2017

If I replace all my apostrophes in the input text with discretionary then none of my text that gets run through pushBack() has any apostrophes at all.

@khaledhosny
Copy link
Contributor

The Hyphen hyphenation library, supports what it calls “non-standard hyphenation” where there are changes applied to the text when it is hyphenated. It might be what you are looking for.

This was requested in #277, but SILE will either need to use Hyphen instead of its own code, or support the enhanced hyphenation patterns.

@alerque alerque self-assigned this Jun 19, 2018
@alerque alerque modified the milestones: v0.9.6, v0.9.7 Jan 11, 2019
@alerque
Copy link
Member Author

alerque commented Feb 19, 2019

Small poke. I've got another book going to press next week and it's turning up a rather large number of these errors (hyphenation at apostrophe's not removing the apostrophe). If you've had any ideas about where in the process this should be implemented I'd be all up for trying it in the next few days.

@simoncozens
Copy link
Member

So, this works:

\begin[papersize=a6]{document}
\language[main=tr]
\font[size=30pt]
Müjdesi\discretionary[prebreak="-",replacement="'"]nin
Müjdesi\discretionary[prebreak="-",replacement="'"]nin
\end{document}

screenshot 2019-02-20 at 13 38 51

At the moment we don't have a way for hyphenation dictionaries to do that clever stuff. If it's a small number of obvious cases I would suggest either using an input filter or defining a command like \magicapostrophehyphenator{Müjdesi’nin}.

The bigger fix is obviously to support enhanced hyphenation patterns as Khaled suggests. I will try to implement that over the next day or two.

alerque added a commit to sile-typesetter/casile that referenced this issue Feb 21, 2019
To be removed when [upstream issue][1] is fixed.

[1]: sile-typesetter/sile#355
alerque added a commit to sile-typesetter/casile that referenced this issue Feb 21, 2019
To be removed when [upstream issue][1] is fixed.

[1]: sile-typesetter/sile#355
@alerque
Copy link
Member Author

alerque commented Feb 21, 2019

I don't know what to say. Either my testing two years ago was flawed (without an MWE that's entirely possible) or something else has changed in SILE since then, because \discretionary does in fact work. Not without some weird artifacts however. For example for some reason it botches up in centered chapter headlines, so I have to exclude those. With this workaround any apostrophe's in chapter titles would show up after the end of the chapter title instead of their normal place in the text.

Doing it my hand or for specific words would be a nightmare given the hundreds of words involved. I was able to get my current book looking better by preprocessing the Markdown source. This is a lot easier than trying to set it up in SILE using an input filter simply by virtue of brute force access to the raw text.

perl -Mutf8 -CS -pne '/^#/ or s/(?<=\p{L})’(?=\p{L})/\\ah{}/g'

That skips all headings and matches apostrophes both preceded and succeeded by a letter character and replaces them with an inline SILE command (apostrophe hack) defined as follows:

SILE.registerCommand("ah", function ()
  SILE.call("discretionary", { prebreak = "-", replacement = "" })
end)

That will get me by for this book (and the output looks a lot better) but it also has some down sides. For whatever reason it completely changes the outcome of the line breaker even in cases where apostrophe's do not fall at a possible break point. Lines that had an apostrophed words even in the middle of a line are being broken at different points, and more emergency stretching is involved. I suspect the usual penalty weight imposed by hyphens is causing the math to work out differently, but I'm not sure exactly how. The change is certainly not for the better, so a proper fix to the hyphenation system to understand more advanced patterns is certainly still in order.

Thanks for the input.

@alerque
Copy link
Member Author

alerque commented Feb 21, 2019

Another unfortunate side effect of this hack is single character suffixes ending up on the next line from the word they are attached too!

image

Kutsal Kitap'ı shouldn't really be left with the ı hanging here, but the \discretionary hack described above makes the break point too attractive.

@simoncozens
Copy link
Member

Well, if that’s not a hyphenation point, don’t put a discretionary there! Maybe your Perl regexp needs adjusting?

I am half way through implementing libhyphen support, but since I can’t find extended hyphenation patterns for Turkish, I am not sure how useful it will actually be for you...

@alerque alerque modified the milestones: v0.12.x, v0.13.0 Apr 18, 2022
@alerque alerque modified the milestones: v0.13.0, v0.13.x May 21, 2022
@alerque alerque modified the milestones: v0.13.x, v0.14.x Jun 24, 2022
@alerque
Copy link
Member Author

alerque commented Aug 13, 2022

Well, if that’s not a hyphenation point, don’t put a discretionary there! Maybe your Perl regexp needs adjusting?

Adding the hack isn't the problem, If I don't do anything at all SILE makes the sames mistake only worse because it also dosen't make the replacement.


In other news, the workaround I've been using for years now isn't serving any more in v0.14+, not sure why yet but introducing manual discretionary nodes now breaks alignment completely.

@alerque alerque closed this as completed Aug 13, 2022
@alerque alerque reopened this Aug 13, 2022
@alerque
Copy link
Member Author

alerque commented Aug 13, 2022

This is not going to be easy to solve. I started tinkering with the hyphenator and got something working—only to run into another problem more visible than the one I was trying to solve. If you setup correct patterns for hyphenation around intra-word ' the next problem you get is that it throws off all the other cases. The hyphenator is designed to work on word tokens. Words that have apostrophes in them should be considered as whole words on each side of the apostrophe, i.e. this is bad:

\begin[papersize=a7]{document}
\nofolios
\neverindent
\set[parameter=document.lskip,value=9%pw]
\set[parameter=document.rskip,value=9%pw]
\language[main=tr]
\font[size=25pt]
Afyonkarahisar'danmış
\end{document}

image

That single letter should not have been allowed to break because it should have been considered the end of a word. My hack actually avoided this problem because it dropped a command in between two words so the tokenizer always treated them separately.

Bah!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Software improvement or feature request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants