Creating hyphenation patterns for Church Slavonic
- Python installed (Python 3 is recommended as it is noticeably faster)
pypatgenversion 2.9 or better
To build TeX file with hyphenation patterns and hyphenations exceptions, just do this:
make clean make
Result should be a file
When one of the input changes, use
make to rebuild the result.
To build CTAN package (ZIP file
hyphen-churchslavonic.zip ready to be uploaded to CTAN), do:
Explanations of (some) files in this directory (note that some of the mentioned files are created by
Makefile- does all the dirty work
hyph-cu.tex- the ultimate result
hyph-cu.err- hyphenation exceptions in pattern format
cv4.sh- scripts used to do 2-fold, 3-fold, and 4-fold cross-validation of trained syllable patterns
combiner_patterns.txt- contains TeX patterns inhibiting hyphenation before a combining symbol (like accent). These patterns are generated with
make_pats.pyscript. If you want to change it, change script, not
root_patterns.txt- hand-crafted patterns for some common roots that raw patterns do not cover.
single_patterns.txt- patterns that inhibit hyphenation before last character and after the first character (note that because of the accents setting
righthyphenmin=2is not sufficient.
specsX.py- specifications for training syllable patterns.
syl_patt.txt- syllable patterns, learned from syllable dictionary.
syl_patt.err- errors that syllable patterns make on the syllable dictionary.
words-hyph.txt- hyphenation dictionary
words-hyph-expanded.txt- expanded hyphenation dictionary
make_hyph.py- script to generate hyphenation dictionary from syllable dictionary. Hyphenation dictionary differs from syllable dictionary because hyphenation is not allowed before a last vowel (even if it forms a syllable), and generally (with some exceptions) not allowed after a single vowel.
make_pats.py- script to generate
expand.py- does recursive NFC<->NFD expansion of characters
cu_hyph.tex- TeX hyphenation patterns (in NFC form, unexpanded)
cu_hyph_expanded.tex- expanded patterns, have some parasites
In Church Slavonic words are hyphenated on the syllable boundary. Syllable boundaries are determined on the basis of complex morphological rules given in the syllable dictionary. No hyphenation is allowed before a single-letter syllable at the end of a word and no hyphenation is allowed after a single-letter syllable at the beginning of a word with some exceptions (See below). Note that by a single-letter syllable we mean such a syllable that has a single letter (always a vowel), optionally followed by one or more combining marks (accents); by spelling convention, any vowel at the beginning of a word will always have a breathing mark.
We create TeX hyphenation patterns in the following stages
Create syllable patterns. We do this by training on a syllable dictionary words.txt and adding some hand-crafted patterns to cover common roots where learned patterns fail. Specifically:
Use the hyphenation dictionary to automatically generate patterns. Since dictionary uses Unicode Normal Form C, all patterns will also be in this form.
Analyze errors that patterns make on the hyphenation dictionary and add (long) patterns that cover offending stems. This essentially allows one to move some errors out of the exception list into patterns, hoping that the new pattern will cover all word forms (not just forms seen within the training dictionary).
Add special patterns that ensure that no hyphenation happens before a combining character. Since Church Slavonic uses a rich set of diacritical marks, we do not rely on step 1 to find all of these places, and just add these rules explicitly1
This step is performed by
Add patterns to suppress hyphenation after the first letter and before the last letter. Note that we cannot rely here on TeX mechanism of
righthyphenminbecause (i) TeX also counts accents and breathing marks as characters when counting "letters", and (ii) there are exceptional cases when hyphenation after a single letter at the beginning of a word is allowed. To achieve our goal we generate a list of all vowels and then list all vowels with all possible (and sometimes impossible) combining accents. From this list we create inhibiting patterns for word prefixes and suffixes.
Since Church Slavonic allows hyphenation after some vowels at the beginning of a word, we remove corresponding inhibiting prefix rules:
- Hyphenation is allowed after a leading syllable ѿ (OT), ѹ (UK), and ѡ (OMEGA)
At this stage we also generate the TeX list of hyphenation exceptions.
Expand patterns and exception list by replacing each character with its Normal Form D. Note that for robustness we create all combinations of D and C forms for every character that has these different forms. This is different than just converting each pattern and exception to Normal Form D. To the built-in Unicode combining rules we add the following two extra rules:
- U+0479 <-> U1C82 U0443 (Sharp-O decomposition of UK) - this will be included in the upcoming revision of Unicode, but current engines do not know this yet
- U+047D <-> U+A64D U+0486 U+0311 (omega with veliky apostrof) - this symbol is incorrectly marked as not decomposable in Unicode
Remove some patterns that do not do any good, but after pattern expansion actually do some harm (we call them "parasitic patterns"). Example is pattern:
1б2ве. Consider word
любвѐthat should be hyphenated as
люб-вѐ. When last character is expanded to its NFD (decomposed) form, pattern activates and kills valid hyphenation. Removal of this pattern does not affect hyphenation of any other dictionary word. We have total of 6 such parasites, listed in
Generating syllable patterns
./train.sh > train.log
Input here is a hyphenation dictionary
words.txt. Output is a file
containing raw syllable patterns and raw syllable exceptions.
Training parameters were chosen manually after several trial-and-error sessions with the objective to achieve best possible generalization performance.
For pattern generation we used pypatgen tool, version 0.2.9.
Training process is detailed in a separate document.
err_raw_patterns.txt that lists all syllable exceptions in the form
of a full-word pattern. It is used in the next step to make rules more general and more compact.
Add "long" patterns from exceptions
In the exception list one can often see many variants of a same-root word. It makes sense to make a "long" prefix pattern to cover this offending root and all its word forms. For example,
бо-лѣ́-зней бо-лѣ́-знемъ бо-лѣ́-знен-нѡ бо-лѣ́-знен-ны-ѧ бо-лѣ́-зни
It is much more efficient to replace all these exceptions with a single pattern:
The upside is that other forms with the same root will now have correct hyphenation in the root part, even though they were not provided in the dictionary.
Generally speaking, hyphenation of suffixes is more reliable than hyphenation of roots. The reason is that suffix hyphenation is learned from all words in the dictionary, while root hyphenation - only from words containing this root.
Note that the pattern generation step failed to build this "long" pattern because we limit the pattern length (for better generalization).
To assist in making such "long" prefix patterns,
pypatgen can generate an error report in the form of suggested full-word
patterns (use option
-p of the
Result of this (manual) work is file
root_patterns.txt with "long" patterns.
Adding special rules for combining symbols and digraphs
Do not split digraph
LOWERCASE UKwith a hyphen:
U+0443Also do not split the variant form
Do not allow a hyphen before the following symbols (combiners1):
- combining grave:
- combining acute:
- combining inverted breve:
- combining Cyrillic psili pneumata:
- combining breve:
- combining vertical tilde:
- combining paerok:
- combining kavyka:
- combining dot above:
- combining diaeresis:
- combining double grave:
- combining Cyrillic titlo:
- combining pokrytie:
- combining Cyrillic letter A:
- combining Cyrillic letter Be:
- combining Cyrillic letter Ve:
- combining Cyrillic letter Ge:
- combining Cyrillic letter De:
- combining Cyrillic letter Ie:
- combining Cyrillic letter Zhe:
- combining Cyrillic letter Ze:
- combining Cyrillic letter I:
- combining Cyrillic letter Ka:
- combining Cyrillic letter El:
- combining Cyrillic letter Em:
- combining Cyrillic letter En:
- combining Cyrillic letter O:
- combining Cyrillic letter Pe:
- combining Cyrillic letter Er:
- combining Cyrillic letter Es:
- combining Cyrillic letter Te:
- combining Cyrillic letter Monograph Uk:
- combining Cyrillic letter Ef:
- combining Cyrillic letter Kha:
- combining Cyrillic letter Tse:
- combining Cyrillic letter Che:
- combining Cyrillic letter Sha:
- combining Cyrillic letter Shcha:
- combining Cyrillic letter Yat:
- combining Cyrillic letter Yu:
- combining Cyrillic letter Iotified A:
- combining Cyrillic letter Little Yus:
- combining Cyrillic letter Fita:
- combining Cyrillic letter Es-Te:
U+2DF5(Note: Unicode discourages use of this character)
- combining grave:
Note that other combining letters do not occur in Poluustav or Synodal Slavonic. See UTN 41 for details.
Do not hyphenate before:
- yer (ъ)
- tall yer
- soft sign
Do not hyphenate before or after these symbols:
- combining titlo left half:
- combining titlo right half:
- combining conjoining macron:
In the hand-crafted rules above, mark "(auto)" denotes patterns that were found automatically during step 1.
Result of this work is file
combiner_patterns.txt. Note that for convenience this file is actually generated programmatically, with
the help of the utility
Building hyphenation patterns from syllable patterns
To make hyphenation patterns we need to inhibit hyphenation after a leading single-letter syllable and before a trailing single-letter syllable.
To do this we programmatically generate the file
single_patterns.txt, using the utility
make_pats.py. These inhibiting patterns
suppress unwanted hyphens, and make a special exception for those few cases when hyphenation after a single letter at the beginning
of a word is allowed.
We run the
build.sh script to build syllable and hyphenation patterns and to generate the hyphenation TeX file:
./build.sh > build.log
err_patterns.txt. The latter lists hyphenation exceptions and is just a different representation of
words within the
\hyphenation clause in the
Expanding patterns and exceptions
Input here is
cu-hyp.tex and the output is
The hyphenation dictionary contains only the following characters that have different NFD forms:
TABLE = [ ('\u0400', ['\u0415\u0300']), # E grave ('\u0401', ['\u0415\u0308']), # IO ('\u0403', ['\u0413\u0301']), # GJE ('\u0407', ['\u0406\u0308']), # YI ('\u040c', ['\u041a\u0301']), # KJE ('\u040d', ['\u0418\u0300']), # I grave ('\u040e', ['\u0423\u0306']), # SHORT U ('\u0419', ['\u0418\u0306']), # SHORT I ('\u0439', ['\u0438\u0306']), # short i ('\u0450', ['\u0435\u0300']), # e grave ('\u0451', ['\u0435\u0308']), # io ('\u0453', ['\u0433\u0301']), # ghe ('\u0457', ['\u0456\u0308']), # yi ('\u045c', ['\u043a\u0301']), # kje ('\u045d', ['\u0438\u0300']), # i grave ('\u045e', ['\u0443\u0306']), # short u ('\u0476', ['\u0474\u030f']), # IZHITSA with double grave ('\u0477', ['\u0475\u030f']), # izhitsa with double grave ('\u0479', ['\u1c82\u0443']), # uk ('\u047d', ['\ua64d\u0486\u0311']), # broad omega with veliky apostrof ]
and the following recursive function was used to expand each pattern and each exceptional word:
_EXPLODE_MAP = dict(TABLE) _EXPLODE_REX = re.compile('|'.join(re.escape(x) for x in _EXPLODE_MAP.keys())) def explode_nfd(string): ''' Takes string and generates all possible equivalent representations by substituting each expandable character with all possible NFD expansions. ''' string = nfc(string) mtc = _EXPLODE_REX.search(string) if mtc is None: yield string return prefix = string[:mtc.start()] explodable = mtc.group() rest = string[mtc.end():] for suffix in explode_nfd(rest): yield prefix + explodable + suffix for expansion in _EXPLODE_MAP[explodable]: yield prefix + expansion + suffix