# Phonologists Have an Alignment Problem

# 1 Introduction
## 1.1 The Problem
### 1.1.1 Syllables
Words are can be thought of as being composed hierarchically, with phonemes as the smallest units. Phonemes come together to form syllables, which in turn can be arranged to form unique words. Syllables as a structure are present in most, if not all, languages, and minimally contain a vowel nucleus, with optional consonant onset and coda. Syllable therefore come in one of four possible formations, vowel (V), consonant-vowel (VC), vowel-consonant (VC), and consonant-vowel-consonant (CVC). Some languages preferentially use a subset of possible syllable formations over others; Japanese, for instance, near exclusively constructs words from combinations of base CV syllables.
### 1.1.2 Syllabification
Syllabification is the task of retroactively identifying the syllables in word-level phonological encoding. \[hæ.pi] (happy), for instance, is a two-syllable word, with a full stop denoting the bounary between the first syllable \[hæ\] and second syllable \[pi\]. When syllabifying phoneme sequences, the maximal onset principle states that in cases where a consonant or consonant cluster can serve as the coda of a syllable or the onset of the next syllable, that consonant is attributed as the onset of the next syllable so long as the resulting syllbale is consistent with the phonotactic constraints set by the language. For example, the \[p\] in \[hæpi\] is attributed to the second syllable \[hæ.pi\] instead of the first \[hæp.i\] because \[pi\] is a valid syllable within the phonotactic constraints of English. On the other hand, \[mæntə\] is syllabified \[mæn.tə\], with the consonant cluster \[nt\] divided between the first and second syllables because \[ntə\] is not a valid syllable in English, whereas \[tə\] is.
### 1.1.3 Orthographic Syllabification
In languages with alphabet-based orthographies, orthographic syllabification is the process of segmenting the orthographic representation of a word in alignment with the syllables in the phonologic representation of that word. For a straightforward example, brother would be syllabified bro-ther, in alignment with the syllables \[bɹʌ.ðə\].
Orthographic syllabification as a task is less trivial than phonological syllabification; orthography is not innate, it must be taught, and so too must the rules of orthographic syllabification be taught. In a deep orthography like English, where an orthographic unit (grapheme) may not always map to the same phoneme (think 'gh' in words "enough", "though", "ugh"), and vice versa, deciding where to draw syllabic boundaries is dependent upon both the knowledge of the word's pronunciation as well as the goal of the syllabification.
## 1.2 Background
The goal of orthographic syllabification is not always the same; phonologists and teachers may be interested in syllabifying words in order to align text to sound for the purposes of study or teaching. Meanwhile, in the type-setting industry, syllabification is useful for breaking (hyphenating) words to maintain cleanliness in line breaks. Hyphenation is less concerned with the phonologic qualities of the words being broken; instead, it focuses on maintaining meaning and readability. Liang's TEX hyphenation algorithm is perhaps the most famous tool for machine hyphenation currently available; Liang's algorithm hyphenates "brother" as "broth-er," vwhich violates the maximal onset principle, but may be easier to read.
Many modern orthographic syllabification tools use hyphenation algorithms, because often hyphenation algorithms do properly syllabify words. Line breaks are typically less jarring when they are split readably, but readability tends to favor full suffixes instead of partial suffixes that respect syllable structure. For instance, clarity is hyphenated as clar-ity, because -ity is a common suffix that has little meaning in English when split -i and -ty. Despite this discrepancy, hyphenation algorthms have been a go-to "good-enough" approximation for mass orthographic syllabification in English.
## 1.3 CELEX
To highlight this issue in research, CELEX is a large lexical database for English, Dutch, and German, developed by the University of Nijmegen, the Institute for Dutch Lexicology in Leiden, the Max Planck Institute for Psycholinguistics in Nijmegen, and the Institute for Perception Research in Eindhoven. Part of the lexicon includes orthographic syllable information along with phonological syllable information, for English words, making it ideal for research involving the role of phonological information in orthographic representation. However, the issue is that CELEX appears to follow hyphenation rules for syllabification and phonological rules for syllabification inconsistently. Some words will apply maximal onset principle correctly, such as cer.tain (\[sɚ.tən\]), while other words are incorrectly sylabified, such as clam-our (\[klæ.məR\]). We manually analyzed a random sample of 3000 multi-syllabic words from CELEX, evaluating whether or not the word syllabifications were correct at all and followed the maximal onset principle. Our results show that 1884 words were correctly syllabified, and 1116 words were incorrectly syllabified, so around 37% of the words were incorrectly syllabified.
## 1.4 Goal
We aim to develop a tool that can accurately syllabify orthography. The goal is to have a general tool fo Enlgish word syllabifification that generalizes across not only out of dictionary English words, but also pseudowords. A peristent problem that prior algorithms have faced is their inability to properly handle edge cases. ometimes, the pronunciation, and therefore syllable structure, of a word in English is impossible to know just from the orthography. Worcestershire, for example, is commonly improperly syllabified Wor-ce-ster-shire, a four syllable word. However, its true pronunciation \[wʊs.tɚ.ʃɚ\] is tri-syllabic, which is difficult to glean from the orthography. As such, the tool we develop must fulfill a certain set of requirements:
1. It must generate plausible syllabifications for any english word and for english pseudowords
2. It must accurately syllabify words that CELEX improperly syllabifies
3. It must outperform hyphenation algorithms

With these stipulations in mind, we chose a neural-network model approach to the task of syllabification. IKrantz et. al employed the neural-network approach to developing a language agnostic model of phonological syllabification, and found success in treating syllabification as a binary classification task. The advantage of the neural network approach is that it does not rely on a dictionary of seen words and substrings, allowing it to handle any possible string sequence.

One disadvantage of the neural network model is interpretability. It will not be immediately obvious under what conditions that the model is identifying syllable boundaries, especially in the absence of phonological information.

The greatest disadvantage we will have to overcome is accurately labelled data. As stated before, CELEX, one of the largest banks of syllable data, is an unreliable source of clean data. We will need a human-curated dataset in order to train a good model.