Latin Vowel Length Classifier

Since vowel length in Classical Latin is phonemic, Latin vowel length annotation efforts to date rely on dictionary lookup, with the exception of a small number of rules of thumb. In this squib, I explore the possibility of predicting the length of Latin syllable nuclei based only on phonological features available from the orthography, without recourse to a dictionary. I show that a simple machine learning classifier is able to score remarkably well on this task, showing that Latin vowel length is in fact predictable for 95% of previous seen and 89% of unseen syllables. Further, by using interpretable machine learning architectures in the form of decision trees, random forests and (structured) perceptron, we can extract the linguistic features judged most predictive for the task.

Special thanks to Kevin Ryan for sharing his compiled dataset of Latin text annotated with macrons.

How to run

This project uses Anaconda to manage dependencies. Clone this repository, then use

conda env create -f environment.yml

to install the dependencies. Activate your new conda environment with

conda activate latin-vowel-classifier

If the /data folder only contains Ryan_Latin_master.txt, run

python preprocess.py

to create the two corpus files Latin_words_preprocessed.txt and Latin_words_preprocessed_unique.txt.

Then run model.py to train and evaluate a model:

python model.py

Various variables can be set in model.py, including the type of classifier, its hyperparameters, which features to use, and whether or not to plot the tree (only applicable for a decision tree). These are set in the main body of the script. Tree plotting options can be changed in the plot_fitted_tree() method. Logs will be output both to the console and to a log file under /experiments with the current timestamp.

Unit tests

Unit tests for syllabification are contained in test_syllabify.py. The file test_syllabify_corpus.py contains an integration test which checks that the entire corpus can be syllabified without errors.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

experiments

experiments

test

test

.gitignore

.gitignore

README.md

README.md

environment.yml

environment.yml

features.py

features.py

model.py

model.py

parse.py

parse.py

preprocess.py

preprocess.py

syllabify.py

syllabify.py

Repository files navigation

Latin Vowel Length Classifier

How to run

Unit tests

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
data		data
experiments		experiments
test		test
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml
features.py		features.py
model.py		model.py
parse.py		parse.py
preprocess.py		preprocess.py
syllabify.py		syllabify.py

rossh2/latin-vowel-length

Folders and files

Latest commit

History

Repository files navigation

Latin Vowel Length Classifier

How to run

Unit tests

About

Resources

Stars

Watchers

Forks

Languages