This repository contains code and data for the paper "Morphological Segmentation for Seneca", to appear in AmericasNLP 2021
There are two data sources: a grammar book (Bardeau 2007)), and words collected from transcribed informal recordings.
Transcriptions of informal recordings were performed by Robbie Jimerson (rcj2772@rit.edu)
To constuct data sets for the grammar book, do:
python3 code/segmentation_data.py --input resources/all-forms-from-spreadsheet.txt --output OUTPUT_PATH --lang grammar
The data generated this way in our experiments is in 1/grammar/:
-
baseline contains data used to train the Naive baseline
-
tuning contains data used to train the Less naive baseline
-
domain contains data for a series of cross-domain training experiments (1) basic is for transfer learning between the two domains / data sources (2) finetine is to fine-tune the model from (1) with in-domain data (3) self-training is for using additional words from the Bible (in the resource folder) for training (4) multi-task is for multi-task learning; to get the data for this configuration in particular, do:
python3 code/augmentation.py --input TARGET_TRAINING_FILE --output OUTPUT_PATH --method b --bible resources/Bible_select.txt
python3 code/prep_task.py --input INPUT_DEVELOPMENT_FILE --output OUTPUT_DEVELOPMENT_FILE
-
crosslingual contains data for a series of cross-linguistic training experiments (1) basic is for transfer learning between Seneca and four Mexican indigenous languages from Kann et al. (2018) (2) finetune is to fine-tune the model from (1) with in-domain data (3) multi-task is for multi-task learning with the four Mexican indigenous languages from Kann et al. (2018)
Data set construction for the informal sources is similar, except that the code is:
python3 code/segmentation_data.py --input resources/all-forms-from-spreadsheet.txt --output OUTPUT_PATH --lang robbie
The output organization (e.g folder names) is the same as that for the grammar book described above.
Data set construction when evaluating with a development domain is not available per request from the community
It is quite simple and straightforward. Hooray.ipynb
contains a run-through of training and applying one segmentation model
All our models trained under different configurations are within each of the specified folders described above.
python3 code/segmentation_eval.py --gold GOLD_FILE --pred PREDICTED_OUTPUT --ex onmt
All our evaluation results are within each of the specified folders described above, including results when evaluating using a development domain
The experiments/test
folder contains all models and results during the final testing stage.