Seneca

This repository contains code and data for the paper "Morphological Segmentation for Seneca", to appear in AmericasNLP 2021

Data set construction when evaluating with a development set

There are two data sources: a grammar book (Bardeau 2007)), and words collected from transcribed informal recordings.

Transcriptions of informal recordings were performed by Robbie Jimerson (rcj2772@rit.edu)

To constuct data sets for the grammar book, do:

python3 code/segmentation_data.py --input resources/all-forms-from-spreadsheet.txt --output OUTPUT_PATH --lang grammar

The data generated this way in our experiments is in 1/grammar/:

baseline contains data used to train the Naive baseline
tuning contains data used to train the Less naive baseline
domain contains data for a series of cross-domain training experiments (1) basic is for transfer learning between the two domains / data sources (2) finetine is to fine-tune the model from (1) with in-domain data (3) self-training is for using additional words from the Bible (in the resource folder) for training (4) multi-task is for multi-task learning; to get the data for this configuration in particular, do:

python3 code/augmentation.py --input TARGET_TRAINING_FILE --output OUTPUT_PATH --method b --bible resources/Bible_select.txt

python3 code/prep_task.py --input INPUT_DEVELOPMENT_FILE --output OUTPUT_DEVELOPMENT_FILE
crosslingual contains data for a series of cross-linguistic training experiments (1) basic is for transfer learning between Seneca and four Mexican indigenous languages from Kann et al. (2018) (2) finetune is to fine-tune the model from (1) with in-domain data (3) multi-task is for multi-task learning with the four Mexican indigenous languages from Kann et al. (2018)

Data set construction for the informal sources is similar, except that the code is:

python3 code/segmentation_data.py --input resources/all-forms-from-spreadsheet.txt --output OUTPUT_PATH --lang robbie

The output organization (e.g folder names) is the same as that for the grammar book described above.

Data set construction when evaluating with a development domain is not available per request from the community

Training/Applying seq2seq morphological segmentation model

It is quite simple and straightforward. Hooray.ipynb contains a run-through of training and applying one segmentation model

All our models trained under different configurations are within each of the specified folders described above.

Evaluating the output of a model

python3 code/segmentation_eval.py --gold GOLD_FILE --pred PREDICTED_OUTPUT --ex onmt

All our evaluation results are within each of the specified folders described above, including results when evaluating using a development domain

Testing a model

The experiments/test folder contains all models and results during the final testing stage.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Seneca

Data set construction when evaluating with a development set

Data set construction when evaluating with a development domain is not available per request from the community

Training/Applying seq2seq morphological segmentation model

Evaluating the output of a model

Testing a model

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
code		code
experiments		experiments
resources		resources
results		results
.DS_Store		.DS_Store
Hooray.ipynb		Hooray.ipynb
README.md		README.md

zoeyliu18/Seneca

Folders and files

Latest commit

History

Repository files navigation

Seneca

Data set construction when evaluating with a development set

Data set construction when evaluating with a development domain is not available per request from the community

Training/Applying seq2seq morphological segmentation model

Evaluating the output of a model

Testing a model

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages