Skip to content

rycolab/surprisal-duration-tradeoff

main
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
src
 
 
 
 
 
 
 
 
 
 
 
 
 
 

surprisal-duration-tradeoff

CircleCI

This code accompanies the paper A surprisal--duration trade-off across and within the world's languages (Pimentel et al., EMNLP 2021). It is a study of the trade-off between surprisal and duration resulting from a hypothetical channel capacity in human's language processing capacity.

Install

To install dependencies run:

$ conda env create -f environment.yml

Activate the created conda environment with command:

$ source activate.sh

Finally, install the appropriate version of pytorch:

$ conda install -y pytorch torchvision cudatoolkit=10.1 -c pytorch
$ # conda install pytorch torchvision cpuonly -c pytorch
$ pip install transformers

For the R analysis, you will need to install

$ sudo apt-get install libnlopt-dev

Get Data

VoxClamantis

VoxClamantis information can be found at https://voxclamantisproject.github.io. First, download the alignment data for both epitran and unitran in the OSF repository and place it in folder data/vox/alignments/. Second, the raw text information in this dataset is copyrighted, so you will need to email us to request access. This text data should then be placed in path data/vox/text/. You can now extract the VoxClamantis data with the command:

$ make extract_vox LANGUAGE=<language> DATASET=<dataset>

or directly preprocess it with command:

$ make get_data LANGUAGE=<language> DATASET=<dataset>

with any language in VoxClamantis, e.g. AZEBSA. There are three alignment methods available in VoxClamantis, called <dataset>, we use two of them in this project: unitran, and epitran.

To preprocess all languages in one of the analysed datasets run:

$ source src/h01_data/process_epitran.sh
$ source src/h01_data/process_unitran.sh

Train and evaluate the models

You can train your models on each language with the command

$ make train LANGUAGE=<language> DATASET=<dataset>

Similarly, you can then use your model to get surprisal values on that language by running:

$ make eval LANGUAGE=<language> DATASET=<dataset>

To train the model in all languages from one of the datasets, run

$ python src/h02_learn/train_all.py --dataset <dataset>

Analysis

After training and evaluating the models in all languages with the commands above. Merge the results and filter them based on MCD scores:

$ make filter_mcd DATASET=<dataset>

With this filtered results, we are now able to run the paper analysis. First, run the individual languages analysis:

$ make analysis_monoling DATASET=<dataset>

Second, the full cross-linguistic analysis:

$ make analysis_crossling DATASET=<dataset>

Finally, run the "pure" cross-linguistic analysis with command:

$ make analysis_tradeoff DATASET=<dataset>

Paper plots and results

Finally, to get the plots as in the paper run the following commands. For figure 1:

$ make plot_monoling_full DATASET=unitran

For figure 2:

$ make plot_monoling_effects DATASET=epitran

For figure 3:

$ make plot_crossling_effects DATASET=epitran

For the Appendix Table:

$ make print_controls DATASET=epitran

Further, to get the average unitran slope for the individual language analysis run:

$ make plot_monoling_effects DATASET=unitran

And for the cross-linguistic analysis:

$ make plot_crossling_effects DATASET=unitran

This result might not match the paper exactly since the original models were deleted and we had to rerun this cross-linguistic unitran analysis. Finally, to get the pure cross-linguistic slopes run:

$ make print_tradeoff DATASET=unitran

Extra Information

Citation

If this code or the paper were usefull to you, consider citing it:

@inproceedings{pimentel-etal-2021-surprisal,
    title = "A surprisal--duration trade-off across and within the world's languages",
    author = "Pimentel, Tiago and
    Meister, Clara and
    Salesky, Elizabeth and
    Teufel, Simone and
    Blasi, Damián and
    Cotterell, Ryan",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    year = "2021",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/2109.15000",
}

Contact

To ask questions or report problems, please open an issue.

About

This repository accompanies the paper "A surprisal--duration trade-off across and within the world's languages" published in EMNLP 2021.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published