Skip to content

Analyses for "Word forms - not just their lengths - are optimized for efficient communication"

Notifications You must be signed in to change notification settings

smeylan/pic-analysis

Repository files navigation

pic-analysis

Analyses for "Word forms - not just their lengths - are optimized for efficient communication"

To download / clean / compute frequency and in-context surprisal estimates, see smeylan/ngrawk. To download these estimates instead, in the appropriate directory structure:

wget cocosci.berkeley.edu/smeylan/pic/results.zip && unzip results.zip
wget cocosci.berkeley.edu/smeylan/pic/token_results.zip && unzip token_results.zip

The analysis requires other data sources including the Clearpond database, dates of first use from the Oxford English Dictionary, word lists from OPUS, and a list of plurals. To download these:

wget cocosci.berkeley.edu/smeylan/pic/data.zip && unzip data.zip

To limit the analysis to morphologically simple words, you will need a copy of the CELEX2 corpus. Add a symlink with ln -s to the data/ directory after decompressing.

To check against the Piantadosi et al. (2011) results, download and unzip publicly avalable data from the Colala website and place them in data/Google1T_Piantadosi. To filter the words, you will also need to request the OPUS wordlists from the Colala lab and place them in data/OPUS_Piantadosi.

About

Analyses for "Word forms - not just their lengths - are optimized for efficient communication"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published