Skip to content

seannyD/UniversalsInWHWords

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A test of universals in wh words

This is the repository for the following paper:

Slonimska, A., & Roberts, S. G. (2017). A case for systematic sound symbolism in pragmatics: Universals in wh-words. Journal of Pragmatics, 116, 1-20. Link to paper. Link to pdf

The full dataset contains 430,000 entries from 314 languages, taken from the IDS, the WOLD and Sprakbanken.

https://github.com/seannyD/UniversalsInWHWords/blob/master/Processing/CleanedAndSimplifiedData/Alldata_simple.csv

Warning: The data has been cleaned and processed with a specific hypothesis in mind. Researchers wishing to use the lexical data are encouraged to go to the original source, where all data is freely available.

There is also a script that restricts the data to non-creole, non-reconstructed, non-dialect-level distinctions, and only concepts that are well covered. e.g. there are many Nakh-Daghestanian languages documented, which is quite unbalanced considering the rest of the coverage. This dataset is 1000 concepts in 230 languages:

https://github.com/seannyD/UniversalsInWHWords/blob/master/Analysis/RestrictionsApplied.R

The following script will produce a matrix where rows are meanings and columns are languages:

(in directory Analysis)

source("PermutationTools.r")

alldata<-read.csv("../Processing/CleanedAndSimplifiedData/Alldata_simple.csv", stringsAsFactors=F)

d = data.frame.to.matrix(alldata)

Data format

  • word: Original transcription. Note that there can be multiple words per entry, separated by ";"
  • word.clean: original transcription with characters normalised
  • word.simple: simplified orthography, paying attention only to place and manner of articulation (no tones, no vowel length, no aspiration/nasality/palatalisation). We were mainly interested in consonants, so vowels are very simplified.
  • meaning: the original meaning (can differ for the same meaning ID)
  • meaning.id: the original meaning ID (see WOLD/IDS)
  • meaning.id.fixed: we fixed and normalised some meaning IDs.
  • domain: meaning domain.
  • analyzability: For WOLD data, whether the word can be analysed into sub-parts
  • Source: source of the data. Some languages were covered in more than one database, we prioritised WOLD since it has analysability data.

Workflow for producing the data:

Process raw WOLD data:

Collect_new_WOLD_data.R
addTranscriptions_new.R

Collect_new_IDS_data.R

Collect_Spraakbanken_data.R

Merge WOLD, IDS AND SB

List_merge.R

Simplify data (creates Processing/CleanedAndSimplifiedData/Alldata_simple.csv)

simplifyData.R    

End up with these files:

  • Alldata_simple.csv
  • RAW_data/Data_clean_up2.csv (manually created)
  • RAW_data/Grammars.csv (manually created)

Get languages in analysis by running:

Analysis/RestrictionsApplied.R

Processing/addGeoDataToLangList.R

Analysis

Select data:

Analysis/RestrictionsApplied.R
grammars.R
makeDataVariables.R

(these three are included in most RunAnalysis* files)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published