GitHub page for the RAW-C dataset: Relatedness of Ambiguous Words, in Context.
To cite:
Trott, S., Bergen, B. (2021). RAW-C: Relatedness of Ambiguous Words––in Context (A New Lexical Resource for English). Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2021).
There are several data files.
Most relevant is data/processed/raw-c.csv
: the complete set of sentence pairs in the final RAW-C dataset. The most important columns (for most purposes) are:
word
sentence1
andsentence2
: the sentence pair being contrastedsame
: whether the target word has the same or different meaning across the sentence pairambiguity_type
: whether the different-sense use is Polysemy or Homonymymean_relatedness
: the mean relatedness judgment across human participantsdistance_bert
: the cosine distance between BERT's representation of the target worddistance_elmo
: the cosine distance between ELMo's representation of the target word
This file also contains information about the number of annotators who rated each sentence pair (count
), as well as the variance across those judgments (sd_relatedness
).
We also include another version of this file, which does not contain the human relatedness judgments, but does have the BERT/ELMo norms (data/processed/stims_with_nlm_distances.csv
). This can be used to run the nlm_analysis.Rmd
file.
We also include data/processed/raw-c_with_dominance.csv
, which contains all of the same columns as RAW-C, with several additions:
dominance_sentence2
: mean dominance ofsentence2
relative tosentence1
.sd_dominance_sentence2
: standard deviation for dominance judgments ofsentence2
relative tosentence1
.
Note that dominance judgments are only included for different sense sentence pairs.
Finally, we include the original stimuli file (data/processed/stimuli.csv
).
The file src/modeling/get_distances.py
can be used to run each sentence pair through BERT and ELMo, and extract the cosine distance from the contextualized representations:
python src/modeling/get_distances.py
Note that this script requires the bert-embedding
package, as well as the allennlp
package.
We include the analysis file for original stimuli, using BERT and ELMo (data/src/analysis/nlm_analysis.html
). This can be rerun by "knitting" the .Rmd file (data/src/analysis/nlm_analysis.Rmd
).
We also include the analysis file for analyzing the individual trial data (data/src/analysis/norming_analysis.html
). Note that this file also performs the analyses in Section 5.3 of the paper (the language model evaluations). The individual trial data is available upon request (sttrott@ucsd.edu).