This repository contains the code and supplementary materials for the paper:
Where New Words Are Born: Distributional Semantic Analysis of Neologisms and Their Semantic Neighborhoods
Maria Ryskina, Ella Rabinovich, Taylor Berg-Kirkpatrick, David R. Mortensen, Yulia Tsvetkov
SCiL 2020
Please contact mryskina@cs.cmu.edu for any questions.
Embedding alignment code in projection.py
is based on Ryan Heuser's Gensim port of William Hamilton's alignment code in HistWords.
This code uses the COHA and COCA corpora in plain text format. The corpora need to be downloaded from https://www.corpusdata.org/.
Code to train the historical and modern embeddings using Gensim:
python train_w2v.py <coha_path> historical
python train_w2v.py <coca_path> modern
where <coha_path>
and <coca_path>
need to be replaced with paths to COHA and COCA top-level text directories respectively. Trained embedding models will be saved into the models
directory.
Pretrained embeddings will be available shortly.
Code to reproduce the main analysis:
python main.py <coha_path> <coca_path> [--seed <seed>] [--stable]
where:
--seed
is an optional argument specifiying a random seed used to randomize control set selection--stable
flag switches between stable and relaxed control sets
The MATLAB script for fitting the generalized linear model (GLM) can be found in glm.m
.
vocabulary.txt
contains a vocabulary of nouns extracted from Wikicorpusneologisms.txt
is a list of neologisms automatically extracted by our codefreq_growth.tsv
contains the frequency growth rates (Spearman's correlation coefficients and p-values) for all vocabulary wordspairs.{stable|relaxed}.tsv
is a list of neologism-control pairs for stable and relaxed control sets respectivelydensity.{stable|relaxed}.tsv
andgrowth.{stable|relaxed}.tsv
display neighborhood density and average frequency growth rate for a range of neighborhood sizes for each neologism and control wordglm.{stable|relaxed}.tsv
is a reformatting of the density and growth data to be used for GLM fittingSupplementary.xlsx
contains detailed results of the regression analysis and collinearity tests and nearest historical neighbors for all neologisms
Core dependencies:
- Python >= 3.6
- SciPy >= 1.0.1
- 3.7.0 <= Gensim < 4.0
- NLTK
- MATLAB (for GLM analysis only)
@article{ryskina2020where,
title={Where New Words Are Born: Distributional Semantic Analysis of Neologisms and Their Semantic Neighborhoods},
author={Ryskina, Maria and Rabinovich, Ella and Berg-Kirkpatrick, Taylor and Mortensen, David R. and Tsvetkov, Yulia},
journal={Proceedings of the Society for Computation in Linguistics},
volume={3},
number={1},
pages={43--52},
year={2020}
}