Skip to content

Code for the paper "Where New Words Are Born: Distributional Semantic Analysis of Neologisms and Their Semantic Neighborhoods" (SCiL 2020)

Notifications You must be signed in to change notification settings

ryskina/neology

Repository files navigation

This repository contains the code and supplementary materials for the paper:

Where New Words Are Born: Distributional Semantic Analysis of Neologisms and Their Semantic Neighborhoods
Maria Ryskina, Ella Rabinovich, Taylor Berg-Kirkpatrick, David R. Mortensen, Yulia Tsvetkov
SCiL 2020

Please contact mryskina@cs.cmu.edu for any questions.

Embedding alignment code in projection.py is based on Ryan Heuser's Gensim port of William Hamilton's alignment code in HistWords.

Data

This code uses the COHA and COCA corpora in plain text format. The corpora need to be downloaded from https://www.corpusdata.org/.

Usage

Code to train the historical and modern embeddings using Gensim:

python train_w2v.py <coha_path> historical
python train_w2v.py <coca_path> modern

where <coha_path> and <coca_path> need to be replaced with paths to COHA and COCA top-level text directories respectively. Trained embedding models will be saved into the models directory.

Pretrained embeddings will be available shortly.

Code to reproduce the main analysis:

python main.py <coha_path> <coca_path> [--seed <seed>] [--stable]

where:

  • --seed is an optional argument specifiying a random seed used to randomize control set selection
  • --stable flag switches between stable and relaxed control sets

The MATLAB script for fitting the generalized linear model (GLM) can be found in glm.m.

Files

  • vocabulary.txt contains a vocabulary of nouns extracted from Wikicorpus
  • neologisms.txt is a list of neologisms automatically extracted by our code
  • freq_growth.tsv contains the frequency growth rates (Spearman's correlation coefficients and p-values) for all vocabulary words
  • pairs.{stable|relaxed}.tsv is a list of neologism-control pairs for stable and relaxed control sets respectively
  • density.{stable|relaxed}.tsv and growth.{stable|relaxed}.tsv display neighborhood density and average frequency growth rate for a range of neighborhood sizes for each neologism and control word
  • glm.{stable|relaxed}.tsv is a reformatting of the density and growth data to be used for GLM fitting
  • Supplementary.xlsx contains detailed results of the regression analysis and collinearity tests and nearest historical neighbors for all neologisms

Dependencies

Core dependencies:

  • Python >= 3.6
  • SciPy >= 1.0.1
  • 3.7.0 <= Gensim < 4.0
  • NLTK
  • MATLAB (for GLM analysis only)

Reference

@article{ryskina2020where,
 title={Where New Words Are Born: Distributional Semantic Analysis of Neologisms and Their Semantic Neighborhoods},
 author={Ryskina, Maria and Rabinovich, Ella and Berg-Kirkpatrick, Taylor and Mortensen, David R. and Tsvetkov, Yulia},
 journal={Proceedings of the Society for Computation in Linguistics},
 volume={3},
 number={1},
 pages={43--52},
 year={2020}
}

About

Code for the paper "Where New Words Are Born: Distributional Semantic Analysis of Neologisms and Their Semantic Neighborhoods" (SCiL 2020)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published