Codebase for investigating the semantic capabilities of language models in formal language environments. Specifically, in these environments, the process generating the training data is idealized in terms of pragmatic theory and controllable via many hyperparameters. This enables testing hypotheses about how speakers' pragmatic concerns embed semantic information in raw text that language models can leverage.
pip install allennlp allennlp_models
To generate training data in the language powerset
:
python generate.py powerset --temp=5 --cost=.5 > documents.txt
Development and testing sets can be generated by specifying a different random seeds.
python generate.py powerset --seed=3 --temp=5 --cost=.5 > dev_documents.txt
Full documentation of all the training data can be found in scripts/generate.s:
source scripts/generate.s
The following command shows how to train and save a language model on the synthetic data:
CUDA=0 TRAIN=documents.txt DEV=dev_documents.txt allennlp train training_config/bi_lm.jsonnet -s=rsa1_model
Note that this part can be done using whatever language modeling framework you want, but I'm using AllenNLP.
I also have provided scripts to train a bunch of models on the NYU slurm cluster, assuming all the data has been setup with generate.s.
SPEAKER=literal source scripts/launch_train.sh
SPEAKER=informative source scripts/launch_train.sh
SPEAKER=independent source scripts/launch_train.sh
General evaluation:
python evaluate.py independent \
--model_dir=$ROOT/models/powerset-3/literal \
--eval_path=$ROOT/data/powerset-3/eval.tsv
Evaluation setting cost to its gold value:
ROOT=$SCRATCH/synthetic-language-understanding
python evaluate.py informative --cost=0.1 \
--model_dir=$ROOT/models/powerset-3/informative \
--eval_path=$ROOT/data/powerset-3/eval.tsv
We use the script generate_compositional_test_data.py
to generate pairs of entailed and non-entailed texts. In addition to entailment labels, it will also output gold probabilities of each text according to the RSA model.
Use argument n_items
to specify the number of worlds, and max_sent_len
to specify the maximum number to lexical items in a premise or hypothesis (not including the stop token).
For example:
cd ${PROJECT_ROOT}
for agent in vanilla dependent; do
python generate_compositional_test_data.py ${lang}
--${agent} \
--n_items=3 \
--temp=5 \
--cost=0.1 \
--max_sent_len=5 \
--eval_dir=data/powerset/dependent
The following command will reproduce our n-gram model evaluation.
cd ${PROJECT_ROOT}/src/evaluation
python evaluate_entailment.py \
--test_data=data/powerset/dependent/eval_entail-3_worlds-5_sents.tsv \
--distributional_model=ngram \
--lang=powerset \
--dependent \
--n_items=3 \
--cost=0.1 \
--training_dir=data/powerset/dependent \
--order=3 \
--size=100000000 \
--plot_type=line \
--complexity=length \
--n-increments=22
--auc