This repository contains the WordNet Language Model Probing (WNLaMPro) dataset. Each line of the dataset file (dataset/WNLaMPro.txt) has the following form (note that all columns are separated by tabs rather than spaces):
<ID> <SET_TYPE> <KEY_WORD> <RELATION> <TARGET_WORD1> <TARGET_WORD2> ...
The columns have the following meaning:
<ID>: A unique identifier for this dataset entry<SET_TYPE>: Eithertestordev, depending on whether this entry belongs to the development or test subset of PSR<KEY_WORD>: The key word in the<ANNOTATED_WORD>format (see below)<RELATION>: The relation of this entry, eitherantonym,hypernym,cohyponymorcorruption<TARGET_WORDn>: Then-th target word for this dataset entry, in the<ANNOTATED_WORD>format (see below)
Each key and target word of the WNLaMPro dataset is represented as an <ANNOTATED_WORD> in the following form:
<ANNOTATED_WORD> := <WORD> (<POS>,<FREQ>,<COUNT>)
The columns have the following meaning:
<WORD>: The actual word<POS>: The part-of-speech tag for this word (eithernoun oradjective)<FREQ>: The estimated Zipf frequency for this word, obtained using wordfreq<COUNT>: The number of occurrences of this word in the Westbury Wikipedia corpus
You can evaluate a pretrained language model on WNLaMPro as follows:
python3 eval-script/evaluate.py --root ROOT --predictions_file PREDICTIONS_FILE --output_file OUTPUT_FILE --model_cls MODEL_CLS --model_name MODEL_NAME (--embeddings EMBEDDINGS)
where
ROOTis the path to the directory whereWNLaMPro.txtcan be found;PREDICTIONS_FILEis the name of the file in which predictions are to be stored (relative toROOT);OUTPUT_FILEis the name of the file in which the model's MRR is to be stored (relative toROOT);MODEL_CLSis eitherbertorroberta(the evaluation script currently does not support other pretrained language models);MODEL_NAMEis either the name of a pretrained model from the Hugging Face Transformers Library (e.g.,bert-base-uncased) or the path to a finetuned model;EMBEDDINGS(optional) is the path (relative toROOT) of a file that contains embeddings which are used to overwrite the language model's original embeddings. Each line of this file has to be in the format<WORD> <EMBEDDING>, for exampleapple -0.12 3.45 0.23 ... 0.03.
For additional parameters, check the content of eval-script/evaluate.py or run python3 eval-script/evaluate.py --help.
If you make use of the WNLaMPro dataset, please cite the following paper:
@inproceedings{schick2020rare,
title={Rare words: A major problem for contextualized representation and how to fix it by attentive mimicking},
author={Schick, Timo and Sch{\"u}tze, Hinrich},
url="https://arxiv.org/abs/1904.06707",
booktitle={Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence},
year={2020}
}