Skip to content

This repository contains the WordNet Language Model Probing (WNLaMPro) dataset introduced in "Rare Words: A Major Problem for Contextualized Embeddings and How to Fix it by Attentive Mimicking".

Notifications You must be signed in to change notification settings

timoschick/am-for-bert

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 

Repository files navigation

WordNet Language Model Probing

This repository contains the WordNet Language Model Probing (WNLaMPro) dataset. Each line of the dataset file (dataset/WNLaMPro.txt) has the following form (note that all columns are separated by tabs rather than spaces):

<ID>  <SET_TYPE>  <KEY_WORD>  <RELATION>  <TARGET_WORD1>  <TARGET_WORD2>  ...

The columns have the following meaning:

  • <ID>: A unique identifier for this dataset entry
  • <SET_TYPE>: Either test or dev, depending on whether this entry belongs to the development or test subset of PSR
  • <KEY_WORD>: The key word in the <ANNOTATED_WORD> format (see below)
  • <RELATION>: The relation of this entry, either antonym, hypernym, cohyponym or corruption
  • <TARGET_WORDn>: The n-th target word for this dataset entry, in the <ANNOTATED_WORD> format (see below)

Annotated Words

Each key and target word of the WNLaMPro dataset is represented as an <ANNOTATED_WORD> in the following form:

<ANNOTATED_WORD> := <WORD> (<POS>,<FREQ>,<COUNT>)

The columns have the following meaning:

  • <WORD>: The actual word
  • <POS>: The part-of-speech tag for this word (either noun or adjective)
  • <FREQ>: The estimated Zipf frequency for this word, obtained using wordfreq
  • <COUNT>: The number of occurrences of this word in the Westbury Wikipedia corpus

Evaluation Script

You can evaluate a pretrained language model on WNLaMPro as follows:

python3 eval-script/evaluate.py --root ROOT --predictions_file PREDICTIONS_FILE --output_file OUTPUT_FILE --model_cls MODEL_CLS --model_name MODEL_NAME (--embeddings EMBEDDINGS)

where

  • ROOT is the path to the directory where WNLaMPro.txt can be found;
  • PREDICTIONS_FILE is the name of the file in which predictions are to be stored (relative to ROOT);
  • OUTPUT_FILE is the name of the file in which the model's MRR is to be stored (relative to ROOT);
  • MODEL_CLS is either bert or roberta (the evaluation script currently does not support other pretrained language models);
  • MODEL_NAME is either the name of a pretrained model from the Hugging Face Transformers Library (e.g., bert-base-uncased) or the path to a finetuned model;
  • EMBEDDINGS (optional) is the path (relative to ROOT) of a file that contains embeddings which are used to overwrite the language model's original embeddings. Each line of this file has to be in the format <WORD> <EMBEDDING>, for example apple -0.12 3.45 0.23 ... 0.03.

For additional parameters, check the content of eval-script/evaluate.py or run python3 eval-script/evaluate.py --help.

Citation

If you make use of the WNLaMPro dataset, please cite the following paper:

@inproceedings{schick2020rare,
  title={Rare words: A major problem for contextualized representation and how to fix it by attentive mimicking},
  author={Schick, Timo and Sch{\"u}tze, Hinrich},
  url="https://arxiv.org/abs/1904.06707",
  booktitle={Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence},
  year={2020}
}

About

This repository contains the WordNet Language Model Probing (WNLaMPro) dataset introduced in "Rare Words: A Major Problem for Contextualized Embeddings and How to Fix it by Attentive Mimicking".

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages