pmb2tsv is a collection of scripts to convert data from the Parallel Meaning
Bank (PMB) into column-based (.tsv) files. It also
extracts the raw and tokenized data in the form of .tok.iob files and the
DRSs in the form of .drs.clf files.
Please download the PMB 3.0.0/4.0.0 and extract the
directory pmb-3.0.0/pmb-4.0.0 into a data directory in the root directory
of this repository (or symlink it).
Scripts to convert the files are mostly found in this repository; however, the following software needs to be present on the system:
- Python 3 – the
python3executable should be on your$PATH. - Produce – the
produceexecutable should be on your$PATH. - SWI-Prolog 7 or higher – the
swiplexecutable should be on your$PATH. Tested with version 8.4.2. - GNU Parallel – the
parallelexecutable should be on your$PATH.
The data is extracted in four different formats:
.tok.iob: Tokenization information in character-level BIO format.parse.tags: CCG derivations with tags in Prolog syntax.drs.clf: DRSs in clause format
Individual per-token annotation layers are extracted to files with these extensions:
.toknum(token number).tok(token form).lemma(lemma).sem(semantic tag).wordnet(WordNet sense).dep(dependency head).frag(DRS fragment)
All of these are integrated into TSV files (the columns are in the above order):
.tsv
The output file names before the extension contain the following information, indicating which data is in each file:
- PMB version (
3.0.0or4.0.0) - language (
en,de,it, ornl) - status (
gold,silver, orbronze) - part (
p00,p01, ..., orp99) or portion (train,dev,test, oreval). The latter are concatenated together from the former according to the division in the PMB README (different for 3.0.0 and 4.0.0).
Use the produce command to produce the file(s) with the information you need.
For example, to extract the .drs.clf file for the PMB 4.0.0 English gold part
00:
produce out/pmb-4.0.0-en-gold-p00.drs.clf
To produce train, dev, and test data in TSV format for all languages:
produce out/pmb-4.0.0-{en,de,it,nl}-gold-{train,dev,test}.tsv
There are also shorthands to extract all data for a given version and status. For example:
produce gold-4.0.0
produce silver-4.0.0
produce bronze-4.0.0
For a small number of CCG derivations, especially some that are not fully corrected, dependency extraction will fail. The corresponding columns will be empty/missing. In extremely rare cases a dependency non-tree (a cyclic graph) may be extracted.
For details on the conversion from CCG derivations to dependency trees, see
Kilian Evang (2020): Configurable Dependency Tree Extraction from CCG
Derivations. Proceedings of the Universal Dependencies Workshop.
To reproduce the experiments from that paper, checkout out the
evang-2020-configurable tag and run:
produce pmb-3.0.0-{en,de,it,nl}-gold-{p00,p01}.eval