COLF-VID is a COrpus of Literal and Figurative readings of German Verbal IDioms in context. It comes in 34 files containing annotated instances (along with the sentences they occur in) of 34 different German verbal idiom (VID) types. The annotation consists of four labels: LITERAL
-> LIT
, IDIOMATIC/FIGURATIVE
-> FIG
, UNDECIDABLE
-> UND
and BOTH
-> BOTH
. A more detailed description of the corpus can be found in the paper Supervised Disambiguation of German Verbal Idioms with a BiLSTM
Architecture. At the moment, there exist three different versions of COLF-VID:
- COLF-VID_1.0: The version of the corpus that was used during the experiments described in the paper. It was lemmatized with GermaLemma and POS tagged with the TreeTagger.
- COLF-VID_1.1: The cleaned version of COLF-VID_1.0. COLF-VID_1.0 contained some dublicates that were removed. Does not contain lemmas or POS tags at the moment, but we will add those along with dependency information in the near future with UDPipe.
- COLF-VID_2.0: Work in progress. We aim to add annotations for VID instances in the corpus that were not part of the pre-chosen set of the 34 VID types and thus were not annotated in the first run.
Available under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License (In the paper we erroneously write that we make it available under the Creative Commons Attribution-ShareAlike 4.0 International license).