Skip to content

rubcompling/nodalida2021

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 

Repository files navigation

Chunking Historical German

This repository contains the manually annotated data sets and additional material for the paper:

Katrin Ortmann (2021). Chunking Historical German. In Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), Reykjavik, Iceland (online), pages 190–199. PDF

Gold data

The manually annotated data sets can be found in the data folder. Included are:

The data sets were taken from a previous study on topological fields (Ortmann, 2020) and enriched with chunks as well as additional corrections of POS tags and sentence boundaries. More information on these data sets including a mapping of HIPKON POS tags to STTS can be found here.

The data is available in CoNLL2000 format (cf. Sang & Buchholz, 2000) and contains word forms, STTS POS tags, and BIO chunks, separated by spaces.

The following chunk types are included:

  • NC (noun chunk)
  • PC (prepositional chunk)
  • ADVC (adverb chunk)
  • AC (adjective chunk)
  • sNC (stranded noun chunk)
  • sPC (stranded prepositional chunk)

Additional material

Mercurius POS to STTS mapping

The following mapping rules are used to derive STTS tags (Schiller et al., 1999) from the POS tagset of the Mercurius corpus (Demske, 2005). Tags not listed here remain unchanged.

POS STTS
$! $.
$: $.
$; $.
$? $.
-- XY
KOMPE POS of following token
NNE NN
PROAV PAV
UNKNOWN XY
VVPG ADJD

ReF.UP POS to STTS mapping

The following mapping rules are used to derive STTS tags (Schiller et al., 1999) from the POS tagset of the ReF.UP corpus, a subcorpus of the Reference Corpus of Early New High German (Wegera et al., 2021). A documentation (in German) can be found here.

POS STTS
-- XY
$! $.
$( $(
$, $,
$. $.
$: $.
$; $.
$? $.
$MK $,
$MSBI $.
$QL $(
$QR $(
ADJA ADJA
ADJD ADJD
ADJN ADJD
ADJS ADJA
ADJV ADJD
ADV ADV
APPO APPO
APPR APPR
APPRDARTB APPRART
APZR APZR
AVD ADV
AVNEG ADV
AVREL ADV
AVW PWAV
CARD CARD
DARTB ART
DARTU ART
DDEM PDAT
DINDEF PIAT
DPOS PPOSAT
DW PWAT
FM FM
ITJ ITJ
KOKOM KOKOM
KON KON
KOUI KOUI
KOUS KOUS
NA NN
NE NE
PAVAP ADV
PAVD ADV
PAVDAP PAV
PAVREL ADV
PAVRELAP PAV
PAVW PWAV
PAVWAP PWAV
PDEM PDS
PINDEF PIS
PPER PPER
PPOS PPOSS
PRELAT PRELAT
PRELS PRELS
PRF PRF
PTKA PTKA
PTKANT PTKANT
PTKNEG PTKNEG
PTKREL ADV
PTKVZ PTKVZ
PTKZU PTKZU
PW PWS
PWAV PWAV
SPELL XY
TRUNC TRUNC
VAFIN VAFIN
VAIMP VAIMP
VAINF VAINF
VAINFS VAINF
VAPP VAPP
VAPPA ADJA
VAPPD ADJD
VAPPN VAPP
VAPSA ADJA
VAPSD ADJD
VAPSN VAPP
VAPSS NN
VMFIN VMFIN
VMIMP VMIMP
VMINF VMINF
VMINFS NN
VMPP VMPP
VVFIN VVFIN
VVIMP VVIMP
VVINF VVINF
VVINFS NN
VVIZU VVIZU
VVPP VVPP
VVPPA ADJA
VVPPD ADJD
VVPPN VVPP
VVPPS NN
VVPS VVPP
VVPSA ADJA
VVPSD ADJD
VVPSN VVPP
VVPSS NN

RegExp Rules

The following rules are used for chunk identification with the RegExp chunker:

PC:
{<KOKOM>*<APPR><(ART|PPOSAT|PDAT|PIAT|PWAT|CARD|ADJA|ADJD|ADV|PTKNEG|$,|$\(|KON|TRUNC)>*<(NN|NE)>+<APZR>*}
{<KOKOM>*<APPRART><(CARD|ADJA|ADJD|ADV|PTKNEG|$,|$\(|KON|TRUNC)>*<(NN|NE)>+<APZR>*}
{<KOKOM>*<(ART|PPOSAT|PDAT|PIAT|PWAT|CARD|ADJA|TRUNC)><(ART|PPOSAT|PDAT|PIAT|PWAT|CARD|ADJA|ADJD|ADV|PTKNEG|$,|$\(|KON|TRUNC)>*<(NN|NE)>+<APPO>+}
{<KOKOM>*<(ART|PPOSAT|PDAT|PIAT|PWAT|CARD|ADJA|TRUNC)>*<(NN|NE)>+<APPO>+}
{<KOKOM>*<APPR><ART><(PIS|PPOSS)><APZR>*}
{<KOKOM>*<ART><(PIS|PPOSS)><APPO>}
{<KOKOM>*<(APPR|APPRART)><(PIS|PDS|PWS|PPER|PPOSS|PRELS|PRF)><APZR>*}
{<KOKOM>*<(PIS|PDS|PWS|PPER|PPOSS|PRELS|PRF)><APPO>}
{<KOKOM>*<APPR>*<PAV>}
NC:
{<KOKOM>*<(ART|PPOSAT|PDAT|PIAT|PWAT|CARD|ADJA|TRUNC)><(ART|PPOSAT|PDAT|PIAT|PWAT|CARD|ADJA|ADJD|ADV|PTKNEG|$,|$\(|KON|TRUNC)>*<(NN|NE)>+}
{<KOKOM>*<(ART|PPOSAT|PDAT|PIAT|PWAT|CARD|ADJA|TRUNC)>*<(NN|NE)>+}
{<KOKOM>*<ART><(PIS|PPOSS)>}
{<KOKOM>*<(PIS|PDS|PWS|PPER|PPOSS|PRELS|PRF)>}
AC:
{<KOKOM>*<(ADJA|ADV|PTKNEG|PTKA)>*<ADJD>+}
ADVC:
{<KOKOM>*<(ADV|PTKNEG)>+}
NC:
{<KOKOM>*<CARD>+}
sPC:
{<KOKOM>*<(APPR|APPRART)><(ART|PPOSAT|PDAT|PIAT|PWAT|ADJA)>*}
sNC:
{<KOKOM>*<(ART|PPOSAT|PDAT|PIAT|PWAT|ADJA)>}

License

The DTA corpus is licensed under CC BY-SA 4.0 and the HIPKON corpus under CC BY 3.0. The modern data set is licensed under CC BY-SA 3.0, except for the TED talk sample, which is provided under CC BY–NC–ND 4.0.

References

BBAW. 2019. Deutsches Textarchiv. Grundlage für ein Referenzkorpus der neuhochdeutschen Sprache. Berlin-Brandenburgische Akademie der Wissenschaften; http://www.deutschestextarchiv.de/.

Marco Coniglio, Karin Donhauser, and Eva Schlachter. 2014. HIPKON: Historisches Predigtenkorpus zum Nachfeld (Version 1.0). Humboldt-Universität zu Berlin. SFB 632 Teilprojekt B4.

Ulrike Demske. 2005. Mercurius-Baumbank (Version 1.1). Universität Potsdam. LAUDATIO access

Katrin Ortmann, Adam Roussel, and Stefanie Dipper. 2019. Evaluating Off-the-Shelf NLP Tools for German. In Proceedings of the Conference on Natural Language Processing (KONVENS), pages 212–222.

Katrin Ortmann. 2020. Automatic Topological Field Identification in (Historical) German Texts. In Proceedings of the The 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL), pages 10-18.

Anne Schiller, Simone Teufel, Christine Stöckert, and Christine Thielen. 1999. Guidelines für das Tagging deutscher Textcorpora mit STTS (Kleines und großes Tagset). Retrieved from http://www.sfs.uni-tuebingen.de/resources/stts-1999.pdf.

Klaus-Peter Wegera, Hans-Joachim Solms, UlrikeDemske, and Stefanie Dipper. 2021. Referenzkorpus Frühneuhochdeutsch (Version 1.0). https://www.linguistics.rub.de/ref

About

Chunking Historical German

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published