Skip to content

redewiedergabe/corpus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 

Repository files navigation

Corpus "Rᴇᴅᴇᴡɪᴇᴅᴇʀɢᴀʙᴇ"

License: CC BY-NC-SA 4.0 DOI

A historical German-language corpus (1840-1919) of fictional and non-fictional texts, annotated for speech, thought and writing representation (STWR).

The corpus was created by the DFG-funded project "Redewiedergabe - eine literatur- und sprachwissenschaftliche Korpusanalyse" (Leibniz Institute for the German Language / University of Würzburg). Homepage: www.redewiedergabe.de

The following publication complements this technical description with in-depth discussion of corpus design and annotation. Please cite it when using the corpus:

Brunner, Annelen / Engelberg, Stefan / Jannidis, Fotis / Tu, Ngoc Duyen Tanja / Weimer, Lukas (2020): Corpus REDEWIEDERGABE, Proceedings of The 12th Language Resources and Evaluation Conference, Marseille, pp. 803‑812.

The detailed annotation guidelines developed by project REDEWIEDERGABE are available at redewiedergabe.de/richtlinien/richtlinien.html or at DOI (in German).

If you encounter any issues or have any questions, please use Github's Issues tracker.

Project Redewiedergabe also provides automatic taggers for German STWR, trained (mostly) on this corpus.

License

The corpus REDEWIEDERGABE (and the additional material) is available under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

We ask you to mention project "Redewiedergabe" regarding the annotation, and project TextGrid, Deutsches Textarchiv, Leibniz-Institut für Deutsche Sprache and Staats- und Universitätsbibliothek Bremen regarding the texts.

Available Data

Core corpus

Samples Tokens STWR instances Notes
Main corpus 838 489,459 12,123 DOI Detailed statistical data
Main corpus (Beta release) 619 360,974 9,451 DOI Detailed statistical data; Differences to the final release

Additional Material

This is a collection of several types of additional annotated material produced by project Redewiedergabe. The material generally follows the same annotation guidelines and is available in the same formats as the core corpus, but has some idiosyncrasies and less quality control. For additional information about the different corpus parts follow the links in the table.

Files Tokens STWR instances Notes
Single-annotated samples 258 150,162 4,395 Annotations only by a single annotator
Single-annotated full texts (fictional) 18 235,493 6,232 Annotations only by a single annotator; NOTE: Annotation guidelines differ slightly with respect to speaker
Single-annotated full texts (non-fictional) 15 84,769 1,472 Annotations only by a single annotator
Indirect full texts 16 51,864 272 Only instances of indirect STWR with a simplified annotation system
Free indirect full texts (fictional) 142 2,647,924 2,136 Only instances of free indirect STWR with a simplified annotation system; semi-automated annotation
Primary annotations of the core corpus 1,704 989,384 27,297 Collection of all individual annotations of the core corpus
KONVENS 2020 data Data splits used for the STWR taggers, as described in the KONVENS 2020 paper

Format

The core corpus is available in three different formats:

NOTE: The XMI files are compatible with the free annotation tool ATHEN (developed by Markus Krug in the Kallimachos project) and its STWR view (developed by Tanja Tu).

Project

The core corpus REDEWIEDERGABE and the additional material was created by the DFG-funded project "Redewiedergabe. Eine literatur- und sprachwissenschaftliche Korpusanalyse" (2017-2020) in a cooperation between Leibniz-Institut für Deutsche Sprache, Mannheim (Abteilung Lexik) and Universität Würzburg (Lehrstuhl für Computerphilologie und Neuere Deutsche Literaturgeschichte).

Project members: Annelen Brunner (IDS Mannheim), Stefan Engelberg (IDS Mannheim), Fotis Jannidis (Universität Würzburg), Ngoc Duyen Tanja Tu (IDS Mannheim), Lukas Weimer (Universität Würzburg).

In addition, the following people participated in the annotation: Sarah Gorke, Anna Hartmann, Janne Lorenzen, Christoph Peterek, Laura Schäfer, Lisa Sergel and Theresa Valta.

Project homepage: www.redewiedergabe.de

A list of all publications can be found here.

Text sources

The core corpus REDEWIEDERGABE is a historical corpus of fictional and non-fictional texts. These texts were published between 1840-1919 and were compiled from the following three sources:

  • Narrative texts from the 'Digitalen Bibliothek', converted to TEI format by project TextGrid
  • Texts from the magazine "Die Grenzboten", digitized by Universitätsbibliothek Bremen (Source: Die Grenzboten: Zeitschrift für Politik, Literatur und Kunst. Berlin: Dt. Verl, 1841-1922. Staats- und Universitätsbibliothek Bremen, Ac 7155 Public Domain Mark 1.0), TEI structuring by Deutsches Textarchiv and OCR correction by project "Redewiedergabe".
  • Texts from the "Mannheimer Korpus Historischer Zeitungen und Zeitschriften" (Mannheim corpus of historical newspapers and magazines), collected by the Leibniz-Institut für Deutsche Sprache and converted by Deutsches Textarchiv.

The corpus does not consist of complete texts but of text samples. The sample length is at least 500 tokens for texts from the Digitale Bibliothek and at least 200 tokens for newspaper/magazine texts. The samples are drawn randomly from the available material with following additional rules: For the texts from the Digitale Bibliothek, it was enforced that material by each author was considered evenly within a decade. Accordingly, for the texts from MKHZ it was enforced that the different newspapers/magazines were considered evenly. Thus we prevented authors or newspapers with little material from dropping out entirely during the sampling process.

Each sample contains metadata with information about the publication time, text type, fictionality status. Author and title are provided if available (more information: Metadata).

Annotation

The core corpus contains detailed annotation of instances of speech, thought and writing representation (STWR). We distinguish four main types: direct STWR (Er sagte: "Ich bin hungrig."), indirect STWR (Er sagte, er sei hungrig.), free indirect STWR (Wo sollte er jetzt etwas zu essen herbekommen?) and reported STWR (Er sprach über Restaurants.), as well as the main media speech, thought and writing. In addition to that, we annotate attributes like embedding level, non-factual STWR, borderline cases, pragmatic and metaphoric use, as well as frames, introductory expressions and speakers.

Each sample of the main corpus was annotated independently by two different people. The final annotation was created by a third person on the basis of those annotations. The underlying first annotations are also available (see primary annotations).

The detailed annotation guidelines are available at redewiedergabe.de/richtlinien/richtlinien.html (in German). DOI

An overview over the structure of the annotations is available at Annotation structure.