A historical German-language corpus (1840-1920) of fictional and non-fictional texts, annotated for speech, thought and writing representation
The corpus was created by the DFG-funded project "Redewiedergabe - eine literatur- und sprachwissenschaftliche Korpusanalyse" (Leibniz Institute for the German Language / University of Würzburg). Homepage: www.redewiedergabe.de
Please cite the following publication, if you use the corpus:
Brunner, Annelen / Engelberg, Stefan / Jannidis, Fotis / Tu, Ngoc Duyen Tanja / Weimer, Lukas (2020): Corpus REDEWIEDERGABE, Proceedings of The 12th Language Resources and Evaluation Conference, Marseille, pp. 796‑805.
If you encounter any issues or have any questions, please use Github's Issues tracker.
Project Redewiedergabe also provides automatic taggers for German STWR, trained (mostly) on this corpus.
|Main corpus||838||489,459||12,123||Detailed statistical data|
|Main corpus (Beta release)||619||360,974||9,451||Detailed statistical data; Differences to the final release|
This is a collection of several types of additional annotated material produced by project Redewiedergabe. The material generally follows the same annotation guidelines and is available in the same formats as the main corpus, but has some idiosyncracies and less quality control. For additional information about the different corpus parts follow the links in the table.
|Single-annotated samples||258||150,162||4,395||Annotations only by a single annotator|
|Single-annotated full texts (fictional)||18||235,493||6,232||Annotations only by a single annotator; Note: Annotation guidelines differ slightly with respect to speaker|
|Single-annotated full texts (non-fictional)||15||84,769||1,472||Annotations only by a single annotator|
|Indirect full texts||16||51,864||272||Only instances of indirect STWR with a simplified annotation system|
|Free indirect full texts (fictional)||142||2,647,924||2,136||Only instances of free indirect STWR with a simplified annotation system; semi-automated annotation|
|Primary annotations of the core corpus||1,704||989,384||27,297||Collection of all individual annotations of the core corpus|
|KONVENS 2020 data||Data splits used for the STWR taggers, as described in the KONVENS 2020 paper|
The corpus is available in three different formats:
The "Redewiedergabe" corpus is created by the DFG-funded project "Redewiedergabe. Eine literatur- und sprachwissenschaftliche Korpusanalyse" in a cooperation between Leibniz-Institut für Deutsche Sprache, Mannheim (Abteilung Lexik) and Universität Würzburg (Lehrstuhl für Computerphilologie und Neuere Deutsche Literaturgeschichte).
In addition, the following people participated in the annotation: Sarah Gorke, Anna Hartmann, Janne Lorenzen, Christoph Peterek, Laura Schäfer, Lisa Sergel and Theresa Valta.
Project homepage: www.redewiedergabe.de
Most recent publication: Brunner, Annelen/Engelberg, Stefan/Jannidis, Fotis/Tu, Ngoc Duyen Tanja/Weimer, Lukas (2020): Corpus REDEWIEDERGABE, Proceedings of The 12th Language Resources and Evaluation Conference, Marseille.
A complete list of all publications can be found here.
The "Redewiedergabe" corpus is available under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
We ask you to mention project "Redewiedergabe" regarding the annotation, and project TextGrid, Deutsches Textarchiv, Leibniz-Institut für Deutsche Sprache and Universitätsbibliothek Bremen regarding the texts.
The "Redewiedergabe" corpus is a historical corpus of fictional and non-fictional texts. These texts were published between 1840-1920 and were compiled from the following three sources:
- Narrative texts from the 'Digitalen Bibliothek', converted to TEI format by project TextGrid
- Texts from the magazine "Die Grenzboten", digitized by Universitätsbibliothek Bremen (Source: Die Grenzboten: Zeitschrift für Politik, Literatur und Kunst. Berlin: Dt. Verl, 1841-1922. Staats- und Universitätsbibliothek Bremen, Ac 7155 Public Domain Mark 1.0), TEI structuring by Deutsches Textarchiv and OCR correction by project "Redewiedergabe".
- Texts from the "Mannheimer Korpus Historischer Zeitungen und Zeitschriften" (Mannheim corpus of historical newspapers and magazines), collected by the Leibniz-Institute für Deutsche Sprache and converted by Deutsches Textarchiv.
The corpus does not consist of complete texts but of text samples. The sample length is at least 500 tokens for texts from the Digitale Bibliothek and at least 200 tokens for newspaper/magazine texts. The samples are drawn randomly from the available material with following additional rules: For the texts from the Digitale Bibliothek, it was enforced that material by each author was considered evenly within a decade. Accordingly, for the texts from MKHZ it was enforced that the different newspapers/magazines were considered evenly. Thus we prevented authors or newspapers with little material from dropping out entirely during the sampling process.
Each sample contains metadata with information about the publication time, text type, fictionality status and author and title if available (more information: Metadata).
The corpus contains detailed annotation of instances of speech, thought and writing representation (STWR). We distiguish four main types: direct STWR (Er sagte: "Ich bin hungrig."), indirect STWR (Er sagte, er sei hungrig.), free indirect STWR (Wo sollte er jetzt etwas zu Essen herbekommen?) and reported STWR (Er sprach über Restaurants.), as well as the main media speech, thought and writing. In addition to that, we annotate attributes like embedding level, non-factual STWR, borderline cases, pragmatic and metaphoric use, as well as frames, introductory expressions and speakers.
Each sample of the main corpus was annotated independently by two different people. The final annotation was created by a third person on the basis of those annotations. The underlying first annotations are also available under "additional material".
The detailed annotation guidelines are available at redewiedergabe.de/richtlinien/richtlinien.html (in German).
An overview over the structure of the annotations is available at Annotation structure.