Skip to content
master
Switch branches/tags
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 

Corpus "Rᴇᴅᴇᴡɪᴇᴅᴇʀɢᴀʙᴇ"

License: CC BY-NC-SA 4.0 DOI

A historical German-language corpus (1840-1920) of fictional and non-fictional texts, annotated for speech, thought and writing representation

The corpus was created by the DFG-funded project "Redewiedergabe - eine literatur- und sprachwissenschaftliche Korpusanalyse" (Leibniz Institute for the German Language / University of Würzburg). Homepage: www.redewiedergabe.de

Please cite the following publication, if you use the corpus:

Brunner, Annelen / Engelberg, Stefan / Jannidis, Fotis / Tu, Ngoc Duyen Tanja / Weimer, Lukas (2020): Corpus REDEWIEDERGABE, Proceedings of The 12th Language Resources and Evaluation Conference, Marseille, pp. 796‑805.

If you encounter any issues or have any questions, please use Github's Issues tracker.

Project Redewiedergabe also provides automatic taggers for German STWR, trained (mostly) on this corpus.

Available Data

Core corpus

Samples Tokens STWR instances Notes
Main corpus 838 489,459 12,123 DOI Detailed statistical data
Main corpus (Beta release) 619 360,974 9,451 DOI Detailed statistical data; Differences to the final release

Additional Material

This is a collection of several types of additional annotated material produced by project Redewiedergabe. The material generally follows the same annotation guidelines and is available in the same formats as the main corpus, but has some idiosyncracies and less quality control. For additional information about the different corpus parts follow the links in the table.

Files Tokens STWR instances Notes
Single-annotated samples 258 150,162 4,395 Annotations only by a single annotator
Single-annotated full texts (fictional) 18 235,493 6,232 Annotations only by a single annotator; Note: Annotation guidelines differ slightly with respect to speaker
Single-annotated full texts (non-fictional) 15 84,769 1,472 Annotations only by a single annotator
Indirect full texts 16 51,864 272 Only instances of indirect STWR with a simplified annotation system
Free indirect full texts (fictional) 142 2,647,924 2,136 Only instances of free indirect STWR with a simplified annotation system; semi-automated annotation
Primary annotations of the core corpus 1,704 989,384 27,297 Collection of all individual annotations of the core corpus
KONVENS 2020 data Data splits used for the STWR taggers, as described in the KONVENS 2020 paper

Format

The corpus is available in three different formats:

NOTE: The XMI files are compatible with the free anntotation tool ATHEN (developed by Markus Krug in the Kallimachos project) and its STWR view (developed by Tanja Tu).

Project

The "Redewiedergabe" corpus is created by the DFG-funded project "Redewiedergabe. Eine literatur- und sprachwissenschaftliche Korpusanalyse" in a cooperation between Leibniz-Institut für Deutsche Sprache, Mannheim (Abteilung Lexik) and Universität Würzburg (Lehrstuhl für Computerphilologie und Neuere Deutsche Literaturgeschichte).

Project members: Annelen Brunner (IDS Mannheim), Stefan Engelberg (IDS Mannheim), Fotis Jannidis (Universität Würzburg), Ngoc Duyen Tanja Tu (IDS Mannheim), Lukas Weimer (Universität Würzburg).

In addition, the following people participated in the annotation: Sarah Gorke, Anna Hartmann, Janne Lorenzen, Christoph Peterek, Laura Schäfer, Lisa Sergel and Theresa Valta.

Project homepage: www.redewiedergabe.de

Publications

Most recent publication: Brunner, Annelen/Engelberg, Stefan/Jannidis, Fotis/Tu, Ngoc Duyen Tanja/Weimer, Lukas (2020): Corpus REDEWIEDERGABE, Proceedings of The 12th Language Resources and Evaluation Conference, Marseille.

A complete list of all publications can be found here.

License

The "Redewiedergabe" corpus is available under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

We ask you to mention project "Redewiedergabe" regarding the annotation, and project TextGrid, Deutsches Textarchiv, Leibniz-Institut für Deutsche Sprache and Universitätsbibliothek Bremen regarding the texts.

Text sources

The "Redewiedergabe" corpus is a historical corpus of fictional and non-fictional texts. These texts were published between 1840-1920 and were compiled from the following three sources:

  • Narrative texts from the 'Digitalen Bibliothek', converted to TEI format by project TextGrid
  • Texts from the magazine "Die Grenzboten", digitized by Universitätsbibliothek Bremen (Source: Die Grenzboten: Zeitschrift für Politik, Literatur und Kunst. Berlin: Dt. Verl, 1841-1922. Staats- und Universitätsbibliothek Bremen, Ac 7155 Public Domain Mark 1.0), TEI structuring by Deutsches Textarchiv and OCR correction by project "Redewiedergabe".
  • Texts from the "Mannheimer Korpus Historischer Zeitungen und Zeitschriften" (Mannheim corpus of historical newspapers and magazines), collected by the Leibniz-Institute für Deutsche Sprache and converted by Deutsches Textarchiv.

The corpus does not consist of complete texts but of text samples. The sample length is at least 500 tokens for texts from the Digitale Bibliothek and at least 200 tokens for newspaper/magazine texts. The samples are drawn randomly from the available material with following additional rules: For the texts from the Digitale Bibliothek, it was enforced that material by each author was considered evenly within a decade. Accordingly, for the texts from MKHZ it was enforced that the different newspapers/magazines were considered evenly. Thus we prevented authors or newspapers with little material from dropping out entirely during the sampling process.

Each sample contains metadata with information about the publication time, text type, fictionality status and author and title if available (more information: Metadata).

Annotation

The corpus contains detailed annotation of instances of speech, thought and writing representation (STWR). We distiguish four main types: direct STWR (Er sagte: "Ich bin hungrig."), indirect STWR (Er sagte, er sei hungrig.), free indirect STWR (Wo sollte er jetzt etwas zu Essen herbekommen?) and reported STWR (Er sprach über Restaurants.), as well as the main media speech, thought and writing. In addition to that, we annotate attributes like embedding level, non-factual STWR, borderline cases, pragmatic and metaphoric use, as well as frames, introductory expressions and speakers.

Each sample of the main corpus was annotated independently by two different people. The final annotation was created by a third person on the basis of those annotations. The underlying first annotations are also available under "additional material".

The detailed annotation guidelines are available at redewiedergabe.de/richtlinien/richtlinien.html (in German). DOI

An overview over the structure of the annotations is available at Annotation structure.

About

a corpus annotated for speech, thought and writing representation

Resources

Packages

No packages published

Languages