Skip to content

rstodden/DEPlain

Repository files navigation

DEPlain: A German Parallel Corpus with Intralingual Translations into Plain Language for Sentence and Document Simplification

To advance sentence simplification and document simplification in German, we present DEplain, a new dataset of parallel, professionally written and manually aligned simplifications in plain German ("plain DE" or in German: "Einfache Sprache").

More details can be found in our paper: Stodden, Momen, Kallmeyer (2023). "DEplain: A German Parallel Corpus with Intralingual Translations into Plain Language for Sentence and Document Simplification." In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, Canada. Association for Computational Linguistics.

Contributions

Overall, our paper contains the following contributions. A more detailed description and the resources per contribution can be found in the corresponding/linked subdirectories:

  1. A web harvester to download and harvest parallel documents with standard German and plain German,
  2. Two document simplification datasets,
  3. Sentence-wise Alignment (manually using TS-ANNO and automatically using some alignment algorithms),
  4. A simplification plan per document based on the manually sentence-wise alignments,
  5. Four sentence simplification datasets,
  6. Some Human Annotations on the manual aligned sentence pairs,
  7. Automatic text simplification models for document simplification and sentence simplification.

The following figure shows the connection between the contributions made in our paper. The document level corpora (B) and the sentence level corpora (E) are used for training and evaluating the automatic text simplification models (F).

Process of building the DEplain corpus

Corpora Statistics

Metadata of the resulting subcorpora are shown in the table below:

Name License # Doc. Pairs (train/dev/test) # Original Sents # Simple Sents. Alignment # Sent. Pairs (train/dev/test) Corpus Name Doc. Corpus Name Sent.
1 DEplain-apa upon request 483 (387/48/48) 25,607 26,471 manual 13,122 (10,660/1,231/1,231) DEplain-APA-doc DEplain-APA-sent
2 DEplain-web open 147 (-/-/147) 6,138 6,402 manual 1,846 (-/-/1846) DEplain-web-doc-manual-open DEplain-web-sent-manual-open
3 DEplain-web open 249 (199/50/-) 7,087 7,760 auto 652 (514/138/-) DEplain-web-doc-auto-open DEplain-web-sent-auto-open
4 DEplain-web closed 360 (288/72/-) 12,847 18,068 auto 942 (767/175/-) DEplain-web-doc-auto-closed DEplain-web-sent-auto-closed
In total mixed 1,239 (874/170/195) 51,681 58,701 mixed 16,562 (11,941/1,544/3,077)

Data Availability

Document Simplification

Please check ./B__Document-level_Corpus for information on how to access our document simplification corpora (DEplain-APA-doc and DEplain-web-doc). For DEplain-APA, please request the access via DEplain-APA zenodo repository. The documents of DEplain-web with open licenses are provided here; the documents with closed licenses can be downloaded using the web crawler.

Sentence Simplification

Please check ./E__Sentence-level_Corpus for information on how to access our sentence simplification corpora (DEplain-APA-sent and DEplain-web-sent). For DEplain-APA, please request the access via DEplain-APA zenodo repository. The manually aligned sentence pairs of DEplain-web and the automatic aligned sentence pairs with an open license can directly downloaded from the repository. If you downloaded the documents of DEplain-web with a closed license, you can automatically align these documents using one of the provided alignment algorithms.

Reproduction of Results of the Paper

Reproduction of Automatic Sentence Alignment

For reproduction of our experiments regarding automatic sentence-wise alignment, please see ./C__Alignment Algorithms.

Reproduction of Automatic Text Simplification

For reproduction of our experiments regarding automatic document simplification and sentence simplification, please see ./G__Automatic_Text_Simplification_Experiments.

License

The parts of the work are licensed under different licenses. Please see the corresponding subdirectory for more information on the license per contribution.

Citation

If you use part of this work, please cite our paper:

@inproceedings{stodden-etal-2023-deplain,
    title = "{DE}plain: A {G}erman Parallel Corpus with Intralingual Translations into Plain Language for Sentence and Document Simplification",
    author = "Stodden, Regina and Momen, Omar and Kallmeyer, Laura",
    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.acl-long.908",
    doi = "10.18653/v1/2023.acl-long.908",
    pages = "16441--16463",
}

Contact:

Feel free to contact Regina Stodden if you have any comments or problems with the provided materials.