DEPlain: A German Parallel Corpus with Intralingual Translations into Plain Language for Sentence and Document Simplification

To advance sentence simplification and document simplification in German, we present DEplain, a new dataset of parallel, professionally written and manually aligned simplifications in plain German ("plain DE" or in German: "Einfache Sprache").

More details can be found in our paper: Stodden, Momen, Kallmeyer (2023). "DEplain: A German Parallel Corpus with Intralingual Translations into Plain Language for Sentence and Document Simplification." In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, Canada. Association for Computational Linguistics.

Contributions

Overall, our paper contains the following contributions. A more detailed description and the resources per contribution can be found in the corresponding/linked subdirectories:

A web harvester to download and harvest parallel documents with standard German and plain German,
Two document simplification datasets,
Sentence-wise Alignment (manually using TS-ANNO and automatically using some alignment algorithms),
A simplification plan per document based on the manually sentence-wise alignments,
Four sentence simplification datasets,
Some Human Annotations on the manual aligned sentence pairs,
Automatic text simplification models for document simplification and sentence simplification.

The following figure shows the connection between the contributions made in our paper. The document level corpora (B) and the sentence level corpora (E) are used for training and evaluating the automatic text simplification models (F).

Corpora Statistics

Metadata of the resulting subcorpora are shown in the table below:

	Name	License	# Doc. Pairs (train/dev/test)	# Original Sents	# Simple Sents.	Alignment	# Sent. Pairs (train/dev/test)	Corpus Name Doc.	Corpus Name Sent.
1	DEplain-apa	upon request	483 (387/48/48)	25,607	26,471	manual	13,122 (10,660/1,231/1,231)	DEplain-APA-doc	DEplain-APA-sent
2	DEplain-web	open	147 (-/-/147)	6,138	6,402	manual	1,846 (-/-/1846)	DEplain-web-doc-manual-open	DEplain-web-sent-manual-open
3	DEplain-web	open	249 (199/50/-)	7,087	7,760	auto	652 (514/138/-)	DEplain-web-doc-auto-open	DEplain-web-sent-auto-open
4	DEplain-web	closed	360 (288/72/-)	12,847	18,068	auto	942 (767/175/-)	DEplain-web-doc-auto-closed	DEplain-web-sent-auto-closed
	In total	mixed	1,239 (874/170/195)	51,681	58,701	mixed	16,562 (11,941/1,544/3,077)

Data Availability

Document Simplification

Please check ./B__Document-level_Corpus for information on how to access our document simplification corpora (DEplain-APA-doc and DEplain-web-doc). For DEplain-APA, please request the access via DEplain-APA zenodo repository. The documents of DEplain-web with open licenses are provided here; the documents with closed licenses can be downloaded using the web crawler.

Sentence Simplification

Please check ./E__Sentence-level_Corpus for information on how to access our sentence simplification corpora (DEplain-APA-sent and DEplain-web-sent). For DEplain-APA, please request the access via DEplain-APA zenodo repository. The manually aligned sentence pairs of DEplain-web and the automatic aligned sentence pairs with an open license can directly downloaded from the repository. If you downloaded the documents of DEplain-web with a closed license, you can automatically align these documents using one of the provided alignment algorithms.

Reproduction of Results of the Paper

Reproduction of Automatic Sentence Alignment

For reproduction of our experiments regarding automatic sentence-wise alignment, please see ./C__Alignment Algorithms.

Reproduction of Automatic Text Simplification

For reproduction of our experiments regarding automatic document simplification and sentence simplification, please see ./G__Automatic_Text_Simplification_Experiments.

License

The parts of the work are licensed under different licenses. Please see the corresponding subdirectory for more information on the license per contribution.

Citation

If you use part of this work, please cite our paper:

@inproceedings{stodden-etal-2023-deplain,
    title = "{DE}plain: A {G}erman Parallel Corpus with Intralingual Translations into Plain Language for Sentence and Document Simplification",
    author = "Stodden, Regina and Momen, Omar and Kallmeyer, Laura",
    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.acl-long.908",
    doi = "10.18653/v1/2023.acl-long.908",
    pages = "16441--16463",
}

Contact:

Feel free to contact Regina Stodden if you have any comments or problems with the provided materials.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
A__Web_Harvester		A__Web_Harvester
B__Document-level_Corpus		B__Document-level_Corpus
C__Alignment_Algorithms		C__Alignment_Algorithms
D__Simplification_Plans		D__Simplification_Plans
E__Sentence-level_Corpus		E__Sentence-level_Corpus
F__Human_Annotations		F__Human_Annotations
G__Automatic_Text_Simplification_Experiments		G__Automatic_Text_Simplification_Experiments
conference-material		conference-material
README.md		README.md
building_process_of_deplain.svg		building_process_of_deplain.svg

rstodden/DEPlain

Folders and files

Latest commit

History

Repository files navigation

DEPlain: A German Parallel Corpus with Intralingual Translations into Plain Language for Sentence and Document Simplification

Contributions

Corpora Statistics

Data Availability

Document Simplification

Sentence Simplification

Reproduction of Results of the Paper

Reproduction of Automatic Sentence Alignment

Reproduction of Automatic Text Simplification

License

Citation

Contact:

About

Topics

Resources

Stars

Watchers

Forks

Languages