Repository for the CommonLit Ease of Readability Corpus
This repository contains the CommonLit Ease of Readability (CLEAR) corpus, which provides unique readability scores for ~5,000 text excerpts leveled for 3rd-12th grade readers along with information about the excerpt’s year of publishing, genre, and other meta-data. The CLEAR corpus is meant to provide researchers interested in discourse processing and reading with a resource from which to develop and test readability metrics and to model text readability. The CLEAR corpus includes a number of improvements in comparison to previous readability corpora including size (N = ~5,000 reading excerpts), breadth of the excerpts available, which cover over 250 years of writing in two different genres, and unique readability criterion provided for each text based on teachers’ ratings of text difficulty for student readers.
Two published papers on the corpus are below.
Crossley, S. A., Heintz, A., Choi, J., Batchelor, J., Karimi, M., & Malatinszky, A. (in press). A large-scaled corpus for assessing text readability. Behavior Research Methods.
Crossley2022_Article_ALarge-scaledCorpusForAssessin.pdf
Crossley, S. A., Heintz, A., Choi, J., Batchelor, J., & Karimi, M. (2021). The CommonLit Ease of Readability (CLEAR) Corpus. Proceedings of the 14th International Conference on Educational Data Mining (EDM). Paris, France.
The data is provided under a CC BY-NC-SA 4.0 DEED Attribution-NonCommercial-ShareAlike 4.0 International license (https://creativecommons.org/licenses/by-nc-sa/4.0/deed.en)