The English Language Learner Insight, Proficiency and Skills Evaluation (ELLIPSE) Corpus
The English Language Learner Insight, Proficiency and Skills Evaluation (ELLIPSE) Corpus is a freely available corpus of ~6,500 ELL writing samples that have been scored for overall holistic language proficiency as well as analytic proficiency scores related to cohesion, syntax, vocabulary, phraseology, grammar, and conventions. In addition, the ELLIPSE corpus provides individual and demographic information for the ELL writers in the corpus including economic status, gender, grade level (8-12), and race/ethnicity. The corpus provides language proficiency scores for individual writers and was developed to advance research in corpus and NLP approaches to assess overall and more fine-grained features of proficiency.
This repository contains the corpus (including text and average scores for reliable texts) and the scoring rubric for the final ELLIPSE corpus. The corpus is broken into two dataframes.
ELLIPSE_Final_github_train.csv which contains all the training data and metadata.
ELLIPSE_Final_github_test.zip, which is a password protect zip file that contains all the test data and the metadata.
NOTE The password for the zip file is ellipse_test. You may need to use specific software to decrypt the zip file like 7-Zip for Windows of Keka for Mac.
A second file contains the entire corpus (~9,000 essays) and the individual ratings for each essay. Many of these essays were not included in the final corpus because they were found to not be reliable at the text or rater level.
This file is ellipsis_raw_rater_scores_anon_all_essay.zip, which is also password protect zip file.
NOTE The password for the zip file is ellipse_raw_data. You may need to use specific software to decrypt the zip file like 7-Zip for Windows of Keka for Mac.
The paper was published in 2023. Citation for the work is
Crossley, S. A., Tian, Y., Baffour, P., Franklin, A., Kim, Y., Morris, W., Benner, B., Picou, A., & Boser, U. (2023). Measuring second language proficiency using the English Language Learner Insight, Proficiency and Skills Evaluation (ELLIPSE) Corpus. International Journal of Learner Corpus Research, 9 (2), 248-269.
A pre-print of the paper that is freely accessible to all can be found here.
The data is provided under a CC BY-NC-SA 4.0 DEED Attribution-NonCommercial-ShareAlike 4.0 International license (https://creativecommons.org/licenses/by-nc-sa/4.0/deed.en)