This repository contains the codes for the paper From Automation to Augmentation: Large Language Models Elevating the Essay Scoring Landscape. Dataset and other resources will be released after the anonymous period.
In this study, we investigate the effectiveness of LLMs, specifically GPT-4 and fine-tuned GPT-3.5, as tools for Automated Essay Scoring (AES). Our comprehensive set of experiments, conducted on both public and private datasets, highlights the remarkable advantages of LLM-based AES systems. They include superior accuracy, consistency, generalizability, and interpretability, with fine-tuned GPT-3.5 surpassing traditional grading models.
Additionally, we undertake LLM-assisted human evaluation experiments involving both novice and expert graders. One pivotal discovery is that LLMs not only automate the grading process but also enhance the performance of human graders. Novice graders when provided with feedback generated by LLMs, achieve a level of accuracy on par with experts, while experts become more efficient and maintain greater consistency in their assessments.
Contributions:
-
We pioneer the exploration of LLMs' capabilities as AES systems, especially in intricate scenarios with tailored grading criteria.
-
We introduce a substantial essay-scoring dataset, comprising 6,559 essays written by Chinese high school students, along with multi-dimensional scores provided by expert educators.
-
Our findings of the LLM-assisted human evaluation experiments underscore the potential of LLM-generated feedback to elevate the capabilities of individuals with limited domain knowledge to a level comparable to experts.