Source code for the EMNLP 2023 paper entitled "Doolittle: Benchmarks and Corpora for Academic Writing Formalization" by Shizhe Diao et al.
Improving the quality of academic writing is a meaningful but challenging task.
Conventional methods of language refinement focus on narrow, specific linguistic features within isolated sentences, such as grammatical errors and improper word use.
We propose a more general task, Academic Writing Formalization (AWF)
, to improve the overall quality of formal academic writing at the paragraph level.
We formulate this language refinement task as a formal text style transfer task which transfers informal-academic text to formal-academic and contribute a large-scale non-parallel dataset, Doolittle
, for this purpose.
Doolittle
is a large-scale non-parallel dataset for AWF task.
It contains 13,000 training samples and 465 dev samples for each of the two domains, informal-academic and formal-academic.
Please request access to Doolittle dataset by filling in this form and we will send you the download link via email.
Then please put the full dataset under AWF-dataset/
folder.
The detailed information is:
Description | File Name | #Paragraphs | Parallel |
---|---|---|---|
Informal-academic train set | paragraph_native_train.0 | 13.0K | No |
Formal-academic train set | paragraph_native_train.1 | 55.6K | No |
Informal-academic dev set | paragraph_native_dev.0 | 465 | Yes |
Formal-academic dev set | paragraph_native_dev.1 | 465 | Yes |
Informal-academic test set | paragraph_native_test.0 | 415 | Yes |
Formal-academic test set | paragraph_native_test.1 | 415 | Yes |
Informal-academic dev set for MORL Training | dev.0.csv | 465 | No |
To address our task with reduced cost and better performance, we propose a method called Metric-Oriented Reinforcement Learning (MORL)
.
This methodology, inspired by Reinforcement Learning with Human Feedback (RLHF), follows a three-step training process:
Step 1: Train a policy model (usually a PLM) that can meet the requirements of a task.
Step 2: Select some metrics that can accurately evaluate the quality of how the task has been performed. Build a reward model that can score a given policy model’s output with a scalar.
Step 3: Optimize the policy against the reward model using reinforcement learning with the proximal policy optimization (PPO) algorithm.
In our work, we chose Galactica-1.3B
and BART-Large
as two backbone policy models for their inherent capability in solving academic-related grammatical error correction (GEC) task.
And used MORL to tune these two models against 4 automatic metrics which are Transfer Accuracy (ACC)
, Perplexity (PPL)
, Semantic Similarity (SIM)
, BART Score (BARTS)
.
The code is implemented with reference to many other repositories, which are listed below:
Code Implementation | Link to Reference Repository |
---|---|
Automatic Metrics: PPL | https://github.com/huggingface/evaluate |
Automatic Metrics: SIM | https://github.com/martiansideofthemoon/style-transfer-paraphrase.git |
Automatic Metrics: BARTScore | https://github.com/neulab/BARTScore |
Reinforment Learning Algorithm: PPO | https://github.com/huggingface/trl |
We provided two notebooks as MORL tuning examples, where MORL-BARTLarge.ipynb tuned a BART-Large
model while MORL-Galactica.ipynb tuned a Galactica-1.3B
model.
Python = 3.10
CUDA = 11.7
Ubuntu = 20.04
pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cu117
cd trl
pip install -e .
Galactica-1.3B
and BART-Large
are chosen as two policy models in our work. In this repository, we did not provide either the tuned policy models or the scripts to fine-tune them, please refer to our paper and Huggingface Transformers tutorial to train your own policy models on our Doolittle
dataset.
For help or issues using MORL, please submit a GitHub issue.
For personal communication related to Doolittle dataset and MORL, please contact Shizhe Diao (sdiaoaa@connect.ust.hk
) or Yongyu Lei (yleiah@connect.ust.hk
).
If you use or extend our work, please cite the following paper:
@article{diaodoolittle,
title={Doolittle: Benchmarks and Corpora for Academic Writing Formalization},
author={Diao, Shizhe and Lei, Yongyu and Pan, Liangming and Fang, Tianqing and Zhou, Wangchunshu and Keh, Sedrick Scott and Kan, Min-Yen and Zhang, Tong},
booktitle = "The 2023 Conference on Empirical Methods in Natural Language Processing",
year={2023}
}