LM-Combiner

Implementation of COLING 2024 paper "LM-Combiner: A Contextual Rewriting Model for Chinese Grammatical Error Correction".

All the code and model are released. Thank you for your patience!

Requirements

The part of the model is implemented using the huggingface framework and the required environment is as follows:

Python
torch
transformers
datasets
tqdm

For the evaluation, we refer to the relevant environment configurations of ChERRANT.

Training Stage

Preprocessing

Baseline Model

Firstly, we train a baseline model (Chinese-Bart-large) for LM-Combiner on the FCGEC dataset using the Seq2Seq format.

sh ./script/run_bart_baseline.sh

Candidate Datasets

Candidate Sentence Generation

We use the baseline model to generate candidate sentences for the training and test sets
On tasks where the model fits better (spelling correction, etc.), we recommend using the K-fold cross-inference from the paper to generate candidate sentences separately.

python ./src/predict_bl_tsv.py

Golden Labels Merging

We use the ChERRANT tool to fully decouple the error correction task and the rewriting task by merging the correct labels.

python ./scorer_wapper/golden_label_merging.py

LM-combiner (gpt2)

Subsequently, we train LM-Combiner on the constructed candidate dataset
In particular, we supplement the gpt2 vocab (mainly double quotes) to better fit the FCGEC dataset, see ./pt_model/gpt2-base/vocab.txt for details.

sh ./script/run_lm_combiner.py

Evaluation

We use the official ChERRANT script to evaluate the model on the FCGEC-dev.

sh ./script/compute_score.sh

method	Prec	Rec	F0.5
bart_baseline	28.88	38.95	40.46
+lm_combiner	52.15	37.41	48.34

Citation

If you find this work is useful for your research, please cite our paper:

@inproceedings{wang-etal-2024-lm-combiner,
    title = "{LM}-Combiner: A Contextual Rewriting Model for {C}hinese Grammatical Error Correction",
    author = "Wang, Yixuan  and
      Wang, Baoxin  and
      Liu, Yijun  and
      Wu, Dayong  and
      Che, Wanxiang",
    editor = "Calzolari, Nicoletta  and
      Kan, Min-Yen  and
      Hoste, Veronique  and
      Lenci, Alessandro  and
      Sakti, Sakriani  and
      Xue, Nianwen",
    booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
    month = may,
    year = "2024",
    address = "Torino, Italia",
    publisher = "ELRA and ICCL",
    url = "https://aclanthology.org/2024.lrec-main.934",
    pages = "10675--10685",
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
pic		pic
pt_model/gpt2-base		pt_model/gpt2-base
scorer		scorer
scorer_wapper		scorer_wapper
script		script
src		src
README.md		README.md
temp_predict_bart.m2		temp_predict_bart.m2
temp_predict_bart.txt		temp_predict_bart.txt
temp_predict_gpt.m2		temp_predict_gpt.m2
temp_predict_gpt.txt		temp_predict_gpt.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LM-Combiner

Requirements

Training Stage

Preprocessing

Baseline Model

Candidate Datasets

LM-combiner (gpt2)

Evaluation

Citation

About

Releases

Packages

Contributors 2

Languages

wyxstriker/LM-Combiner

Folders and files

Latest commit

History

Repository files navigation

LM-Combiner

Requirements

Training Stage

Preprocessing

Baseline Model

Candidate Datasets

LM-combiner (gpt2)

Evaluation

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages