WSPAlign

WSPAlign: Word Alignment Pre-training via Large-Scale Weakly Supervised Span Prediction, published at ACL 2023 main conference.

This repository includes the source codes of paper WSPAlign: Word Alignment Pre-training via Large-Scale Weakly Supervised Span Prediction. Part of the implementation is from word_align. The implementation of inference and evaluation are at WSPAlign.InferEval.

Requirements

Run pip install -r requirements.txt to install all the required packages.

Model list

Model List	Description
qiyuw/WSPAlign-xlm-base	Pretrained on xlm-roberta
qiyuw/WSPAlign-mbert-base	Pretrained on mBERT
qiyuw/WSPAlign-ft-kftt	Finetuned with English-Japanese KFTT dataset
qiyuw/WSPAlign-ft-deen	Finetuned with German-English dataset
qiyuw/WSPAlign-ft-enfr	Finetuned with English-French dataset
qiyuw/WSPAlign-ft-roen	Finetuned with Romanian-English dataset

Use our model checkpoints with huggingface

Note: For Japanese, Chinese, and other asian languages, we recommend to use mbert-based models like qiyuw/WSPAlign-mbert-base for better performance as we discussed in the original paper.

Data preparation

Dataset list	Description
qiyuw/wspalign_pt_data	Pre-training dataset
qiyuw/wspalign_ft_data	Finetuning dataset
qiyuw/wspalign_few_ft_data	Few-shot fintuning dataset
qiyuw/wspalign_test_data	Test dataset for evaluation

Construction of Finetuning and Test dataset can be found at word_align.

Run download_dataset.sh to download all the above datasets.

Pre-train and finetune

You can do pre-train, finetune and evaluate by running the following scripts.

Pre-train

See pretrain.sh for details.

You can also use pre-traned model to directly do word alignment (zero-shot), see zero-shot.sh for details.

Finetune

See finetune.sh, fewshot.sh for details.

Evaluate

Refer to WSPAlign Inference for details.

Citation

If you use our code or model, please cite our paper:

@inproceedings{wu-etal-2023-wspalign,
    title = "{WSPA}lign: Word Alignment Pre-training via Large-Scale Weakly Supervised Span Prediction",
    author = "Wu, Qiyu  and Nagata, Masaaki  and Tsuruoka, Yoshimasa",
    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.acl-long.621",
    pages = "11084--11099",
}

License

This software is released under the CC-BY-NC-SA-4.0 License, see LICENSE.txt.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
preprocess		preprocess
wspalign		wspalign
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
batch_inference.py		batch_inference.py
download_dataset.sh		download_dataset.sh
fewshot.sh		fewshot.sh
finetune.sh		finetune.sh
finetune_jp.sh		finetune_jp.sh
pretrain.sh		pretrain.sh
requirements.txt		requirements.txt
zeroshot.sh		zeroshot.sh

License

qiyuw/WSPAlign

Folders and files

Latest commit

History

Repository files navigation

WSPAlign

Requirements

Model list

Data preparation

Pre-train and finetune

Pre-train

Finetune

Evaluate

Citation

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages