[ICCV'23] Gloss-free Sign Language Translation: Improving from Visual-Language Pretraining

This repo is the official implementation of "Gloss-free Sign Language Translation: Improving from Visual-Language Pretraining" as well as the follow-ups. It currently includes code and models for the following tasks:

Gloss-free Sign Language Translation: Included in this repo.

Visual-Language Pre-training in SLT Tasks.: Included in this repo.

NEWS

2023/12/26

One can use the official 12-layer Mbart decoder for text decoding by setting --decoder-type LLMD. However, it is important to note that this requires pre-training using train_vlp.py.

Installation

conda create -n gfslt python==3.8
conda activate gfslt

# Please install PyTorch according to your CUDA version.
pip install -r requirements.txt

Getting Started

Preparation

Please refer to pretrain_models to prepare MBart weight file and GFSLT model.

VLP Pretrain

If the goal is to solely pre-train the Visual Encoder, the optimal approach would involve employing the fundamental visual-language alignment strategy. This method stands out as a straightforward and highly efficient means of SLT pre-training.

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 --master_port=1236 --use_env train_vlp.py --batch-size 4 --epochs 80 --opt sgd --lr 0.01 --output_dir out/vlp

VLP Pretrain V2

If the objective is to simultaneously pre-train both the visual encoder and text decoder, the recommended approach is to adopt the VLP-V2 version. This version employs a methodology that combines Contrastive Language-Image Pretraining (CLIP) with masked self-supervised learning, resulting in the creation of pre-tasks that effectively bridge the semantic gap between visual and textual representations while also restoring masked sentences. Consequently, this integrated method enables the achievement of joint pre-training for the visual encoder and text decoder.

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 --master_port=1236 --use_env train_vlp_v2.py --batch-size 4 --epochs 80 --opt sgd --lr 0.01 --output_dir out/vlp_v2 --training-refurbish True --noise-rate 0.15 --noise-type omit_last --random-shuffle False

GFSLT-VLP

Irrespective of the chosen pre-training strategy, when undertaking downstream sign language translation tasks, you only need to utilize the --finetune hyperparameter to designate the relevant pre-trained model.

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 --master_port=1236 --use_env train_slt.py --batch-size 2 --epochs 200 --opt sgd --lr 0.01 --output_dir out/Gloss-Free \
--finetune ./out/vlp/checkpoint.pth

Evaluation

To obtain more accurate evaluation results and multiple metrics, including BLER-n, METEOR, ROUGE_L, and CIDEr, it is strongly recommended to perform a systematic evaluation with the following commands on a single GPU.

CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --nproc_per_node=1 --master_port=1236 --use_env train_slt.py --batch-size 2 --epochs 200 --opt sgd --lr 0.01 --output_dir out/Gloss-Free --resume out/Gloss-Free/best_checkpoint.pth --eval

However, to use these commands, you need to install the nlgeval package, one can also refer to the README.md for instructions on how to install it.

Citation

If you find our work useful for your project, please consider citing the paper:

@InProceedings{Zhou_2023_ICCV,
    author    = {Zhou, Benjia and Chen, Zhigang and Clap\'es, Albert and Wan, Jun and Liang, Yanyan and Escalera, Sergio and Lei, Zhen and Zhang, Du},
    title     = {Gloss-Free Sign Language Translation: Improving from Visual-Language Pretraining},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2023},
    pages     = {20871-20881}
}

LICENSE

The code is released under the MIT license.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
configs		configs
data/Phonexi-2014T		data/Phonexi-2014T
demo		demo
log		log
metrics		metrics
pretrain_models		pretrain_models
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
augmentation.py		augmentation.py
datasets.py		datasets.py
definition.py		definition.py
metrics.py		metrics.py
models.py		models.py
requirements.txt		requirements.txt
train_slt.py		train_slt.py
train_vlp.py		train_vlp.py
train_vlp_v2.py		train_vlp_v2.py
utils.py		utils.py

License

zhoubenjia/GFSLT-VLP

Folders and files

Latest commit

History

Repository files navigation

[ICCV'23] Gloss-free Sign Language Translation: Improving from Visual-Language Pretraining

NEWS

Installation

Getting Started

Preparation

VLP Pretrain

VLP Pretrain V2

GFSLT-VLP

Evaluation

Citation

LICENSE

About

Resources

License

Stars

Watchers

Forks

Languages