GenTranslate: Large Language Models are Generative Multilingual Speech and Machine Translators

This work proposes a generative paradigm for translation tasks that leverages LLMs to generate higher-quality translation results based on the N-best hypotheses decoded from foundation model (e.g., SeamlessM4T-Large-V2). We also release a HypoTranslate dataset to support LLM finetuning, which contains over 592K pairs of N-best hypotheses and ground-truth translation in 11 languages. Experiments show that our GenTranslate significantly outperforms the state-of-the-art SeamlessM4T-Large-V2 on various speech and machine translation benchmarks.

TIP: At this time (before publication), we provide inference script, test data and partial well-trained models only for inference use. Full-version resources of this paper, including training script, the entire HypoTranslate dataset and all the models, will be open sourced upon publication to benefit the community.

Conda Environment Configuration

Our code is built based on lit-gpt, please refer to official tutorial to build the conda environment. Then, please install the required packages using following command:

pip install -r requirements.txt

Code

Model code: lit_gpt/gentrans.py;
Inference script: infer.sh;

Models

For LLMs, please refer to tutorial for configuration steps, which support many mainstream LLMs like LLaMA-2;
For well-trained adapter checkpoints, please refer to our HuggingFace repo.

Dataset

We have released our HypoTranslate dataset at HuggingFace.

Inference Usage

We provide two well-trained models and corresponding test sets for inference use, i.e., FLEURS Fr-En and En-Fr ST tasks. Before running inference, please follow the steps below for preparation:

Go to infer.sh:
- Specify you conda environment <your-conda-env>;
- Specify the source-target language pair, where we provide two example pairs fr-en and en-fr;
- Specify the LLM size: 7b for fr-en, 13b for en-fr;
Download and convert LLaMA-2 pre-trained checkpoint:
- Please refer to official tutorial to configure Llama-2-7b-hf and Llama-2-13b-hf;
Go to inference/gentrans.py:
- Specify the experiment directory exp_dir: the root path of this README.md file;
- Specify the data directory data_dir: the absolute path of test data (.pt file);
- Specify the LLM directory llm_dir: the absolute path of your downloaded LLaMA-2 checkpoint;
- Specify the adapter directory adapter_dir: the absolute path of our released adapter checkpoint;

Now you can run inference on your specified language direction by:

bash infer.sh

You will see the BLEU results of GenTranslate on your specified test set.

References

@article{hu2024gentranslate,
  title={GenTranslate: Large Language Models are Generative Multilingual Speech and Machine Translators},
  author={Hu, Yuchen and Chen, Chen and Yang, Chao-Han Huck and Li, Ruizhe and Zhang, Dong and Chen, Zhehuai and Chng, Eng Siong},
  journal={arXiv preprint arXiv:2402.06894},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
chat		chat
eval		eval
finetune		finetune
generate		generate
inference		inference
lit_gpt		lit_gpt
notebooks		notebooks
pretrain		pretrain
quantize		quantize
scripts		scripts
tests		tests
tutorials		tutorials
xla		xla
.DS_Store		.DS_Store
LICENSE		LICENSE
README.md		README.md
infer.sh		infer.sh
requirements.txt		requirements.txt
setup.py		setup.py

License

YUCHEN005/GenTranslate

Folders and files

Latest commit

History

Repository files navigation

GenTranslate: Large Language Models are Generative Multilingual Speech and Machine Translators

Conda Environment Configuration

Code

Models

Dataset

Inference Usage

References

About

Resources

License

Stars

Watchers

Forks

Languages