This guide explains how to use the code in this repository. Before using the code, please make sure you have installed the necessary dependencies as described in the INSTALL.md file.
The following instructions are described for any [src]-[tgt] pair and include optional instructions if one wishes to train models in both directions using the same dataset.
The following instructions are meant to explain using MTCue on new data. To use the data from the paper, follow the instructions in INSTALL.md.
- Put your desired dataset in the
data/
folder with the following hierarchy, wheresrc
,tgt
are the source and target language respectively:
- Source/target files in
data/[src]-[tgt]/{train,dev,test}.{src,tgt}
- Context files in
data/[src]-[tgt]/context/{train,dev,test}.[context_name].{cxt,[src],[tgt]}
, wherecxt
indicates an "other" metadata type, [src] is source context and [tgt] is target context. Source context is used when training forward models and target context is used when training backward ([tgt] -> [src]) models. None of the contexts is required per se, but this needs specifying when making embeddings/training.
- To create embeddings for the forward model, run
bash cxt2vec/main.py --path data/[src]-[tgt] --dest_path data/[src]-[tgt]/embeddings_[src] --suffixes [src],cxt --model minilm
- To create embeddings for the backward model, run
bash cxt2vec/main.py --path data/[src]-[tgt] --dest_path data/[src]-[tgt]/embeddings_[tgt] --suffixes [tgt],cxt --model minilm
- Once embeddings are created, run
bash scripts/mtcue/prepare.sh --src [src] --tgt [tgt]
for the forward model (swap [src] and [tgt] for backward model)
To train, run bash scripts/mtcue/finetune.sh --src [src] --tgt [tgt] --arch [mtcue/mtcue-big]
.
Note that the repository offers the possiblity to train an MTCue-big model which is ~4x bigger and offers slightly better translation performance (+1-2 BLEU) and similar performance on the contextual tasks. Hyperparameters are pre-selected within the file but can be modified.
To evaluate, run bash scripts/mtcue/evaluate.sh --src [src] --tgt [tgt]
.
To use the model for inference on new data:
- Put your data in the
data/evaluation/my_data
repository, with the same hierarchy as training data:
data/evaluation/my_data/test.{[src],[tgt]}
for source/target files anddata/evaluation/my_data/context/test.[context_name].{cxt,[src],[tgt]}
for context files.
- Run
bash scripts/mtcue/inference_prepare.sh --src [src] --tgt [tgt] --path data/evaluation/[my_data]
. The binary files will be saved underdata-bin/[my_data]
. - Run
bash scripts/mtcue/inference.sh --data-bin [my_data] --model [model_name]/[checkpoint_name].pt
for inference. Hypotheses will be saved indata/evaluation/my_data/system.[tgt]
.
You can specify the input data and other inference parameters using command-line arguments or a configuration file.
We provide pre-trained models for the experiments described in the paper. You can download the models from [Insert link to the pre-trained models here]. To use a pre-trained model, follow the instructions in the README.md file.
If you use this code in your research, please cite our paper:
@misc{vincent2023mtcue,
title={MTCue: Learning Zero-Shot Control of Extra-Textual Attributes by Leveraging Unstructured Context in Neural Machine Translation},
author={Sebastian Vincent and Robert Flynn and Carolina Scarton},
year={2023},
eprint={2305.15904},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
(An official citation for ACL will be inserted here when the camera-ready version is out.)
If you have any questions or issues with the code, please contact [Insert contact information here].