scripts for training a transliterator using a list of transliteration pairs.



an example configuration file is provided ruen-config.tape. The following variables are mandatory:

  • ducttape_output output directory
  • transliterator_home root of the transliterator's repository
  • all_oovs source-language words which needs to be transliterated (e.g. a test set)
  • char_lm kenlm-compiled language model of target language characters. An English character language model is provided
  • transliteration_pairs src-tgt transliterations, one per line, formatted as SOURCE LANGUAGE ||| CEURSE LAUNJE
  • m2m_maxX maximum source-language character sequence which corresponds to one character in target language
  • m2m_maxY maximum target-language character sequence which corresponds to one character in source language
  • nprocs number of processors to use for training
  • wammar_utils_dir root of this repository
  • m2m_aligner path to m2m aligner
  • cdec_dir path to cdec decoder
  • DelX: yes means that some characters in the source language may be deleted
  • DelY: yes means that some characters in the target language may be deleted

example usage:

ducttape translit.tape -C ruen-config.tape -p Full -y


  • use mpi_adagrad_optimize instead of mpi_flex_optimize
  • rewrite


scripts are still under development and may be unstable. please do contact me if anything does not work.

if you use this software, consider citing our ACL 2012 workshop paper: