Python Shell
Switch branches/tags
Nothing to show
Permalink
Failed to load latest commit information.
ruen guessed-names.ru-en Jan 27, 2014
README.md Update README.md Feb 5, 2014
acc.py add all scripts Mar 13, 2013
accBackend.py add all scripts Mar 13, 2013
analyze-transliterations.py add all scripts Mar 13, 2013
aren-config.tape add configurations for arabic and hindi Feb 5, 2014
augment-parallel-names-with-russian-inflections.py augment-parallel-names-with-russian-inflections.py Jan 27, 2014
convert-alignments-to-cdec-format.py bug fix Feb 5, 2014
convert-alignments-to-testset.py add all scripts Mar 13, 2013
convert-bars-format-to-m2m-format.py add all scripts Mar 13, 2013
convert-cdec-kbest-output-to-xml.py add all scripts Mar 13, 2013
convert-test-xml-format-to-cdec-input-format.py add all scripts Mar 13, 2013
convert-xml-format-to-m2m-format.py add configurations for arabic and hindi Feb 5, 2014
convert-xml-format-to-wordpair-format.py add configurations for arabic and hindi Feb 5, 2014
create-kbest-grammar.py we only used to print the first word of a hindi oov phrase. now fixed. Feb 20, 2014
filter-rules.py add all scripts Mar 13, 2013
hien-config.tape add configurations for arabic and hindi Feb 5, 2014
mono.en.char.lm fix the ducttape config files Feb 5, 2014
prob-ylen-given-xlen.py add all scripts Mar 13, 2013
remove-long-examples.py add all scripts Mar 13, 2013
rerank.py add all scripts Mar 13, 2013
rerankBackend.py add all scripts Mar 13, 2013
ruen-config.tape fix the ducttape config files Feb 5, 2014
split-alignments-into-train-test.py add all scripts Mar 13, 2013
string-to-cdec-input.py add all scripts Mar 13, 2013
test-conditional-length-model.py add all scripts Mar 13, 2013
test.py add all scripts Mar 13, 2013
train-conditional-length-model.py add all scripts Mar 13, 2013
translit-oovs.py poor cdec integration Feb 19, 2014
translit.tape poor cdec integration Feb 19, 2014
tuneRerankWeights.py add all scripts Mar 13, 2013
word-to-char.py poor cdec integration Feb 19, 2014

README.md

scripts for training a transliterator using a list of transliteration pairs.

dependencies:

configurations:

an example configuration file is provided ruen-config.tape. The following variables are mandatory:

  • ducttape_output output directory
  • transliterator_home root of the transliterator's repository
  • all_oovs source-language words which needs to be transliterated (e.g. a test set)
  • char_lm kenlm-compiled language model of target language characters. An English character language model is provided
  • transliteration_pairs src-tgt transliterations, one per line, formatted as SOURCE LANGUAGE ||| CEURSE LAUNJE
  • m2m_maxX maximum source-language character sequence which corresponds to one character in target language
  • m2m_maxY maximum target-language character sequence which corresponds to one character in source language
  • nprocs number of processors to use for training
  • wammar_utils_dir root of this repository
  • m2m_aligner path to m2m aligner
  • cdec_dir path to cdec decoder
  • DelX: yes means that some characters in the source language may be deleted
  • DelY: yes means that some characters in the target language may be deleted

example usage:

ducttape translit.tape -C ruen-config.tape -p Full -y

todos:

  • use mpi_adagrad_optimize instead of mpi_flex_optimize
  • rewrite convert-alignments-to-cdec-format.py

##disclaimer:

scripts are still under development and may be unstable. please do contact me if anything does not work.

if you use this software, consider citing our ACL 2012 workshop paper: http://www.cs.cmu.edu/~wammar/pubs/translit-acl12.pdf