Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
ruen
README.md
acc.py
accBackend.py
analyze-transliterations.py
aren-config.tape
augment-parallel-names-with-russian-inflections.py
convert-alignments-to-cdec-format.py
convert-alignments-to-testset.py
convert-bars-format-to-m2m-format.py
convert-cdec-kbest-output-to-xml.py
convert-test-xml-format-to-cdec-input-format.py
convert-xml-format-to-m2m-format.py
convert-xml-format-to-wordpair-format.py
create-kbest-grammar.py
filter-rules.py
hien-config.tape
mono.en.char.lm
prob-ylen-given-xlen.py
remove-long-examples.py
rerank.py
rerankBackend.py
ruen-config.tape
split-alignments-into-train-test.py
string-to-cdec-input.py
test-conditional-length-model.py
test.py
train-conditional-length-model.py
translit-oovs.py
translit.tape
tuneRerankWeights.py
word-to-char.py

README.md

scripts for training a transliterator using a list of transliteration pairs.

dependencies:

configurations:

an example configuration file is provided ruen-config.tape. The following variables are mandatory:

  • ducttape_output output directory
  • transliterator_home root of the transliterator's repository
  • all_oovs source-language words which needs to be transliterated (e.g. a test set)
  • char_lm kenlm-compiled language model of target language characters. An English character language model is provided
  • transliteration_pairs src-tgt transliterations, one per line, formatted as SOURCE LANGUAGE ||| CEURSE LAUNJE
  • m2m_maxX maximum source-language character sequence which corresponds to one character in target language
  • m2m_maxY maximum target-language character sequence which corresponds to one character in source language
  • nprocs number of processors to use for training
  • wammar_utils_dir root of this repository
  • m2m_aligner path to m2m aligner
  • cdec_dir path to cdec decoder
  • DelX: yes means that some characters in the source language may be deleted
  • DelY: yes means that some characters in the target language may be deleted

example usage:

ducttape translit.tape -C ruen-config.tape -p Full -y

todos:

  • use mpi_adagrad_optimize instead of mpi_flex_optimize
  • rewrite convert-alignments-to-cdec-format.py

##disclaimer:

scripts are still under development and may be unstable. please do contact me if anything does not work.

if you use this software, consider citing our ACL 2012 workshop paper: http://www.cs.cmu.edu/~wammar/pubs/translit-acl12.pdf