Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
to be included as a submodule in other projects
Python C++ Perl
branch: master
Failed to load latest commit information.
SimulatedAnnealing add generic c++ utils
AlignmentErrorRate.h compute alignment error rate
ClustersComparer.h old edits
FstUtils.cc old edits
FstUtils.h add generic c++ utils
LbfgsUtils.h add generic c++ utils
README.md Update README.md
Samplers.h old edits
StringUtils.h RSplit
add-prefix-to-conll-column.py multilingual embeddings
american_english.txt american-british-vocab-mapping
arabic-morphanalyzer.py output clusters are also in utf8 Arabic script
attach-left-parser.py more scripts
attach-right-parser.py more scripts
british_english.txt american-british-vocab-mapping
buckwalter2unicode.py more scripts
check-readme.disk.py more scripts
check-token-count.py a few bugfixes and a few more python utils
clear-conll-column-k.py multilingual embeddings
combine-token-label-in-mrg-file.py a few bugfixes and a few more python utils
combine-token-label-in-one-file.py a few bugfixes and a few more python utils
compose-parallel-corpora.py old utils
conllx-eval.v1_8.pl old utils
convert-bitext-many2many-to-one2many.py old utils
convert-bitext-many2many-to-one2one.py old utils
convert-bitext-one2one-to-one2many.py old utils
convert-closure-embeddings-to-word-embeddings.py multilingual embeddings
convert-conll-to-text.py old utils
convert-conllu-to-conll06.py a little script to convert the universal dependency treebanks format …
convert-flat-parses-to-pos-tags.py more scripts
convert-libsvm-format-to-arff-format.py bugfix
convert-parallel-sent-per-line-to-one-sent-per-line.py split parallel sents in two separate files
convert-sent-per-line-to-conll-format.py more scripts
convert-to-one-target-per-line.py utilities for cross-lingual dependency parsing
copy-conll-columns.py multilingual embeddings
create-vocab.py old utils
decode-corpus.py bug fix
diff-vocab.py old utils
encode-corpus.py explicit argument to indicate whethre the vocab file is to be created…
eval-labels.py evaluate label sequences
filter-irrelevant-params.py more scripts
filter-long-sent-pairs.py old utils
horizontal-split-parallel-corpus.py old utils
letters-to-clusters.py adding a few new utils
literal-translate.py more scripts
lowercase-initial.py lowercase the first character in each line
lowercase.py utilities for cross-lingual dependency parsing
map-words-to-transitive-closures.py multilingual embeddings
normalize-arabic.py Update normalize-arabic.py
normalize-brown-cluster-emission-probs.py commonly used scripts
paste.py more scripts
preprocess-czech.py allow encode-corpus to reuse the vocab file to encode test sets using…
preprocess-english.py whatever
preprocess-iris-dataset.py preprocess standard datasets
preprocess-mnist-dataset.py preprocess standard datasets
print-letters-of-unique-words.py old utils
prune-long-lines.py commonly used scripts
remove-empty-lines.py more scripts
remove-non-latin-words.py adding a few new utils
remove-sequences-of-different-length.py old utils
replace-words-in-monolingual-corpus.py utilities for cross-lingual dependency parsing
score-classes.py a few bugfixes and a few more python utils
score-vm.py a few bugfixes and a few more python utils
shuffle-lines.py shuffle lines in a file.
split-berg-kirkpatrick-pos-output-into-gold-vs-pred.py a few bugfixes and a few more python utils
swap-conll-columns.py multilingual embeddings
symmetrize-word-alignment-parameters.py old utils
tokenize-on.py more scripts
trie-encode-corpus.py use a trie instead of defaultdict for the vocab (cont'd)
tuple.h add generic c++ utils
uk2us.py fix a bug
union-vocab.py old utils
unordered_map_serialization.hpp add generic c++ utils
vertical-split-corpus.py commonly used scripts
vertical-split-parallel-corpus.py a few bugfixes and a few more python utils
wiktionary-multilingual-to-bilingual-dictionaries.py utilities for cross-lingual dependency parsing
word-alignments-to-dependency-parses.py more scripts

README.md

wammar-utils

this repository is designed to be included as a submodule in other repositories

description of utilities:

create-vocab.py

a python script that extracts the types in a text file and give them integer ids.

encode-corpus.py

a python script that replaces each type in the input file to a unique integer id in the target file. another file is output which contains the id:type mappings.

decode-corpus.py

inverse of encode-corpus.py.

filter-long-sent-pairs.py

a python script that filters out parallel sentences with number of tokens.

split-parallel-corpus.py

a python script that splits a parallel corpus into train/dev/test sets.

american-english.txt, british-english.txt

American vs. British English vocabulary collected from http://www.tysto.com/uk-us-spelling-list.html

Something went wrong with that request. Please try again.