to be included as a submodule in other projects
Python C++ Perl Shell Matlab NewLisp Other
Switch branches/tags
Nothing to show
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
SimulatedAnnealing
europarl-tools
AlignmentErrorRate.h
ClustersComparer.h
DXDsvd.m
FstUtils.cc
FstUtils.h
LbfgsUtils.h
README.md
Samplers.h
StringUtils.h
WikiExtractor.py
add-prefix-to-conll-column.py
add-word2vec-header.py
align-europarl.sh
alignments-to-dictionary.py
american_english.txt
append-conll-columns.py
arabic-morphanalyzer.py
attach-left-parser.py
attach-right-parser.py
british_english.txt
buckwalter2unicode.py
check-readme.disk.py
check-token-count.py
clear-conll-column-k.py
clip-file-after-kth-match.py
combine-token-label-in-mrg-file.py
combine-token-label-in-one-file.py
common_typos.py
compose-parallel-corpora.py
compute-average-embedding.py
compute-embeddings-coverage.py
conllx-eval.v1_8.pl
convert-bitext-many2many-to-one2many.py
convert-bitext-many2many-to-one2one.py
convert-bitext-one2one-to-one2many.py
convert-closure-embeddings-to-word-embeddings.py
convert-conll-format-to-sent-per-line.py
convert-conll-to-text.py
convert-conllu-to-conll06.py
convert-flat-parses-to-pos-tags.py
convert-libsvm-format-to-arff-format.py
convert-libsvm-sparse-format-to-arff-sparse-format.py
convert-parallel-sent-per-line-to-one-sent-per-line.py
convert-sent-per-line-to-conll-format.py
convert-to-one-target-per-line.py
copy-conll-columns-to-another-file.py
copy-conll-columns.py
create-vocab.py
decode-corpus.py
diff-vocab.py
download-and-process-wikipedia-text.sh
encode-corpus.py
eval-labels.py
extract-wiktionary-translations.py
filter-irrelevant-params.py
filter-long-sent-pairs.py
filter-unused-embeddings.py
filter-word-alignment-parameters.py
flip-parallel-corpus.py
frequently_used_commands.sh
gardner-et-al-emnlp-2015.py
generate-crosslingual-punctuation-mappings.py
head-conll.py
horizontal-split-parallel-corpus.py
induce-code-switches-in-conll.py
induce-typos-in-conll.py
intersect-vocabs.py
iso_639_1_codes.py
letters-to-clusters.py
literal-translate.py
log-miner.py
lowercase-initial.py
lowercase.py
map-words-to-transitive-closures.py
normalize-arabic.py
normalize-brown-cluster-emission-probs.py
normalize-embeddings.py
paste.py
prefix_lines.py
prefix_tokens.py
prepare-data-for-multilingual-embeddings.sh
preprocess-czech.py
preprocess-english.py
preprocess-iris-dataset.py
preprocess-mnist-dataset.py
print-corpus-statistics.py
print-letters-of-unique-words.py
print-unique-words.py
process-typology-features.py
project_foreign_words_to_english_brown_clusters.py
prune-long-lines.py
remove-conll-column-k.py
remove-empty-lines.py
remove-non-latin-words.py
remove-sequences-of-different-length.py
replace-words-in-conll-corpus.py
replace-words-in-monolingual-corpus.py
resize-embeddings.py
score-classes.py
score-vm.py
shuffle-lines.py
split-berg-kirkpatrick-pos-output-into-gold-vs-pred.py
split-corpus-files.py
sswl_features.csv
strip_lines.py
swap-conll-columns.py
symmetrize-word-alignment-parameters.py
task_names.txt
tokenize-on.py
tokenize-parallel.py
train-multilingual-embeddings.sh
trie-encode-corpus.py
tuple.h
uk2us.py
union-vocab.py
unordered_map_serialization.hpp
vertical-split-corpus.py
vertical-split-parallel-corpus.py
wals_features.csv
wiktionary-multilingual-to-bilingual-dictionaries.py
word-alignments-to-dependency-parses.py

README.md

wammar-utils

this repository is designed to be included as a submodule in other repositories

description of utilities:

create-vocab.py

a python script that extracts the types in a text file and give them integer ids.

================ encode-corpus.py

a python script that replaces each type in the input file to a unique integer id in the target file. another file is output which contains the id:type mappings.

================ decode-corpus.py

inverse of encode-corpus.py.

========================= filter-long-sent-pairs.py

a python script that filters out parallel sentences with number of tokens.

========================= split-parallel-corpus.py

a python script that splits a parallel corpus into train/dev/test sets.

========================================= american-english.txt, british-english.txt

American vs. British English vocabulary collected from http://www.tysto.com/uk-us-spelling-list.html