efmugal

Efficient Multilingual Gibbs Aligner

This is an implementation of an improved version of Östling (2014), using a partially collapsed sampler for better parallelization. The current code uses OpenMP for parallelizing sampling on a multi-core system.

Preliminary experiments showed that mixing is too slow for this sampler to be practical with more than a few dozen languages, so work has been abandoned. The code is published here in case anyone wants to experiment with it.

Parts of the codebase has been turned into the efficient bitext aligner eflomal which can be used to perform pairwise word alignment -- in many cases a better solution than using the Östling (2014) multilingual alignment method.

Authors

Robert Östling and Claire De Maricourt.

Data format

You need a multi-parallel corpus without gaps, i.e. translations of each of n_sentences sentences to each of n_languages languages.

The text of each language is described by a file of this format:

language_id n_sentences vocabulary_size
sent1_length token1 token2 token3 [...]
sent2_length token1 token2 token3 [...]
[...]

Where all of the items are decimal integers.

These should be summarized in a corpus description file of the following format:

n_languages
language_1.file
language_2.file
[...]
language_n.file

Where n_languages is a decimal integer, and the filenames correspond to the encoded texts for each language as described above.

Next, call the main program:

./efmugal corpus-description-file concept-initializations-file n_iterations

The concept-initializations-file is of the same format as (and can indeed be one of) the encoded text files. It could also be randomly intialized, but this likely leads to suboptimal results. There is a script random_like.py for producing random texts in the correct format.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
efmugal.c		efmugal.c
hash.c		hash.c
import_bibles.py		import_bibles.py
natmap.c		natmap.c
random.c		random.c
random_like.py		random_like.py
show_clusters.py		show_clusters.py
simd_math_prims.h		simd_math_prims.h
test.c		test.c
to_plain.py		to_plain.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

efmugal

Authors

Data format

About

Releases

Packages

Languages

License

robertostling/efmugal

Folders and files

Latest commit

History

Repository files navigation

efmugal

Authors

Data format

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages