Hieralign- Symmetric Word Alignment using Hierarchial Sub-sentential Alignment Method with Various IBM1 Models

Hieralign is yet another symmetric word alignment tool simple, fast and unsupervised, it has been implemented with multi-threading in all stages, which makes it run as fast as fast_align.

The source code in this repository is provided under the terms of the Apache License, Version 2.0.

Cite paper

Hao Wang and Yves Lepage. Hierarchical Sub-sentential Alignment with IBM Models for Statistical Phrase-based Machine Translation. Jounral of Natural Language Processing. Vol.24 No.4 2017.9.

Characters:

Pure C++ implementation, included the new features of C++11, no additional library required.
multi-threading, It has been implemented with multi-threading in all stages (same as fast_align, using OpenMP) and well-optimized.
cross-platform CMake file, easy and one-step compiling and installation.
replaced the best-1 version of hierarchical subsentential alignment with beam search version.
fast, less than 2 minutes for 300,000 lines.
well-formed codes and simple class structure.

Input format

The input to Hieralign must be tokenized and aligned into parallel sentences. Each line is a source language sentence and its target language translation, separated by a triple pipe symbol with leading and trailing white space (\t). An example 2-sentence English-Japanese parallel corpus is:

  I have a book . 私 は 本 を 持って いた 。
  I do not know the answer    私 は その 答え を 知ら ない

Compiling and Installation

Building Hieralign requires a modern C++ compiler and the CMake build system. Additionally, the OpenMP library is required here to obtain better performance. Please make sure your complier version (gcc/clang) supports OpenMP before compiling.

To compile, do the following

cmake . 
make

Usage

Run Hieralign -h to see a list of command line options. There are 3 ways to use Hieralign:

as a normal word aligner: Hieralign generates symmetric alignments (source–target, i.e., left language–right language) alignments is:

./Hieralign --train train.txt --model vbibm1 > aligned.hier
train the model in advance: The usually and recommended way to train and store a model is to just add the --model_file option:

./Hieralign --train train.txt --model vbibm1 --model_file path_to_store_model
as an online aligner: Given the trained model, Hieralign can be ran as an online aligner:

./Hieralign --input test.txt --model vbibm1 --model_file path_to_store_model > aligned.hier

Output

Hieralign produces outputs in the widely-used i-j “Pharaoh format,” where a pair i-j indicates that the ith word (zero-indexed) of the left language (source) is aligned to the jth word of the right sentence (target) but have been symmetrized. For example, a good alignment for first sentence would be:

  0-0 0-1 1-4 1-5 3-2 4-6
  0-0 1-6 2-6 3-5 4-2 5-3

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
src		src
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hieralign- Symmetric Word Alignment using Hierarchial Sub-sentential Alignment Method with Various IBM1 Models

Cite paper

Characters:

Input format

Compiling and Installation

Usage

Output

About

Releases

Packages

Languages

License

wang-h/Hieralign

Folders and files

Latest commit

History

Repository files navigation

Hieralign- Symmetric Word Alignment using Hierarchial Sub-sentential Alignment Method with Various IBM1 Models

Cite paper

Characters:

Input format

Compiling and Installation

Usage

Output

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages