This project implements Lucene [1] based translation memory with BLEU rescoring as described in Multi-Domain Neural Machine Translation through Unsupervised Adaptation [2]
Java JDK installation is required. Project is tested with JDK 8.
To build project simply run:
./gradlew installDist
which should result in a build target:
./build/install/tm
You can run it by passing arguments to bat script in
./build/install/tm/bin/tm.bat --port 8080 --bleu-rescoring-threshold 0.05 --index-dir my_index
Or you can run it straight from gradle
./gradlew run --args="--port 8080 --bleu-rescoring-threshold 0.05 --index-dir my_index"
In our case we used the jar file from python3 using jpype:
import jpype.imports
from jpype.types import *
jpype.addClassPath('build/libs/tm.jar')
jpype.startJVM(convertStrings=False)
import java.lang
from java.lang import System
from com import LuceneSentenceSearch
curl \
--header "Content-Type: application/json" \
--request POST \
--data '{"source":"Hello World !", "target": "Sveika pasaule!", "meta": {"uid": "Artūrs", "srclang": "en"}}' \
http://localhost:8080/save
Response:
{
"errorMessage": null,
"status": "OK"
}
curl \
--header "Content-Type: application/json" \
--request POST \
--data '{"input":"Hello World !", "meta": {"uid": "Artūrs", "srclang": "en"}}' \
http://localhost:8080/get
Response:
{
"sourceContext" : [ "Hello World !", "Hello Worlds !" ],
"targetContext" : [ "Sveika pasaule!", "Sveiki pasaules!" ],
"status" : "OK",
"errorMessage" : null
}
curl \
--header "Content-Type: application/json" \
--request POST \
--data '{"uid": "Artūrs"}' \
http://localhost:8080/delete
Response:
{
"errorMessage": null,
"status": "OK"
}
createIndexInDir("/tmp", "lv")
- will initialize a Latvian source language translation memory stored in/tmp
addFileToIndex(srcFile, trgFile, "IT")
- will load content of two parallel files in translation memory for domainIT
queryTM(String query_sentence, String domain, boolean skipBleuRescorer, int numberOfCandidates)
- will retrieve at mostnumberOfCandidates
sentences from TM that are similar with respect to stemmed query TFIDF; ifskipBleuRescorer is
True` then will also use BLEU rescoring to refine results further
[1] McCandless, Michael, et al. Lucene in action. Vol. 2. Greenwich: Manning, 2010.
[2] Farajian, M. Amin, et al. "Multi-domain neural machine translation through unsupervised adaptation." Proceedings of the Second Conference on Machine Translation. 2017.