Word level QE reproduction #40

zouharvi · 2019-09-13T10:49:30Z

As per the email we sent all of the paper authors we (@zouharvi, @obo) trained the predictor and then the estimator on our custom data, but the results were almost random.

Describe the bug
While trying to find the problem we tried to reproduce the WMT result based on your pre-trained models, as mentioned in the documentation. There must be some systematic mistake we're making because the pre-trained estimator produces almost random results.

To Reproduce
Run in an empty directory. The script downloads the model and then tries to estimate the quality of the first sentence from the training dataset in WMT18.

wget https://github.com/Unbabel/OpenKiwi/releases/download/0.1.1/en_de.nmt_models.zip
unzip -n en_de.nmt_models.zip

mkdir output input
echo "the part of the regular expression within the forward slashes defines the pattern ." > ./input/test.src
echo "der Teil des regulären Ausdrucks innerhalb der umgekehrten Schrägstrich definiert das Muster ." > ./input/test.trg

kiwi predict \
--config ./en_de.nmt_models/estimator/target_1/predict.yaml \
--load-model ./en_de.nmt_models/estimator/target_1/model.torch \
--experiment-name "Single line test" \
--output-dir output \
--gpu-id -1 \
--test-source ./input/test.src \
--test-target ./input/test.trg

cat output/tags

Expected result

OK OK OK OK OK OK OK OK OK OK OK OK OK BAD OK OK OK BAD OK OK OK OK OK OK OK OK OK

Of course, the gold annotation contains the extra gap tags, but despite that most of the sentence is classified as OK, which is contrary to the model output (lots of almost zeroes).

Actual result

0.04104529693722725 0.013736072927713394 0.011828877963125706 0.014644734561443329 0.022598857060074806 0.10979203879833221 0.8875276446342468 0.711827278137207 0.9585599303245544 0.20660772919654846 0.22217749059200287  0.1782749891281128 0.012791415676474571

Environment (please complete the following information):

OS: Fedora 30, Ubuntu 18.04
OpenKiwi version 0.1.2
Python version 3.7.4

The text was updated successfully, but these errors were encountered:

captainvera · 2019-09-13T11:21:21Z

Hey @zouharvi, thanks for your interest in OpenKiwi and the detailed issue!

I believe it is not an error that you're making but a misinterpretation of the results.
What we model is the probability of a word being BAD and not the probability of a word being OK. With that in mind, the results you're getting are completely expected 🙂

See below:

Golden Tags (removed gaps)

 OK  OK  OK  OK  OK  OK  BAD  OK  BAD  OK  OK  OK  OK

Your results

 OK  OK  OK  OK  OK  OK BAD BAD BAD  OK  OK  OK  OK

The model is actually only getting one tag wrong. So it's getting around 92% accuracy, not too bad!

zouharvi · 2019-09-13T11:41:29Z

Thank you for your reply,

What we model is the probability of a word being BAD

This is somewhat counterintuitive to what we expected from QuEst++ and deepQuest, but we're glad it has been resolved. The data now makes sense. 🙂

We'll rerun our experiments on WMT17 en_de and our custom cs_de data and will let you know of the results (cs_de didn't work out the last time).

captainvera · 2019-09-13T13:39:11Z

Great, glad I could help you guys.

I'll close the issue for now, but feel free to reopen it if you have any doubts about your results with the cs_de data!

zouharvi added the bug Something isn't working label Sep 13, 2019

captainvera closed this as completed Sep 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Word level QE reproduction #40

Word level QE reproduction #40

zouharvi commented Sep 13, 2019

captainvera commented Sep 13, 2019

zouharvi commented Sep 13, 2019

captainvera commented Sep 13, 2019

Word level QE reproduction #40

Word level QE reproduction #40

Comments

zouharvi commented Sep 13, 2019

captainvera commented Sep 13, 2019

zouharvi commented Sep 13, 2019

captainvera commented Sep 13, 2019