Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Word level QE reproduction #40

Closed
zouharvi opened this issue Sep 13, 2019 · 3 comments
Closed

Word level QE reproduction #40

zouharvi opened this issue Sep 13, 2019 · 3 comments
Labels
bug Something isn't working

Comments

@zouharvi
Copy link
Contributor

As per the email we sent all of the paper authors we (@zouharvi, @obo) trained the predictor and then the estimator on our custom data, but the results were almost random.

Describe the bug
While trying to find the problem we tried to reproduce the WMT result based on your pre-trained models, as mentioned in the documentation. There must be some systematic mistake we're making because the pre-trained estimator produces almost random results.

To Reproduce
Run in an empty directory. The script downloads the model and then tries to estimate the quality of the first sentence from the training dataset in WMT18.

wget https://github.com/Unbabel/OpenKiwi/releases/download/0.1.1/en_de.nmt_models.zip
unzip -n en_de.nmt_models.zip

mkdir output input
echo "the part of the regular expression within the forward slashes defines the pattern ." > ./input/test.src
echo "der Teil des regulären Ausdrucks innerhalb der umgekehrten Schrägstrich definiert das Muster ." > ./input/test.trg

kiwi predict \
--config ./en_de.nmt_models/estimator/target_1/predict.yaml \
--load-model ./en_de.nmt_models/estimator/target_1/model.torch \
--experiment-name "Single line test" \
--output-dir output \
--gpu-id -1 \
--test-source ./input/test.src \
--test-target ./input/test.trg

cat output/tags

Expected result

OK OK OK OK OK OK OK OK OK OK OK OK OK BAD OK OK OK BAD OK OK OK OK OK OK OK OK OK

Of course, the gold annotation contains the extra gap tags, but despite that most of the sentence is classified as OK, which is contrary to the model output (lots of almost zeroes).

Actual result

0.04104529693722725 0.013736072927713394 0.011828877963125706 0.014644734561443329 0.022598857060074806 0.10979203879833221 0.8875276446342468 0.711827278137207 0.9585599303245544 0.20660772919654846 0.22217749059200287  0.1782749891281128 0.012791415676474571

Environment (please complete the following information):

  • OS: Fedora 30, Ubuntu 18.04
  • OpenKiwi version 0.1.2
  • Python version 3.7.4
@zouharvi zouharvi added the bug Something isn't working label Sep 13, 2019
@captainvera
Copy link
Contributor

Hey @zouharvi, thanks for your interest in OpenKiwi and the detailed issue!

I believe it is not an error that you're making but a misinterpretation of the results.
What we model is the probability of a word being BAD and not the probability of a word being OK. With that in mind, the results you're getting are completely expected 🙂

See below:

Golden Tags (removed gaps)

 OK  OK  OK  OK  OK  OK  BAD  OK  BAD  OK  OK  OK  OK 

Your results

 OK  OK  OK  OK  OK  OK BAD BAD BAD  OK  OK  OK  OK

The model is actually only getting one tag wrong. So it's getting around 92% accuracy, not too bad!

@zouharvi
Copy link
Contributor Author

Thank you for your reply,

What we model is the probability of a word being BAD

This is somewhat counterintuitive to what we expected from QuEst++ and deepQuest, but we're glad it has been resolved. The data now makes sense. 🙂

We'll rerun our experiments on WMT17 en_de and our custom cs_de data and will let you know of the results (cs_de didn't work out the last time).

@captainvera
Copy link
Contributor

Great, glad I could help you guys.

I'll close the issue for now, but feel free to reopen it if you have any doubts about your results with the cs_de data!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants