Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to do calculate all bleu scores during evaluvaion #37

Closed
HarryDresden1 opened this issue Jan 20, 2019 · 9 comments
Closed

How to do calculate all bleu scores during evaluvaion #37

HarryDresden1 opened this issue Jan 20, 2019 · 9 comments

Comments

@HarryDresden1
Copy link

Hi,

Thanks for the well-documented code and tutorial. I trained my model from scratch using your code now when I want to evaluate am not sure how to get all the BLEU scores, not just bleu4 as currently in the eval.py.

@kmario23
Copy link
Contributor

kmario23 commented Jan 21, 2019

The current codebase uses NLTK to calculate BLEU-4 scores. However, BLEU-1 to BLEU-n can be easily implemented, if you want to do that yourself. If you don't want to do that, you can then simply use NLTK for doing this which provides a nice interface to achieve this. (see code below)

Here is the explanation of how BLEU score computation is defined:

BLEU-n is just the geometric average of the n-gram precision.

(precisely it's string matching, at different n-gram levels, between references and hypotheses; that's why there has been much criticism on this metric. But, people still use it anyways because it has stuck with the community for ages)

For example, BLEU-1 is simply the unigram precision, BLEU-2 is the geometric average of unigram and bigram precision, BLEU-3 is the geometric average of unigram, bigram, and trigram precision and so on.


Having said that, if you want to compute specific n-gram BLEU scores, you have to pass a weights parameter when you call corpus_bleu . Note that if you ignore passing this weights parameter, then by default BLEU-4 scores are returned, which is what happening in the evaluation here.

To compute, BLEU-1 you can call copus_bleu with weights as

weights = (1.0/1.0, )
corpus_bleu(references, hypotheses, weights)

To compute, BLEU-2 you can call corpus_bleu with weights as

weights=(1.0/2.0, 1.0/2.0,)
corpus_bleu(references, hypotheses, weights)

To compute, BLEU-3 you can call corpus_bleu with weights as

weights=(1.0/3.0, 1.0/3.0, 1.0/3.0,)
corpus_bleu(references, hypotheses, weights)

To compute, BLEU-5 you can call corpus_bleu with weights as

weights=(1.0/5.0, 1.0/5.0, 1.0/5.0, 1.0/5.0, 1.0/5.0,)
corpus_bleu(references, hypotheses, weights)

Here is a demonstration using a toy example adapted from NLTK webpage:

bleu-n-grams

Note how the BLEU score keeps decreasing as we increase the number n in n-grams using the weights parameter. Also, note how not passing the weights parameter yields the same score as passing a weights parameter for quadrigram because that's the default weight NLTK passes, if we don't pass one.


Refer this page for more information on the NLTK BLEU score implementation

@HarryDresden1
Copy link
Author

Thank you very much for your explanation

@kmario23
Copy link
Contributor

@sgrvinod Maybe is it a good idea that we incorporate this into the documentation somewhere? Because someone would want such a comprehensive evaluation to report.

Do you think Remarks section be an apt place?

@kmario23
Copy link
Contributor

@sgrvinod ping!

@sgrvinod
Copy link
Owner

sgrvinod commented Mar 16, 2019

Oops, didn't see this. Yes, it's a good idea, I'll add it tomorrow with credit to you, thanks!

I think the entire detailed explanation is too long for the Remarks section. I'll either link to your post here from the Remarks section, or add a question to the FAQ with your answer (and crediting you), or both. You could also submit a pull request if you wish, and I'll make minor edits to it if needed.

@kmario23
Copy link
Contributor

@sgrvinod done!

@sgrvinod
Copy link
Owner

Merged #52.

@forence
Copy link

forence commented Mar 22, 2019

@kmario23 Thanks for your brilliant explanation, I got the proceprocessing to calculate BLEU by NLTK. But I still confuse that if I have 3 References and only 1 Hypotheses, does the tool calculate <ref, hyp> pairs one by one? And then gets the mean value of them or get the maximum value of them?

@kmario23
Copy link
Contributor

kmario23 commented Mar 22, 2019

Hello @forence, thanks! Contrary to our intuition, it's not how the BLEU score is computed. However, luckily the paper that proposed BLEU is quite very well written (and easy to understand). Please have a look at Section 2 of BLEU: a Method for Automatic Evaluation of Machine Translation for how they compute a Modified Unigram Precision, which is better than simple precision.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants