my idea of metric on diversity. #6

bojone · 2019-06-07T07:50:31Z

In your article, you use the whole test data as reference then calculate the BLEU of each generated sentence. The average of them can be a metric of generated reality.

Conversely, why not use the whole generated data (the same number as test data) as reference then calculate the BLEU of each test sentence. The average of them can be a metric of generated diversity.

weilinie · 2019-06-07T16:14:18Z

If I understand correctly, you are referring to the self-BLEU. Actually, I opened an issue #27 on Texygen about the self-BLEU metric.

bojone · 2019-06-07T17:42:13Z

No, it is not self bleu.

the bleu in your work is something like

np.mean([
    bleu(references=the_whole_test_data, hypothesis=s)
    for s in the_whole_generated_data
])

it can be a metric of generated reality.

my idea is to calculate

np.mean([
    bleu(references=the_whole_generated_data, hypothesis=s)
    for s in the_whole_test_data
])

as a metric of generated diversity, while high score means all the the_whole_test_data can be found in the_whole_generated_data.

weilinie · 2019-06-07T19:52:58Z

Thanks for the explanation and now I see your point. I guess what you have proposed is basically the same with bleu, since the func bleu() in our case actually calculates the mean of all the bleu scores between each reference and hypothesis, and you just swap the order of two for loops.

bojone · 2019-06-08T01:55:39Z

Approximately, the original one is to check whether if the_whole_generated_data is a subset of the_whole_test_data or not. And my idea is to check whether if the_whole_test_data is a subset of the_whole_generated_data or not.

If both of them are high, it means the_whole_generated_data ⊆ the_whole_test_data and the_whole_test_data ⊆ the_whole_generated_data, indicating the_whole_test_data = the_whole_generated_data.

chenwq95 · 2019-08-20T08:44:18Z

I have computed Self-BLEU which ensured that test data and reference data is the same. I guess that the issue #27 on Texygen does not happen for me. Because I do not reuse the saved "references" in SelfBleu Class.

For COCO, I saved 1,000 sentences and compute Self-BLEU-2 at each epoch. After pretraining, Self-BLEU-2 was around 0.76. After adversarial training for about 10 epochs (3130 iters), Self-BLEU-2 rise to about 0.85.

weilinie · 2019-08-20T13:44:46Z

Hmm, this is interesting. Could you please share your code to calculate the self-BLEU score? Thanks!

weilinie closed this as completed Sep 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

my idea of metric on diversity. #6

my idea of metric on diversity. #6

bojone commented Jun 7, 2019 •

edited

Loading

weilinie commented Jun 7, 2019

bojone commented Jun 7, 2019 •

edited

Loading

weilinie commented Jun 7, 2019

bojone commented Jun 8, 2019

chenwq95 commented Aug 20, 2019

weilinie commented Aug 20, 2019

my idea of metric on diversity. #6

my idea of metric on diversity. #6

Comments

bojone commented Jun 7, 2019 • edited Loading

weilinie commented Jun 7, 2019

bojone commented Jun 7, 2019 • edited Loading

weilinie commented Jun 7, 2019

bojone commented Jun 8, 2019

chenwq95 commented Aug 20, 2019

weilinie commented Aug 20, 2019

bojone commented Jun 7, 2019 •

edited

Loading

bojone commented Jun 7, 2019 •

edited

Loading