Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

my idea of metric on diversity. #6

Closed
bojone opened this issue Jun 7, 2019 · 6 comments
Closed

my idea of metric on diversity. #6

bojone opened this issue Jun 7, 2019 · 6 comments

Comments

@bojone
Copy link

bojone commented Jun 7, 2019

In your article, you use the whole test data as reference then calculate the BLEU of each generated sentence. The average of them can be a metric of generated reality.

Conversely, why not use the whole generated data (the same number as test data) as reference then calculate the BLEU of each test sentence. The average of them can be a metric of generated diversity.

@weilinie
Copy link
Owner

weilinie commented Jun 7, 2019

If I understand correctly, you are referring to the self-BLEU. Actually, I opened an issue #27 on Texygen about the self-BLEU metric.

@bojone
Copy link
Author

bojone commented Jun 7, 2019

No, it is not self bleu.

the bleu in your work is something like

np.mean([
    bleu(references=the_whole_test_data, hypothesis=s)
    for s in the_whole_generated_data
])

it can be a metric of generated reality.

my idea is to calculate

np.mean([
    bleu(references=the_whole_generated_data, hypothesis=s)
    for s in the_whole_test_data
])

as a metric of generated diversity, while high score means all the the_whole_test_data can be found in the_whole_generated_data.

@weilinie
Copy link
Owner

weilinie commented Jun 7, 2019

Thanks for the explanation and now I see your point. I guess what you have proposed is basically the same with bleu, since the func bleu() in our case actually calculates the mean of all the bleu scores between each reference and hypothesis, and you just swap the order of two for loops.

@bojone
Copy link
Author

bojone commented Jun 8, 2019

Approximately, the original one is to check whether if the_whole_generated_data is a subset of the_whole_test_data or not. And my idea is to check whether if the_whole_test_data is a subset of the_whole_generated_data or not.

If both of them are high, it means the_whole_generated_data ⊆ the_whole_test_data and the_whole_test_data ⊆ the_whole_generated_data, indicating the_whole_test_data = the_whole_generated_data.

@chenwq95
Copy link

I have computed Self-BLEU which ensured that test data and reference data is the same. I guess that the issue #27 on Texygen does not happen for me. Because I do not reuse the saved "references" in SelfBleu Class.

For COCO, I saved 1,000 sentences and compute Self-BLEU-2 at each epoch. After pretraining, Self-BLEU-2 was around 0.76. After adversarial training for about 10 epochs (3130 iters), Self-BLEU-2 rise to about 0.85.

@weilinie
Copy link
Owner

Hmm, this is interesting. Could you please share your code to calculate the self-BLEU score? Thanks!

@weilinie weilinie closed this as completed Sep 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants