Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About the fatal weakness of the Embedding-based metric #23

Closed
g32M7fT6b8Y opened this issue Nov 24, 2019 · 2 comments
Closed

About the fatal weakness of the Embedding-based metric #23

g32M7fT6b8Y opened this issue Nov 24, 2019 · 2 comments

Comments

@g32M7fT6b8Y
Copy link

g32M7fT6b8Y commented Nov 24, 2019

Hi, thank you for your wonderful repo.
In my view, I think BERTScore is a kind of the embedding-based metric for measuring the quality of the responses which is similar to the Embedding-Average and Greedy Matching.
After trying the Embedding-Average, Greedy Matching, Vector Extrema, and BERTScore, I found that the average scores of these embedding-based metrics are very high (average 0.817 on Dailydialog dataset and Cornell dataset). In this case, any responses or even very bad responses could achieve the "Good" score and the difference between the "Good" and "Bad" are very small.
I attribute this question to the "fuzzy" representation of the word embedding. So I think the embedding-based metrics are not very appropriate for measuring the performance of the generative models such as dialog systems and NMT.

How do you think about this issue ? And how to alleviate it ?

Hope to get the response from you. Thanks.

@Tiiiger
Copy link
Owner

Tiiiger commented Nov 29, 2019

Hi @g32M7fT6b8Y ,

We believe this is mostly an issue of usage, not a weakness in the method itself. Indeed, we have found that BERTScore computed with deep contextual embedding models can sometimes have a small numerical range (also pointed out by #20 ). However, this does not suggest that BERTScore cannot distinguish bad candidates (bad responses in your case) from good candidates. If we rank the candidates, the good candidates would score higher than the bad candidates. On this note, we also refer you to the correlation studies in our paper.



We also don’t want to simply ignore this “numerical range” problem because it hinders the readability of our method. After rounds of considerations, here’s what we propose: 


We take a large monolingual corpus and randomly assign sentences to be candidate-reference pairs. When we evaluate these pairs with BERTScore, the output score (averaged) should serve as a lower bound because the candidate and reference are irrelevant to each other. We propose to use this lower bound to rescale BERTScore. To do this, we subtract this lower bound from a BERTScore and divide the difference by 1-lower bound.

For some numbers:
On the WMT17 news crawl English corpus, a lower bound for BERTScore computed with RoBERTa-Large is 0.83. With this recalling, the average BERTScore on the WMT18 De-EN translation evaluation dataset drops from 0.9311 to 0.5758. For a concrete example, let's look at the example mentioned in #20. Before this rescaling the score distribution is like this:
image

After rescaling, this distribution looks like this:
image



Note that this modification would only change the range of BERTScore and won’t affect BERTScore’s correlation with human judgment. Currently, we are adding software support in this repo. Stay tuned and we’ll push this change into the new version soon. 



I am closing this issue but feel free to continue the thread here.

@gmftbyGMFTBY
Copy link

Thank you for your response.
I think it maybe a appropriate way to alleviate this issue.
Cannot wait to try the new version of BERTScore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants