[QUESTION]Interpretation and comparison of COMET scores across several languages #110

clairehua1 · 2023-02-07T22:37:24Z

❓ Questions and Help

Before asking:

Search for similar issues.
Search the docs.

What is your question?

Is there a way to interpret the COMET score other than using it as a ranking system?
For example, we have a French dataset of 500 sentences from TED Talks and a Spanish dataset consisting of 500 sentences from Parliament sessions. We are comparing the COMET scores for French and Spanish, the scores are denoted as comet_fr and comet_es. If comet_fr > comet_es, does that mean that the machine translation quality of French is better than the machine translation quality of Spanish? Is the COMET score comparable across languages? Or is this comparison invalid because the source data is not the same?

Code

What have you tried?

What's your environment?

OS: [e.g. iOS, Linux, Win]
Packaging [e.g. pip, conda]
Version [e.g. 0.5.2.1]

ricardorei · 2023-02-11T22:32:16Z

Hi @clairehua1,

You should avoid comparing scores between languages and even between domains. This is not just for COMET but for any MT Metric.

For example BLEU, even tho is lexical, highly depends on the underlying tokenizer thus the results vary a lot between different languages.

PS: even human annotation has a lot of variability between languages and domains. If we want reliable and comparable results we need to make sure the test conditions are the same (same data, same annotators)

Cheers,
Ricardo

clairehua1 · 2023-02-16T22:57:44Z

Thanks for the answer Ricardo! Is there a way to interpret the COMET score other than using it as a ranking system?

ricardorei · 2023-02-19T18:37:05Z

@clairehua1 for a specific setting (language pair and domain) you could plot the distribution of scores and analyse it by looking at quantiles. The scores usually follow a normal distribution.

To give a bit more context most models are trained to predict a z-normalized direct assessment (a z-score). Z-scores have a mean at 0 and follow a normal distribution which means that ideally a score of 0 should represent an average translation.

In practise the distribution of scores (for the default models wmt20-comet-da) is slightly skewed towards positive scores which means that an average translation is usually assigned a score of 0.5. I have an explanation here

ricardorei · 2023-02-19T18:39:03Z

In the plots above you can see how different is the scores between English-German and English-Hausa. But you can see that the "peak" for German is a bit higher than Hausa.

Nonetheless this is expected due to the fact that German translations tend to have better quality than Hausa ones.

ricardorei · 2023-02-19T18:39:49Z

clairehua1 added the question Further information is requested label Feb 7, 2023

ricardorei closed this as completed Feb 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUESTION]Interpretation and comparison of COMET scores across several languages #110

[QUESTION]Interpretation and comparison of COMET scores across several languages #110

clairehua1 commented Feb 7, 2023

ricardorei commented Feb 11, 2023

clairehua1 commented Feb 16, 2023

ricardorei commented Feb 19, 2023

ricardorei commented Feb 19, 2023 •

edited

Loading

ricardorei commented Feb 19, 2023

[QUESTION]Interpretation and comparison of COMET scores across several languages #110

[QUESTION]Interpretation and comparison of COMET scores across several languages #110

Comments

clairehua1 commented Feb 7, 2023

❓ Questions and Help

Before asking:

What is your question?

Code

What have you tried?

What's your environment?

ricardorei commented Feb 11, 2023

clairehua1 commented Feb 16, 2023

ricardorei commented Feb 19, 2023

ricardorei commented Feb 19, 2023 • edited Loading

ricardorei commented Feb 19, 2023

ricardorei commented Feb 19, 2023 •

edited

Loading