t-test is two-tailed instead of one-tailed #104

erip · 2022-12-13T21:39:42Z

🐛 Bug

The code to perform paired t-test is two-tailed instead of one-tailed. The alternative hypothesis that users typically care about is that the baseline mean score is less than sys1's mean score, but that is not reflected in the test.

To Reproduce

See here :-)

Expected behaviour

The test should probably use alternative="less" or otherwise be configurable.

Screenshots

If applicable, add screenshots to help explain your problem.

Environment

OS: [e.g. iOS, Linux, Win] N/A
Packaging [e.g. pip, conda] N/A
Version [e.g. 0.5.2.1] N/A

Additional context

The text was updated successfully, but these errors were encountered:

ricardorei · 2022-12-19T12:15:53Z

I am adding a flag for t_test alternative and setting it by default to "less".

ricardorei · 2022-12-19T12:16:10Z

This will be available on the next release

erip · 2022-12-19T13:06:33Z

Perhaps for the sake of not breaking the API, setting the default to "two-sided" might be a better option until the next major release? I'd hate to be the cause of people's papers comparing apples and oranges. I think with the ability to configure it, this should cover the issue well.

ricardorei · 2022-12-19T13:17:02Z

@erip Thats a good point thanks!

The next release will be 2.0 and we are also going to replace the default models with new ones.

We will make it clear that scores (with default settings) won't be directly comparable to the previous version (1.1.3).

There will be backward compatibility but default options will probably change for all 3 commands: comet-score, comet-mbr and comet-compare

ricardorei · 2022-12-19T13:20:53Z

I agree with you that people typically want to know if baseline mean score is less than sys1's mean score. I think its a good call to change the t_test to less. What you think?

Atm this is only updated in the fix-multigpu branch and not merged into master.

erip · 2022-12-19T13:31:34Z

It seems very reasonable to me. I'm hoping there's not some nuance that I've overlooked here. I can look at sacrebleu to see what their alternative hypothesis is in their tests (for sake of consistency more than correctness).

ricardorei · 2022-12-19T13:35:04Z

Perfect! Thanks!

erip · 2022-12-19T13:37:40Z

Unless I'm misreading their code, it seems like they're testing using a two-sided alternative hypothesis due to the absolute value.

ricardorei · 2023-02-22T09:39:38Z

@erip I looked a bit more into this and indeed two-sided t_test is more usual and results made more sense in my tests. Nonetheless I am keeping the option to change that in the command line. I am going to merge v2.0 into master.

The release was delayed but at least master will contain the new changes

erip added the bug Something isn't working label Dec 13, 2022

ricardorei self-assigned this Dec 16, 2022

ricardorei pushed a commit that referenced this issue Dec 19, 2022

MultiGPU inference + t_test alternative flag (#101, #104)

c709d63

ricardorei closed this as completed Jan 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

t-test is two-tailed instead of one-tailed #104

t-test is two-tailed instead of one-tailed #104

erip commented Dec 13, 2022 •

edited

Loading

ricardorei commented Dec 19, 2022

ricardorei commented Dec 19, 2022

erip commented Dec 19, 2022 •

edited

Loading

ricardorei commented Dec 19, 2022

ricardorei commented Dec 19, 2022 •

edited

Loading

erip commented Dec 19, 2022

ricardorei commented Dec 19, 2022

erip commented Dec 19, 2022 •

edited

Loading

ricardorei commented Feb 22, 2023

t-test is two-tailed instead of one-tailed #104

t-test is two-tailed instead of one-tailed #104

Comments

erip commented Dec 13, 2022 • edited Loading

🐛 Bug

To Reproduce

Expected behaviour

Screenshots

Environment

Additional context

ricardorei commented Dec 19, 2022

ricardorei commented Dec 19, 2022

erip commented Dec 19, 2022 • edited Loading

ricardorei commented Dec 19, 2022

ricardorei commented Dec 19, 2022 • edited Loading

erip commented Dec 19, 2022

ricardorei commented Dec 19, 2022

erip commented Dec 19, 2022 • edited Loading

ricardorei commented Feb 22, 2023

erip commented Dec 13, 2022 •

edited

Loading

erip commented Dec 19, 2022 •

edited

Loading

ricardorei commented Dec 19, 2022 •

edited

Loading

erip commented Dec 19, 2022 •

edited

Loading