Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[P1] MNLI has two validation set, how do you report the score #62

Closed
BaohaoLiao opened this issue Apr 22, 2024 · 3 comments
Closed

[P1] MNLI has two validation set, how do you report the score #62

BaohaoLiao opened this issue Apr 22, 2024 · 3 comments
Assignees
Labels
question Further information is requested

Comments

@BaohaoLiao
Copy link

BaohaoLiao commented Apr 22, 2024

Hi,

I have a question about the GLUE task, MNLI. As you know, MNLI has matched and mis-matched validation set. How do you partition the validation set and report the score?

It would be great if you can offer the reproduction script for MNLI task.

@frankaging frankaging self-assigned this Apr 22, 2024
@frankaging frankaging added the question Further information is requested label Apr 22, 2024
@frankaging frankaging changed the title MNLI has two validation set, how do you report the score [P1] MNLI has two validation set, how do you report the score Apr 22, 2024
@frankaging
Copy link
Collaborator

frankaging commented Apr 22, 2024

@BaohaoLiao Hey, thanks for your question! For MNLI dataset, we choose the validation_matched split for validation and testing. (I will make this clear in the next revision. I think the RED paper was not clear either, so I figured this out by emailing the authors! I might also just describe what RED paper appendix says in the ReFT paper as well to make it self-contained about the validation setup and evaluation metric (whether use accuracy, correlation, etc..).)

To reproduce, here is an example script for RoBERTa-base. For RoBERTa-large, you can copy the hyperparameters from our appendix to reproduce:

python train.py -task glue \
-train_dataset mnli \
-model FacebookAI/roberta-base \
-seed 42 -l all -r 1 -p f1 -e 40 -lr 6e-4 \
-type LoreftIntervention \
-gradient_accumulation_steps 1 \
-batch_size 32 \
-eval_batch_size 32 \
-test_split validation_matched \
-max_length 256 \
--metric_for_best_model accuracy \
--dropout 0.05 \
--weight_decay 0.0000 \
--warmup_ratio 0.00 \
--logging_steps 20 \
--allow_cls_grad

Use the seeds {42,43,44,45,46}. And for the validation set partition, please refer to our code for details. But basically, we partition a set from the validation set (random partition based on the seed) for selecting the best model, and report the final accuracy on the hold out set.

Please let me know if you have other questions! And feel free to close the ticket if you feel like your question is addressed.

Thanks for your interests!

@frankaging
Copy link
Collaborator

frankaging commented Apr 22, 2024

Also attaching GLUE benchmark description that will be added into the Appendix to provide more details. Please also see Appendix A.1 of the RED paper for the original implementation (I basically paraphrased their setup description, so credit goes to them).

Screenshot 2024-04-22 at 11 35 23 AM

@BaohaoLiao
Copy link
Author

Thank you very much for your timely help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants