Replicating SuperGLUE benchmark #19

Shamdan17 · 2020-12-17T12:38:26Z

Hello! I'm trying to replicate your work, and I'm currently comparing the performance of my replication to your implementation. Just to be sure, could you please provide me the exact commands you used for training the SuperGLUE tasks? I would be very grateful. Thank you!

timoschick · 2020-12-17T14:50:31Z

Hi @Shamdan17, sure! Here's the exact command for training ALBERT with iPET on RTE:

python3 cli.py \
--method ipet \
--pattern_ids 0 1 2 3 \
--data_dir ${PATH_TO_YOUR_DATA_DIR} \
--model_type albert \
--model_name_or_path albert-xxlarge-v2 \
--task_name rte \
--output_dir ${PATH_TO_YOUR_OUTPUT_DIR} \
--do_train \
--do_eval \
--pet_per_gpu_eval_batch_size 8 \
--pet_per_gpu_train_batch_size 2 \
--pet_gradient_accumulation_steps 8 \
--pet_max_steps 250 \
--pet_max_seq_length 256 \
--sc_per_gpu_train_batch_size 2 \
--sc_per_gpu_unlabeled_batch_size 2 \
--sc_gradient_accumulation_steps 8 \
--sc_max_steps 5000 \
--sc_max_seq_length 256

If you want to train a model with PET rather than iPET, simply replace --method ipet with --method pet. For other tasks, the following modifications are required:

WiC: Change --task_name to wic and --pattern_ids to 0 1 2
CB: Change --task_name to cb
BoolQ: Change --task_name to boolq and --pattern_ids to 0 1 2 3 4 5
MultiRC: Change --task_name to multirc and --pattern_ids to 0 1 2. MultiRC has many contexts that require more than 256 tokens, so we additionally set --pet_max_seq_length 512 and --sc_max_seq_length 512, which requires us to reduce the batch size. To keep the same effective batch size, we increase the number of gradient accumulation steps:

--pet_per_gpu_train_batch_size 1 \
--pet_gradient_accumulation_steps 16 \
--sc_per_gpu_train_batch_size 1 \
--sc_per_gpu_unlabeled_batch_size 1 \
--sc_gradient_accumulation_steps 16 \

WSC: Change --method to pet, --task_name to wsc and pattern_ids to 0 1 2. Also, change the maximum sequence length (--pet_max_seq_length and --sc_max_seq_length) to 128. This allows you to increase the batch size and reduce the gradient accumulation steps for the same effective batch size:

--pet_per_gpu_train_batch_size 4 \
--pet_gradient_accumulation_steps 4 \
--sc_per_gpu_train_batch_size 4 \
--sc_per_gpu_unlabeled_batch_size 4 \
--sc_gradient_accumulation_steps 4 \

CoPA: Change --method to pet, --task_name to copa and pattern_ids to 0 1. Apart from that, perform the same changes as for WSC (but a smaller max sequence length of --pet_max_seq_length 96 is sufficient).
ReCoRD: Change --task_name to record, --method to pet, pattern_ids to 0. Additionally, add the --no_distillation flag to not perform knowledge distillation. Similar to MultiRC, increase the maximum sequence length to 512 and change the batch size and gradient accumulation steps accordingly.

Shamdan17 · 2020-12-17T14:53:29Z

Thank you very much! I greatly appreciate it. Looking forward to your future work :)

Shamdan17 · 2020-12-17T17:33:54Z

Another question, do you not use auxiliary LM loss for superGLUE?

timoschick · 2020-12-18T08:07:19Z

No, we did not use auxiliary LM loss. This would have required a batch size of at least 4 (the ratio of labeled and unlabeled examples in the original PET paper is 1:3), which was not possible using a single GPU.

Shamdan17 · 2020-12-18T12:09:34Z

I see, makes sense, thanks again. Just one last question (I hope), just to make sure I'm not doing something wrong, is the sc_per_gpu_train_batch_size flag necessary in this case? From what I saw in the code, once you are using the use_logits flag for distillation, you only work with the unlabeled dataloader and discard the original training dataloader. Is that correct? or is there another place where you need this flag? Thanks a lot again for your time :)

timoschick · 2020-12-18T12:19:38Z

You are absolutely correct! I've used the same script for both regular training and PET/iPET, which is why I always updated both --sc_per_gpu_train_batch_size and --sc_per_gpu_unlabeled_batch_size. But in your case (i.e., if you want to only train PET/iPET), --sc_per_gpu_train_batch_size is not necessary.

dorost1234 · 2021-11-06T20:22:15Z

Hi, as I see RTE has 5 patterns with 0,1,2,3,4 in the codes of pvp.py, is this intentional to only use 0,1,2,3 in the command above you mentioned? similarly multirc has 0,1,2,3 patterns and not 0,1,2, thanks

timoschick · 2021-11-08T09:20:26Z

Hi @dorost1234, for these tasks, the last pattern is always the one used by GPT-3. We originally did not include these patterns, so if you want to reproduce our main results in Table 1, you should not use them. However, if you want to reproduce the p_comb results in Table 2 (where we use a combination of our patterns and the GPT-3 pattern, which leads to better performance), you should include it.

timoschick self-assigned this Dec 17, 2020

timoschick closed this as completed Mar 9, 2021

timoschick mentioned this issue Sep 6, 2021

Reproduce the result of the RTE task #48

Closed

rabeehk mentioned this issue Nov 6, 2021

reproducing the results on RTE #64

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replicating SuperGLUE benchmark #19

Replicating SuperGLUE benchmark #19

Shamdan17 commented Dec 17, 2020

timoschick commented Dec 17, 2020

Shamdan17 commented Dec 17, 2020

Shamdan17 commented Dec 17, 2020

timoschick commented Dec 18, 2020

Shamdan17 commented Dec 18, 2020

timoschick commented Dec 18, 2020

dorost1234 commented Nov 6, 2021 •

edited

Loading

timoschick commented Nov 8, 2021

Replicating SuperGLUE benchmark #19

Replicating SuperGLUE benchmark #19

Comments

Shamdan17 commented Dec 17, 2020

timoschick commented Dec 17, 2020

Shamdan17 commented Dec 17, 2020

Shamdan17 commented Dec 17, 2020

timoschick commented Dec 18, 2020

Shamdan17 commented Dec 18, 2020

timoschick commented Dec 18, 2020

dorost1234 commented Nov 6, 2021 • edited Loading

timoschick commented Nov 8, 2021

dorost1234 commented Nov 6, 2021 •

edited

Loading