New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to reproduce the PT-2 results of RTE in Table 1 #9
Comments
Thanks for your interest in our work! Our newly released codes may be helpful on your problem. |
Hi Xiao, I found your newly released codes do help to rebuild the PT-2 results. However, I am only able to rebuild it based on "--prefix" instead of "--prompt". Can you clarify what's the main difference between them? In addition, can you clarify which one is the one you described in your paper? Many thanks! |
Hi @CSerxy , |
Hi Xiao, Thanks for the quick response!! I am a little confused by the --prefix part code. Can you help me address the below question? It seems in your paper (Ptuningv2) you describe a model that inserts prompt in front of each layer. For example, let's assume it is prompt with length 5. Then the tunable parameter will be 5 * 1024 * 24 assuming the hidden size is 1024 and the number of layers is 24. This is what you described in your paper, right? However, when I look at your implementation, first of all, I found the number of parameters is doubled. The tunable parameter is 5 * 1024 * 24 * 2 when I set --pre_seq_len=5 in model/prefix_encoder.py. And I found the doubled parameters is to generate a length-5 key_layer and a length-5 value_layer, which is different from the model you mentioned in the paper. I wonder if I understand correctly? Sincerely appreciate your help and look forward to your reply! |
This is an implementation trick we inherit from prefix-tuning, that if we do not want to change the original BERT codes, we have to leverage the Originally, keys and values of prefix tokens should be computed from their hidden states using projection matrix K & V in an attention head. Here we directly passes keys and values into it without using K & V to compute from hidden states (as prefix-tuning does), which in fact doubles the parameters of prefix embeddings. In practice, we find it has almost the same performance with the original implementation. |
I have some questions about rebuilding the PT-2 results of RTE in Table 1.
My base model is RoBERTa-large, I trained the model for 10 epochs with the recommended parameters (prompt length = 4, learning rate = 1e-2 as suggested in previous issue).
However, I can only get roughly 58% accuracy on the RTE dev set.
I am not sure whether the below factor would cause this, hope the authors can give me some hints, many thanks!
The text was updated successfully, but these errors were encountered: