-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I wonder if the model works fine when batch is not 1? #2
Comments
This link may be helpful. In RoBERTa, if attention_mask is None, the attention of all tokens is 1 by default. |
If the attention mask for all tokens is 1, wouldn't there be a problem dealing with multiple sequences? (Since the padding input_ids won't be ignored when calculating attention scores) |
Let me explain with an example. if batch = 2
sample instance2: [u1; u2; u3; u4]
input
As you were concerned, We do not set the attention mask for pad tokens to 1. When we train the model, the batch_size is set to 1. |
Thank you for your detailed explanation. Can I conclude that during training, when I manually set the batch_size to 16 for example, this will not have a negative impact on training due to attention mask issues? Or am I mistaken and did I read your comment wrong? |
There may be a negative impact, but it is thought to be small. You got it right. |
I don't see any operations for attention_mask, which means if the Roberta model will set all attention_mask to 1?
The text was updated successfully, but these errors were encountered: