Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I wonder if the model works fine when batch is not 1? #2

Closed
qftie opened this issue Jun 10, 2022 · 5 comments
Closed

I wonder if the model works fine when batch is not 1? #2

qftie opened this issue Jun 10, 2022 · 5 comments

Comments

@qftie
Copy link

qftie commented Jun 10, 2022

I don't see any operations for attention_mask, which means if the Roberta model will set all attention_mask to 1?

@rungjoo
Copy link
Owner

rungjoo commented Jun 13, 2022

This link may be helpful.
https://github.com/huggingface/transformers/blob/v4.19.3/src/transformers/models/roberta/modeling_roberta.py#L807

In RoBERTa, if attention_mask is None, the attention of all tokens is 1 by default.

@qftie
Copy link
Author

qftie commented Jun 16, 2022

If the attention mask for all tokens is 1, wouldn't there be a problem dealing with multiple sequences? (Since the padding input_ids won't be ignored when calculating attention scores)

@rungjoo
Copy link
Owner

rungjoo commented Jun 20, 2022

Let me explain with an example.

if batch = 2
sample instance1: [u1; u2; u3]

  • predict u3's emotions

sample instance2: [u1; u2; u3; u4]

  • predict u4's emotions

input

  • Batch input is padded by the length difference between instance1 and instance2.

As you were concerned, We do not set the attention mask for pad tokens to 1.
So it seems that attention mask will be set to 1 even for padding tokens when batch_size is greater than 1.
We missed this part because we set batch_size to 1 when training.

When we train the model, the batch_size is set to 1.
Therefore, these problems did not occur.
Also if batch_size is greater than 1, even if the mask of the pad token is set to 1, it is expected that the model can ignore the padding part while training.
However, your comments will make the model train more effectively.
Thanks.

@rungjoo rungjoo closed this as completed Jul 2, 2022
@ThomasDeCleen
Copy link

Thank you for your detailed explanation. Can I conclude that during training, when I manually set the batch_size to 16 for example, this will not have a negative impact on training due to attention mask issues? Or am I mistaken and did I read your comment wrong?

@rungjoo
Copy link
Owner

rungjoo commented Apr 21, 2023

There may be a negative impact, but it is thought to be small.
To remove the effect, the attention corresponding to the padding token must be set to 0.

You got it right.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants