I wonder if the model works fine when batch is not 1? #2

qftie · 2022-06-10T09:30:27Z

I don't see any operations for attention_mask, which means if the Roberta model will set all attention_mask to 1?

rungjoo · 2022-06-13T10:06:37Z

This link may be helpful.
https://github.com/huggingface/transformers/blob/v4.19.3/src/transformers/models/roberta/modeling_roberta.py#L807

In RoBERTa, if attention_mask is None, the attention of all tokens is 1 by default.

qftie · 2022-06-16T09:33:49Z

If the attention mask for all tokens is 1, wouldn't there be a problem dealing with multiple sequences? (Since the padding input_ids won't be ignored when calculating attention scores)

rungjoo · 2022-06-20T08:52:13Z

Let me explain with an example.

if batch = 2
sample instance1: [u1; u2; u3]

predict u3's emotions

sample instance2: [u1; u2; u3; u4]

predict u4's emotions

input

Batch input is padded by the length difference between instance1 and instance2.

As you were concerned, We do not set the attention mask for pad tokens to 1.
So it seems that attention mask will be set to 1 even for padding tokens when batch_size is greater than 1.
We missed this part because we set batch_size to 1 when training.

When we train the model, the batch_size is set to 1.
Therefore, these problems did not occur.
Also if batch_size is greater than 1, even if the mask of the pad token is set to 1, it is expected that the model can ignore the padding part while training.
However, your comments will make the model train more effectively.
Thanks.

ThomasDeCleen · 2023-04-20T16:09:48Z

Thank you for your detailed explanation. Can I conclude that during training, when I manually set the batch_size to 16 for example, this will not have a negative impact on training due to attention mask issues? Or am I mistaken and did I read your comment wrong?

rungjoo · 2023-04-21T05:13:15Z

There may be a negative impact, but it is thought to be small.
To remove the effect, the attention corresponding to the padding token must be set to 0.

You got it right.

rungjoo closed this as completed Jul 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I wonder if the model works fine when batch is not 1? #2

I wonder if the model works fine when batch is not 1? #2

qftie commented Jun 10, 2022

rungjoo commented Jun 13, 2022

qftie commented Jun 16, 2022

rungjoo commented Jun 20, 2022 •

edited

Loading

ThomasDeCleen commented Apr 20, 2023

rungjoo commented Apr 21, 2023

I wonder if the model works fine when batch is not 1? #2

I wonder if the model works fine when batch is not 1? #2

Comments

qftie commented Jun 10, 2022

rungjoo commented Jun 13, 2022

qftie commented Jun 16, 2022

rungjoo commented Jun 20, 2022 • edited Loading

ThomasDeCleen commented Apr 20, 2023

rungjoo commented Apr 21, 2023

rungjoo commented Jun 20, 2022 •

edited

Loading