Details about Attention Mask #3

NoviScl · 2020-07-12T06:23:11Z

Hi authors, great work! I have a question about the attention mask:
Suppose I am to perform TMix over two sequences with different lengths, I can pass their respective attention masks to the BertEncoder4Mix. But after I perform the mixup at mix_layer, which attention_mask should I use to pass along for the rest of the BERT layers? My intuition is that you should be using the attention mask for the longer sequence (otherwise some text tokens may be masked out) for the mixed hidden states, may I check if you agree with this?

After briefly reading through the codes in MixText.py , it seems that you've just ignored the attention mask and didn't mask out the padding tokens? Would this affect the model performance compared to add the masking properly?

jiaaoc · 2020-07-12T14:14:42Z

Yes, if you utilize attention masks, you need to adopt the longer one after mixing (by using a simple 'or' operation between two masks).

I did not use attention masks for text classifications. But I think adding attention masks might further boost the performance.

jiaaoc closed this as completed Jul 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Details about Attention Mask #3

Details about Attention Mask #3

NoviScl commented Jul 12, 2020 •

edited

jiaaoc commented Jul 12, 2020 •

edited

Details about Attention Mask #3

Details about Attention Mask #3

Comments

NoviScl commented Jul 12, 2020 • edited

jiaaoc commented Jul 12, 2020 • edited

NoviScl commented Jul 12, 2020 •

edited

jiaaoc commented Jul 12, 2020 •

edited