Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Details about Attention Mask #3

Closed
NoviScl opened this issue Jul 12, 2020 · 1 comment
Closed

Details about Attention Mask #3

NoviScl opened this issue Jul 12, 2020 · 1 comment

Comments

@NoviScl
Copy link

NoviScl commented Jul 12, 2020

Hi authors, great work! I have a question about the attention mask:
Suppose I am to perform TMix over two sequences with different lengths, I can pass their respective attention masks to the BertEncoder4Mix. But after I perform the mixup at mix_layer, which attention_mask should I use to pass along for the rest of the BERT layers? My intuition is that you should be using the attention mask for the longer sequence (otherwise some text tokens may be masked out) for the mixed hidden states, may I check if you agree with this?

After briefly reading through the codes in MixText.py , it seems that you've just ignored the attention mask and didn't mask out the padding tokens? Would this affect the model performance compared to add the masking properly?

@jiaaoc
Copy link
Member

jiaaoc commented Jul 12, 2020

Yes, if you utilize attention masks, you need to adopt the longer one after mixing (by using a simple 'or' operation between two masks).

I did not use attention masks for text classifications. But I think adding attention masks might further boost the performance.

@jiaaoc jiaaoc closed this as completed Jul 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants