You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi authors, great work! I have a question about the attention mask:
Suppose I am to perform TMix over two sequences with different lengths, I can pass their respective attention masks to the BertEncoder4Mix. But after I perform the mixup at mix_layer, which attention_mask should I use to pass along for the rest of the BERT layers? My intuition is that you should be using the attention mask for the longer sequence (otherwise some text tokens may be masked out) for the mixed hidden states, may I check if you agree with this?
After briefly reading through the codes in MixText.py , it seems that you've just ignored the attention mask and didn't mask out the padding tokens? Would this affect the model performance compared to add the masking properly?
The text was updated successfully, but these errors were encountered:
Hi authors, great work! I have a question about the attention mask:
Suppose I am to perform TMix over two sequences with different lengths, I can pass their respective attention masks to the BertEncoder4Mix. But after I perform the mixup at mix_layer, which attention_mask should I use to pass along for the rest of the BERT layers? My intuition is that you should be using the attention mask for the longer sequence (otherwise some text tokens may be masked out) for the mixed hidden states, may I check if you agree with this?
After briefly reading through the codes in MixText.py , it seems that you've just ignored the attention mask and didn't mask out the padding tokens? Would this affect the model performance compared to add the masking properly?
The text was updated successfully, but these errors were encountered: