Long Sequence in SQuAD #41

ecchochan · 2019-06-24T07:24:13Z

Case: SQuAD task, sequence length > 512

Does your script utilizes cached memory/extended context in a segment, such that the predictions are inferred from sequence longer than 512 tokens?

If yes, where is the code that achieves this?

If not, what do you suggest to utilize cached memory to perform QA task?

Thank you for such a great work!

kimiyoung · 2019-06-24T17:22:37Z

We are not using cached memory for finetuning yet. Cached memory was used during pretraining to improve modeling long sequences. Once training is done, the model is better for long sequence modeling even with memory removed.

Including cached memory for finetuning is also an option, but not included for now. I would suggest using the same mechanism as in pretraining but backpropagating the gradients across segments.

ecchochan · 2019-06-25T07:13:32Z

Thank you!

I am trying to understand how previous segments are aligned to feed into the model.

Input Sentence Length: 1024
max_seq_length: 512
mem_len: 256

Am I correct to understand it to have 3 segments? [0:512], [256:768], [512:1024] ?
And the data need to be passed into the model 3 times like sliding windows?
But this would make it unable to infer answer from context that is after the window.

Can you advice on how it long sequences can be processed to infer?

Sorry for asking this which might be a foundation knowledge because I am relatively new to the architecture. Btw do you plan on releasing code on using cached memory for finetuning?

SivilTaram · 2019-06-25T13:00:27Z

@ecchochan The memory cache is actually origins from the Transformer-XL paper. I think you could dig into details from the paper. If I get it right, there are only two segments [0:512], [513:1024]. And the cached memory is [256:512] for the second segment though. As for the question

But this would make it unable to infer answer from context that is after the window.

The cached memory aims to catch long-dependency. The problem will be stuck whether there is cached memory (just like the BERT arch, which has a fixed length 512).

astariul mentioned this issue Jul 2, 2019

[Question] How to use cached memory during finetuning ? #98

Open

patrickvonplaten mentioned this issue Nov 16, 2020

XLNet evaluation fails if the size of evaluation set can't be divided by a given evaluation batch size huggingface/transformers#7584

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long Sequence in SQuAD #41

Long Sequence in SQuAD #41

ecchochan commented Jun 24, 2019 •

edited

kimiyoung commented Jun 24, 2019

ecchochan commented Jun 25, 2019 •

edited

SivilTaram commented Jun 25, 2019 •

edited

Long Sequence in SQuAD #41

Long Sequence in SQuAD #41

Comments

ecchochan commented Jun 24, 2019 • edited

kimiyoung commented Jun 24, 2019

ecchochan commented Jun 25, 2019 • edited

SivilTaram commented Jun 25, 2019 • edited

ecchochan commented Jun 24, 2019 •

edited

ecchochan commented Jun 25, 2019 •

edited

SivilTaram commented Jun 25, 2019 •

edited