Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long Sequence in SQuAD #41

Open
ecchochan opened this issue Jun 24, 2019 · 3 comments
Open

Long Sequence in SQuAD #41

ecchochan opened this issue Jun 24, 2019 · 3 comments

Comments

@ecchochan
Copy link

ecchochan commented Jun 24, 2019

Case: SQuAD task, sequence length > 512

Does your script utilizes cached memory/extended context in a segment, such that the predictions are inferred from sequence longer than 512 tokens?

If yes, where is the code that achieves this?

If not, what do you suggest to utilize cached memory to perform QA task?

Thank you for such a great work!

@kimiyoung
Copy link
Collaborator

We are not using cached memory for finetuning yet. Cached memory was used during pretraining to improve modeling long sequences. Once training is done, the model is better for long sequence modeling even with memory removed.

Including cached memory for finetuning is also an option, but not included for now. I would suggest using the same mechanism as in pretraining but backpropagating the gradients across segments.

@ecchochan
Copy link
Author

ecchochan commented Jun 25, 2019

Thank you!

I am trying to understand how previous segments are aligned to feed into the model.

Input Sentence Length: 1024
max_seq_length: 512
mem_len: 256

Am I correct to understand it to have 3 segments? [0:512], [256:768], [512:1024] ?
And the data need to be passed into the model 3 times like sliding windows?
But this would make it unable to infer answer from context that is after the window.

Can you advice on how it long sequences can be processed to infer?

Sorry for asking this which might be a foundation knowledge because I am relatively new to the architecture. Btw do you plan on releasing code on using cached memory for finetuning?

@SivilTaram
Copy link

SivilTaram commented Jun 25, 2019

@ecchochan The memory cache is actually origins from the Transformer-XL paper. I think you could dig into details from the paper. If I get it right, there are only two segments [0:512], [513:1024]. And the cached memory is [256:512] for the second segment though. As for the question

But this would make it unable to infer answer from context that is after the window.

The cached memory aims to catch long-dependency. The problem will be stuck whether there is cached memory (just like the BERT arch, which has a fixed length 512).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants