-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Truncating the prefix of a sequence rather than the suffix #12909
Comments
There's a |
@NielsRogge Perhaps I miss something, but it doesn't seem to implement this functionality. The documentation says that it truncates the first sequence and not the first tokens of the sequence, right?
|
I'm not sure why you mean by truncating the prefix of a sequence. For question answering, one typically provides What do you mean by prefix/suffix? |
We had a misunderstanding. If I use T5/GPT for question answering, the model will receive as input a single sequence. This input might look as follows: Are my intentions clear now? If we think about implementation, perhaps we can add flags that signal which part of the sequence we wish to truncate - prefix, or suffix? |
Additionally, in many tasks even BERT will receive a single input. A good example might be intent detection of an ongoing dialog. I think that it is unnatural to divide a dialog that is made out of multiple turns into two sequences. However, for intent detection, the most important part of the sequence might be the last turns. Thus, cutting the start of the sequence (prefix) rather than the end (suffix) is probably preferable. |
Ok I get it. Perhaps this could be defined as an additional argument called
|
perfect! Thanks for simplifying it :) |
I think if a |
Let me implement this :) |
Thank you! |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
🚀 Feature request
Hi, tokenizers get
truncation
as an argument. When set toTrue
the tokenizer will truncate the suffix of a sequence so it does not surpass the specifiedmax_length
. I'd like to have a functionality that truncates the prefix of the sequence, so the model will see the suffix of the sequence.Motivation
In many applications (e.g. Dialog, and QA) the most important part of the sequence is the suffix (e.g. the question after the context, or the last response of the dialog).
Your contribution
Perhaps I'll submit a PR, but it might take me some time as I'm close to some deadlines of mine :(
The text was updated successfully, but these errors were encountered: