Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Truncating the prefix of a sequence rather than the suffix #12909

Open
yuvalkirstain opened this issue Jul 27, 2021 · 11 comments
Open

Truncating the prefix of a sequence rather than the suffix #12909

yuvalkirstain opened this issue Jul 27, 2021 · 11 comments
Labels
WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress

Comments

@yuvalkirstain
Copy link
Contributor

🚀 Feature request

Hi, tokenizers get truncation as an argument. When set to True the tokenizer will truncate the suffix of a sequence so it does not surpass the specified max_length. I'd like to have a functionality that truncates the prefix of the sequence, so the model will see the suffix of the sequence.

Motivation

In many applications (e.g. Dialog, and QA) the most important part of the sequence is the suffix (e.g. the question after the context, or the last response of the dialog).

Your contribution

Perhaps I'll submit a PR, but it might take me some time as I'm close to some deadlines of mine :(

@NielsRogge
Copy link
Contributor

There's a TruncationStrategy called "only_first" that implements this. See this for all possible truncation strategies.

@yuvalkirstain
Copy link
Contributor Author

@NielsRogge Perhaps I miss something, but it doesn't seem to implement this functionality. The documentation says that it truncates the first sequence and not the first tokens of the sequence, right?

:obj:`'only_first'`: Truncate to a maximum length specified with the argument :obj:`max_length` or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.

@NielsRogge
Copy link
Contributor

NielsRogge commented Jul 27, 2021

I'm not sure why you mean by truncating the prefix of a sequence.

For question answering, one typically provides [CLS] question [SEP] context [SEP] to the model (so question being the first sequence, context being the second sequence). People are usually interested in either truncating the tokens of the question or the context.

What do you mean by prefix/suffix?

@yuvalkirstain
Copy link
Contributor Author

We had a misunderstanding. If I use T5/GPT for question answering, the model will receive as input a single sequence. This input might look as follows:
Background: <first sentence in the context> ... <last sentence in the context>\nQuestion: <question>\nAnswer:.
Now, if I truncate the suffix of the input it might end up as:
Background: <first sentence in the context> ... <last sentence in the context>.
Thus, I will prefer to truncate the prefix of the input so the model will get
<third sentence in the context>... <last sentence in the context>\nQuestion: <question>\nAnswer:.

Are my intentions clear now?

If we think about implementation, perhaps we can add flags that signal which part of the sequence we wish to truncate - prefix, or suffix?

@yuvalkirstain
Copy link
Contributor Author

Additionally, in many tasks even BERT will receive a single input. A good example might be intent detection of an ongoing dialog. I think that it is unnatural to divide a dialog that is made out of multiple turns into two sequences. However, for intent detection, the most important part of the sequence might be the last turns. Thus, cutting the start of the sequence (prefix) rather than the end (suffix) is probably preferable.

@NielsRogge
Copy link
Contributor

NielsRogge commented Jul 27, 2021

Ok I get it. Perhaps this could be defined as an additional argument called truncation_side (similar to padding_side), which can be either "left" or "right".

Padding_side is already implemented as can be seen here (as one of the attributes when initializing a tokenizer).

@yuvalkirstain
Copy link
Contributor Author

perfect! Thanks for simplifying it :)

@NielsRogge
Copy link
Contributor

I think if a truncation_side is defined, then it should be used in the truncate_sequences function defined here. It could then be used by all different truncation strategies.

@NielsRogge
Copy link
Contributor

Let me implement this :)

@yuvalkirstain
Copy link
Contributor Author

Thank you!

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot closed this as completed Sep 3, 2021
@LysandreJik LysandreJik reopened this Sep 3, 2021
@LysandreJik LysandreJik reopened this Sep 13, 2021
@LysandreJik LysandreJik reopened this Sep 22, 2021
@LysandreJik LysandreJik added the WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress label Sep 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants