Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tests for tokenization truncation. Does it truncate to the left or to the right? #5

Open
thejaminator opened this issue Feb 26, 2023 · 0 comments
Labels
good first issue Good for newcomers

Comments

@thejaminator
Copy link
Owner

thejaminator commented Feb 26, 2023

In our function batch_tokenize_gpt2, it truncates the text if its too long.
I forgot if the default huggingface behavior truncates the left tokens or the right tokens.

IMO we should always truncate the right tokens.
This is so that not every text that hits the context limit has an eos token at the end of it.

i.e. suppose you have a context length of 3 words. And you are tokenizing The quick brown fox
It should get tokenized to

The quick brown

rather than

quick brown EOS

And if you tokenized Hello there
It doesn't get truncated and becomes

Hello there EOS

We should write some tests to enforce this behavior for batch_tokenize_gpt2

@thejaminator thejaminator added the good first issue Good for newcomers label Feb 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

1 participant