Add tests for tokenization truncation. Does it truncate to the left or to the right? #5

thejaminator · 2023-02-26T11:43:55Z

In our function batch_tokenize_gpt2, it truncates the text if its too long.
I forgot if the default huggingface behavior truncates the left tokens or the right tokens.

IMO we should always truncate the right tokens.
This is so that not every text that hits the context limit has an eos token at the end of it.

i.e. suppose you have a context length of 3 words. And you are tokenizing The quick brown fox
It should get tokenized to

The quick brown

rather than

quick brown EOS

And if you tokenized Hello there
It doesn't get truncated and becomes

Hello there EOS

We should write some tests to enforce this behavior for batch_tokenize_gpt2

The text was updated successfully, but these errors were encountered:

thejaminator assigned thejaminator and unassigned thejaminator Feb 26, 2023

thejaminator added the good first issue Good for newcomers label Feb 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add tests for tokenization truncation. Does it truncate to the left or to the right? #5

Add tests for tokenization truncation. Does it truncate to the left or to the right? #5

thejaminator commented Feb 26, 2023 •

edited

Loading

Add tests for tokenization truncation. Does it truncate to the left or to the right? #5

Add tests for tokenization truncation. Does it truncate to the left or to the right? #5

Comments

thejaminator commented Feb 26, 2023 • edited Loading

thejaminator commented Feb 26, 2023 •

edited

Loading