You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In our function batch_tokenize_gpt2, it truncates the text if its too long.
I forgot if the default huggingface behavior truncates the left tokens or the right tokens.
IMO we should always truncate the right tokens.
This is so that not every text that hits the context limit has an eos token at the end of it.
i.e. suppose you have a context length of 3 words. And you are tokenizing The quick brown fox
It should get tokenized to
The quick brown
rather than
quick brown EOS
And if you tokenized Hello there
It doesn't get truncated and becomes
Hello there EOS
We should write some tests to enforce this behavior for batch_tokenize_gpt2
The text was updated successfully, but these errors were encountered:
In our function
batch_tokenize_gpt2
, it truncates the text if its too long.I forgot if the default huggingface behavior truncates the left tokens or the right tokens.
IMO we should always truncate the right tokens.
This is so that not every text that hits the context limit has an eos token at the end of it.
i.e. suppose you have a context length of 3 words. And you are tokenizing
The quick brown fox
It should get tokenized to
rather than
And if you tokenized
Hello there
It doesn't get truncated and becomes
We should write some tests to enforce this behavior for
batch_tokenize_gpt2
The text was updated successfully, but these errors were encountered: