You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am using ELECTRADataProcessor to tokenize my corpus for pretraining (like your example sais).
I am getting this following message:
Token indices sequence length is longer than the specified maximum sequence length for this model (642 > 512). Running this sequence through the model will result in indexing errors
My question: Can this be ignored because the tokenizer cuts off the text or will it cause a crash when training? How can that be avoided?
Thanks again
Philip
The text was updated successfully, but these errors were encountered:
This is because huggingface's tokenizer internally records maximum length its corresponding model but not itself can deal with, and check the length when tokenize. So this should be able to be ignored.
As how to avoid it, you can ask on Huggingface's forum.
I am using
ELECTRADataProcessor
to tokenize my corpus for pretraining (like your example sais).I am getting this following message:
My question: Can this be ignored because the tokenizer cuts off the text or will it cause a crash when training? How can that be avoided?
Thanks again
Philip
The text was updated successfully, but these errors were encountered: