Sequence length too long for `ELECTRADataProcessor`. #22

PhilipMay · 2021-04-12T11:30:52Z

I am using ELECTRADataProcessor to tokenize my corpus for pretraining (like your example sais).

I am getting this following message:

Token indices sequence length is longer than the specified maximum sequence length for this model (642 > 512). Running this sequence through the model will result in indexing errors

My question: Can this be ignored because the tokenizer cuts off the text or will it cause a crash when training? How can that be avoided?

Thanks again
Philip

The text was updated successfully, but these errors were encountered:

richarddwang · 2021-04-12T11:51:52Z

This is because huggingface's tokenizer internally records maximum length its corresponding model but not itself can deal with, and check the length when tokenize. So this should be able to be ignored.

As how to avoid it, you can ask on Huggingface's forum.

PhilipMay · 2021-04-12T19:24:25Z

So this should be able to be ignored.

So the batches you create are not more than the 512 token in length - right?

richarddwang · 2021-04-13T04:37:24Z

Yes, no more than max length, which is 128 under small scale.

richarddwang closed this as completed Apr 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sequence length too long for `ELECTRADataProcessor`. #22

Sequence length too long for `ELECTRADataProcessor`. #22

PhilipMay commented Apr 12, 2021

richarddwang commented Apr 12, 2021

PhilipMay commented Apr 12, 2021

richarddwang commented Apr 13, 2021

Sequence length too long for ELECTRADataProcessor. #22

Sequence length too long for ELECTRADataProcessor. #22

Comments

PhilipMay commented Apr 12, 2021

richarddwang commented Apr 12, 2021

PhilipMay commented Apr 12, 2021

richarddwang commented Apr 13, 2021

Sequence length too long for `ELECTRADataProcessor`. #22

Sequence length too long for `ELECTRADataProcessor`. #22