You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As far as I can see you do not sentence split your input data for pretraining. Is that correct?
You have one document per "row" and just cut it when the sequence lenth of the model is reached.
But how do you continue after that for the nest "sentence"? With the rest of the cut sentence?
Thanks
Philip
The text was updated successfully, but these errors were encountered:
Every "row" in the original (raw) huggingface dataset contains a column 'text', which is a document (a very long python string)
ELECTRADataProcessor will read that long string and split it by \n into sentences and clear empty sentences by default
ELECTRADataProcessor sequentially concatenate all sentences from the same document in the same way the official one do it, into many "sample"s.
In conclusion, ELECTRADataProcessor.map take a raw huggingface dataset with document as row and output a preprocessed huggingface dataset which has "row" as sample, and each row is with columns "input_ids", "attention_mask", ....
Feel free to tag me if you still have some questions.
As far as I can see you do not sentence split your input data for pretraining. Is that correct?
You have one document per "row" and just cut it when the sequence lenth of the model is reached.
But how do you continue after that for the nest "sentence"? With the rest of the cut sentence?
Thanks
Philip
The text was updated successfully, but these errors were encountered: