Is it right that your input data is not sentence splitted? #15

PhilipMay · 2021-04-09T07:47:56Z

As far as I can see you do not sentence split your input data for pretraining. Is that correct?

You have one document per "row" and just cut it when the sequence lenth of the model is reached.
But how do you continue after that for the nest "sentence"? With the rest of the cut sentence?

Thanks
Philip

richarddwang · 2021-04-09T08:14:20Z

Hi @PhilipMay

electra_pytorch/_utils/utils.py

Lines 137 to 140 in f4940c7

    
           for line in re.split(self.lines_delimiter, text): # for every paragraph 
        
             if re.fullmatch(r'\s*', line): continue # empty string or string with all space characters 
        
             if self.apply_cleaning and self.filter_out(line): continue

Every "row" in the original (raw) huggingface dataset contains a column 'text', which is a document (a very long python string)
ELECTRADataProcessor will read that long string and split it by \n into sentences and clear empty sentences by default
ELECTRADataProcessor sequentially concatenate all sentences from the same document in the same way the official one do it, into many "sample"s.
In conclusion, ELECTRADataProcessor.map take a raw huggingface dataset with document as row and output a preprocessed huggingface dataset which has "row" as sample, and each row is with columns "input_ids", "attention_mask", ....

Feel free to tag me if you still have some questions.

PhilipMay · 2021-04-09T08:20:41Z

Ahh I see. Many thanks.

richarddwang closed this as completed Apr 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it right that your input data is not sentence splitted? #15

Is it right that your input data is not sentence splitted? #15

PhilipMay commented Apr 9, 2021

richarddwang commented Apr 9, 2021

PhilipMay commented Apr 9, 2021

Is it right that your input data is not sentence splitted? #15

Is it right that your input data is not sentence splitted? #15

Comments

PhilipMay commented Apr 9, 2021

richarddwang commented Apr 9, 2021

PhilipMay commented Apr 9, 2021