Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it right that your input data is not sentence splitted? #15

Closed
PhilipMay opened this issue Apr 9, 2021 · 2 comments
Closed

Is it right that your input data is not sentence splitted? #15

PhilipMay opened this issue Apr 9, 2021 · 2 comments

Comments

@PhilipMay
Copy link

As far as I can see you do not sentence split your input data for pretraining. Is that correct?

You have one document per "row" and just cut it when the sequence lenth of the model is reached.
But how do you continue after that for the nest "sentence"? With the rest of the cut sentence?

Thanks
Philip

@richarddwang
Copy link
Owner

Hi @PhilipMay

for line in re.split(self.lines_delimiter, text): # for every paragraph
if re.fullmatch(r'\s*', line): continue # empty string or string with all space characters
if self.apply_cleaning and self.filter_out(line): continue

  • Every "row" in the original (raw) huggingface dataset contains a column 'text', which is a document (a very long python string)
  • ELECTRADataProcessor will read that long string and split it by \n into sentences and clear empty sentences by default
  • ELECTRADataProcessor sequentially concatenate all sentences from the same document in the same way the official one do it, into many "sample"s.
  • In conclusion, ELECTRADataProcessor.map take a raw huggingface dataset with document as row and output a preprocessed huggingface dataset which has "row" as sample, and each row is with columns "input_ids", "attention_mask", ....

Feel free to tag me if you still have some questions.

@PhilipMay
Copy link
Author

Ahh I see. Many thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants