Description to use "just text files". #14

PhilipMay · 2021-04-09T07:21:44Z

Hey @richarddwang
would it be possible to provide a description how to use "just text files" for pretraining?
I have a large sentence splitted file with blank line between documents and would like to domain adapt
my electra model to my domainspecific corpus.

Your examples use these hugdatafast arrow datasets. How do I inject my own texts?

Many thanks
Philip

The text was updated successfully, but these errors were encountered:

PhilipMay · 2021-04-09T07:53:12Z

Well - I think this is the solution: https://huggingface.co/docs/datasets/loading_datasets.html#text-files

richarddwang · 2021-04-09T07:59:26Z

Hi PhilipMay

This is a repo for my personal research and there is no plan of adding feature to train or finetune on only text files.
And yes, the link you pasted is the only solution currently.

PhilipMay · 2021-04-09T08:26:26Z

there is no plan of adding feature to train or finetune on only text files.

I would like to understand what you exactly mean by this. As far as I understand I could use the solution from above to load a textfile as data and continue pretraining from a stored checkpoint. Is that right?

richarddwang · 2021-04-09T08:28:09Z

Yes, I just mean I won't add feature to directly train on text files without huggingface/datasets.

richarddwang closed this as completed Apr 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Description to use "just text files". #14

Description to use "just text files". #14

PhilipMay commented Apr 9, 2021

PhilipMay commented Apr 9, 2021

richarddwang commented Apr 9, 2021

PhilipMay commented Apr 9, 2021

richarddwang commented Apr 9, 2021

Description to use "just text files". #14

Description to use "just text files". #14

Comments

PhilipMay commented Apr 9, 2021

PhilipMay commented Apr 9, 2021

richarddwang commented Apr 9, 2021

PhilipMay commented Apr 9, 2021

richarddwang commented Apr 9, 2021