-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Description to use "just text files". #14
Comments
Well - I think this is the solution: https://huggingface.co/docs/datasets/loading_datasets.html#text-files |
Hi PhilipMay This is a repo for my personal research and there is no plan of adding feature to train or finetune on only text files. |
I would like to understand what you exactly mean by this. As far as I understand I could use the solution from above to load a textfile as data and continue pretraining from a stored checkpoint. Is that right? |
Yes, I just mean I won't add feature to directly train on text files without huggingface/datasets. |
Hey @richarddwang
would it be possible to provide a description how to use "just text files" for pretraining?
I have a large sentence splitted file with blank line between documents and would like to domain adapt
my electra model to my domainspecific corpus.
Your examples use these
hugdatafast
arrow datasets. How do I inject my own texts?Many thanks
Philip
The text was updated successfully, but these errors were encountered: