Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Description to use "just text files". #14

Closed
PhilipMay opened this issue Apr 9, 2021 · 4 comments
Closed

Description to use "just text files". #14

PhilipMay opened this issue Apr 9, 2021 · 4 comments

Comments

@PhilipMay
Copy link

Hey @richarddwang
would it be possible to provide a description how to use "just text files" for pretraining?
I have a large sentence splitted file with blank line between documents and would like to domain adapt
my electra model to my domainspecific corpus.

Your examples use these hugdatafast arrow datasets. How do I inject my own texts?

Many thanks
Philip

@PhilipMay
Copy link
Author

Well - I think this is the solution: https://huggingface.co/docs/datasets/loading_datasets.html#text-files

@richarddwang
Copy link
Owner

Hi PhilipMay

This is a repo for my personal research and there is no plan of adding feature to train or finetune on only text files.
And yes, the link you pasted is the only solution currently.

@PhilipMay
Copy link
Author

there is no plan of adding feature to train or finetune on only text files.

I would like to understand what you exactly mean by this. As far as I understand I could use the solution from above to load a textfile as data and continue pretraining from a stored checkpoint. Is that right?

@richarddwang
Copy link
Owner

Yes, I just mean I won't add feature to directly train on text files without huggingface/datasets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants