Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

easy HF dataset doremi? #10

Open
brando90 opened this issue Aug 21, 2023 · 2 comments
Open

easy HF dataset doremi? #10

brando90 opened this issue Aug 21, 2023 · 2 comments

Comments

@brando90
Copy link

Is there a data set compatible with HF I may use?

dataset = load_dataset("c4", "en", streaming=True, split="train").with_format("torch")
remove_columns = ["text", "timestamp", "url"]
but instead have

dataset = load_dataset("doremi", "en", streaming=True, split="train").with_format("torch")
remove_columns = ["text", "timestamp", "url"]
thus automatically using the doremi weights?

@brando90
Copy link
Author

@sangmichaelxie
Copy link
Owner

we don't currently have such a dataset on huggingface, but we will let you know if we decide to do so! One issue is that the weights are on the chunk level, meaning that we are weighting sampling probability for the tokenized examples (not the raw documents).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants