Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

num_proc not specified in .map functions #47

Open
sandstromviktor opened this issue Dec 14, 2023 · 3 comments · May be fixed by #50
Open

num_proc not specified in .map functions #47

sandstromviktor opened this issue Dec 14, 2023 · 3 comments · May be fixed by #50

Comments

@sandstromviktor
Copy link

In trainer.py, there are three .map functions where num_proc is not specified.
It should be possible to set this because it speeds up tokenization, spreading and normalizations by a significant amount.

Example from trainer.py row 227:

        with tokenizer.entity_tracker(split=dataset_name):
            dataset = dataset.map(
                tokenizer,
                batched=True,
                remove_columns=set(dataset.column_names) - set(self.OPTIONAL_COLUMNS),
                desc=f"Tokenizing the {dataset_name} dataset",
                fn_kwargs={"return_num_words": is_evaluate},
                num_proc=4, # Added this - should be specifiable
            )

This sped up tokenization by about 4 times.

@tomaarsen
Copy link
Owner

Woah, good call! I'd love to look into this.

@tomaarsen tomaarsen linked a pull request Jan 9, 2024 that will close this issue
@sandstromviktor
Copy link
Author

I think that this can be set using the huggingface trainingargs dataloader_num_workers argument. If not please tell me why :)

@tomaarsen
Copy link
Owner

That makes sense, I should've thought of that!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants