-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for seed in DataCollatorForLanguageModeling
#36497
Add support for seed in DataCollatorForLanguageModeling
#36497
Conversation
Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. When it is ready for review, please click the |
@Rocketknight1 this won't pass a few data collator tests, but I submitted a PR to fix these (#36457) |
@Rocketknight1, this PR should be good to go given the fix above now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On reading, this looks good to me - it adds a modest amount of code, but it's well-structured and everything is there for a reason.
The main thing I wanted to check was that there wouldn't be any regression if self.seed = None
, but I believe that in this case self.generator
will always be None
as well, and therefore behaviour will be unchanged from before.
The only changes occur when self.seed
is set, and previously this was unsupported at all, which means this PR is a strict improvement.
Still, this is big enough that we probably need core maintainer approval, so cc @ArthurZucker @Cyrilvallez, but since I own the data collator code you can just leave it to me to finalize the review and merge this if you want!
Yep @Rocketknight1, this is exactly the case! in the post_init method, self.generator is initialized to None, and there wouldn't be any changes compared to before. The About the core maintainer review - since you're most familiar with the code, I'll leave it up to you! I wouldn't mind their inputs as well. |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
hi @capemox, it's up to the core maintainers to decide if they want to review this themselves or not! |
Whoops nvm, thought it was directed at me :D |
My bad lol, it did look like I was asking you to decide |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, LGTM! About the tests, it looks like it's mostly copy/paste between frameworks, but as it seems to be the way other tests are as well, I'll let you judge @Rocketknight1! Feel free to merge when you're happy!
a846302
to
6e5481e
Compare
I'm happy with it at this point! Doing a rebase to hopefully clear the CI and then we can merge |
…te tests for verifying behaviour.
6e5481e
to
ef9a03b
Compare
Merged, and thank you for the PR! |
My pleasure! |
What does this PR do?
This PR adds support for setting a seed in the class
DataCollatorForLanguageModeling
. This helps with reproducibility in generating masks for masked language modeling (MLM). This issue was approved by @Rocketknight1 (#36357)Currently, there is a way for ensuring reproducibility by using the function
transformers.set_seed()
. However, this function sets the seed of the global RNG for PyTorch, Numpy, etc. What this means is that, setting a global seed can impact other pseudo-random functions outside the scope of the collator such as parameter initialization for models. This also means, that changes in the script outside the collator can impact the masking.Instead, it is preferred to create generator objects which can be passed around to different functions. This is also considered good practice. What my PR does, is:
seed
parameter toDataCollatorForLanguageModeling
return_tensors
parameterThe generator object is scoped to the collator class, so it won't affect pseudo-random functions outside the class and vice-versa.
One important factor to consider is using multiple workers for the collator function, as PyTorch's
DataLoader
does. PyTorch has documentation regarding this, whereby we set a different seed for each worker given byshared_seed + worker_id
. This is because the worker's seeds would be cloned, and so each worker would mask the input in exactly the same manner, which is undesirable. A critical part of PyTorch'sDataLoader
is that from within the worker, it is possible to access the worker'sid
(important to set the worker seed). Because of this constraint, this PR only supports multi-worker scenarios with PyTorch'sDataLoader
. With theseed
set, if the code detects that the collator is running in a multi-processing scenario and the worker information is unavailable, an error is thrown.The algorithm for creating the generator object is:
Tests have also been written to verify this behaviour in
tests/trainer/test_data_collator.py
.These changes were done on Python 3.12.8. The dependencies installed were as pip install -e ".[dev]" along with:
Fixes #36357
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@Rocketknight1 should be the right person to review this.