Add support for seed in `DataCollatorForLanguageModeling` #36497

capemox · 2025-03-02T16:12:33Z

What does this PR do?

This PR adds support for setting a seed in the class DataCollatorForLanguageModeling. This helps with reproducibility in generating masks for masked language modeling (MLM). This issue was approved by @Rocketknight1 (#36357)

Currently, there is a way for ensuring reproducibility by using the function transformers.set_seed(). However, this function sets the seed of the global RNG for PyTorch, Numpy, etc. What this means is that, setting a global seed can impact other pseudo-random functions outside the scope of the collator such as parameter initialization for models. This also means, that changes in the script outside the collator can impact the masking.

Instead, it is preferred to create generator objects which can be passed around to different functions. This is also considered good practice. What my PR does, is:

Allows users to pass a seed parameter to DataCollatorForLanguageModeling
Instantiates a generator object with the seed depending on the return_tensors parameter
The generator object is used for pseudo-random functions within the class
In case the user does not pass a seed, the collator falls back to its default implementation, for backwards compatibility

The generator object is scoped to the collator class, so it won't affect pseudo-random functions outside the class and vice-versa.

One important factor to consider is using multiple workers for the collator function, as PyTorch's DataLoader does. PyTorch has documentation regarding this, whereby we set a different seed for each worker given by shared_seed + worker_id. This is because the worker's seeds would be cloned, and so each worker would mask the input in exactly the same manner, which is undesirable. A critical part of PyTorch's DataLoader is that from within the worker, it is possible to access the worker's id (important to set the worker seed). Because of this constraint, this PR only supports multi-worker scenarios with PyTorch's DataLoader. With the seed set, if the code detects that the collator is running in a multi-processing scenario and the worker information is unavailable, an error is thrown.

The algorithm for creating the generator object is:

# use multiprocessing to get current process' name
if current_process == "MainProcess":
    # Collator running in the in main python process
    # No issues here
    generator = get_generator(seed)
else:
    # Multiprocess scenario. Throw an error if
    # we cannot access PyTorch's worker info.
    worker_info = torch.utils.data.get_worker_info()
    if worker_info is None:
        raise ValueError
    
    generator = get_generator(seed + worker_info.id)

Tests have also been written to verify this behaviour in tests/trainer/test_data_collator.py.

These changes were done on Python 3.12.8. The dependencies installed were as pip install -e ".[dev]" along with:

torch==2.6.0
tensorflow==2.18.0
tf-keras==2.18.0

Fixes #36357

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@Rocketknight1 should be the right person to review this.

github-actions · 2025-03-02T16:12:45Z

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. When it is ready for review, please click the Ready for review button (at the bottom of the PR page).

capemox · 2025-03-03T08:14:43Z

@Rocketknight1 this won't pass a few data collator tests, but I submitted a PR to fix these (#36457)

capemox · 2025-03-08T12:55:55Z

@Rocketknight1, this PR should be good to go given the fix above now

Rocketknight1

On reading, this looks good to me - it adds a modest amount of code, but it's well-structured and everything is there for a reason.

The main thing I wanted to check was that there wouldn't be any regression if self.seed = None, but I believe that in this case self.generator will always be None as well, and therefore behaviour will be unchanged from before.

The only changes occur when self.seed is set, and previously this was unsupported at all, which means this PR is a strict improvement.

Still, this is big enough that we probably need core maintainer approval, so cc @ArthurZucker @Cyrilvallez, but since I own the data collator code you can just leave it to me to finalize the review and merge this if you want!

capemox · 2025-03-10T17:15:56Z

The main thing I wanted to check was that there wouldn't be any regression if self.seed = None, but I believe that in this case self.generator will always be None as well, and therefore behaviour will be unchanged from before.

Yep @Rocketknight1, this is exactly the case! in the post_init method, self.generator is initialized to None, and there wouldn't be any changes compared to before. The create_rng method would initialize the generator, but that would only be the case if seed is set.

About the core maintainer review - since you're most familiar with the code, I'll leave it up to you! I wouldn't mind their inputs as well.

HuggingFaceDocBuilderDev · 2025-03-10T17:20:16Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Rocketknight1 · 2025-03-11T13:22:37Z

hi @capemox, it's up to the core maintainers to decide if they want to review this themselves or not!

capemox · 2025-03-11T13:43:25Z

Whoops nvm, thought it was directed at me :D

Rocketknight1 · 2025-03-11T13:49:33Z

My bad lol, it did look like I was asking you to decide

Cyrilvallez

Alright, LGTM! About the tests, it looks like it's mostly copy/paste between frameworks, but as it seems to be the way other tests are as well, I'll let you judge @Rocketknight1! Feel free to merge when you're happy!

Rocketknight1 · 2025-03-19T18:53:26Z

I'm happy with it at this point! Doing a rebase to hopefully clear the CI and then we can merge

…te tests for verifying behaviour.

Rocketknight1 · 2025-03-20T18:28:03Z

Merged, and thank you for the PR!

capemox · 2025-03-20T18:38:53Z

My pleasure!

github-actions bot marked this pull request as draft March 2, 2025 16:12

capemox marked this pull request as ready for review March 2, 2025 16:33

capemox mentioned this pull request Mar 7, 2025

Fixed datatype related issues in DataCollatorForLanguageModeling #36457

Merged

5 tasks

Rocketknight1 approved these changes Mar 10, 2025

View reviewed changes

capemox mentioned this pull request Mar 12, 2025

Add seed to data collator classes #36655

Closed

Cyrilvallez approved these changes Mar 19, 2025

View reviewed changes

Rocketknight1 force-pushed the add-seed-for-data-collator-for-language-modeling branch from a846302 to 6e5481e Compare March 19, 2025 18:52

Add support for seed in DataCollatorForLanguageModeling. Also wro…

ef9a03b

…te tests for verifying behaviour.

Rocketknight1 force-pushed the add-seed-for-data-collator-for-language-modeling branch from 6e5481e to ef9a03b Compare March 20, 2025 18:07

Rocketknight1 merged commit 9e771bf into huggingface:main Mar 20, 2025
21 checks passed

capemox mentioned this pull request Mar 22, 2025

Added support for seed in DataCollatorForWholeWordMask #36903

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for seed in `DataCollatorForLanguageModeling` #36497

Add support for seed in `DataCollatorForLanguageModeling` #36497

capemox commented Mar 2, 2025

github-actions bot commented Mar 2, 2025

capemox commented Mar 3, 2025 •

edited

Loading

capemox commented Mar 8, 2025

Rocketknight1 left a comment

capemox commented Mar 10, 2025 •

edited

Loading

HuggingFaceDocBuilderDev commented Mar 10, 2025

Rocketknight1 commented Mar 11, 2025

capemox commented Mar 11, 2025

Rocketknight1 commented Mar 11, 2025

Cyrilvallez left a comment

Rocketknight1 commented Mar 19, 2025

Rocketknight1 commented Mar 20, 2025

capemox commented Mar 20, 2025

Add support for seed in DataCollatorForLanguageModeling #36497

Add support for seed in DataCollatorForLanguageModeling #36497

Conversation

capemox commented Mar 2, 2025

What does this PR do?

Before submitting

Who can review?

github-actions bot commented Mar 2, 2025

capemox commented Mar 3, 2025 • edited Loading

capemox commented Mar 8, 2025

Rocketknight1 left a comment

Choose a reason for hiding this comment

capemox commented Mar 10, 2025 • edited Loading

HuggingFaceDocBuilderDev commented Mar 10, 2025

Rocketknight1 commented Mar 11, 2025

capemox commented Mar 11, 2025

Rocketknight1 commented Mar 11, 2025

Cyrilvallez left a comment

Choose a reason for hiding this comment

Rocketknight1 commented Mar 19, 2025

Rocketknight1 commented Mar 20, 2025

capemox commented Mar 20, 2025

Add support for seed in `DataCollatorForLanguageModeling` #36497

Add support for seed in `DataCollatorForLanguageModeling` #36497

capemox commented Mar 3, 2025 •

edited

Loading

capemox commented Mar 10, 2025 •

edited

Loading