reseed all Generators in Dataloader's _worker_loop() #107034

NicolasHug · 2023-08-11T14:02:03Z

This PR addresses https://fb.workplace.com/groups/pytorch.oss.dev/posts/1699944830430051 and does a bunch of stacked changes:

Make Generator weakref-able (C++ part)
Create a registry of manually-created Generator objects (via weakrefs)
Use this registry in the Dataloader's _worker_loop to re-seed all existing Generator instances: this extends what is already applied to the global Generator, which is already re-seeded.

TODO: a bit of docs and justification, which I'll do if this PR is mergeable.

CC @albanD as previously discussed

cc @ezyang @gchanan @ssnl @VitalyFedyunin @ejguan @dzhulgakov @pbelevich

pytorch-bot · 2023-08-11T14:02:06Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/107034

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 5 New Failures

As of commit f73cc30:

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

NicolasHug · 2023-08-11T15:15:18Z

torch/utils/data/_utils/worker.py

+            # We also need to create a generator_seed that depends on the current generator state, otherwise
+            # all Generator instances within a given worker would have the same RNG.
+            generator_seed = torch.empty((), dtype=torch.int64).random_(generator=generator).item() + seed
+            generator.manual_seed(generator_seed)


Technically this is a change of behaviour: In main, the Generator instances have the same RNG across workers; now they don't.

I don't think users are / should rely on this behaviour anyway? The only use I can think of where users would want to have the same RNG across workers is if they're using the Generator to shuffle the datasets: when shuffling, you want all workers to shuffle in the same way. But for those who need that, I think it's fair to say that they should be relying on worker_init_fn anyway.

This is technically BC-breaking so we should properly document it (please add a small paragraph in the description for the release notes). But that's ok to do it I think yes.

albanD

Small things but the approach sounds good to me.
Will let @ezyang give his opinion as well

albanD · 2023-08-11T19:09:21Z

test/test_dataloader.py

+
+        for from_global, from_g1, from_g2 in dl:
+            # Assert RNG of all Generators are different within a given worker (each "batch" comes from a single worker)
+            assert len(set([from_global, from_g1, from_g2])) == 3


Can you please remove all the plain assert from tests. You can use self.assertEqual(), self.assertNot(), etc as appropriate

albanD · 2023-08-11T19:12:54Z

torch/csrc/Generator.cpp

+  self->weakreflist = NULL;
+
+  static py::handle _generator_registry = py::module::import("torch").attr("random").attr("_generator_registry");
+  _generator_registry.attr("add")(py::cast<py::object>((PyObject*)self.get()));


move the add above. You don't want to do the "add" dynamically every time right?

albanD · 2023-08-11T19:13:15Z

torch/csrc/Generator.cpp

  }
+  self->weakreflist = NULL;
+
+  static py::handle _generator_registry = py::module::import("torch").attr("random").attr("_generator_registry");


You should have a "release()" at the end here to make sure to leak the reference.

albanD · 2023-08-11T19:14:34Z

torch/random.py

        for device, device_rng_state in zip(devices, device_rng_states):
            device_mod.set_rng_state(device_rng_state, device)
+
+# We keep track of all Generator instances (except the default one) via a registry of weak references.


Ho why not the default one? Because that is the global rng and so it is already handled?

albanD · 2023-08-11T19:15:53Z

torch/utils/data/_utils/worker.py

+            # would be the same as the global one.
+            # We also need to create a generator_seed that depends on the current generator state, otherwise
+            # all Generator instances within a given worker would have the same RNG.
+            generator_seed = torch.empty((), dtype=torch.int64).random_(generator=generator).item() + seed


Why not use generator.initial_seed() ?

ezyang · 2023-08-13T18:28:11Z

As we discussed earlier, I am begrudgingly ok with this approach. There are some implementation detail's I'll talk about tomorrow; one in particular is having the registry in C++ and thread safe. EDIT: Actually, we cannot easily do this because you use the Python weakref functionality 🤔

ezyang · 2023-08-13T21:19:01Z

torch/utils/data/_utils/worker.py

        seed = base_seed + worker_id
        random.seed(seed)
        torch.manual_seed(seed)
+        for generator in torch.random._generator_registry:


It occurs to me, technically, you don't even need the generator registry; we could just gc.get_objects() and traverse the entire live heap to look for generators 😆

Actually, legit question, why don't we do this? The patch becomes super simple then.

I gave it a try in #107131, LMK what you think

Ho yeah that definitely makes it a a lot less invasive if we're ok with traversing all the alive objects on process creation. Which I think we are.

@ezyang

Alternative to #107034, implements @ezyang 's suggestion from #107034 (comment). This PR addresses https://fb.workplace.com/groups/pytorch.oss.dev/posts/1699944830430051 and does a bunch of stacked changes: - Make `Generator` class support GC;this makes all `Generator` instances tracked and accessile through Python's GC. - Use the GC to retrieve all existing Generator instances in Dataloader's `_worker_loop` and re-seed them: this extends what is already applied to the global/default Generator, which is already re-seeded. ~TODO: a bit of docs and justification, which I'll do if this PR is mergeable.~ -- Done CC @albanD @ezyang as previously discussed BC-Breaking Note ------------------- We now re-seed all `Generator` instances within the `Dataloader` workers' loop to ensure that their RNG is different across workers. Previously, the RNG of user-defined `Generators` would be the same across workers, which could lead to wrong training procedures. This only affects user-defined `Generators`, not the default `Generator` (which was already re-seeded). Pull Request resolved: #107131 Approved by: https://github.com/ezyang

NicolasHug · 2023-08-22T09:17:54Z

Superseded by #107131

reseed all Generators in Dataloader's _worker_loop()

f73cc30

NicolasHug requested a review from ejguan as a code owner August 11, 2023 14:02

pytorch-bot bot added the release notes: dataloader release notes category label Aug 11, 2023

NicolasHug mentioned this pull request Aug 11, 2023

Reseed all Generators in Dataloader's _worker_loop() #107031

Closed

NicolasHug commented Aug 11, 2023

View reviewed changes

NicolasHug added module: dataloader Related to torch.utils.data.DataLoader and Sampler module: random Related to random number generation in PyTorch (rng generator) labels Aug 11, 2023

albanD added the module: bc-breaking Related to a BC-breaking change label Aug 11, 2023

pytorch-bot bot added the topic: bc breaking topic category label Aug 11, 2023

albanD reviewed Aug 11, 2023

View reviewed changes

ezyang reviewed Aug 13, 2023

View reviewed changes

NicolasHug mentioned this pull request Aug 14, 2023

reseed all Generators in Dataloader's _worker_loop() -- via GC #107131

Closed

NicolasHug closed this Aug 22, 2023

reseed all Generators in Dataloader's _worker_loop() #107034

reseed all Generators in Dataloader's _worker_loop() #107034

Uh oh!

Conversation

NicolasHug commented Aug 11, 2023 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/107034

❌ 5 New Failures

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

albanD left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ezyang commented Aug 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NicolasHug commented Aug 22, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

NicolasHug commented Aug 11, 2023 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Aug 11, 2023 •

edited

Loading

albanD left a comment •

edited

Loading

ezyang commented Aug 13, 2023 •

edited

Loading