-
Notifications
You must be signed in to change notification settings - Fork 618
fix set_determinism on single gpu #1983
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -145,7 +145,9 @@ def set_determinism( | |
| # and choose a unique seed for each rank on the PP mesh. | ||
| # We support multiple distinct dimensions by adding each distinct dimension's local rank to the seed. | ||
| distinct_dims_in_mesh = [ | ||
| dim for dim in distinct_seed_mesh_dims if dim in world_mesh.mesh_dim_names | ||
| dim | ||
| for dim in distinct_seed_mesh_dims | ||
| if world_mesh.mesh_dim_names and dim in world_mesh.mesh_dim_names | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @fegin It seems if NGPU=1, Does this sound right to you? I somehow feel we should have default I'm OK with this change to unblock.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can land this PR to unblock. My new DeviceMesh PR should address this problem. I will also ensure that the newly added unittest pass in my PR. |
||
| ] | ||
|
|
||
| if c10d.get_world_size() > 1 and distinct_dims_in_mesh: | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this! I have a n00b question, will
world_mesh.mesh_dim_namesbe empty or empty list: https://github.com/pytorch/torchtitan/blob/main/torchtitan/distributed/parallel_dims.py#L159, if we init_device_mesh withmesh = init_device_mesh(device_type, dims=[], mesh_dim_names=[])There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
world_mesh.mesh_dim_namesis empty with typeNoneType