-
Notifications
You must be signed in to change notification settings - Fork 617
fix set_determinism on single gpu #1983
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
4ec5729 to
89efc19
Compare
| dim for dim in distinct_seed_mesh_dims if dim in world_mesh.mesh_dim_names | ||
| dim | ||
| for dim in distinct_seed_mesh_dims | ||
| if world_mesh.mesh_dim_names and dim in world_mesh.mesh_dim_names |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fegin It seems if NGPU=1, world_mesh is not None but mesh_dim_names is None, due to this code https://github.com/pytorch/torchtitan/blob/main/torchtitan/distributed/parallel_dims.py#L154-L156.
Does this sound right to you? I somehow feel we should have default mesh_dim_names, but I can't find a perfect option for it.
I'm OK with this change to unblock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can land this PR to unblock. My new DeviceMesh PR should address this problem. I will also ensure that the newly added unittest pass in my PR.
| # For PP + SPMD cases, we want to separate the world into the SPMD mesh and the PP mesh, | ||
| # and choose a unique seed for each rank on the PP mesh. | ||
| # We support multiple distinct dimensions by adding each distinct dimension's local rank to the seed. | ||
| distinct_dims_in_mesh = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this! I have a n00b question, will world_mesh.mesh_dim_names be empty or empty list: https://github.com/pytorch/torchtitan/blob/main/torchtitan/distributed/parallel_dims.py#L159, if we init_device_mesh with mesh = init_device_mesh(device_type, dims=[], mesh_dim_names=[])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
world_mesh.mesh_dim_names is empty with type NoneType
fegin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is that DeviceMesh doesn't enforce names while init_device_mesh does. But this case is that the dimensions are empty. So even though we call init_device_mesh, it seems that we still fall back to the no dimension name default case with DeviceMesh.
cc., @fduwjj
| dim for dim in distinct_seed_mesh_dims if dim in world_mesh.mesh_dim_names | ||
| dim | ||
| for dim in distinct_seed_mesh_dims | ||
| if world_mesh.mesh_dim_names and dim in world_mesh.mesh_dim_names |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can land this PR to unblock. My new DeviceMesh PR should address this problem. I will also ensure that the newly added unittest pass in my PR.
**Summary**
Currently, running
`CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml"
NGPU=1 CUDA_VISIBLE_DEVICES=0 ./run_train.sh` returns
```
dim for dim in distinct_seed_mesh_dims if dim in world_mesh.mesh_dim_names
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: argument of type 'NoneType' is not iterable
```
This PR fixes the case for a single GPU or when
world_mesh.mesh_dim_names is None
**Testing**
Added unit test to `tests/unit_tests/test_set_determinism.py`
Summary
Currently, running
CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" NGPU=1 CUDA_VISIBLE_DEVICES=0 ./run_train.shreturnsThis PR fixes the case for a single GPU or when world_mesh.mesh_dim_names is None
Testing
Added unit test to
tests/unit_tests/test_set_determinism.py