Skip to content

Conversation

@liangel-02
Copy link
Contributor

Summary

Currently, running CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" NGPU=1 CUDA_VISIBLE_DEVICES=0 ./run_train.sh returns

 dim for dim in distinct_seed_mesh_dims if dim in world_mesh.mesh_dim_names
                                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  TypeError: argument of type 'NoneType' is not iterable

This PR fixes the case for a single GPU or when world_mesh.mesh_dim_names is None

Testing

Added unit test to tests/unit_tests/test_set_determinism.py

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 4, 2025
@liangel-02 liangel-02 requested a review from drisspg November 4, 2025 01:37
dim for dim in distinct_seed_mesh_dims if dim in world_mesh.mesh_dim_names
dim
for dim in distinct_seed_mesh_dims
if world_mesh.mesh_dim_names and dim in world_mesh.mesh_dim_names
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fegin It seems if NGPU=1, world_mesh is not None but mesh_dim_names is None, due to this code https://github.com/pytorch/torchtitan/blob/main/torchtitan/distributed/parallel_dims.py#L154-L156.

Does this sound right to you? I somehow feel we should have default mesh_dim_names, but I can't find a perfect option for it.

I'm OK with this change to unblock.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can land this PR to unblock. My new DeviceMesh PR should address this problem. I will also ensure that the newly added unittest pass in my PR.

# For PP + SPMD cases, we want to separate the world into the SPMD mesh and the PP mesh,
# and choose a unique seed for each rank on the PP mesh.
# We support multiple distinct dimensions by adding each distinct dimension's local rank to the seed.
distinct_dims_in_mesh = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this! I have a n00b question, will world_mesh.mesh_dim_names be empty or empty list: https://github.com/pytorch/torchtitan/blob/main/torchtitan/distributed/parallel_dims.py#L159, if we init_device_mesh with mesh = init_device_mesh(device_type, dims=[], mesh_dim_names=[])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

world_mesh.mesh_dim_names is empty with type NoneType

Copy link
Contributor

@fegin fegin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that DeviceMesh doesn't enforce names while init_device_mesh does. But this case is that the dimensions are empty. So even though we call init_device_mesh, it seems that we still fall back to the no dimension name default case with DeviceMesh.

cc., @fduwjj

dim for dim in distinct_seed_mesh_dims if dim in world_mesh.mesh_dim_names
dim
for dim in distinct_seed_mesh_dims
if world_mesh.mesh_dim_names and dim in world_mesh.mesh_dim_names
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can land this PR to unblock. My new DeviceMesh PR should address this problem. I will also ensure that the newly added unittest pass in my PR.

@liangel-02 liangel-02 merged commit bb308da into main Nov 4, 2025
9 checks passed
@fegin fegin deleted the test_varlen branch November 4, 2025 19:25
jquesnelle pushed a commit to NousResearch/torchtitan that referenced this pull request Nov 10, 2025
**Summary**

Currently, running
`CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml"
NGPU=1 CUDA_VISIBLE_DEVICES=0 ./run_train.sh` returns

```
 dim for dim in distinct_seed_mesh_dims if dim in world_mesh.mesh_dim_names
                                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  TypeError: argument of type 'NoneType' is not iterable
```

This PR fixes the case for a single GPU or when
world_mesh.mesh_dim_names is None

**Testing**

Added unit test to `tests/unit_tests/test_set_determinism.py`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants