[HSDP] Fix Node 1 unable receive parameters from Node 0

When use hybrid_shard mode FSDP, state.process_group means gpu_0,1,,,~,7 on node 0，so gpus on node 1 can not receive parameters, setting process_group to default_group（global_group）can fix this issue
pytorch · Sep 7, 2023 · 0f99fd7 · 0f99fd7
1 parent 121cfb6
commit 0f99fd7
Showing 1 changed file with 2 additions and 1 deletion.
diff --git a/torch/distributed/fsdp/_init_utils.py b/torch/distributed/fsdp/_init_utils.py
@@ -569,8 +569,9 @@ def _init_param_handle_from_module(
 
     managed_params = list(_get_orig_params(fully_sharded_module, state._ignored_params))
     if sync_module_states:
+        default_group = _get_default_group()
         _sync_module_params_and_buffers(
-            fully_sharded_module, managed_params, state.process_group
+            fully_sharded_module, managed_params, default_group
         )
     _init_param_handle_from_params(state, managed_params, fully_sharded_module)
     return state