Skip to content

Commit

Permalink
[HSDP] Fix Node 1 unable receive parameters from Node 0
Browse files Browse the repository at this point in the history
When use hybrid_shard mode FSDP, 
state.process_group means gpu_0,1,,,~,7 on node 0,so gpus on node 1 can not receive parameters, setting process_group to default_group(global_group)can fix this issue
  • Loading branch information
lxg2015 authored and pytorchmergebot committed Sep 7, 2023
1 parent 121cfb6 commit 0f99fd7
Showing 1 changed file with 2 additions and 1 deletion.
3 changes: 2 additions & 1 deletion torch/distributed/fsdp/_init_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -569,8 +569,9 @@ def _init_param_handle_from_module(

managed_params = list(_get_orig_params(fully_sharded_module, state._ignored_params))
if sync_module_states:
default_group = _get_default_group()
_sync_module_params_and_buffers(
fully_sharded_module, managed_params, state.process_group
fully_sharded_module, managed_params, default_group
)
_init_param_handle_from_params(state, managed_params, fully_sharded_module)
return state
Expand Down

0 comments on commit 0f99fd7

Please sign in to comment.