Support eager mode for multi-process training #7327

JackCaoG · 2024-06-21T20:30:46Z

in place all_reduce is used for optimizer_step for data parallel training for multi-process. The HLO for

  ordinal_tensor_1 = torch.tensor([index], dtype=torch.float).to(device)
  ordinal_tensor_2 = torch.tensor([index], dtype=torch.int32).to(device)

  xm.all_reduce(xm.REDUCE_SUM, [ordinal_tensor_1, ordinal_tensor_2])

looks like

ENTRY %IrToHlo.27 (p0.1: f32[], p1.2: s32[1], p2.3: f32[1]) -> (f32[1], s32[1]) {
.......
  %all-reduce.12 = (s32[1]{0}, s32[]) all-reduce(s32[1]{0} %get-tuple-element.6, s32[] %get-tuple-element.7), replica_groups={}, constrain_layout=true, to_apply=%AddComputation.8, metadata={op_type="xla__cross_replica_sum" op_name="xla__cross_replica_sum" source_file="/workspaces/dk3/pytorch/xla/torch_xla/core/xla_model.py" source_line=501}
.....
  %get-tuple-element.24 = f32[1]{0} get-tuple-element((f32[1]{0}, f32[]) %all-reduce.23), index=0, metadata={op_type="xla__cross_replica_sum" op_name="xla__cross_replica_sum" source_file="/workspaces/dk3/pytorch/xla/torch_xla/core/xla_model.py" source_line=501}
  %get-tuple-element.13 = s32[1]{0} get-tuple-element((s32[1]{0}, s32[]) %all-reduce.12), index=0, metadata={op_type="xla__cross_replica_sum" op_name="xla__cross_replica_sum" source_file="/workspaces/dk3/pytorch/xla/torch_xla/core/xla_model.py" source_line=501}
  ROOT %tuple.26 = (f32[1]{0}, s32[1]{0}) tuple(f32[1]{0} %get-tuple-element.24, s32[1]{0} %get-tuple-element.13)
}

Note that in above HLO we have 2 output but we only all_reduce once. Without this change we will eagerly evaluate each output, which result in all_rduce being compiled/execute twice which is not ideal. For ops like all_reduce that one ops has multiple outputs, it is better to group the execution and only execute once.

JackCaoG · 2024-06-24T20:05:57Z

this pr is ready for review.

Support eager mode for multi-process training

71fd5fb

JackCaoG added the tpuci label Jun 21, 2024

update test

83bdf6c

JackCaoG added eager usability Bugs/features related to improving the usability of PyTorch/XLA labels Jun 21, 2024

JackCaoG added 2 commits June 21, 2024 14:14

Update run_tests.sh

2185ca4

Update run_tests.sh

7397f93

JackCaoG marked this pull request as ready for review June 21, 2024 23:29

JackCaoG requested review from alanwaketan, qihqi and wonjoolee95 June 21, 2024 23:29

wonjoolee95 approved these changes Jun 24, 2024

View reviewed changes

JackCaoG merged commit 222bbd8 into master Jun 24, 2024
21 of 22 checks passed

JackCaoG deleted the JackCaoG/multi_process_eager_2 branch June 24, 2024 23:57

JackCaoG mentioned this pull request Jul 2, 2024

2.4 backport PR request list #7242

Open

JackCaoG added a commit that referenced this pull request Jul 11, 2024

Support eager mode for multi-process training (#7327)

8e634d0

JackCaoG added a commit that referenced this pull request Jul 11, 2024

Backport Support eager mode for multi-process training #7327 (#7668)

6d4c87c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support eager mode for multi-process training #7327

Support eager mode for multi-process training #7327

JackCaoG commented Jun 21, 2024

JackCaoG commented Jun 24, 2024

Support eager mode for multi-process training #7327

Support eager mode for multi-process training #7327

Conversation

JackCaoG commented Jun 21, 2024

JackCaoG commented Jun 24, 2024