[SPMD] fixes bugs with AssignIrValue & ExecuteReplicated #4233

yeounoh · 2022-11-22T07:37:43Z

This fixes a couple of bugs in AssignIrValue and ExecuteReplicated for sharding, to enable mark_step() with SPMD. Note that this doesn't address sharding propagation through views, which will be handled later.

third_party/xla_client/pjrt_computation_client.cc

torch_xla/csrc/tensor.cpp

JackCaoG · 2022-11-22T22:10:47Z

torch_xla/csrc/tensor.cpp

+    if (!ir_value) {
+      ir_value = CreateTensorNode(CurrentXlaData(), /*read_only=*/false);
+    }
+    XLA_CHECK(ir_value.node != nullptr) << "Tyring to access a null cursor";


hmm, so this line is to force the sharding on the old IR before it is being replaced? I am confuse because we did not clear the sharding after the new ir_value being assigned.

The intent is to force the same sharding on the new IR (input), per a common request like this, AssignIrValue(torch::lazy::Value()). Did I get it reversed?

oh ok, I read it backward, you re trying to populate the sharding spec to the new IR being assigned.

test/test_xla_sharding.py

JackCaoG · 2022-11-23T22:49:43Z

test/test_xla_sharding.py

+    self.assertEqual(sharding_spec, torch_xla._XLAC._get_xla_sharding_spec(xt))
+
+    xt.add_(1)  # inplace update
+    xm.mark_step()  # resets IR value


I don't think you need the mark_step here? After the xt.add_(1), if you print the IR and HLO by

print(torch_xla._XLAC._get_xla_tensors_text([xt])) print(torch_xla._XLAC._get_xla_tensors_hlo([xt]))

you should see the sharding spec on the output?

It works either way, before or after -- and checks for the sharding annotation.

JackCaoG

Mostly LGTM, approve it to unblock the merge once test is green. You might need to rebase.

yeounoh added the distributed SPMD and other distributed things. label Nov 22, 2022

yeounoh requested review from JackCaoG, jonb377 and steventk-g November 22, 2022 07:37

yeounoh self-assigned this Nov 22, 2022

yeounoh force-pushed the spmd_mark_step branch from aa6a0da to 62f2179 Compare November 22, 2022 19:15

JackCaoG reviewed Nov 22, 2022

View reviewed changes

third_party/xla_client/pjrt_computation_client.cc Outdated Show resolved Hide resolved

JackCaoG reviewed Nov 22, 2022

View reviewed changes

third_party/xla_client/pjrt_computation_client.cc Outdated Show resolved Hide resolved

JackCaoG reviewed Nov 22, 2022

View reviewed changes

torch_xla/csrc/tensor.cpp Show resolved Hide resolved

JackCaoG reviewed Nov 22, 2022

View reviewed changes

test/test_xla_sharding.py Show resolved Hide resolved

yeounoh force-pushed the spmd_mark_step branch 2 times, most recently from b083799 to 63773a9 Compare November 23, 2022 22:07

JackCaoG reviewed Nov 23, 2022

View reviewed changes

JackCaoG approved these changes Nov 23, 2022

View reviewed changes

yeounoh force-pushed the spmd_mark_step branch from 63773a9 to a16dc28 Compare November 23, 2022 23:17

yeounoh added 3 commits November 23, 2022 23:21

[SPMD] fixes bugs with AssignIrValue & ExecuteReplicated

c8f70ad

rebase master

55f161d

Add/modify a unit test case, test_mark_step_with_sharding

f167e30

yeounoh force-pushed the spmd_mark_step branch from a16dc28 to f167e30 Compare November 23, 2022 23:21

yeounoh mentioned this pull request Nov 23, 2022

Add test to train sharded resnet50 model #4141

Merged

yeounoh merged commit 25cb997 into master Nov 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPMD] fixes bugs with AssignIrValue & ExecuteReplicated #4233

[SPMD] fixes bugs with AssignIrValue & ExecuteReplicated #4233

Uh oh!

yeounoh commented Nov 22, 2022 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JackCaoG Nov 22, 2022

Uh oh!

yeounoh Nov 22, 2022

Uh oh!

JackCaoG Nov 22, 2022

Uh oh!

Uh oh!

JackCaoG Nov 23, 2022

Uh oh!

yeounoh Nov 23, 2022

Uh oh!

JackCaoG left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPMD] fixes bugs with AssignIrValue & ExecuteReplicated #4233

[SPMD] fixes bugs with AssignIrValue & ExecuteReplicated #4233

Uh oh!

Conversation

yeounoh commented Nov 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JackCaoG Nov 22, 2022

Choose a reason for hiding this comment

Uh oh!

yeounoh Nov 22, 2022

Choose a reason for hiding this comment

Uh oh!

JackCaoG Nov 22, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JackCaoG Nov 23, 2022

Choose a reason for hiding this comment

Uh oh!

yeounoh Nov 23, 2022

Choose a reason for hiding this comment

Uh oh!

JackCaoG left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yeounoh commented Nov 22, 2022 •

edited

Loading