-
Notifications
You must be signed in to change notification settings - Fork 559
Fix the error where mark_step does not materalize tensors on SPMD:0 #5281
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -81,6 +81,7 @@ auto XLAGraphExecutor::DeviceContextArena::Get() -> DeviceContextArena* { | |
| std::vector<XLATensorPtr> XLAGraphExecutor::DeviceContextArena::GetLiveTensors( | ||
| const torch::lazy::BackendDevice* device) { | ||
| std::vector<XLATensorPtr> tensors; | ||
| torch::lazy::BackendDevice virtual_device = GetVirtualDevice(); | ||
| auto fn = [&](DeviceContext* devctx) { | ||
| std::lock_guard<std::mutex> lock(devctx->lock); | ||
| for (auto& uid_wptr : devctx->tensors_data) { | ||
|
|
@@ -92,6 +93,8 @@ std::vector<XLATensorPtr> XLAGraphExecutor::DeviceContextArena::GetLiveTensors( | |
| } | ||
| }; | ||
| ForAllDeviceContexts(fn, device); | ||
| // TODO(JackCaoG): all tensors should be on spmd:0 in SPMD mode. | ||
| ForAllDeviceContexts(fn, &virtual_device); | ||
| return tensors; | ||
| } | ||
|
|
||
|
|
@@ -502,7 +505,10 @@ XLAGraphExecutor::SyncTensorCollection XLAGraphExecutor::CollectSyncTensors( | |
| tsl::profiler::TraceMeLevel::kInfo); | ||
| runtime::util::Unique<torch::lazy::BackendDevice> unique_device; | ||
| for (size_t i = 0; i < tensors.size(); ++i) { | ||
| unique_device.set(tensors[i]->GetDevice()); | ||
| // TODO(JackCaoG): all tensors should be on spmd:0 in SPMD mode. | ||
| if (tensors[i]->GetDevice().toString() != "SPMD:0") { | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So currently we allow SPMD:0 and TPU:0 to coexist? I wonder what's the deal with the code later on that deals with unique_device? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I will move all tensors on will be on will be on There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Right, but we have code in this method later on use unique_device? And it looks like it's important: And then if SPMD:0 and TPU:0 really coexist in this method, shouldn't There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. oh, so here is where things become messy. What happened is that we will pass https://github.com/pytorch/xla/blob/master/torch_xla/csrc/xla_graph_executor.cpp#L999-L1000 It is a very broken design now. Different function is checking different thing. I want to unify all those later. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Err... whatever, I will approve that and then wait for your redesign. |
||
| unique_device.set(tensors[i]->GetDevice()); | ||
| } | ||
| } | ||
| SyncTensorCollection coll; | ||
| if (!unique_device) { | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this t1 about?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so.. her eis another weird bug.. if both
xt1andxt2onspmd:0, themark_stepwill see that there is no tensor onTPU:0device and it will skip the execution.. I think it is easier for me to rewrite whole virtual device instead of fixing these bugs to suit bothspmd:0andTPU:0exist at the same time.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good luck... I wonder how was I able to do GPT-2 experiments all the time...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am surprised that code runs at all lol.