Skip to content

Conversation

@JackCaoG
Copy link
Collaborator

@JackCaoG JackCaoG commented Nov 22, 2022

This is to fix #4189. Even if a tensor has xla_data, we might still have to update it if there is an active view associated with the tensor.

FYI @Liyang90 @alanwaketan

@JackCaoG
Copy link
Collaborator Author

@wonjoolee95 let's revert this change once functionization pass migration is completed, we need to make newly added test still works through.

@wonjoo-wj
Copy link
Collaborator

Thanks for the heads up. Are the CI tests failures expected?

@JackCaoG
Copy link
Collaborator Author

@wonjoolee95 lol I need to rebase, let me do that.

@JackCaoG JackCaoG force-pushed the jackcao/fix_view_not_update branch from 75451ef to 119155e Compare November 22, 2022 18:40
// A tensor's xla_data might not be up to date if there is a view
// associated with it. Make sure to sync those tensors here too.
(tensors[i]->CurrentXlaData() == nullptr ||
tensors[i]->data()->view != nullptr)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, @JackCaoG

xm.mark_step()
t1[12] = 1123
xm.mark_step()
self.assertNotIn('update_slice',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't the graph be trimmed anyway after mark_step()? I just tried on my branch without this patch, and it doesn't have the update_slice after. But maybe I am not following / understanding 🥟

nit. It might be helpful to add a brief comment explaining what this assertion is about -- esp for the future readers who haven't been following the original issue.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, without this change, the test failed at my end.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bug is that the second mark_step won't execute the IR of t1, can you verify?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, it failed -- sorry for the false alarm. I can verify that the graph wasn't trimmed after the second mark_step.

Copy link
Contributor

@yeounoh yeounoh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nit comment

@JackCaoG
Copy link
Collaborator Author

weird that test still failing, trying to repo on my end.

Copy link
Contributor

@yeounoh yeounoh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM -- with a nit comment. Feel free to merge once you resolve the test failures.

@JackCaoG
Copy link
Collaborator Author

ok error is real and it is a regression, I will look into it a bit.

@JackCaoG
Copy link
Collaborator Author

Seems like I need to make this check smarter

[ScheduleSyncTensorsGraph]
TensorsGraphInfo:
  mark_step (/pytorch/xla/torch_xla/core/xla_model.py:953)
  optimized_mod (/pytorch/torch/_dynamo/optimizations/torchxla_integration.py:186)
  _fn (/pytorch/torch/_dynamo/eval_frame.py:194)
  run_model_with_dynamo (test/dynamo/test_dynamo.py:25)
  _fn (/pytorch/torch/_dynamo/eval_frame.py:194)
  test_resnet18 (test/dynamo/test_dynamo.py:62)
  run (/root/anaconda3/envs/pytorch/lib/python3.7/unittest/case.py:628)
  __call__ (/root/anaconda3/envs/pytorch/lib/python3.7/unittest/case.py:676)
  run (/root/anaconda3/envs/pytorch/lib/python3.7/unittest/suite.py:122)
  __call__ (/root/anaconda3/envs/pytorch/lib/python3.7/unittest/suite.py:84)
  run (/root/anaconda3/envs/pytorch/lib/python3.7/unittest/suite.py:122)
  __call__ (/root/anaconda3/envs/pytorch/lib/python3.7/unittest/suite.py:84)
  run (/root/anaconda3/envs/pytorch/lib/python3.7/unittest/runner.py:176)
  runTests (/root/anaconda3/envs/pytorch/lib/python3.7/unittest/main.py:271)
  __init__ (/root/anaconda3/envs/pytorch/lib/python3.7/unittest/main.py:101)
  <module> (test/dynamo/test_dynamo.py:76)

Hashes: (91e5b939b215944892e246b99ed1d826)

## BEGIN_GRAPH
IR {
  %0 = f32[1000,512]{1,0} xla::device_data(), location=forward@linear.py:114, device=CPU:0, ROOT=0
}

currently it will execute a graph that only contains a single device_data.

@JackCaoG JackCaoG merged commit ba9f0df into master Nov 23, 2022
@miladm
Copy link
Collaborator

miladm commented Dec 13, 2022

This is an interesting bug/fix.
@JackCaoG did we run any tests to determine the perf influence of this change?

@JackCaoG
Copy link
Collaborator Author

not really, this should be a rare case since most tensor should not outlive the scope of a single mark_step. This happened in lighting because they used a tensor list to record some running average.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Behavior of in-place updated xla tensor at mark_step()

5 participants