Skip to content

Conversation

@JackCaoG
Copy link
Collaborator

@JackCaoG JackCaoG commented Feb 11, 2023

Currently we use xla_sync_multi to force a execution which will warm up the cache for future dynamo runs, this is not ideal because

  1. execution can be time consuming
  2. xla_sync_multi is considered as a mark_step like sync and will trigger buffer aliasing. However we actually saved some XLA data when we cache the dynamo graph. Those buffer might be aliased during the xla_sync_multi and cause random crash when we actually execute the dynamo graph

confirmed that this change does not regress speed on TPU when running torchbench.

FYI @wconstab @will-cromar

@vanbasten23
Copy link
Collaborator

Where can I read and learn the buffer aliasing?

@JackCaoG
Copy link
Collaborator Author

Ah, one thing I realized is we actually need to materializing some of the tensor, at least those one we save for future execution. Let me tweak it a bit.

@JackCaoG
Copy link
Collaborator Author

Where can I read and learn the buffer aliasing?

If you search Aliasing in XLA you should see some very very brief doc.. I mostly just read codes.

@JackCaoG
Copy link
Collaborator Author

@shunting314 @alanwaketan This pr is ready for review.

@shunting314
Copy link
Collaborator

xla_sync_multi is considered as a mark_step like sync and will trigger buffer aliasing. However we actually saved some XLA data when we cache the dynamo graph. Those buffer might be aliased during the xla_sync_multi and cause random crash when we actually execute the dynamo graph

Is it easy to have a unit test to show the buffer aliasing related issue?

@JackCaoG
Copy link
Collaborator Author

yea, actually PJRT=CPU + the current resnet18 test crashes(currently ci is using XRT, we are migrating test to PJRT runtime then I find this issue). Do you want to add a smaller unit test?

@shunting314
Copy link
Collaborator

TBH, I don't quite understand what the 'buffer aliasing' issue fixed by this PR is. Or more specifically, how does it cause crashes. A simple unit test should help.

But it's not blocking, I think we can merge this PR first as well.

Copy link
Collaborator

@alanwaketan alanwaketan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@JackCaoG
Copy link
Collaborator Author

@shunting314 I should be more specified. The buffer aliasing is only enabled during a mark_step or equivant(sync_multi ) The code is in

if (enable_aliasing && coll.config.sync_ltc_data &&
coll.config.force_ltc_data && !is_sharded) {

In this pr I set the config.force_ltc_data to False during the warm_up_cache hence it will skip the aliasing. Regarding what Aliasing does, we have a simple test in

self.assertEqual(met.metric_data("InputOutputAliasCount")[1], 2.0)
# check in place op aliasing.
t3 = t1 + t2
t1 *= 2.0
t2 += 2.0
xm.mark_step()
self.assertEqual(met.metric_data("InputOutputAliasCount")[1], 4.0)

You see that we perform two inplace operation on t1 and t2. We will have a hlo similar to

hlo(input1, input2) -> (output1, output2, output3)

input in this case is t1 and t2, output is the t3 and new t1 and t2. What aliasing does is instead of allocating new buffer for the output1 and output2, it reuses the input buffer since pytorch/xla can tell compiler this is a inplace operation and we don't need t1 and t2's original buffer after this computation.

The bug we had prior this pr is we actually saved some input for each dynamo graph. However when we do xla_sync_multi to warm up the cache, we incorrectly trigger the aliasing. This means the buffer we saved was donated to the output of the xla_sync_multi which will crash when we tried to use saved input for the real dynamo run.

@JackCaoG JackCaoG merged commit b7707fa into master Feb 14, 2023
JackCaoG added a commit that referenced this pull request Feb 16, 2023
* Fix HLO dumping (#4619)

* Update TF pin to 2/13 (#4615)

* Update TF pin to 2/13

* Fix pinned commit

* Add patch to revert TF 3e24055

* Add comment to new patch

* Fix patch command in TPU CI (#4623)

* Skip execution for extract_compiled_graph (#4612)

* Only warm up cache for dynamo extract_graph step

* Add missing config

* Make sure warm up run does not cause place holder to be created

* Fix tests

* Disable failing `test_operations.py` tests on TPU (#4622)

* Disable `test_operations.py` tests failing on TPU

* Add to TPU CI

* Bazel (#4528)

* Replace tensorflow with a bazel external repository

* Basic migration to bazel for xla_client.

* Revert to blob

* Add vscode config.

* Update newlines

* Merge with pjrt client test build changes.

* Migrate tests to new build

* Format test and plugin

* Order imports

* Conditionally apply tf patches; apply pt patches always.

* Format python

* configure formatters

* Mirror TF pin update an fixes in bazel.

* Support local and sandboxed build based on flags

* Add cloud cache URLs for llvm.

* Merge with upstream

* Update TF pin

* Fix patching regression

* Revert "Bazel (#4528)" (#4631)

This reverts commit 3a90f5a.

---------

Co-authored-by: JackCaoG <59073027+JackCaoG@users.noreply.github.com>
Co-authored-by: Will Cromar <wcromar@google.com>
Co-authored-by: stgpetrovic <stgpetrovic@gmail.com>
JackCaoG added a commit that referenced this pull request Feb 16, 2023
* Only warm up cache for dynamo extract_graph step

* Add missing config

* Make sure warm up run does not cause place holder to be created

* Fix tests
chandrasekhard2 pushed a commit that referenced this pull request Feb 22, 2023
* Only warm up cache for dynamo extract_graph step

* Add missing config

* Make sure warm up run does not cause place holder to be created

* Fix tests
chandrasekhard2 pushed a commit that referenced this pull request Feb 22, 2023
* Only warm up cache for dynamo extract_graph step

* Add missing config

* Make sure warm up run does not cause place holder to be created

* Fix tests
mateuszlewko pushed a commit that referenced this pull request Mar 15, 2023
* Only warm up cache for dynamo extract_graph step

* Add missing config

* Make sure warm up run does not cause place holder to be created

* Fix tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants