Skip execution for extract_compiled_graph #4612

JackCaoG · 2023-02-11T01:17:30Z

Currently we use xla_sync_multi to force a execution which will warm up the cache for future dynamo runs, this is not ideal because

execution can be time consuming
xla_sync_multi is considered as a mark_step like sync and will trigger buffer aliasing. However we actually saved some XLA data when we cache the dynamo graph. Those buffer might be aliased during the xla_sync_multi and cause random crash when we actually execute the dynamo graph

confirmed that this change does not regress speed on TPU when running torchbench.

FYI @wconstab @will-cromar

vanbasten23 · 2023-02-11T01:30:36Z

Where can I read and learn the buffer aliasing?

JackCaoG · 2023-02-11T01:30:57Z

Ah, one thing I realized is we actually need to materializing some of the tensor, at least those one we save for future execution. Let me tweak it a bit.

JackCaoG · 2023-02-11T01:31:38Z

Where can I read and learn the buffer aliasing?

If you search Aliasing in XLA you should see some very very brief doc.. I mostly just read codes.

JackCaoG · 2023-02-13T23:56:24Z

@shunting314 @alanwaketan This pr is ready for review.

shunting314 · 2023-02-14T19:02:29Z

xla_sync_multi is considered as a mark_step like sync and will trigger buffer aliasing. However we actually saved some XLA data when we cache the dynamo graph. Those buffer might be aliased during the xla_sync_multi and cause random crash when we actually execute the dynamo graph

Is it easy to have a unit test to show the buffer aliasing related issue?

JackCaoG · 2023-02-14T19:04:16Z

yea, actually PJRT=CPU + the current resnet18 test crashes(currently ci is using XRT, we are migrating test to PJRT runtime then I find this issue). Do you want to add a smaller unit test?

shunting314 · 2023-02-14T19:10:15Z

TBH, I don't quite understand what the 'buffer aliasing' issue fixed by this PR is. Or more specifically, how does it cause crashes. A simple unit test should help.

But it's not blocking, I think we can merge this PR first as well.

alanwaketan

LGTM.

JackCaoG · 2023-02-14T21:55:40Z

@shunting314 I should be more specified. The buffer aliasing is only enabled during a mark_step or equivant(sync_multi ) The code is in

xla/torch_xla/csrc/xla_graph_executor.cpp

Lines 1104 to 1105 in 574cfda

    
           if (enable_aliasing && coll.config.sync_ltc_data && 
        
               coll.config.force_ltc_data && !is_sharded) {

In this pr I set the config.force_ltc_data to False during the warm_up_cache hence it will skip the aliasing. Regarding what Aliasing does, we have a simple test in

xla/test/test_input_output_aliases.py

Lines 26 to 34 in 574cfda

    
           self.assertEqual(met.metric_data("InputOutputAliasCount")[1], 2.0) 
        
           # check in place op aliasing. 
        
           t3 = t1 + t2 
        
           t1 *= 2.0 
        
           t2 += 2.0 
        
           xm.mark_step() 
        
           self.assertEqual(met.metric_data("InputOutputAliasCount")[1], 4.0)

You see that we perform two inplace operation on t1 and t2. We will have a hlo similar to

hlo(input1, input2) -> (output1, output2, output3)

input in this case is t1 and t2, output is the t3 and new t1 and t2. What aliasing does is instead of allocating new buffer for the output1 and output2, it reuses the input buffer since pytorch/xla can tell compiler this is a inplace operation and we don't need t1 and t2's original buffer after this computation.

The bug we had prior this pr is we actually saved some input for each dynamo graph. However when we do xla_sync_multi to warm up the cache, we incorrectly trigger the aliasing. This means the buffer we saved was donated to the output of the xla_sync_multi which will crash when we tried to use saved input for the real dynamo run.

* Fix HLO dumping (#4619) * Update TF pin to 2/13 (#4615) * Update TF pin to 2/13 * Fix pinned commit * Add patch to revert TF 3e24055 * Add comment to new patch * Fix patch command in TPU CI (#4623) * Skip execution for extract_compiled_graph (#4612) * Only warm up cache for dynamo extract_graph step * Add missing config * Make sure warm up run does not cause place holder to be created * Fix tests * Disable failing `test_operations.py` tests on TPU (#4622) * Disable `test_operations.py` tests failing on TPU * Add to TPU CI * Bazel (#4528) * Replace tensorflow with a bazel external repository * Basic migration to bazel for xla_client. * Revert to blob * Add vscode config. * Update newlines * Merge with pjrt client test build changes. * Migrate tests to new build * Format test and plugin * Order imports * Conditionally apply tf patches; apply pt patches always. * Format python * configure formatters * Mirror TF pin update an fixes in bazel. * Support local and sandboxed build based on flags * Add cloud cache URLs for llvm. * Merge with upstream * Update TF pin * Fix patching regression * Revert "Bazel (#4528)" (#4631) This reverts commit 3a90f5a. --------- Co-authored-by: JackCaoG <59073027+JackCaoG@users.noreply.github.com> Co-authored-by: Will Cromar <wcromar@google.com> Co-authored-by: stgpetrovic <stgpetrovic@gmail.com>

* Only warm up cache for dynamo extract_graph step * Add missing config * Make sure warm up run does not cause place holder to be created * Fix tests

JackCaoG added 2 commits February 11, 2023 01:00

Only warm up cache for dynamo extract_graph step

5c4aa84

Add missing config

ee054b1

JackCaoG requested review from alanwaketan and shunting314 February 11, 2023 01:17

JackCaoG added the dynamo label Feb 11, 2023

JackCaoG added 2 commits February 11, 2023 02:33

Make sure warm up run does not cause place holder to be created

08d39bb

Fix tests

436bff0

shunting314 approved these changes Feb 14, 2023

View reviewed changes

alanwaketan approved these changes Feb 14, 2023

View reviewed changes

JackCaoG merged commit b7707fa into master Feb 14, 2023

JackCaoG added a commit that referenced this pull request Feb 16, 2023

Skip execution for extract_compiled_graph (#4612)

973e8b4

* Only warm up cache for dynamo extract_graph step * Add missing config * Make sure warm up run does not cause place holder to be created * Fix tests

chandrasekhard2 pushed a commit that referenced this pull request Feb 22, 2023

Skip execution for extract_compiled_graph (#4612)

038903d

* Only warm up cache for dynamo extract_graph step * Add missing config * Make sure warm up run does not cause place holder to be created * Fix tests

chandrasekhard2 pushed a commit that referenced this pull request Feb 22, 2023

Skip execution for extract_compiled_graph (#4612)

e371842

* Only warm up cache for dynamo extract_graph step * Add missing config * Make sure warm up run does not cause place holder to be created * Fix tests

mateuszlewko pushed a commit that referenced this pull request Mar 15, 2023

Skip execution for extract_compiled_graph (#4612)

444aa64

* Only warm up cache for dynamo extract_graph step * Add missing config * Make sure warm up run does not cause place holder to be created * Fix tests

vanbasten23 mentioned this pull request Jul 21, 2023

Disable cxx_abi when building PyTorch/XLA in r2.0. #5332

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Skip execution for extract_compiled_graph #4612

Skip execution for extract_compiled_graph #4612

Uh oh!

JackCaoG commented Feb 11, 2023 •

edited

Loading

Uh oh!

vanbasten23 commented Feb 11, 2023

Uh oh!

JackCaoG commented Feb 11, 2023

Uh oh!

JackCaoG commented Feb 11, 2023

Uh oh!

JackCaoG commented Feb 13, 2023

Uh oh!

shunting314 commented Feb 14, 2023

Uh oh!

JackCaoG commented Feb 14, 2023

Uh oh!

shunting314 commented Feb 14, 2023

Uh oh!

alanwaketan left a comment

Uh oh!

JackCaoG commented Feb 14, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Skip execution for extract_compiled_graph #4612

Skip execution for extract_compiled_graph #4612

Uh oh!

Conversation

JackCaoG commented Feb 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vanbasten23 commented Feb 11, 2023

Uh oh!

JackCaoG commented Feb 11, 2023

Uh oh!

JackCaoG commented Feb 11, 2023

Uh oh!

JackCaoG commented Feb 13, 2023

Uh oh!

shunting314 commented Feb 14, 2023

Uh oh!

JackCaoG commented Feb 14, 2023

Uh oh!

shunting314 commented Feb 14, 2023

Uh oh!

alanwaketan left a comment

Choose a reason for hiding this comment

Uh oh!

JackCaoG commented Feb 14, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

JackCaoG commented Feb 11, 2023 •

edited

Loading