training support for dynamo+torchxla integration #88449

shunting314 · 2022-11-03T21:01:50Z

We've already shown some promising perf result by integrating dynamo with torchxla for inference. To provide consistent UX for training and for inference, in this PR we try to enable training for dynamo/torchxla.

Training is trickier than inference and we may not expect much perf gains since

in training case, torchxla only generate a single combined graph for fwd/bwd/optimizer while in torchxla_trace_once bridge we added in dynamo, due to how AOT_Autograd works, we will generate 3 graphs: one for forward, one for backward and one for the optimizer. XLA favors larger graph to do more optimizations.
in training case, tracing overhead can be overlapped with computation. Tracing overhead is not as a big deal for training as for inference. After all training cares more about throughput while inference cares more about latency.
in training case, people can increase batch size to 'mitigate' the tracing overhead. Increase batch size does not change tracing overhead, thus it shows like the tracing overhead 'per example' reduces.

But we still want to add training support to dynamo/torchxla to make the work complete.

We added '--iterations-per-run' argument to control how may iterations we do per measure/device sync. This is to understand the impact of item 2 above.

Results:

With '--iterations-per-run' equals to 1, here are the perf numbers:

+-------------------------+--------------------+-------------------------+
| Model                   |   XLA (trace once) |   XLA (trace everytime) |
+=========================+====================+=========================+
| resnet18                |             0.91   |                0.959    |
+-------------------------+--------------------+-------------------------+
| resnet50                |             0.917  |                0.932    |
+-------------------------+--------------------+-------------------------+
| resnext50_32x4d         |             0.912  |                0.905    |
+-------------------------+--------------------+-------------------------+
| alexnet                 |             1.038  |                0.974    |
+-------------------------+--------------------+-------------------------+
| mobilenet_v2            |             0.881  |                0.835    |
+-------------------------+--------------------+-------------------------+
| mnasnet1_0              |             0.903  |                0.931    |
+-------------------------+--------------------+-------------------------+
| vgg16                   |             0.914  |                0.967    |
+-------------------------+--------------------+-------------------------+
| BERT_pytorch            |             1.359  |                0.84     |
+-------------------------+--------------------+-------------------------+
| timm_vision_transformer |             1.288  |                0.893    |
+-------------------------+--------------------+-------------------------+
| geomean                 |             1.0006 |                0.913794 |
+-------------------------+--------------------+-------------------------+

Overall it looks like graph break indeed cause perf loss. But for BERT_pytorch and timm_vision_transformer we still see perf gain. We need do more experiments with larger '--iterations-per-run'

NOTE:
In torchbench.py I added the following code to do a few workaround:

from myscripts import workaround # TODO will remove this line before landing

Here are the content of workaround.py:

import torch
from torch import nn
import os

# override max_pool2d with avg_pool2d
if os.environ.get("REPLACE_MAXPOOL", "0") == "1":
    torch.nn.MaxPool2d = torch.nn.AvgPool2d

It work around a few issues we found

MaxPool2d does not work for training in dynamo/torchxla: Dynamo can not optimize a model with MaxPool2d on XLA devices torchdynamo#1837 . WIP fix from Brian in codegen fixes to fix tracing XLA autograd ops #90226 , https://github.com/pytorch/xla/pull/4276/files (WIP)
recent change ( this PR Rectify native_batch_norm schema by splitting it into two legit schemas #88697 ) in op decomposition cause batch_norm ops to fallback in torchxla. Fix from jack in Lower _native_batch_norm_legit xla#4282 (comment) . (confirmed the fix after adding Deduper to handle duplicated return from fx graph generated by AOTAutograd)
we have issue to handle dropout because of random seed out of sync issue. Here is the fix: api to grab base seed as device data xla#4293 (confirmed the fix)

Example command:

REPLACE_MAXPOOL=1 USE_FAKE_TENSOR=0 GPU_NUM_DEVICES=1 python benchmarks/dynamo/torchbench.py --randomize-input --performance --trace-on-xla --training --backend=aot_torchxla_trace_once --only vgg16

cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @desertfire

pytorch-bot · 2022-11-03T21:01:53Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/88449

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Failures

As of commit 82b8827:

NEW FAILURES - The following jobs have failed:

linux-focal-py3.7-clang7-asan / test (default, 2, 5, linux.2xlarge)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

JackCaoG · 2022-11-05T01:18:16Z

I think the DeviceData that miss the info is the random seed. Random seed does not have a tensor associated with it hence it will never have a tensor_id nor info. I guess we can work around this special case in the bridge.

torch/_dynamo/utils.py

shunting314 · 2022-11-09T02:01:34Z

With some discussion, we found it's too stretching to do the following 2 at the same time

instead of moving data across device in the dynamo compiler function, do that at the top level once before calling the model.
enable AOTAutograd for training

There are quite a few weird issues we need to resolve. Here are some of them

We feel it would be a good strategy to separate the concern and handle these two items separately. We'll do item 2 first and follow up on item 1 after that. I'll suspend on this PR and work on a new PR to specially handle item 2.

cc @JackCaoG @wconstab @jansel

shunting314 · 2022-11-14T21:18:10Z

@JackCaoG there is one more issue about FakeTensor. AOTAutograd uses FakeTensors during compilation. It looks like FakeTensor on XLA has some issue and I get the following error stacks

Traceback (most recent call last):
  File "/pytorch/torch/_dynamo/optimizations/backends.py", line 53, in inner
    return fn(model, **kwargs)
  File "/pytorch/torch/_dynamo/optimizations/backends.py", line 800, in torchxla_trace_once
    return integration.extract_compiled_graph(model, example_inputs)
  File "/pytorch/torch/_dynamo/optimizations/torchxla_integration.py", line 172, in extract_compiled_graph
    torch_xla._XLAC._xla_get_tensor_id(xla_arg) for xla_arg in xla_args
  File "/pytorch/torch/_dynamo/optimizations/torchxla_integration.py", line 172, in <listcomp>
    torch_xla._XLAC._xla_get_tensor_id(xla_arg) for xla_arg in xla_args
RuntimeError: /pytorch/xla/torch_xla/csrc/aten_xla_bridge.cpp:73 : Check failed: xtensor != nullptr

I can unblock myself for now by force AOTAutograd using eager tensors, but this is something we eventually should look into (FakeTensor is mandated if we want support dynamic shape )

Here is the command to reproduce:

USE_FAKE_TENSOR=1 override_model=linear GPU_NUM_DEVICES=1 python benchmarks/dynamo/torchbench.py --randomize-input --performance --
trace-on-xla --training --only resnet18 --backend=aot_torchxla_trace_once

cc @Chillee as we discussed a bit about the FakeTensor usage in AOTAutograd.

In #87741 we added the inference support for dynamo/torchxla integration. Later on in #88449 we attempt to add the training support. That attempt is not smooth because - we try 2 things together 1. let dynamo trace the model on xla rather than eager 2. enable training - It turns out neither of these two tasks are trivial enough. Furthermore, item 2 (enable training) depends on item 1 (tracing on xla). We enable training via AOTAutograd. AOTAutograd lift all model parameters/buffers as graph inputs. Without item 1 being done, we would need copy all graph inputs (including model parameters/buffers) from eager device to xla devices. That hurts performance a lot. Have a cache to map eager parameter to XLA parameter does not solve the problem since the update on either will not sync automatically to the other. They will easily go out of sync. This PR let dynamo trace the model on XLA rather than eager. This is a preparation step to enabling training. Also, tracing on XLA makes the data movement more efficient. We see 1.5x geomean speedup compared to previous 1.38x. ``` +-------------------------+--------------------+-------------------------+ | Model | XLA (trace once) | XLA (trace everytime) | +=========================+====================+=========================+ | resnet18 | 1.38 | 1.008 | +-------------------------+--------------------+-------------------------+ | resnet50 | 1.227 | 0.998 | +-------------------------+--------------------+-------------------------+ | resnext50_32x4d | 1.544 | 1.008 | +-------------------------+--------------------+-------------------------+ | alexnet | 1.085 | 1.045 | +-------------------------+--------------------+-------------------------+ | mobilenet_v2 | 2.028 | 1.013 | +-------------------------+--------------------+-------------------------+ | mnasnet1_0 | 1.516 | 0.995 | +-------------------------+--------------------+-------------------------+ | squeezenet1_1 | 0.868 | 1.01 | +-------------------------+--------------------+-------------------------+ | vgg16 | 1.099 | 1.008 | +-------------------------+--------------------+-------------------------+ | BERT_pytorch | 3.26 | 1.027 | +-------------------------+--------------------+-------------------------+ | timm_vision_transformer | 2.182 | 1.015 | +-------------------------+--------------------+-------------------------+ | geomean | 1.50389 | 1.01261 | +-------------------------+--------------------+-------------------------+ ``` Example command ``` GPU_NUM_DEVICES=1 python benchmarks/dynamo/torchbench.py --randomize-input --performance --trace-on-xla --only resnet18 --backend=torchxla_trace_once ``` Pull Request resolved: #88904 Approved by: https://github.com/wconstab, https://github.com/JackCaoG, https://github.com/jansel

shunting314 · 2022-12-03T00:42:56Z

Code is cleaned up. The 10 models we tested for inference can mostly be run now (one fail because of fallback). But I just found an issue with Dropout which cause incorrect result: #90097 . cc @JackCaoG

shunting314 · 2022-12-03T02:13:45Z

To workaround the dropout issue, I force torch.nn.Dropout to be a NopModule, now the correctness check for:
resnet18
resnet50
resnext50_32x4d
BERT_pytorch
timm_vision_transformer
mnasnet1_0
mobilenet_v2

All pass. But we still fail the correctness check for vgg16 and alexnet.

shunting314 · 2022-12-06T00:43:03Z

@JackCaoG and I figured out why we fail the correctness check for vgg16 and alexnet. It's because of different set of optimizations applied by XLA on different graphs.

Here are the things we do to verify the point:

in timed method we print the sum for prediction tensor before and after the mark_step call as follows

            # result[0] is the prediction
            print(f"result sum {result[0].sum()}");
            xm.mark_step()
            print(f"result after ms sum {result[0].sum()}");

and found XLA gives different results..

result sum -0.019893646240234375
result after ms sum -0.019596099853515625

mark_step cause a larger graph being compiled and optimized more aggressively. The optimizations XLA applied may cause numerical unstable.

considering the baseline compile a single large graph, while 'trace_once' optimization compile 1 graph for forward, 1 graph for backward and 1 graph for the optimizer, we can add a mark_step manually in def forward_and_backward_pass after the forward pass. This way, we force the baseline to do the similar graph breaks as trace_once. We verified that both vgg16 and alexnet pass the correctness check
Jack also suggested to use XLA_GET_TENSORS_OPBYOP=1 XLA_SYNC_TENSORS_OPBYOP=1 to disable all xla optimizations. But unfortunately trace_once fail because the compiled graph is not found. Maybe the flags cause some change in graph hash computation.

But I think 1 and 2 are strong enough to convince me the root cause is because of different graph size.

Jack suggested to use 1e-3 as the tolerence to do correctness check.

shunting314 · 2023-01-05T18:59:15Z

@pytorchbot merge

pytorch-bot · 2023-01-05T18:59:19Z

This PR needs to be approved by an authorized maintainer before merge.

qihqi

stamp

shunting314 · 2023-01-05T19:10:57Z

@pytorchbot merge

pytorch-bot · 2023-01-05T19:11:00Z

This PR needs to be approved by an authorized maintainer before merge.

malfet · 2023-01-05T19:37:20Z

@pytorchbot merge

pytorch-bot · 2023-01-05T19:37:23Z

This PR needs to be approved by an authorized maintainer before merge.

malfet · 2023-01-05T19:40:35Z

@pytorchbot help

pytorch-bot · 2023-01-05T19:40:37Z

❌ 🤖 pytorchbot command failed:

@pytorchbot: error: argument command: invalid choice: 'help' (choose from 'merge', 'revert', 'rebase', 'label', 'drci')

usage: @pytorchbot [-h] {merge,revert,rebase,label,drci} ...

Try @pytorchbot --help for more info.

malfet · 2023-01-05T19:41:21Z

@pytorchbot help

pytorch-bot · 2023-01-05T19:41:23Z

❌ 🤖 pytorchbot command failed:

@pytorchbot: error: argument command: invalid choice: 'help' (choose from 'merge', 'revert', 'rebase', 'label', 'drci')

usage: @pytorchbot [-h] {merge,revert,rebase,label,drci} ...

Try @pytorchbot --help for more info.

shunting314 · 2023-01-05T19:49:23Z

Try dismiss review from Jason to merge the PR since jason is on PTO

To merge the PR since there is some issue to merge the PR

shunting314 · 2023-01-05T19:50:42Z

@pytorchbot merge

pytorch-bot · 2023-01-05T19:50:45Z

This PR needs to be approved by an authorized maintainer before merge.

huydhn · 2023-01-05T19:53:01Z

@pytorchbot merge

pytorch-bot · 2023-01-05T19:53:05Z

This PR needs to be approved by an authorized maintainer before merge.

malfet · 2023-01-05T19:57:49Z

@pytorchbot merge -f "Apologies for the inconvenience"

pytorchmergebot · 2023-01-05T19:59:30Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

This is a follow up from the previous PR: #88449 , to move the dynamo/TorchXLA bridge from pytorch repo to xla repo. Overall the dynamo/TorchXLA integration has the following four layers of code - pybind layer: This is the bottom layer containing various pybind APIs as the foundation. This part resident in xla repo - bridge layer: build upon the pybind layer to implement the trace once functionality. This layer and it's corresponding unit test are in pytorch repro previously. This PR (and the corresponding xla pr pytorch/xla#4476 ) moves them to the xla repo. - dynamo backend registration: this a thin layer registers 4 dynamo backends (training/inference/trace_once/trace_everytime). It remains in pytorch repo. - benchmark script: the torchbench.py script in dynamo is adapted so it can be used in dynamo/TorchXLA integration. This one remains in pytorch repo. We think the new code organization is cleaner. I'll wait for the xla PR in first before trying to merge this one. Tests 1. run the unit tests moved to the xla repo 2. Test for inference: `GPU_NUM_DEVICES=1 python benchmarks/dynamo/torchbench.py --randomize-input --performance --trace-on-xla --backend=torchxla_trace_once --only resnet18` 3. Test for training: `GPU_NUM_DEVICES=1 python benchmarks/dynamo/torchbench.py --randomize-input --performance --trace-on-xla --training --backend=aot_torchxla_trace_once --only resnet18 --collect-outputs` Pull Request resolved: #92601 Approved by: https://github.com/wconstab

shunting314 requested a review from jansel November 3, 2022 21:05

shunting314 assigned wconstab and unassigned wconstab Nov 3, 2022

shunting314 requested review from wconstab and JackCaoG November 3, 2022 21:05

github-actions bot added ciflow/inductor module: dynamo labels Nov 3, 2022

shunting314 force-pushed the dynamo_torchxla_training branch 2 times, most recently from 00b55ae to 32c9f7b Compare November 4, 2022 19:04

jansel reviewed Nov 6, 2022

View reviewed changes

torch/_dynamo/utils.py Outdated Show resolved Hide resolved

JackCaoG mentioned this pull request Nov 8, 2022

[Dynamo] Training Poc issue infoptr is missing pytorch/xla#4169

Closed

JackCaoG mentioned this pull request Nov 9, 2022

[Dynamo] Single linear model failed with AOTAutograd + training + dynamo pytorch/xla#4183

Closed

shunting314 mentioned this pull request Nov 11, 2022

dynamo/torchxla integration: trace on xla rather than eager #88904

Closed

shunting314 force-pushed the dynamo_torchxla_training branch from 32c9f7b to 01f81c5 Compare November 14, 2022 21:10

shunting314 force-pushed the dynamo_torchxla_training branch 2 times, most recently from 21c191e to 8036941 Compare November 16, 2022 00:25

shunting314 force-pushed the dynamo_torchxla_training branch from 8036941 to a90f46e Compare November 23, 2022 23:19

shunting314 mentioned this pull request Nov 23, 2022

Dynamo/TorchXLA Integration: the traced graph for models using BatchNorm2d returning unexpected number of outputs pytorch/torchdynamo#1908

Closed

shunting314 force-pushed the dynamo_torchxla_training branch from a90f46e to 8180347 Compare December 3, 2022 00:05

shunting314 mentioned this pull request Dec 3, 2022

dynamo/torchxla integration handle dropout incorrectly for training #90097

Closed

shunting314 force-pushed the dynamo_torchxla_training branch from 8180347 to 56d40ac Compare December 3, 2022 02:16

shunting314 force-pushed the dynamo_torchxla_training branch from 56d40ac to f5315a6 Compare December 6, 2022 00:58

shunting314 force-pushed the dynamo_torchxla_training branch from 797fbbe to 82b8827 Compare January 5, 2023 01:46

qihqi approved these changes Jan 5, 2023

View reviewed changes

malfet approved these changes Jan 5, 2023

View reviewed changes

huydhn removed the request for review from jansel January 5, 2023 19:52

malfet added the ciflow/trunk Trigger trunk jobs on your pull request label Jan 5, 2023

pytorchmergebot added the Merged label Jan 5, 2023

pytorchmergebot closed this in a5f32f8 Jan 5, 2023

chunyuan-w mentioned this pull request Jan 13, 2023

[Inductor] [CPU] Torchbench model lennard_jones performance regression > 10% on ww02.3 #93504

Closed

shunting314 mentioned this pull request Jan 19, 2023

refactor: move dynamo/TorchXLA bridge to pytorch/xla repo #92601

Closed

github-actions bot deleted the dynamo_torchxla_training branch July 6, 2024 01:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

training support for dynamo+torchxla integration #88449

training support for dynamo+torchxla integration #88449

shunting314 commented Nov 3, 2022 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Nov 3, 2022 •

edited

Loading

JackCaoG commented Nov 5, 2022

shunting314 commented Nov 9, 2022 •

edited

Loading

shunting314 commented Nov 14, 2022 •

edited

Loading

shunting314 commented Dec 3, 2022 •

edited

Loading

shunting314 commented Dec 3, 2022

shunting314 commented Dec 6, 2022 •

edited

Loading

shunting314 commented Jan 5, 2023

pytorch-bot bot commented Jan 5, 2023

qihqi left a comment

shunting314 commented Jan 5, 2023

pytorch-bot bot commented Jan 5, 2023

malfet commented Jan 5, 2023

pytorch-bot bot commented Jan 5, 2023

malfet commented Jan 5, 2023

pytorch-bot bot commented Jan 5, 2023

malfet commented Jan 5, 2023

pytorch-bot bot commented Jan 5, 2023

shunting314 commented Jan 5, 2023

shunting314 commented Jan 5, 2023

pytorch-bot bot commented Jan 5, 2023

huydhn commented Jan 5, 2023

pytorch-bot bot commented Jan 5, 2023

malfet commented Jan 5, 2023

pytorchmergebot commented Jan 5, 2023

training support for dynamo+torchxla integration #88449

training support for dynamo+torchxla integration #88449

Conversation

shunting314 commented Nov 3, 2022 • edited by pytorch-bot bot Loading

pytorch-bot bot commented Nov 3, 2022 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/88449

❌ 1 Failures

JackCaoG commented Nov 5, 2022

shunting314 commented Nov 9, 2022 • edited Loading

shunting314 commented Nov 14, 2022 • edited Loading

shunting314 commented Dec 3, 2022 • edited Loading

shunting314 commented Dec 3, 2022

shunting314 commented Dec 6, 2022 • edited Loading

shunting314 commented Jan 5, 2023

pytorch-bot bot commented Jan 5, 2023

qihqi left a comment

Choose a reason for hiding this comment

shunting314 commented Jan 5, 2023

pytorch-bot bot commented Jan 5, 2023

malfet commented Jan 5, 2023

pytorch-bot bot commented Jan 5, 2023

malfet commented Jan 5, 2023

pytorch-bot bot commented Jan 5, 2023

malfet commented Jan 5, 2023

pytorch-bot bot commented Jan 5, 2023

shunting314 commented Jan 5, 2023

shunting314 commented Jan 5, 2023

pytorch-bot bot commented Jan 5, 2023

huydhn commented Jan 5, 2023

pytorch-bot bot commented Jan 5, 2023

malfet commented Jan 5, 2023

pytorchmergebot commented Jan 5, 2023

Merge started

shunting314 commented Nov 3, 2022 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Nov 3, 2022 •

edited

Loading

shunting314 commented Nov 9, 2022 •

edited

Loading

shunting314 commented Nov 14, 2022 •

edited

Loading

shunting314 commented Dec 3, 2022 •

edited

Loading

shunting314 commented Dec 6, 2022 •

edited

Loading