-
Notifications
You must be signed in to change notification settings - Fork 22.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
training support for dynamo+torchxla integration #88449
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/88449
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 FailuresAs of commit 82b8827: NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
00b55ae
to
32c9f7b
Compare
I think the |
With some discussion, we found it's too stretching to do the following 2 at the same time
There are quite a few weird issues we need to resolve. Here are some of them
We feel it would be a good strategy to separate the concern and handle these two items separately. We'll do item 2 first and follow up on item 1 after that. I'll suspend on this PR and work on a new PR to specially handle item 2. |
32c9f7b
to
01f81c5
Compare
@JackCaoG there is one more issue about FakeTensor. AOTAutograd uses FakeTensors during compilation. It looks like FakeTensor on XLA has some issue and I get the following error stacks
I can unblock myself for now by force AOTAutograd using eager tensors, but this is something we eventually should look into (FakeTensor is mandated if we want support dynamic shape ) Here is the command to reproduce:
cc @Chillee as we discussed a bit about the FakeTensor usage in AOTAutograd. |
21c191e
to
8036941
Compare
In #87741 we added the inference support for dynamo/torchxla integration. Later on in #88449 we attempt to add the training support. That attempt is not smooth because - we try 2 things together 1. let dynamo trace the model on xla rather than eager 2. enable training - It turns out neither of these two tasks are trivial enough. Furthermore, item 2 (enable training) depends on item 1 (tracing on xla). We enable training via AOTAutograd. AOTAutograd lift all model parameters/buffers as graph inputs. Without item 1 being done, we would need copy all graph inputs (including model parameters/buffers) from eager device to xla devices. That hurts performance a lot. Have a cache to map eager parameter to XLA parameter does not solve the problem since the update on either will not sync automatically to the other. They will easily go out of sync. This PR let dynamo trace the model on XLA rather than eager. This is a preparation step to enabling training. Also, tracing on XLA makes the data movement more efficient. We see 1.5x geomean speedup compared to previous 1.38x. ``` +-------------------------+--------------------+-------------------------+ | Model | XLA (trace once) | XLA (trace everytime) | +=========================+====================+=========================+ | resnet18 | 1.38 | 1.008 | +-------------------------+--------------------+-------------------------+ | resnet50 | 1.227 | 0.998 | +-------------------------+--------------------+-------------------------+ | resnext50_32x4d | 1.544 | 1.008 | +-------------------------+--------------------+-------------------------+ | alexnet | 1.085 | 1.045 | +-------------------------+--------------------+-------------------------+ | mobilenet_v2 | 2.028 | 1.013 | +-------------------------+--------------------+-------------------------+ | mnasnet1_0 | 1.516 | 0.995 | +-------------------------+--------------------+-------------------------+ | squeezenet1_1 | 0.868 | 1.01 | +-------------------------+--------------------+-------------------------+ | vgg16 | 1.099 | 1.008 | +-------------------------+--------------------+-------------------------+ | BERT_pytorch | 3.26 | 1.027 | +-------------------------+--------------------+-------------------------+ | timm_vision_transformer | 2.182 | 1.015 | +-------------------------+--------------------+-------------------------+ | geomean | 1.50389 | 1.01261 | +-------------------------+--------------------+-------------------------+ ``` Example command ``` GPU_NUM_DEVICES=1 python benchmarks/dynamo/torchbench.py --randomize-input --performance --trace-on-xla --only resnet18 --backend=torchxla_trace_once ``` Pull Request resolved: #88904 Approved by: https://github.com/wconstab, https://github.com/JackCaoG, https://github.com/jansel
8036941
to
a90f46e
Compare
a90f46e
to
8180347
Compare
To workaround the dropout issue, I force torch.nn.Dropout to be a NopModule, now the correctness check for: All pass. But we still fail the correctness check for vgg16 and alexnet. |
8180347
to
56d40ac
Compare
@JackCaoG and I figured out why we fail the correctness check for vgg16 and alexnet. It's because of different set of optimizations applied by XLA on different graphs. Here are the things we do to verify the point:
and found XLA gives different results..
But I think 1 and 2 are strong enough to convince me the root cause is because of different graph size. Jack suggested to use 1e-3 as the tolerence to do correctness check. |
56d40ac
to
f5315a6
Compare
797fbbe
to
82b8827
Compare
@pytorchbot merge |
This PR needs to be approved by an authorized maintainer before merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stamp
@pytorchbot merge |
This PR needs to be approved by an authorized maintainer before merge. |
@pytorchbot merge |
This PR needs to be approved by an authorized maintainer before merge. |
@pytorchbot help |
❌ 🤖 pytorchbot command failed:
Try |
@pytorchbot help |
❌ 🤖 pytorchbot command failed:
Try |
Try dismiss review from Jason to merge the PR since jason is on PTO |
To merge the PR since there is some issue to merge the PR
@pytorchbot merge |
This PR needs to be approved by an authorized maintainer before merge. |
@pytorchbot merge |
This PR needs to be approved by an authorized maintainer before merge. |
@pytorchbot merge -f "Apologies for the inconvenience" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
This is a follow up from the previous PR: #88449 , to move the dynamo/TorchXLA bridge from pytorch repo to xla repo. Overall the dynamo/TorchXLA integration has the following four layers of code - pybind layer: This is the bottom layer containing various pybind APIs as the foundation. This part resident in xla repo - bridge layer: build upon the pybind layer to implement the trace once functionality. This layer and it's corresponding unit test are in pytorch repro previously. This PR (and the corresponding xla pr pytorch/xla#4476 ) moves them to the xla repo. - dynamo backend registration: this a thin layer registers 4 dynamo backends (training/inference/trace_once/trace_everytime). It remains in pytorch repo. - benchmark script: the torchbench.py script in dynamo is adapted so it can be used in dynamo/TorchXLA integration. This one remains in pytorch repo. We think the new code organization is cleaner. I'll wait for the xla PR in first before trying to merge this one. Tests 1. run the unit tests moved to the xla repo 2. Test for inference: `GPU_NUM_DEVICES=1 python benchmarks/dynamo/torchbench.py --randomize-input --performance --trace-on-xla --backend=torchxla_trace_once --only resnet18` 3. Test for training: `GPU_NUM_DEVICES=1 python benchmarks/dynamo/torchbench.py --randomize-input --performance --trace-on-xla --training --backend=aot_torchxla_trace_once --only resnet18 --collect-outputs` Pull Request resolved: #92601 Approved by: https://github.com/wconstab
We've already shown some promising perf result by integrating dynamo with torchxla for inference. To provide consistent UX for training and for inference, in this PR we try to enable training for dynamo/torchxla.
Training is trickier than inference and we may not expect much perf gains since
torchxla_trace_once
bridge we added in dynamo, due to how AOT_Autograd works, we will generate 3 graphs: one for forward, one for backward and one for the optimizer. XLA favors larger graph to do more optimizations.But we still want to add training support to dynamo/torchxla to make the work complete.
We added '--iterations-per-run' argument to control how may iterations we do per measure/device sync. This is to understand the impact of item 2 above.
Results:
With '--iterations-per-run' equals to 1, here are the perf numbers:
Overall it looks like graph break indeed cause perf loss. But for BERT_pytorch and timm_vision_transformer we still see perf gain. We need do more experiments with larger '--iterations-per-run'
NOTE:
In torchbench.py I added the following code to do a few workaround:
Here are the content of workaround.py:
It work around a few issues we found
native_batch_norm
schema by splitting it into two legit schemas #88697 ) in op decomposition cause batch_norm ops to fallback in torchxla. Fix from jack in Lower _native_batch_norm_legit xla#4282 (comment) . (confirmed the fix after adding Deduper to handle duplicated return from fx graph generated by AOTAutograd)Example command:
cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @desertfire