support amp (auto mixed precision) #2654

yaochengji · 2020-11-30T05:06:52Z

Note this PR should be used together with the pytorch PR.

yaochengji · 2020-12-01T18:50:51Z

@davidel @JackCaoG Could you help review?

JackCaoG · 2020-12-01T19:18:22Z

@yaochengji Sorry for the delay, will take a look today.

yaochengji · 2020-12-01T19:28:12Z

Thanks, @JackCaoG. Note the torch_xla/amp folder is almost the same as the amp folder in pytorch.

ailzhang · 2020-12-01T20:16:45Z

@yaochengji Thanks for contributing to pytorch/xla! Would you mind adding a bit more context here, e.g. what's your main use case for pytorch/xla + amp, and with this PR have you see meaningful perf gains on some common models? Thanks a lot!

yaochengji · 2020-12-01T20:30:36Z

Hi @ailzhang , I mainly use torch/xla to accelerate pytorch training on GPUs. And for almost all the CV models, we'd better enable amp.

The torch/xla amp-training of resnet-50 on V100 with the tensorflow fix could reach 1150 images/s, which could get 50% speedup compared to pytorch-amp, refer here. And only a little slower than tf amp, refer here.

JackCaoG · 2020-12-01T22:24:22Z

wow, that performance improvement with xla:gpu is amazing. Copying files from pytorch doesn't sounds like a very good idea. If the amp logic is device agnostic, is there a way to make xla share the same code instead of copying it?

yaochengji · 2020-12-02T00:21:19Z

Hi @JackCaoG, I did some modification in pytorch then I could simplify the code change here.

davidel · 2020-12-03T15:45:31Z

Sorry for the delay, I need a bit of time to take a look at this one.

JackCaoG · 2020-12-29T23:24:26Z

Sorry for the delay, I will try to take a look soon

yaochengji · 2020-12-30T01:35:03Z

Thanks, @JackCaoG.

JackCaoG

@yaochengji Thanks for contributing! I have not finished reviewing the whole thing, will circle back after vacation.

test/test_amp.py

test/test_train_mp_mnist.py

torch_xla/amp/grad_scaler.py

torch_xla/csrc/aten_xla_type.cpp

torch_xla/csrc/batch_norm.cpp

torch_xla/csrc/ops/amp_foreach_non_finite_check_and_unscale.cpp

JackCaoG

Mostly LGTM, some nits

test/cpp/test_aten_xla_tensor.cpp

torch_xla/csrc/batch_norm.cpp

torch_xla/csrc/ops/amp_foreach_non_finite_check_and_unscale.cpp

torch_xla/csrc/ops/amp_foreach_non_finite_check_and_unscale.h

torch_xla/csrc/ops/xla_ops.h

torch_xla/csrc/tensor_methods.cpp

torch_xla/csrc/xla_lower_util.cpp

JackCaoG · 2021-01-06T20:34:30Z

torch_xla/csrc/xla_lower_util.cpp

      scatter_dnums);
 }

+std::vector<xla::XlaOp> BuildAmpForachNonFiniteCheckAndUnscale(


I think you mentioned And my XlaOp could only handle the scaler one. Could you point me to where is that limitation coming from?

Ex. https://github.com/pytorch/xla/pull/2654/files#diff-5555c0238ce581790db5f184fb764328a269c974d5019e8d5b66bcedbc545fefR761, the result of xla::AllReduce is a scalar.

Sorry could you give a bit more details? Why does ReduceAll return type matters here?

The result of xla::AllReduce is a scalar, which could not be used together with a XlaOp with shape (1, ).

BuildAmpForachNonFiniteCheckAndUnscale is typo for for each???

Good catch! Fixed.

JackCaoG

Thanks for making the change, I think we are pretty close!

test/cpp/test_aten_xla_tensor.cpp

torch_xla/csrc/aten_xla_type.cpp

torch_xla/csrc/ops/amp_update_scale.cpp

torch_xla/csrc/ops/amp_update_scale.h

torch_xla/csrc/tensor.h

JackCaoG · 2021-01-08T00:06:38Z

torch_xla/csrc/xla_lower_util.cpp

      scatter_dnums);
 }

+std::vector<xla::XlaOp> BuildAmpForachNonFiniteCheckAndUnscale(


Sorry could you give a bit more details? Why does ReduceAll return type matters here?

torch_xla/csrc/xla_lower_util.cpp

JackCaoG · 2021-01-09T02:44:57Z

@yaochengji Can you rebase this branch? I think you need #2685

yaochengji · 2021-01-09T18:34:49Z

@JackCaoG rebased, and CI could not pass until corresponding pytorch PR is merged.

JackCaoG · 2021-01-11T19:55:30Z

@yaochengji You can add a torch_patches/.torch_pin which will make CI use the specified branch of pytorch. https://github.com/pytorch/xla/pull/2718/files is an example of using a pin version of pytorch.

JackCaoG · 2021-02-17T23:47:43Z

gpu test failed with message

/var/lib/jenkins/.local/lib/python3.6/site-packages/torchvision/datasets/mnist.py:480: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at  ../torch/csrc/utils/tensor_numpy.cpp:143.)
  return torch.from_numpy(parsed.astype(m[2], copy=False)).view(*s)
1654784it [00:00, 2972434.46it/s]                          
8192it [00:00, 33651.25it/s]/opt/conda/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 12 leaked semaphores to clean up at shutdown
  len(cache))
/opt/conda/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown
  len(cache))
Traceback (most recent call last):
  File "test/test_train_mp_mnist_amp.py", line 194, in <module>
    xmp.spawn(_mp_fn, args=(FLAGS,), nprocs=FLAGS.num_cores)
  File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.9-py3.6-linux-x86_64.egg/torch_xla/distributed/xla_multiprocessing.py", line 394, in spawn
    start_method=start_method)
  File "/tmp/pytorch/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/tmp/pytorch/torch/multiprocessing/spawn.py", line 136, in join
    signal_name=name
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGBUS

Exited with code exit status 1

@yaochengji Could you take a look? Thanks!

yaochengji · 2021-02-17T23:52:32Z

gpu test failed with message

/var/lib/jenkins/.local/lib/python3.6/site-packages/torchvision/datasets/mnist.py:480: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at  ../torch/csrc/utils/tensor_numpy.cpp:143.)
  return torch.from_numpy(parsed.astype(m[2], copy=False)).view(*s)
1654784it [00:00, 2972434.46it/s]                          
8192it [00:00, 33651.25it/s]/opt/conda/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 12 leaked semaphores to clean up at shutdown
  len(cache))
/opt/conda/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown
  len(cache))
Traceback (most recent call last):
  File "test/test_train_mp_mnist_amp.py", line 194, in <module>
    xmp.spawn(_mp_fn, args=(FLAGS,), nprocs=FLAGS.num_cores)
  File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.9-py3.6-linux-x86_64.egg/torch_xla/distributed/xla_multiprocessing.py", line 394, in spawn
    start_method=start_method)
  File "/tmp/pytorch/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/tmp/pytorch/torch/multiprocessing/spawn.py", line 136, in join
    signal_name=name
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGBUS

Exited with code exit status 1

@yaochengji Could you take a look? Thanks!

I'm looking into it. It is due to I changed the inv_scale back to shape (1, ) on the pytorch side, and then it will bump into a broadcast error when multiply inv_scale with a non-scalar tensor. Do you know how to correct this in XLA primitives. I tried to apply xla::Broadcast manually or using xla::ReduceAll but it seemed not work.

yaochengji · 2021-02-18T01:16:58Z

I could fix inv_scale shape (1, ) issue by using xla::ReduceAll to change inv_scale to scalar first. But still got another problem when testing _amp_update_scale:

Invalid argument: Run-time shape mismatch for XRTExecute argument[0] (3804975852334223). Expected element_type: F32                                                                                             ment_type: F32
layout {
  format: DENSE
}
; got element_type: F32
dimensions: 1
layout {
  minor_to_major: 0
  format: DENSE
}
is_dynamic_dimension: false

         [[{{node XRTExecute}}]]

yaochengji · 2021-02-18T19:36:07Z

@JackCaoG , I already fixed the shape (1, ) issue.

And I ran GPU_NUM_DEVICES=2 python3 test/test_train_mp_mnist.py on master branch, the process 0 terminated with signal SIGBUS error still occurs. Could help check it on your side?

BTW, GPU_NUM_DEVICES=2 python3 test/test_train_mp_mnist_amp.py --fake_data on add-amp branch could run successfully.

JackCaoG · 2021-02-18T19:55:55Z

@yaochengji thanks for the update, I will try to take a look later today. So the SIGBUS error is from the master? From the circleCI run, it seems like test_train_mnist.py --tidy passed but it failed on test_train_mp_mnist_amp.

yaochengji · 2021-02-18T20:10:42Z

@yaochengji thanks for the update, I will try to take a look later today. So the SIGBUS error is from the master? From the circleCI run, it seems like test_train_mnist.py --tidy passed but it failed on test_train_mp_mnist_amp.

Yes, it seems from the master branch. And I'm double checking on another machine in case of environment chaos.

JackCaoG · 2021-02-18T20:41:11Z

Thanks @yaochengji ! btw, did you ever try 4 gpu setup? Did you ever encounter the error described in #2758?

yaochengji · 2021-02-18T22:09:57Z

#2758

I just reproduced the error on another clean machine on master branch.

The error of 2 gpus and 4 gpus setup are the same: process 0 terminated with signal SIGBUS.

1 gpu could pass, that is why test_train_mnist.py --tidy succeed.

yaochengji · 2021-02-18T22:15:00Z

To summarize: on master branch

GPU_NUM_DEVICES=2 python3 test/test_train_mp_mnist.py failed
GPU_NUM_DEVICES=1 python3 test/test_train_mp_mnist.py passed
GPU_NUM_DEVICES=2 python3 test/test_train_mp_mnist.py --fake_data passed

JackCaoG · 2021-02-18T22:20:39Z

Thanks for the confirmation. The team is pretty busy with the upcoming 1.8 release and tpuvm public preview work. I will try to take a look before tmr. Since this error is not introduced by this pr, I think we can currently workaround it using --fake_data. @ailzhang wdyt?

ailzhang

We can merge this to master first, I will take a look at the multi GPU test failure tmr.
Thanks @yaochengji for contributing!

JackCaoG · 2021-02-19T06:03:42Z

Great work @yaochengji , thanks for the contribution! My hand is a bit tight with tpuvm stuff right now but will try to review and import the XRT pr when I am a bit more free.

yaochengji · 2021-02-19T06:21:50Z

@JackCaoG @ailzhang Thanks for your time reviewing the pr. I'm experimentally using torch/xla in my company and willing to contribute more if I could.

yaochengji mentioned this pull request Nov 30, 2020

enable autocast for xla pytorch/pytorch#48570

Closed

yaochengji force-pushed the add-amp branch from 6e6295d to fcc6228 Compare November 30, 2020 05:17

ailzhang self-requested a review December 4, 2020 22:52

yaochengji force-pushed the add-amp branch 3 times, most recently from 6305eaa to 3d3662c Compare December 11, 2020 06:46

yaochengji force-pushed the add-amp branch from 3d3662c to 8f2b56f Compare December 15, 2020 06:57

ailzhang requested a review from JackCaoG December 16, 2020 22:35

JackCaoG requested changes Dec 31, 2020

View reviewed changes

yaochengji force-pushed the add-amp branch from 8f2b56f to 5297e16 Compare January 1, 2021 23:34

JackCaoG requested changes Jan 6, 2021

View reviewed changes

yaochengji force-pushed the add-amp branch from 41e4dc2 to d57b1a1 Compare January 7, 2021 04:56

JackCaoG requested changes Jan 8, 2021

View reviewed changes

yaochengji force-pushed the add-amp branch 3 times, most recently from eb9c01d to efd803d Compare January 8, 2021 06:37

yaochengji force-pushed the add-amp branch from efd803d to 1a5e2ba Compare January 9, 2021 03:21

yaochengji force-pushed the add-amp branch from e6cd39a to 3ad3371 Compare February 17, 2021 01:27

add amp op cpp test

c797820

yaochengji force-pushed the add-amp branch from 3ad3371 to c797820 Compare February 17, 2021 03:22

xla amp could handle shape (1, ) now.

9211855

yaochengji force-pushed the add-amp branch from 0f7578a to 9211855 Compare February 18, 2021 22:26

ailzhang approved these changes Feb 19, 2021

View reviewed changes

JackCaoG approved these changes Feb 19, 2021

View reviewed changes

JackCaoG merged commit e2aedb7 into pytorch:master Feb 19, 2021

jysohn23 mentioned this pull request Feb 19, 2021

Model training works on GPU but not on Pytorch/XLA #2785

Closed

malfet mentioned this pull request Mar 10, 2021

[1.8.1] enable autocast for xla pytorch/pytorch#53671

Merged

Clive2312 mentioned this pull request Apr 2, 2021

Crashing while using torch.nn.functional.normalize() with amp enabled #2857

Closed

JackCaoG mentioned this pull request Jul 17, 2021

PyTorch/XLA AMP support is disabled pytorch/pytorch#61804

Closed

support amp (auto mixed precision) #2654

support amp (auto mixed precision) #2654

Uh oh!

Conversation

yaochengji commented Nov 30, 2020

Uh oh!

yaochengji commented Dec 1, 2020

Uh oh!

JackCaoG commented Dec 1, 2020

Uh oh!

yaochengji commented Dec 1, 2020

Uh oh!

ailzhang commented Dec 1, 2020

Uh oh!

yaochengji commented Dec 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JackCaoG commented Dec 1, 2020

Uh oh!

yaochengji commented Dec 2, 2020

Uh oh!

davidel commented Dec 3, 2020

Uh oh!

JackCaoG commented Dec 29, 2020

Uh oh!

yaochengji commented Dec 30, 2020

Uh oh!

JackCaoG left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JackCaoG left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JackCaoG Jan 6, 2021

Choose a reason for hiding this comment

Uh oh!

yaochengji Jan 7, 2021

Choose a reason for hiding this comment

Uh oh!

JackCaoG Jan 8, 2021

Choose a reason for hiding this comment

Uh oh!

yaochengji Jan 8, 2021

Choose a reason for hiding this comment

Uh oh!

tyoc213 Jan 10, 2021

Choose a reason for hiding this comment

Uh oh!

yaochengji Jan 11, 2021

Choose a reason for hiding this comment

Uh oh!

JackCaoG left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JackCaoG Jan 8, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yaochengji commented Dec 1, 2020 •

edited

Loading

yaochengji commented Feb 18, 2021 •

edited

Loading

yaochengji commented Feb 18, 2021 •

edited

Loading