Foreach gradient clipping #91846

milesial · 2023-01-07T08:50:42Z

Faster gradient clipping using the foreach functions

[------------------------ (tensors, scalar) -------------------------]
                                   |  without foreach  |  with foreach |    apex 
1 threads: ----------------------------------------------------------------------
      10 tensors of size 4         |         120.5     |       61.1    |     50.3
      100 tensors of size 4        |         946.2     |      239.5    |    136.3
      1000 tensors of size 4       |        9808.5     |     2151.1    |   1006.9
      10000 tensors of size 4      |       96871.2     |    22637.4    |  10119.1
      10 tensors of size 16        |         121.0     |       64.1    |     52.5
      100 tensors of size 16       |         993.4     |      252.6    |    136.7
      1000 tensors of size 16      |        9427.7     |     2151.2    |   1049.5
      10000 tensors of size 16     |       97437.1     |    22203.1    |  10340.0
      10 tensors of size 256       |         118.9     |       62.3    |     51.5
      100 tensors of size 256      |         955.2     |      243.1    |    134.2
      1000 tensors of size 256     |        9374.9     |     2140.7    |   1009.6
      10000 tensors of size 256    |       95302.5     |    21849.4    |  10215.5
      10 tensors of size 65536     |         118.5     |       62.4    |     51.1
      100 tensors of size 65536    |        1740.7     |      243.3    |    225.3
      1000 tensors of size 65536   |       17364.1     |     2228.7    |   2004.5
      10000 tensors of size 65536  |      177510.1     |    25410.4    |  20678.2

pytorch-bot · 2023-01-07T08:50:44Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/91846

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit d00dafa:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

albanD

I'll let @janeyx99 take a look at this.
She is building a generic tool for doing this per_device_and_dtype_grads collection that will simplify this code.

janeyx99

Hey! Can you add a test near test_clip_grad_norm in test/test_nn.py to ensure the calculations are the same?

torch/nn/utils/clip_grad.py

janeyx99 · 2023-01-11T19:02:22Z

Regarding work on consolidating a util for creating this dictionary: I'm currently landing #92014, which has a version of this grouping function. It would be best if the functionality used across this PR could be abstracted to a common util in that file too!

milesial · 2023-01-12T01:15:21Z

I added tests and used _group_tensors_by_device_and_dtype, let's see what CI says

milesial · 2023-01-13T00:19:30Z

@janeyx99 CI is green. I tweaked the import in your util file to avoid import race issues.

test/test_nn.py

torch/nn/utils/clip_grad.py

janeyx99

Thanks for the fast turnaround--looks awesome overall!

I had some nits and noob questions.

janeyx99 · 2023-01-13T01:46:19Z

I confirm the tests passed!

torch/nn/utils/clip_grad.py

test/test_nn.py

janeyx99 · 2023-01-13T21:16:50Z

test/test_nn.py

@@ -11486,7 +11408,8 @@ def run_test_case(norm_type, error_if_nonfinite, scalar, grad_only_one_elem, pre

    @onlyCUDA
    @deviceCountAtLeast(2)
-    def test_clip_grad_norm_multi_device(self, devices):
+    @parametrize_test('foreach', (False, True))
+    def test_clip_grad_norm_multi_device(self, devices, foreach):


Not a concern with your PR, but I am realizing we never run this in CI because we only have one CI config where there is more than one GPU and we don't run this test in that config. 🤔 Filed #92173

If you want to see this run in CI, you would have to add a line similar to https://github.com/pytorch/pytorch/pull/92048/files#diff-5356a2d45f3d28e01b954926d7f1681cccb0c9ad2cafd722478de8c283090bd3R48

And then you would also need to add the ciflow/periodic label to get the multigpu tests to trigger.

torch/nn/utils/clip_grad.py

torch/utils/_foreach_utils.py

torch/nn/utils/clip_grad.py

test/test_nn.py

janeyx99

Looks good to me! Thank you very much for the perf speedup :D

(sorry about the merge conflict--that's my bad)

milesial · 2023-01-18T21:28:54Z

@janeyx99 I rebased but the signature change is breaking torch XLA since the patching there expects the old signature

 Traceback (most recent call last):
  File "/var/lib/jenkins/workspace/xla/test/../../test/test_view_ops.py", line 15, in <module>
    from torch.testing._internal.common_device_type import \
  File "/opt/conda/lib/python3.7/site-packages/torch/testing/_internal/common_device_type.py", line 603, in <module>
    mod = runpy.run_path(path, init_globals=globals())  # type: ignore[func-returns-value]
  File "/opt/conda/lib/python3.7/runpy.py", line 263, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "/opt/conda/lib/python3.7/runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/var/lib/jenkins/workspace/xla/test/pytorch_test_base.py", line 8, in <module>
    import torch_xla
  File "/opt/conda/lib/python3.7/site-packages/torch_xla-2.0.0-py3.7-linux-x86_64.egg/torch_xla/__init__.py", line 140, in <module>
    _apply_patches()
  File "/opt/conda/lib/python3.7/site-packages/torch_xla-2.0.0-py3.7-linux-x86_64.egg/torch_xla/_patched_functions.py", line 54, in _apply_patches
    nn.utils.clip_grad_norm_ = _patch(nn.utils.clip_grad_norm_, clip_grad_norm_)
  File "/opt/conda/lib/python3.7/site-packages/torch_xla-2.0.0-py3.7-linux-x86_64.egg/torch_xla/_patched_functions.py", line 16, in _patch
    fn, xfingerprint, fingerprint))
RuntimeError: Unable to patch <function clip_grad_norm_ at 0x7fc8b282d950>, signature mismatch: (parameters: Union[torch.Tensor, Iterable[torch.Tensor]], max_norm: float, norm_type: float = 2.0, error_if_nonfinite: bool = False, foreach: bool = None) -> torch.Tensor vs (parameters: Union[torch.Tensor, Iterable[torch.Tensor]], max_norm: float, norm_type: float = 2.0, error_if_nonfinite: bool = False) -> torch.Tensor

from https://github.com/pytorch/xla/blob/d636e7774b63cc070d7ebbfeec950e4892efa713/torch_xla/_patched_functions.py#L21-L54

How should we approach that, first merge a fix to torch xla to allow both old and new signatures, then merge this PR?

janeyx99 · 2023-01-18T21:56:58Z

Yes, we'd want to sync the lands. If we're able to fix xla without breaking pytorch, then go for it and land a patch there first. In general, we follow these steps, but the force merging may be unnecessary if xla can be green the whole time:

(1) Make a pytorch/pytorch PR and a pytorch/xla patch
(2) update xla.txt on the pytorch.pytorch PR to point to your patch
(3) once pytorch/pytorch CI is fully green, rebase on tip-of-master again to minimize the chance of merge conflicts / last minute problems
(4) once you get fully green CI on the newly rebased pytorch/pytorch PR, merge the pytorch/xla PR (XLA CI will start failing)
(5) update the pytorch/pytorch PR to the new tip-of-master XLA commit hash (no other changes should be required), and immediately force-merge. You’re counting on the fact that CI was green ~3 hours ago.

The force merge is because we’re betting on the fact that nothing should have changed from the last run to the next, and we don’t want to keep XLA CI red for an unnecessary 3 hours. And the rebasing beforehand is because at least so far, merge conflicts have been a frequent source of “the pytorch/xla PR merged, but the pytorch/pytorch PR is no longer ready”

An example can be found by following https://github.com/pytorch/xla/blob/d636e7774b63cc070d7ebbfeec950e4892efa713/.circleci/README.md?plain=1#L10

milesial · 2023-01-19T00:13:03Z

I don't think I can do it from 2) since I don't have write access to the xla repo, I only have a fork and xla.txt can't be in a fork I think. I tried setting it to pull/4471/head but that didn't work

pytorch/xla#4471

milesial · 2023-01-20T17:43:55Z

@wonjoolee95 I rebased this MR. Once it's green you can merge the XLA MR. Then I'll update the pin on this MR and force merge.

milesial · 2023-01-20T20:37:07Z

@wonjoolee95 CI passed, can you merge the XLA PR?

wonjoolee95 · 2023-01-20T21:24:49Z

@wonjoolee95 CI passed, can you merge the XLA PR?

@milesial, just merged to master. The new pin should be eac4e547138ab22a9b41c6f96208613fd7dd19d5.

milesial · 2023-01-20T21:26:43Z

Thanks, let's hope it goes smoothly

@pytorchbot merge -f "coordinating merge with XLA, CI passed"

pytorch-bot · 2023-01-20T21:26:47Z

You are not authorized to force merges to this repository. Please use the regular @pytorchmergebot merge command instead

JackCaoG · 2023-01-20T21:26:56Z

FYI, next time we can merge this pr first(pin can point to a unmerged branch and that's by design), and then merge the xla one. Otherwise xla head will be red until this pr merged.

milesial · 2023-01-20T21:27:34Z

Haha I can't force merge, @janeyx99 can you help?

wonjoolee95 · 2023-01-20T21:33:09Z

Okay since we can't force merge this right now, I'm going to revert the XLA's PR lol.

JackCaoG · 2023-01-20T21:33:57Z

@milesial don't worry about our revert, as long as pytorch still pin to the correct pytorch/xla pin, pytorch CI will be fine.

wonjoolee95 · 2023-01-20T21:34:31Z

You can update the XLA pin in this PR to 8dcab83819368f468dadbe6e81b064d268830df2 and merge -g. I'll merge the XLA's companion PR once this merges.

milesial · 2023-01-20T21:39:05Z

I think keeping eac4e is fine no? No need to switch to 8dca right?

@pytorchbot merge -g

JackCaoG · 2023-01-20T21:40:15Z

yea, you can keep the old pin

janeyx99 · 2023-01-20T21:40:30Z

Oh I can force merge :D

janeyx99 · 2023-01-20T21:41:21Z

@pytorchbot merge -f "coordinating with xla, prev ci was all green!"

pytorchmergebot · 2023-01-20T21:43:25Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

milesial · 2023-01-20T21:46:24Z

Nice, I guess the XLA side can re-revert, sorry for the commit mess haha. And thanks for the help, that was a fun force-push-to-prod Friday

janeyx99 · 2023-01-20T21:46:51Z

don't speak too soon 🙃

ngimel · 2023-02-01T19:01:18Z

@milesial @crcrpar can you check if debug build using this op errors out? We have reports of debug builds erroring out with

** RuntimeError: t.storage().use_count() == 1 INTERNAL ASSERT FAILED at "caffe2/torch/csrc/autograd/autograd_not_implemented_fallback.cpp":189, please report a bug to PyTorch.

Edit: the issue is most likely not with this PR, which is just python enablement, but with for_each_norm implementation itself.

milesial · 2023-02-02T00:33:57Z

I'll check.
By debug build you mean building with DEBUG=1 right?

pytorchbot added the open source label Jan 7, 2023

milesial marked this pull request as ready for review January 9, 2023 19:32

milesial requested review from albanD and jbschlosser as code owners January 9, 2023 19:32

albanD reviewed Jan 11, 2023

View reviewed changes

janeyx99 self-requested a review January 11, 2023 16:45

janeyx99 reviewed Jan 11, 2023

View reviewed changes

torch/nn/utils/clip_grad.py Outdated Show resolved Hide resolved

drisspg added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jan 12, 2023