Skip to content

Conversation

jon-chuang
Copy link
Collaborator

@jon-chuang jon-chuang commented Oct 1, 2023

In order to avoid running into #110342, we replace complicated cat + max logic into a simple torch.maximum. (see also here in fairseq repo - the same operation with a different, less-explicit syntax).

Like #110339, we also ensure that step is on device.

As noted in #107006 (comment), it is likely that num_kernels=2 is optimal, unless one can move the numel=1, shape=(0,) scalar computations entirely into CPU.

That being said, in the foreach case, a parallel foreach scalar computation on GPU doesn't seem too bad.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 1, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/110345

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 61c685b with merge base 4e73eee (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@jon-chuang jon-chuang marked this pull request as ready for review October 1, 2023 02:06
@jon-chuang jon-chuang changed the title feat(inductor): Improve Adamax implementation to be better fused by Inductor feat(inductor): Improve Adamax to be better fused by Inductor and enable Oct 1, 2023
@jon-chuang jon-chuang changed the title feat(inductor): Improve Adamax to be better fused by Inductor and enable feat(inductor): Improve Adamax to be better fused by Inductor and enable it Oct 1, 2023
out=exp_inf,
)
else:
norm_buf = torch.cat(
Copy link
Collaborator Author

@jon-chuang jon-chuang Oct 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maximum didn't work for differentiable case. so fallback to prev

Error:

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.DoubleTensor [10]], which is output 0 of torch::autograd::CopyBackwards, is at version 2; expected version 1 instead.

Copy link
Contributor

@vadimkantorov vadimkantorov Oct 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related issue on maximum: #54216

@jon-chuang jon-chuang changed the title feat(inductor): Improve Adamax to be better fused by Inductor and enable it feat(inductor): Improve Adamax to be better fused by Inductor and enable it Oct 1, 2023
Copy link
Contributor

@janeyx99 janeyx99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have benchmark numbers on perf for eager foreach (switching differentiable=T/F) showing that this change is not regressing that?

Also, could you move the stylistic changes in another PR?

}

disabled_multi_tensor_opt_modules = {
adamax,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Thanks for taking this on in general!

@colesbury colesbury added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Oct 3, 2023
pytorchmergebot pushed a commit that referenced this pull request Oct 5, 2023
… against list comprehensions (e.g. complex conversion) (#110613)

Fully fixes: #110506

Depends: #110607
Potential merge conflicts:
- #110339
- #110345
- #110454

Related:
- #110606 (we can apply the improvements here orthogonally to the complex support)

### Results

Benchmark: 100 params.

Breakdowns (float32, dynamo):
```
Adagrad: this PR: 4.4s, main: 8.8s
Adam: this PR: 2.1s, main: 9.8s
AdamW: this PR: 2.5s, main: 8.2s
ASGD: this PR: 3.1s, main: 8.5s
RMSProp: this PR: 1.3s, main: 4.2s
RProp: this PR: 6.7s, main: 14.9s
```

Notes:
1. Adagrad is still slow due to `_get_value` list comprehension. Can be fixed in https://github.com/pytorch/pytorch/pull/110339/files by utilizing capturable path
2. Adamax is not actually compiled (it is currently disabled).
3. Inductor compile time is quite variable. We calculate dynamo by subtracting `call_user_compiler` from `compile_inner` timing.

<details>

This PR:
```
Adagrad (torch.float32): 28.47496461868286s
Adagrad (torch.complex64): 29.379547357559204s
Adam (torch.float32): 17.334211587905884s
Adam (torch.complex64): 29.637500524520874s
Adamax (torch.float32): 2.4749321937561035s
Adamax (torch.complex64): 3.1997995376586914s
AdamW (torch.float32): 18.06532859802246s
AdamW (torch.complex64): 28.25661015510559s
ASGD (torch.float32): 23.70255398750305s
ASGD (torch.complex64): 25.33756995201111s
RMSprop (torch.float32): 7.964028596878052s
RMSprop (torch.complex64): 12.909599781036377s
Rprop (torch.float32): 30.512362003326416s
Rprop (torch.complex64): 44.74405765533447s
```

Main
```
Adagrad (torch.float32): 26.919506072998047s
Adagrad (torch.complex64): 35.190622091293335s
Adam (torch.float32): 25.715000867843628s
Adam (torch.complex64): 24.17716670036316s
Adamax (torch.float32): 2.4404726028442383s
Adamax (torch.complex64): 3.3538928031921387s
AdamW (torch.float32): 25.2022807598114s
AdamW (torch.complex64): 28.915700912475586s
ASGD (torch.float32): 24.108731985092163s
ASGD (torch.complex64): 26.589075088500977s
RMSprop (torch.float32): 10.781344175338745s
RMSprop (torch.complex64): 15.136352777481079s
Rprop (torch.float32): 42.46482181549072s
Rprop (torch.complex64): 48.28277635574341s
```

Seems that it doesn't help the complex case by much (but that's not the majority case). torch.float32 is generally positive, when it does not show drastic improvement / regresses, it is due to inductor variance (by manually inspecting the logs).

</details>

### Benchmark Script
```python
import torch
import time
from torch.optim import Adagrad, Adam, Adamax, AdamW, ASGD, RMSprop, Rprop

OPTIMS = [Adagrad, Adam, Adamax, AdamW, ASGD, RMSprop, Rprop]
DTYPES = [torch.float, torch.cfloat]

NUM_PARAMS = 100
kwargs = { "lr": 0.01, "foreach": True }
summary = []

for optim_cls in OPTIMS:
    for dtype in DTYPES:
        torch._dynamo.reset()
        # torch._inductor.metrics.reset()
        input = torch.ones([10, 10], dtype=dtype, device="cuda:0")
        model = torch.nn.Sequential(
            *[torch.nn.Linear(10, 10, dtype=dtype, device="cuda:0") for _ in range(NUM_PARAMS)]
        )

        model(input).sum().abs().backward()
        opt_compiled = optim_cls(model.parameters(), **kwargs)
        compiled_step = torch.compile(opt_compiled.step)

        with torch.set_grad_enabled(False):
            start_time = time.time()
            compiled_step()
            summary.append(f"{optim_cls.__name__} ({dtype}): {time.time() - start_time}s")

        print(optim_cls, kwargs, dtype, torch._dynamo.utils.compile_times())

for s in summary:
    print(s)
```

CC: @janeyx99 @mlazos
Pull Request resolved: #110613
Approved by: https://github.com/janeyx99
@mlazos
Copy link
Contributor

mlazos commented Nov 6, 2023

@jon-chuang what's the status on this?

@jon-chuang
Copy link
Collaborator Author

@mlazos will be tryna get this PR and a few other optim ones in shape this week.

@mlazos
Copy link
Contributor

mlazos commented Nov 7, 2023

@mlazos will be tryna get this PR and a few other optim ones in shape this week.

Cool lemme know if you need help, this will be great to add another optimizer to our collection of compiled optimizers.

Copy link
Contributor

@janeyx99 janeyx99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @jon-chuang sorry for the late review here. This PR is close--but could you add testing for the capturable component in https://github.com/pytorch/pytorch/blob/main/test/test_cuda.py#L3109 and in our OptimizerInfos that I introduced in common_optimizers.py? This review applies for adagrad as well :)

@mlazos
Copy link
Contributor

mlazos commented Jan 19, 2024

closing in favor of #117835

@mlazos mlazos closed this Jan 19, 2024
pytorchmergebot pushed a commit that referenced this pull request Jan 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor module: dynamo module: inductor open source release notes: optim triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants