Add Adafactor foreach impl #132336

janeyx99 · 2024-07-31T22:38:11Z

This PR adds the foreach impl for Adafactor knowing that there are many ways to improve its runtime perf today (by adding more foreach support). After this PR:

we have a foreach flag for Adafactor
It is NOT the default. Why not? It is only slightly faster + uses O(n) more memory where n is the number of params in your max param group. People tend to use Adafactor for memory efficiency.

Next steps:

make torch.compile possible on it
make it faster (by adding more foreach apis)

Stack from ghstack (oldest at bottom):

[ghstack-poisoned]

pytorch-bot · 2024-07-31T22:38:14Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/132336

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 97797b5 with merge base 61625a1 ():

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

Lint / lintrunner-noclang / linux-job (gh) (trunk failure)
>>> Lint for torch/_inductor/runtime/triton_heuristics.py:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 1888a78 Pull Request resolved: #132336

[ghstack-poisoned]

This PR adds the foreach impl for Adafactor knowing that there are many ways to improve its runtime perf today (by adding more foreach support). After this PR: - we have a foreach flag for Adafactor - It is NOT the default. Why not? It is only slightly faster + uses O(n) more memory where n is the number of params in your max param group. People tend to use Adafactor for memory efficiency. Next steps: - make torch.compile possible on it - make it faster (by adding more foreach apis) [ghstack-poisoned]

ghstack-source-id: b8ee838 Pull Request resolved: #132336

janeyx99 · 2024-08-07T22:16:56Z

torch/optim/_adafactor.py

+            torch._foreach_mul_(device_row_vars, beta2_ts)  # type: ignore[arg-type]
+            torch._foreach_mul_(row_means, one_minus_beta2_ts)
+            torch._foreach_add_(device_row_vars, row_means)  # type: ignore[arg-type]


in the future would be a

torch._foreach_lerp_(device_row_vars, row_means, one_minus_beta2_ts)

if we had ScalarList support for _foreach_lerp 3rd arg

Do you expect to do these before merging the PR?
Should these improvements be recapped in an issue?

janeyx99 · 2024-08-07T22:17:19Z

torch/optim/_adafactor.py

+            torch._foreach_mul_(device_col_vars, beta2_ts)  # type: ignore[arg-type]
+            torch._foreach_mul_(col_means, one_minus_beta2_ts)
+            torch._foreach_add_(device_col_vars, col_means)  # type: ignore[arg-type]


torch._foreach_lerp_(device_col_vars, col_means, one_minus_beta2_ts)

janeyx99 · 2024-08-07T22:17:34Z

torch/optim/_adafactor.py

+            torch._foreach_mul_(device_variances, beta2_ts)  # type: ignore[arg-type]
+            torch._foreach_mul_(grads_squared, one_minus_beta2_ts)
+            torch._foreach_add_(device_variances, grads_squared)  # type: ignore[arg-type]


torch._foreach_lerp_(device_variances, grads_squared, one_minus_beta2_ts)

janeyx99 · 2024-08-07T22:18:33Z

torch/optim/_adafactor.py

+            ), "row_var and col_var should be defined when grad is multidimensional"
+            # same as (g * g).mean(dim=-1) w/o materializing an intermediate size g
+            row_means = [
+                torch.norm(grad, dim=-1, keepdim=True) for grad in device_grads


no foreach norm support for this type of norm

janeyx99 · 2024-08-07T22:19:29Z

torch/optim/_adafactor.py

+                for row_var, col_var in zip(device_row_vars, device_col_vars)
+            ]
+            row_var_means = [
+                row_var.mean(dim=-2, keepdim=True) for row_var in device_row_vars  # type: ignore[union-attr]


no foreach mean

janeyx99 · 2024-08-07T22:19:47Z

torch/optim/_adafactor.py

+            del col_means
+
+            var_estimates = [
+                row_var @ col_var  # type: ignore[operator]


no foreach mm lol, probably the bulk of the work

janeyx99 · 2024-08-07T22:20:05Z

torch/optim/_adafactor.py

+        torch._foreach_sqrt_(var_estimates)
+        torch._foreach_reciprocal_(var_estimates)


would benefit from a foreach_rsqrt

janeyx99 · 2024-08-07T22:20:37Z

torch/optim/_adafactor.py

+            for a, update in zip(alphas, updates)
+        ]
+        torch._foreach_mul_(updates, alphas)
+        torch._foreach_add_(device_params, updates)  # type: ignore[arg-type]


would be nice to have a foreach_add where the alphas could be a scalarlist

albanD

Sounds pretty good!
Curious what's the plan for all the future improvements

albanD · 2024-08-12T16:14:25Z

torch/optim/_adafactor.py

+                device_state_steps, torch.tensor(1.0, device="cpu"), alpha=1.0  # type: ignore[arg-type]
+            )
+        else:
+            torch._foreach_add_(device_state_steps, 1)  # type: ignore[arg-type]


Suggested change

torch._foreach_add_(device_state_steps, 1) # type: ignore[arg-type]

torch._foreach_add_(device_state_steps, 1.) # type: ignore[arg-type]

?

albanD · 2024-08-12T16:16:06Z

torch/optim/_adafactor.py

+            torch._foreach_mul_(device_row_vars, beta2_ts)  # type: ignore[arg-type]
+            torch._foreach_mul_(row_means, one_minus_beta2_ts)
+            torch._foreach_add_(device_row_vars, row_means)  # type: ignore[arg-type]


Do you expect to do these before merging the PR?
Should these improvements be recapped in an issue?

janeyx99 · 2024-08-12T18:09:26Z

@albanD I'm planning to encapsulate all the action items in an issue before landing this PR. Including perf wins, compile support, etc.

This PR adds the foreach impl for Adafactor knowing that there are many ways to improve its runtime perf today (by adding more foreach support). After this PR: - we have a foreach flag for Adafactor - It is NOT the default. Why not? It is only slightly faster + uses O(n) more memory where n is the number of params in your max param group. People tend to use Adafactor for memory efficiency. Next steps: - make torch.compile possible on it - make it faster (by adding more foreach apis) [ghstack-poisoned]

janeyx99 · 2024-08-13T22:00:40Z

Perf tracker with all issues: #133367

This PR adds the foreach impl for Adafactor knowing that there are many ways to improve its runtime perf today (by adding more foreach support). After this PR: - we have a foreach flag for Adafactor - It is NOT the default. Why not? It is only slightly faster + uses O(n) more memory where n is the number of params in your max param group. People tend to use Adafactor for memory efficiency. Next steps: - make torch.compile possible on it - make it faster (by adding more foreach apis) [ghstack-poisoned]

ghstack-source-id: b1c1eed Pull Request resolved: #132336

albanD

Sounds good!

janeyx99 · 2024-08-15T16:53:13Z

@pytorchbot merge

pytorchmergebot · 2024-08-15T16:55:08Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Add Adafactor foreach impl

0bf7301

[ghstack-poisoned]

pytorch-bot bot added release notes: optim labels Jul 31, 2024

janeyx99 added a commit that referenced this pull request Jul 31, 2024

Add Adafactor foreach impl

46a60a3

ghstack-source-id: 1888a78 Pull Request resolved: #132336

Update on "Add Adafactor foreach impl"

1704bd6

[ghstack-poisoned]

janeyx99 marked this pull request as ready for review August 7, 2024 22:11

janeyx99 requested a review from albanD as a code owner August 7, 2024 22:11

Update on "Add Adafactor foreach impl"

235a3e4

[ghstack-poisoned]

janeyx99 added a commit that referenced this pull request Aug 7, 2024

Add Adafactor foreach impl

7fbe081

ghstack-source-id: b8ee838 Pull Request resolved: #132336

janeyx99 commented Aug 7, 2024

View reviewed changes

albanD reviewed Aug 12, 2024

View reviewed changes

janeyx99 mentioned this pull request Aug 12, 2024

Adafactor compile support tracker #133268

Open

3 tasks

janeyx99 mentioned this pull request Aug 13, 2024

Correct return type of grouping helper function in Optimizer #133360

Closed

janeyx99 added a commit that referenced this pull request Aug 14, 2024

Add Adafactor foreach impl

1988e89

ghstack-source-id: b1c1eed Pull Request resolved: #132336

janeyx99 added ciflow/trunk Trigger trunk jobs on your pull request topic: performance topic category labels Aug 14, 2024

albanD approved these changes Aug 14, 2024

View reviewed changes

pytorchmergebot added the merging label Aug 15, 2024

pytorchmergebot added the Merged label Aug 15, 2024

pytorchmergebot closed this in c23dceb Aug 15, 2024

pytorchmergebot removed the merging label Aug 15, 2024

janeyx99 mentioned this pull request Aug 19, 2024

Remove aten dispatch to empty in foreach_norm cuda kernel #133897

Closed

github-actions bot deleted the gh/janeyx99/179/head branch September 23, 2024 02:08

		torch._foreach_sqrt_(var_estimates)
		torch._foreach_reciprocal_(var_estimates)

	torch._foreach_add_(device_state_steps, 1) # type: ignore[arg-type]
	torch._foreach_add_(device_state_steps, 1.) # type: ignore[arg-type]

Add Adafactor foreach impl #132336

Add Adafactor foreach impl #132336

Uh oh!

Conversation

janeyx99 commented Jul 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jul 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/132336

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

albanD left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

janeyx99 commented Aug 12, 2024

Uh oh!

janeyx99 commented Aug 13, 2024

Uh oh!

albanD left a comment

Choose a reason for hiding this comment

Uh oh!

janeyx99 commented Aug 15, 2024

Uh oh!

pytorchmergebot commented Aug 15, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

janeyx99 commented Jul 31, 2024 •

edited

Loading

pytorch-bot bot commented Jul 31, 2024 •

edited

Loading