[docs] Warn that GradScaler can scale under 1 #101569

janeyx99 · 2023-05-16T15:08:49Z

Completes action item 1 in #99640

Stack from ghstack (oldest at bottom):

-> [docs] Warn that GradScaler can scale under 1 #101569

[ghstack-poisoned]

pytorch-bot · 2023-05-16T15:08:52Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/101569

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit a9ed77c:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

janeyx99 · 2023-05-16T15:10:57Z

@stas00 @crcrpar Would appreciate your reviews on phrasing!

stas00 · 2023-05-16T16:34:57Z

docs/source/amp.rst

+  AMP/fp16 may not be for every model! For example, most bf16-pretrained models cannot operate in
+  the fp16 numerical range of max 65k and will cause gradients to overflow instead of underflow. In
+  this case, the scale factor may decrease under 1 as an attempt to bring gradients to a number
+  representable in fp16. While one may expect the scale to always be above 1, our GradScaler does
+  NOT make this guarantee to maintain performance. If you encounter NaNs in your loss or gradients
+  when running with AMP or fp16, verify your model is compatible.


Suggested change

AMP/fp16 may not be for every model! For example, most bf16-pretrained models cannot operate in

the fp16 numerical range of max 65k and will cause gradients to overflow instead of underflow. In

this case, the scale factor may decrease under 1 as an attempt to bring gradients to a number

representable in fp16. While one may expect the scale to always be above 1, our GradScaler does

NOT make this guarantee to maintain performance. If you encounter NaNs in your loss or gradients

when running with AMP or fp16, verify your model is compatible.

AMP/fp16 may not work for every model! For example, most bf16-pretrained models cannot operate in

the fp16 numerical range of max 64k and will cause gradients to overflow instead of underflow. In

this case, the scale factor may decrease under 1 as an attempt to bring gradients to a number

representable in fp16 dynamic range. While one may expect the scale to always be above 1, our GradScaler does

NOT make this guarantee to maintain performance. If you encounter NaNs in your loss or gradients

when running with AMP/fp16, verify your model is compatible.

Thanks for the comments! I thought the max for fp16 is 65.5k or something, no?

The maximum representable value is (2−2−10) × 215 = 65504

which is 64K (65504/1024)

Ah, I see the distinction was regarding the K vs k. I'm just going to use the actual number for maximal clarity.

stas00 · 2023-05-16T16:36:28Z

torch/cuda/amp/grad_scaler.py

+            For performance reasons, the scale factor is not guaranteed to be above 1. If the
+            scale falls below 1 and/or you are seeing NaNs in your gradients or loss, something
+            is likely wrong. For example, bf16-pretrained models are often incompatible with
+            AMP/fp16 due to differing numerical ranges.


Suggested change

For performance reasons, the scale factor is not guaranteed to be above 1. If the

scale falls below 1 and/or you are seeing NaNs in your gradients or loss, something

is likely wrong. For example, bf16-pretrained models are often incompatible with

AMP/fp16 due to differing numerical ranges.

For performance reasons, the scale factor is not guaranteed to be above 1. If the

scale falls below 1 and/or you are seeing NaNs in your gradients or loss, something

is likely wrong. For example, bf16-pretrained models are often incompatible with

AMP/fp16 due to differing dynamic ranges.

why is this a performance reason? I'd call it "numerical stability reasons", no?

Because adding a check would incur a device sync per step call, and device syncs are expensive

This is the current reason why we don't, but I intentionally did not mark this PR as one that would "fix" the issue as I'd like to leave that one open for more thoughts from the community if it comes up more often

oh, I understand what you meant now.

but if this is not synced how will a user know that it fell below 1? I'm not suggesting to have the overhead, just trying to understand the explanation.

Oh, like if they check the scale directly as a part of debugging, though I think people would notice the NaNs first

Completes action item 1 in #99640 [ghstack-poisoned]

ngimel · 2023-05-16T20:31:26Z

torch/cuda/amp/grad_scaler.py

            been invoked for all optimizers used this iteration.
+
+        .. warning::
+            For performance reasons, the scale factor is not guaranteed to be above 1. If the


add here that for performance reasons we are not checking scale factor value to avoid synchronization, so it is not guaranteed to be above 1.

Completes action item 1 in #99640 [ghstack-poisoned]

ghstack-source-id: bf0bfaf Pull Request resolved: #101569

janeyx99 · 2023-05-16T20:41:37Z

@pytorchbot merge

pytorchmergebot · 2023-05-16T20:44:08Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Completes action item 1 in #99640 Pull Request resolved: #101569 Approved by: https://github.com/ngimel

[docs] Warn that GradScaler can scale under 1

8aaaa5b

[ghstack-poisoned]

janeyx99 added release notes: cuda release notes category topic: docs topic category labels May 16, 2023

janeyx99 requested a review from ngimel May 16, 2023 15:10

stas00 reviewed May 16, 2023

View reviewed changes

Update on "[docs] Warn that GradScaler can scale under 1"

b1a24ab

Completes action item 1 in #99640 [ghstack-poisoned]

Update on "[docs] Warn that GradScaler can scale under 1"

2f96bfe

Completes action item 1 in #99640 [ghstack-poisoned]

ngimel approved these changes May 16, 2023

View reviewed changes

Update on "[docs] Warn that GradScaler can scale under 1"

a9ed77c

Completes action item 1 in #99640 [ghstack-poisoned]

janeyx99 added a commit that referenced this pull request May 16, 2023

[docs] Warn that GradScaler can scale under 1

ad55ea1

ghstack-source-id: bf0bfaf Pull Request resolved: #101569

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 16, 2023

pytorchmergebot added the merging label May 16, 2023

pytorchmergebot added Merged and removed merging labels May 16, 2023

pytorchmergebot closed this in cde597e May 16, 2023

jcaip pushed a commit that referenced this pull request May 23, 2023

[docs] Warn that GradScaler can scale under 1 (#101569)

690a4fa

Completes action item 1 in #99640 Pull Request resolved: #101569 Approved by: https://github.com/ngimel

facebook-github-bot deleted the gh/janeyx99/49/head branch June 8, 2023 17:23

[docs] Warn that GradScaler can scale under 1 #101569

[docs] Warn that GradScaler can scale under 1 #101569

Uh oh!

Conversation

janeyx99 commented May 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented May 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/101569

✅ No Failures

Uh oh!

janeyx99 commented May 16, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stas00 May 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

janeyx99 commented May 16, 2023

Uh oh!

pytorchmergebot commented May 16, 2023

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

janeyx99 commented May 16, 2023 •

edited

Loading

pytorch-bot bot commented May 16, 2023 •

edited

Loading

stas00 May 16, 2023 •

edited

Loading