-
Notifications
You must be signed in to change notification settings - Fork 25.7k
[docs] Warn that GradScaler can scale under 1 #101569
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/101569
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit a9ed77c: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
docs/source/amp.rst
Outdated
| AMP/fp16 may not be for every model! For example, most bf16-pretrained models cannot operate in | ||
| the fp16 numerical range of max 65k and will cause gradients to overflow instead of underflow. In | ||
| this case, the scale factor may decrease under 1 as an attempt to bring gradients to a number | ||
| representable in fp16. While one may expect the scale to always be above 1, our GradScaler does | ||
| NOT make this guarantee to maintain performance. If you encounter NaNs in your loss or gradients | ||
| when running with AMP or fp16, verify your model is compatible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| AMP/fp16 may not be for every model! For example, most bf16-pretrained models cannot operate in | |
| the fp16 numerical range of max 65k and will cause gradients to overflow instead of underflow. In | |
| this case, the scale factor may decrease under 1 as an attempt to bring gradients to a number | |
| representable in fp16. While one may expect the scale to always be above 1, our GradScaler does | |
| NOT make this guarantee to maintain performance. If you encounter NaNs in your loss or gradients | |
| when running with AMP or fp16, verify your model is compatible. | |
| AMP/fp16 may not work for every model! For example, most bf16-pretrained models cannot operate in | |
| the fp16 numerical range of max 64k and will cause gradients to overflow instead of underflow. In | |
| this case, the scale factor may decrease under 1 as an attempt to bring gradients to a number | |
| representable in fp16 dynamic range. While one may expect the scale to always be above 1, our GradScaler does | |
| NOT make this guarantee to maintain performance. If you encounter NaNs in your loss or gradients | |
| when running with AMP/fp16, verify your model is compatible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comments! I thought the max for fp16 is 65.5k or something, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The maximum representable value is (2−2−10) × 215 = 65504
which is 64K (65504/1024)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see the distinction was regarding the K vs k. I'm just going to use the actual number for maximal clarity.
torch/cuda/amp/grad_scaler.py
Outdated
| For performance reasons, the scale factor is not guaranteed to be above 1. If the | ||
| scale falls below 1 and/or you are seeing NaNs in your gradients or loss, something | ||
| is likely wrong. For example, bf16-pretrained models are often incompatible with | ||
| AMP/fp16 due to differing numerical ranges. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| For performance reasons, the scale factor is not guaranteed to be above 1. If the | |
| scale falls below 1 and/or you are seeing NaNs in your gradients or loss, something | |
| is likely wrong. For example, bf16-pretrained models are often incompatible with | |
| AMP/fp16 due to differing numerical ranges. | |
| For performance reasons, the scale factor is not guaranteed to be above 1. If the | |
| scale falls below 1 and/or you are seeing NaNs in your gradients or loss, something | |
| is likely wrong. For example, bf16-pretrained models are often incompatible with | |
| AMP/fp16 due to differing dynamic ranges. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this a performance reason? I'd call it "numerical stability reasons", no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because adding a check would incur a device sync per step call, and device syncs are expensive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the current reason why we don't, but I intentionally did not mark this PR as one that would "fix" the issue as I'd like to leave that one open for more thoughts from the community if it comes up more often
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, I understand what you meant now.
but if this is not synced how will a user know that it fell below 1? I'm not suggesting to have the overhead, just trying to understand the explanation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, like if they check the scale directly as a part of debugging, though I think people would notice the NaNs first
Completes action item 1 in #99640 [ghstack-poisoned]
Completes action item 1 in #99640 [ghstack-poisoned]
torch/cuda/amp/grad_scaler.py
Outdated
| been invoked for all optimizers used this iteration. | ||
| .. warning:: | ||
| For performance reasons, the scale factor is not guaranteed to be above 1. If the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add here that for performance reasons we are not checking scale factor value to avoid synchronization, so it is not guaranteed to be above 1.
Completes action item 1 in #99640 [ghstack-poisoned]
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Completes action item 1 in #99640 Pull Request resolved: #101569 Approved by: https://github.com/ngimel
Completes action item 1 in #99640
Stack from ghstack (oldest at bottom):