[SyncBatchNorm] Support running with low precision parameters #98332

awgu · 2023-04-04T17:39:44Z

Stack from ghstack (oldest at bottom):

This PR fixes #96203.

Details
When using nn.SyncBatchNorm with the model converted to FP16, there is a dtype discrepancy in the SyncBatchNorm.forward() causing an error like:

 File "/.../pytorch/torch/nn/modules/_functions.py", line 91, in forward
    mean, invstd = torch.batch_norm_gather_stats_with_counts(
RuntimeError: Expected counts to have type Half but got Float

torch.batch_norm_gather_stats_with_counts() requires the running_mean, running_var, and counts to have the same dtype. However, when the model has been converted to FP16, only running_mean and running_var use FP16, while the counts are in FP32 due to mean being in FP32. This PR resolves this by casting counts from FP32 to FP16 instead of the alternative to cast mean and invstd from FP32 to FP16.

Moreover, for the backward, this PR casts weight from FP16 to FP32 to match the dtype of mean and invstd as required by torch.batch_norm_backward_elemt() instead of the alternative to cast mean and invstd from FP32 to FP16.

Test Plan
I dug up this run command from 2021:
For world_size in {1,2} and backend in {nccl, gloo}:

WORLD_SIZE=world_size BACKEND=backend  python -m pytest test/distributed/test_distributed_spawn.py -k test_DistributedDataParallel_SyncBatchNorm_half -vs

[ghstack-poisoned]

pytorch-bot · 2023-04-04T17:39:47Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/98332

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

MacOS M1 testing is backlogged

✅ No Failures

As of commit 285e049:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 3fc48950d1ebc41bb25bac9939a1c1e3e315d39d Pull Request resolved: #98332

rohan-varma

thanks for the fix!

torch/testing/_internal/distributed/distributed_test.py

…ded" This PR fixes #96203. **Details** When using `nn.SyncBatchNorm` with the model converted to FP16, there is a dtype discrepancy in the `SyncBatchNorm.forward()` causing an error like: ``` File "/.../pytorch/torch/nn/modules/_functions.py", line 91, in forward mean, invstd = torch.batch_norm_gather_stats_with_counts( RuntimeError: Expected counts to have type Half but got Float ``` [`torch.batch_norm_gather_stats_with_counts()`](https://github.com/pytorch/pytorch/blob/fe9da29842a07a1f44d6b8c2a4c75053da9e84d0/torch/nn/modules/_functions.py#L88-L97) requires the `running_mean`, `running_var`, and `counts` to have the same dtype. However, when the model has been converted to FP16, only `running_mean` and `running_var` use FP16, while the `counts` are in FP32 due to [`mean` being in FP32](https://github.com/pytorch/pytorch/blob/fe9da29842a07a1f44d6b8c2a4c75053da9e84d0/torch/nn/modules/_functions.py#L25-L30). This PR resolves this by casting `counts` from FP32 to FP16 instead of the alternative to cast `mean` and `invstd` from FP32 to FP16. Moreover, for the backward, this PR casts `weight` from FP16 to FP32 to match the dtype of `mean` and `invstd` as required by `torch.batch_norm_backward_elemt()` instead of the alternative to cast `mean` and `invstd` from FP32 to FP16. **Test Plan** I dug up this run command from 2021: ``` WORLD_SIZE=2 BACKEND=nccl python -m pytest test/distributed/test_distributed_spawn.py -k test_DistributedDataParallel_SyncBatchNorm_half -vs ``` [ghstack-poisoned]

ghstack-source-id: e8a775d674c7d1c0f17b9b2348c9ca6184a59d54 Pull Request resolved: #98332

…ers" This PR fixes #96203. **Details** When using `nn.SyncBatchNorm` with the model converted to FP16, there is a dtype discrepancy in the `SyncBatchNorm.forward()` causing an error like: ``` File "/.../pytorch/torch/nn/modules/_functions.py", line 91, in forward mean, invstd = torch.batch_norm_gather_stats_with_counts( RuntimeError: Expected counts to have type Half but got Float ``` [`torch.batch_norm_gather_stats_with_counts()`](https://github.com/pytorch/pytorch/blob/fe9da29842a07a1f44d6b8c2a4c75053da9e84d0/torch/nn/modules/_functions.py#L88-L97) requires the `running_mean`, `running_var`, and `counts` to have the same dtype. However, when the model has been converted to FP16, only `running_mean` and `running_var` use FP16, while the `counts` are in FP32 due to [`mean` being in FP32](https://github.com/pytorch/pytorch/blob/fe9da29842a07a1f44d6b8c2a4c75053da9e84d0/torch/nn/modules/_functions.py#L25-L30). This PR resolves this by casting `counts` from FP32 to FP16 instead of the alternative to cast `mean` and `invstd` from FP32 to FP16. Moreover, for the backward, this PR casts `weight` from FP16 to FP32 to match the dtype of `mean` and `invstd` as required by `torch.batch_norm_backward_elemt()` instead of the alternative to cast `mean` and `invstd` from FP32 to FP16. **Test Plan** I dug up this run commands from 2021: For `world_size` in `{1,2}` and `backend` in `{nccl, gloo}`: ``` WORLD_SIZE=world_size BACKEND=backend python -m pytest test/distributed/test_distributed_spawn.py -k test_DistributedDataParallel_SyncBatchNorm_half -vs ``` [ghstack-poisoned]

ghstack-source-id: a804c303e0ebe15f4fb0680737c25992d2c44da8 Pull Request resolved: #98332

awgu · 2023-04-04T21:37:34Z

@pytorchbot merge

pytorchmergebot · 2023-04-04T21:39:31Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

awgu · 2023-04-04T23:58:21Z

@pytorchbot merge

pytorchmergebot · 2023-04-05T00:00:20Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

This PR fixes #96203. **Details** When using `nn.SyncBatchNorm` with the model converted to FP16, there is a dtype discrepancy in the `SyncBatchNorm.forward()` causing an error like: ``` File "/.../pytorch/torch/nn/modules/_functions.py", line 91, in forward mean, invstd = torch.batch_norm_gather_stats_with_counts( RuntimeError: Expected counts to have type Half but got Float ``` [`torch.batch_norm_gather_stats_with_counts()`](https://github.com/pytorch/pytorch/blob/fe9da29842a07a1f44d6b8c2a4c75053da9e84d0/torch/nn/modules/_functions.py#L88-L97) requires the `running_mean`, `running_var`, and `counts` to have the same dtype. However, when the model has been converted to FP16, only `running_mean` and `running_var` use FP16, while the `counts` are in FP32 due to [`mean` being in FP32](https://github.com/pytorch/pytorch/blob/fe9da29842a07a1f44d6b8c2a4c75053da9e84d0/torch/nn/modules/_functions.py#L25-L30). This PR resolves this by casting `counts` from FP32 to FP16 instead of the alternative to cast `mean` and `invstd` from FP32 to FP16. Moreover, for the backward, this PR casts `weight` from FP16 to FP32 to match the dtype of `mean` and `invstd` as required by `torch.batch_norm_backward_elemt()` instead of the alternative to cast `mean` and `invstd` from FP32 to FP16. **Test Plan** I dug up this run command from 2021: For `world_size` in `{1,2}` and `backend` in `{nccl, gloo}`: ``` WORLD_SIZE=world_size BACKEND=backend python -m pytest test/distributed/test_distributed_spawn.py -k test_DistributedDataParallel_SyncBatchNorm_half -vs ``` Pull Request resolved: #98332 Approved by: https://github.com/rohan-varma

[SyncBatchNorm] Cast counts to running_mean.dtype if needed

7d6b117

[ghstack-poisoned]

awgu mentioned this pull request Apr 4, 2023

[FSDP][Easy] Remove unused requires_grad_mask #98299

Closed

This was referenced Apr 4, 2023

[FSDP] Add skip writeback check gated by env var #98300

Closed

[FSDP] Only move current FSDP's states to GPU during init #98319

Closed

[FSDP][Docs] Add warning about forward saving param refs #98320

Closed

awgu added a commit that referenced this pull request Apr 4, 2023

[SyncBatchNorm] Cast counts to running_mean.dtype if needed

43adda1

ghstack-source-id: 3fc48950d1ebc41bb25bac9939a1c1e3e315d39d Pull Request resolved: #98332

rohan-varma approved these changes Apr 4, 2023

View reviewed changes

torch/testing/_internal/distributed/distributed_test.py Show resolved Hide resolved

awgu added a commit that referenced this pull request Apr 4, 2023

[SyncBatchNorm] Cast counts to running_mean.dtype if needed

8a66a02

ghstack-source-id: e8a775d674c7d1c0f17b9b2348c9ca6184a59d54 Pull Request resolved: #98332

awgu marked this pull request as ready for review April 4, 2023 18:11

awgu requested review from mrshenli, zhaojuanmao, H-Huang, kwen2501, wanchaol, fegin, albanD, jbschlosser and mikaylagawarecki as code owners April 4, 2023 18:11

awgu added topic: improvements topic category release notes: distributed (miscellaneous) labels Apr 4, 2023

awgu changed the title ~~[SyncBatchNorm] Cast counts to running_mean.dtype if needed~~ [SyncBatchNorm] Support running with low precision parameters Apr 4, 2023

awgu added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 4, 2023

awgu added a commit that referenced this pull request Apr 4, 2023

[SyncBatchNorm] Cast counts to running_mean.dtype if needed

f8f6684

ghstack-source-id: a804c303e0ebe15f4fb0680737c25992d2c44da8 Pull Request resolved: #98332

albanD removed their request for review April 4, 2023 19:55

awgu mentioned this pull request Apr 4, 2023

[Not for Land][FSDP] Only sync free event if needs unshard #98354

Closed

pytorchmergebot added the Merged label Apr 5, 2023

pytorchmergebot closed this in 3686416 Apr 5, 2023

facebook-github-bot deleted the gh/awgu/382/head branch June 8, 2023 15:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SyncBatchNorm] Support running with low precision parameters #98332

[SyncBatchNorm] Support running with low precision parameters #98332

awgu commented Apr 4, 2023 •

edited

pytorch-bot bot commented Apr 4, 2023 •

edited

rohan-varma left a comment

awgu commented Apr 4, 2023

pytorchmergebot commented Apr 4, 2023

awgu commented Apr 4, 2023

pytorchmergebot commented Apr 5, 2023

[SyncBatchNorm] Support running with low precision parameters #98332

[SyncBatchNorm] Support running with low precision parameters #98332

Conversation

awgu commented Apr 4, 2023 • edited

pytorch-bot bot commented Apr 4, 2023 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/98332

❗ 1 Active SEVs

✅ No Failures

rohan-varma left a comment

Choose a reason for hiding this comment

awgu commented Apr 4, 2023

pytorchmergebot commented Apr 4, 2023

Merge started

awgu commented Apr 4, 2023

pytorchmergebot commented Apr 5, 2023

Merge started

awgu commented Apr 4, 2023 •

edited

pytorch-bot bot commented Apr 4, 2023 •

edited