Fix SyncBatchNorm usage without stats tracking #50126

malfet · 2021-01-06T04:13:11Z

In batch_norm_gather_stats_with_counts_cuda use input.scalar_type() if running_mean is not defined
In SyncBatchNorm forward function create count tensor with torch.float32 type if running_mean is None
Fix a few typos

Test Plan:

python -c "import torch;print(torch.batch_norm_gather_stats_with_counts( torch.randn(1, 3, 3, 3, device='cuda'), mean = torch.ones(2, 3, device='cuda'), invstd = torch.ones(2, 3, device='cuda'), running_mean = None, running_var = None  , momentum = .1, eps = 1e-5, counts = torch.ones(2, device='cuda')))"

Fixes #49730

facebook-github-bot · 2021-01-06T04:13:27Z

💊 CI failures summary and remediations

As of commit 4e5308c (more details on the Dr. CI page):

1/2 failures possibly* introduced in this PR
- 1/1 non-CircleCI failure(s)
1/2 broken upstream at merge base c517e15 on Jan 06 from 7:16am to 1:33pm

1 job timed out:

pytorch_linux_bionic_py3_8_gcc9_coverage_test1

🚧 1 fixed upstream failure:

These were probably caused by upstream breakages that were already fixed.

Please rebase on the viable/strict branch (expand for instructions)

If your commit is older than viable/strict, run these commands:

git fetch https://github.com/pytorch/pytorch viable/strict
git rebase FETCH_HEAD

Check out the recency history of this "viable master" tracking branch.

pytorch_linux_bionic_py3_8_gcc9_coverage_test1 on Jan 06 from 7:16am to 1:33pm (2ac180a - 480a756)
- 🔁 rerun

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

This comment has been revised 23 times.

facebook-github-bot

@malfet has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

aten/src/ATen/native/cuda/Normalization.cu

torch/nn/modules/_functions.py

ngimel · 2021-01-06T05:22:48Z

cc @jjsjann123 fyi. Is it true that count type should always be like mean/running mean type, and mean/running mean types should be the same if running_mean is defined?

facebook-github-bot

@malfet has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

codecov · 2021-01-06T19:05:48Z

Codecov Report

Merging #50126 (2d29fd1) into master (2ac180a) will increase coverage by 10.43%.
The diff coverage is 0.00%.

@@             Coverage Diff             @@
##           master   #50126       +/-   ##
===========================================
+ Coverage   70.25%   80.68%   +10.43%     
===========================================
  Files        1900     1900               
  Lines      206246   206246               
===========================================
+ Hits       144894   166408    +21514     
+ Misses      61352    39838    -21514

facebook-github-bot

@malfet has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2021-01-08T02:33:07Z

@malfet merged this pull request in bf4fcab.

Summary: In `batch_norm_gather_stats_with_counts_cuda` use `input.scalar_type()` if `running_mean` is not defined In `SyncBatchNorm` forward function create count tensor with `torch.float32` type if `running_mean` is None Fix a few typos Pull Request resolved: pytorch#50126 Test Plan: ``` python -c "import torch;print(torch.batch_norm_gather_stats_with_counts( torch.randn(1, 3, 3, 3, device='cuda'), mean = torch.ones(2, 3, device='cuda'), invstd = torch.ones(2, 3, device='cuda'), running_mean = None, running_var = None , momentum = .1, eps = 1e-5, counts = torch.ones(2, device='cuda')))" ``` Fixes pytorch#49730 Reviewed By: ngimel Differential Revision: D25797930 Pulled By: malfet fbshipit-source-id: 22a91e3969b5e9bbb7969d9cc70b45013a42fe83

rangwani-harsh · 2022-02-04T09:48:19Z

Hi, @malfet @ngimel it seems like this still fails when using track_running_stats=False when doing mixed-precision training (in distributed data-parallel)?

Version Details:
torch1.10.1
cuda==11.3

With track_running_stats=False, I get the following stack trace:

  File "/home/auk/miniconda3/envs/torch1.10/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/auk/Naman/PyTorch-StudioGAN/src/utils/model_ops.py", line 130, in forward
    out = self.bn(x)
  File "/home/auk/miniconda3/envs/torch1.10/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/auk/miniconda3/envs/torch1.10/lib/python3.9/site-packages/torch/nn/modules/batchnorm.py", line 749, in forward
    return sync_batch_norm.apply(
  File "/home/auk/miniconda3/envs/torch1.10/lib/python3.9/site-packages/torch/nn/modules/_functions.py", line 59, in forward
    mean, invstd = torch.batch_norm_gather_stats_with_counts(
RuntimeError: Expected counts to have type Half but got Float

Can you please take a look (or should I create a new issue)?

malfet requested review from gchanan and ngimel January 6, 2021 04:13

facebook-github-bot added the cla signed label Jan 6, 2021

facebook-github-bot reviewed Jan 6, 2021

View reviewed changes

ngimel reviewed Jan 6, 2021

View reviewed changes

aten/src/ATen/native/cuda/Normalization.cu Show resolved Hide resolved

torch/nn/modules/_functions.py Outdated Show resolved Hide resolved

malfet added 4 commits January 6, 2021 07:42

Use self.scalar_type if running_mean is undefined

ad5fcfb

Fix typos

07d70a8

Fix SyncBatchNorm when running_mean is None

069209d

Applied feedback suggestions

2d29fd1

malfet force-pushed the malfet/fix-SyncBatchNorm-without-stats-tracking branch from 908e570 to 2d29fd1 Compare January 6, 2021 15:55

facebook-github-bot reviewed Jan 6, 2021

View reviewed changes

malfet added 2 commits January 6, 2021 16:19

Fix batch_norm_gather_stats with undefined running_mean

2e6dd69

Add test_batch_norm_gather_stats

4e5308c

facebook-github-bot reviewed Jan 7, 2021

View reviewed changes

ngimel approved these changes Jan 8, 2021

View reviewed changes

facebook-github-bot closed this in bf4fcab Jan 8, 2021

facebook-github-bot added the Merged label Jan 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix SyncBatchNorm usage without stats tracking #50126

Fix SyncBatchNorm usage without stats tracking #50126

malfet commented Jan 6, 2021 •

edited

Loading

facebook-github-bot commented Jan 6, 2021 •

edited

Loading

facebook-github-bot left a comment

ngimel commented Jan 6, 2021

facebook-github-bot left a comment

codecov bot commented Jan 6, 2021

facebook-github-bot left a comment

facebook-github-bot commented Jan 8, 2021

rangwani-harsh commented Feb 4, 2022

Fix SyncBatchNorm usage without stats tracking #50126

Fix SyncBatchNorm usage without stats tracking #50126

Conversation

malfet commented Jan 6, 2021 • edited Loading

facebook-github-bot commented Jan 6, 2021 • edited Loading

💊 CI failures summary and remediations

🚧 1 fixed upstream failure:

facebook-github-bot left a comment

Choose a reason for hiding this comment

ngimel commented Jan 6, 2021

facebook-github-bot left a comment

Choose a reason for hiding this comment

codecov bot commented Jan 6, 2021

Codecov Report

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Jan 8, 2021

rangwani-harsh commented Feb 4, 2022

malfet commented Jan 6, 2021 •

edited

Loading

facebook-github-bot commented Jan 6, 2021 •

edited

Loading