[sync BN] #14267

jjsjann123 · 2018-11-21T08:23:53Z

Summary:

Added synchronized batch normalization, allows synchronization of stats across mini-batches between processes within a process group.
Current implementation uses a mixture of extended ATen native functions (cpp cuda extension) + torch.nn.modules (c10d python API)

User-facing api:

torch.nn.utils.convert_sync_batchnorm(modules, process_group=None)
torch.nn.SyncBatchNorm(num_features, eps=1e-5, momentum=0.1, affine=True, track_running_stats=True, process_group=None)

supported use case:
DistributedDataParallel with single-gpu multi-process

a. User creates model containing torch.nn.SyncBatchNorm layers through one of the ways listed below:

use layers directly:

torch.nn.SyncBatchNorm(...)

similar API as with torch.nn.BatchNormXd(...)
with added argument process_group which is used to limit the scope of
synchronization within each process group. Default value is None, which
implies synchronization across all GPUs
use torch.nn.utils.convert_sync_batchnorm(modules, process_group)

recursively convert all torch.nn.BatchNormXd into torch.nn.SyncBatchNorm
preserving values of parameters/buffers.
the utility function also allows user to specify process_group value to all
converted layers.

b. user wraps their model with
torch.distributed.parallel.DataParallelDistributed, from this point, user
should follow the general guidelines for DDP use guide

Error checking

For use cases not supported, we error out:

Application launched without ddp:

import torch
sbn = torch.nn.SyncBatchNorm(10).cuda()
inp = torch.randn(5, 10, 3, 3).cuda()
sbn(inp) --> Error!
AttributeError: SyncBatchNorm is only supported within torch.nn.parallel.DistributedDataParallel
Application launched using DDP with multi-GPU per-process:

ddp_module = nn.parallel.DistributedDataParallel(module, device_ids=device_ids, output_device=args.local_rank)
ValueError: SyncBatchNorm is only supported for DDP with single GPU per process

Summary: Added synchronized batch normalization, allows synchronization of stats across mini-batches between processes within a process group. Current implementation uses a mixture of extended ATen native functions (cpp cuda extension) + torch.nn.modules (c10d python API) This is a WIP: 1. only supports GPU. 2. only supports single GPU per process.

jjsjann123 · 2018-11-21T08:25:06Z

master nccl is broken. This PR requires #14244 to function

jjsjann123 · 2018-11-21T08:34:34Z

This is the first phase of this PR. We want to have sync BN support there first.
We can discuss support for multiple GPU per process as well as moving communication into pure c++ as well.
One thing we need to discuss here in this PR is testing.

My monkey python tests runs multiple processes & communications, I used it for functional.
For upstream tests, I'm thinking about utilizing unit tests for batch norm layer and adding some simple test in test_distributed.py to validate the communication/parallel welford part.

As I cannot find an official module upstream doing similar things, feedback or hints would be greatly appreciated.

jjsjann123 · 2018-11-28T21:31:17Z

Pinging @ssnl for visibility.

weiyangfb · 2018-11-29T06:47:58Z

can I ask at a very high level what strategy is using here to implement sync BN?

jjsjann123 · 2018-12-03T17:18:48Z

@weiyangfb If i understand the question correctly, the implementation here is to: 1. calculate stats (mean/var) for local mini-batch; 2. all reduce stats across all processes to calculate global mean/var; 3. Apply element-wise normalization.

Backwards follows identical logic with slightly different arithmetic.

fixing backwards path with process group fixing linter issue

jjsjann123 · 2018-12-04T18:16:30Z

Many failed tests. Seems like I got a lemon commit in master. Will merge again later.

ssnl

Thank you! Structure looks good to me. Kernels seem good too, although I didn't try to understand the details (e.g., math). I have some general questions, and this definitely needs some tests.

ssnl · 2018-12-15T03:16:06Z

torch/nn/_functions/thnn/normalization.py

+            mean_l = [mean_all.narrow(0, i, 1) for i in range(world_size)]
+            invstd_l = [invstd_all.narrow(0, i, 1) for i in range(world_size)]
+            # using all_gather instead of all reduce so we can calculate mean/var in one go
+            torch.distributed.all_gather(mean_l, mean, process_group)


Would it be more efficient to coalesce the gathered tensors first and only call one all_gather? Or is it because that they may be of different precisions?

less nccl calls so technically yes. Realistically speaking the actual communication part is tiny in the overall timeline. The synchronization might have some impact here, but since we have two consecutive all_gather/all_reduce, combining the two calls doesn't really help.

torch/nn/_functions/thnn/normalization.py

aten/src/ATen/native/cuda/Normalization.cuh

torch/nn/_functions/thnn/normalization.py

fmassa · 2018-12-17T19:00:27Z

Quick question: doesn't this PR have some overlap with the work from @apaszke from #15146 ? Both exposes a new batch_norm_update_stats function.

jjsjann123 · 2018-12-17T21:24:55Z

Yep. Looks like we are doing similar things here with batch_norm_update_stats(exposing batch_norm_update_stats in at::native)

I need this for synchronization, IIUC, @apaszke did this to fuse second step batchnorm point-wise kernels with following point-wise ops like ReLU e.t.c..

We'll need to resolve the conflicts before merging.

ppwwyyxx · 2018-12-20T20:28:11Z

The global mean & variance can be computed with just one all_reduce:

Each worker compute its mean(x) and mean(x^2)
AllReduce them to get global mean(x) and mean(x^2).
Compute the global variance from the global mean(x) and mean(x^2), by var(x) = mean(x^2) - mean(x)^2.

This is the strategy other existing implementations (in mxnet, tensorflow, caffe2, MegDet) is using. And using AllReduce is supposed to be more communication-efficient than using AllGather.

jjsjann123 · 2018-12-20T20:38:33Z

I use Welford to calculate mean/var in a single pass as well. Welford has better numerical characteristics, which is desired.

The all_gather that is called here is only going to gather 1 set of intermediate mean/m2n per process, for a reasonable cluster size there shouldn't be much difference between all_gather/all_reduce

1. fallback to batch_norm when sync is not required; 2. inplace operator to save memory; 3. swtich from narrow to use unbind.

jjsjann123 · 2018-12-27T06:54:01Z

timed out in c10d test. Saw similar test failure on other PR: #15540.
Test passed on my local machine as well (cuda9_0_cudnn7)

Anything that I should be concerned?

torch/nn/modules/_functions.py

torch/nn/modules/batchnorm.py

jjsjann123 · 2019-01-15T06:27:01Z

@apaszke I'm repeating Carilli's Question regarding the last review comments.
How could I access my parent classes from SyncBatchNorm? i.e. how would I know from within the SyncBatchNorm that the model is being launched by DDP or DP?

1. added check so that SyncBatchNorm is only supported for single GPU per process run with DistributedDataParallel. 2. added utility function to convert BatchNorm layer in module to SyncBatchNorm layer

jjsjann123 · 2019-01-26T09:48:42Z

@apaszke Added check during DDP initialization so that SyncBatchNorm is only supported with single GPU per process with DDP launch.

soumith · 2019-02-08T20:15:50Z

cc: @mrshenli @teng-li and @pietern for review.
@jjsjann123 can you also check and update what happens if this is accidentally used by user in a nn.DataParallel setting

jjsjann123 · 2019-02-08T22:04:50Z

Launching with nn.DataParallel would be the case where it's not launched through DDP, at SyncBatchNorm.forward() an exception is raise:

Application launched without ddp:

import torch
sbn = torch.nn.SyncBatchNorm(10).cuda()
inp = torch.randn(5, 10, 3, 3).cuda()
sbn(inp) --> Error!
AttributeError: SyncBatchNorm is only supported within torch.nn.parallel.DistributedDataParallel

pietern

There is a potential for deadlock with the current calls to allreduce, as they execute in parallel with autograd, and therefore the primary DDP reduction hooks.

test/test_distributed.py

torch/nn/modules/_functions.py

pietern · 2019-02-14T23:13:50Z

torch/nn/modules/_functions.py

+            mean_dy.div_(world_size)
+            torch.distributed.all_reduce(
+                mean_dy_xmu, torch.distributed.ReduceOp.SUM, process_group)
+            mean_dy_xmu.div_(world_size)


These allreduce calls can interfere with ones kicked off by DDP itself.

If autograd runs with single threaded with deterministic ordering you'll be fine, but as soon as it doesn't (e.g. there are multiple branches of the forward graph where the backward functions can be called in parallel), you'll run into deadlocks. This can be avoided by creating a new process group from the main one with new_group and using that throughout. Note that having multiple of these sync batch norm layers running backward in parallel can still deadlock, or worse, result in mixed up data, so for guaranteed correctness you'll have to use a separate process group per sync batch norm layer. This is not ideal and we may need to find a different solution for this.

cc @mrshenli @teng-li

This is a gotcha for me. Thanks a lot for pointing out the issue. I don't fully understand how the allreduce call would cause trouble in branching while DDP handles that fine. Maybe I'll ask more details about this in private channel.

Just to reiterate, for the time being, it's a safe WAR as long as I have a separate process group per sync batch norm layer and use it for both forward and backward pass.
I'll copy/create a process group inside the initializer of SyncBatchNorm.

Realized that duplicating process_groups in the initializer of SyncBatchNorm would not work.

As new_group should be called by all processes in the main group, and inside the initializer or converter function, each process would only see the given process_group it belongs to for that layer.

I have this exact problem now!
My model has many SyncBNs in parallel (I also have an atypical module where I call autograd.grad() repeatedly on two tensors whose graph has syncBNs). In short the backward() has many all_reduce calls and some run in parallel. With DDP the code hangs without error and with NCCL logs I can see that it is because all_reduce hangs.

I have tried torch native SyncBN, apex syncBN, and @ppwwyyxx 's NaiveSyncBN. This happens in both torch DDP and apex DDP (even with delay_allreduce=True).

Is there something I can do to fix it? @jjsjann123 @pietern

torch/nn/modules/batchnorm.py

torch/nn/utils/sync_batch_norm.py

1. adding async_op for all_reduce calls 2. renaming variables 3. removing redundant code

facebook-github-bot

@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

ezyang · 2019-02-28T23:01:37Z

We've decided we're going to go ahead and land this, and keep an eye on it for any problems that may occur later.

ezyang · 2019-03-01T03:31:09Z

@pytorchbot rebase this please

facebook-github-bot

@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

jjsjann123 · 2019-03-01T06:02:35Z

Failures look scary, should I be concerned?

ezyang · 2019-03-01T22:02:29Z

Ah, the PR bitrotted. Just a moment please.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

facebook-github-bot

@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: - Summary: Added synchronized batch normalization, allows synchronization of stats across mini-batches between processes within a process group. Current implementation uses a mixture of extended ATen native functions (cpp cuda extension) + torch.nn.modules (c10d python API) - User-facing api: 1. torch.nn.utils.convert_sync_batchnorm(modules, process_group=None) 2. torch.nn.SyncBatchNorm(num_features, eps=1e-5, momentum=0.1, affine=True, track_running_stats=True, ***process_group=None***) - supported use case: DistributedDataParallel with ***single-gpu multi-process*** a. User creates model containing `torch.nn.SyncBatchNorm` layers through one of the ways listed below: 1. use layers directly: torch.nn.SyncBatchNorm(...) similar API as with torch.nn.BatchNormXd(...) with added argument `process_group` which is used to limit the scope of synchronization within each process group. Default value is None, which implies synchronization across all GPUs 2. use torch.nn.utils.convert_sync_batchnorm(modules, process_group) recursively convert all `torch.nn.BatchNormXd` into `torch.nn.SyncBatchNorm` preserving values of parameters/buffers. the utility function also allows user to specify process_group value to all converted layers. b. user wraps their model with `torch.distributed.parallel.DataParallelDistributed`, from this point, user should follow the general guidelines for DDP use guide - Error checking For use cases not supported, we error out: 1. Application launched without ddp: > import torch > sbn = torch.nn.SyncBatchNorm(10).cuda() > inp = torch.randn(5, 10, 3, 3).cuda() > sbn(inp) --> Error! > AttributeError: SyncBatchNorm is only supported within torch.nn.parallel.DistributedDataParallel 2. Application launched using DDP with multi-GPU per-process: > ddp_module = nn.parallel.DistributedDataParallel(module, device_ids=device_ids, output_device=args.local_rank) > ValueError: SyncBatchNorm is only supported for DDP with single GPU per process Pull Request resolved: pytorch/pytorch#14267 Differential Revision: D14270035 Pulled By: ezyang fbshipit-source-id: 4956d8fa565c32e9df5408d53719ff9f945f4d6d

DrJimFan · 2019-05-01T00:31:00Z

How does torch 1.1's SyncBN feature compare to Nvidia apex library?

soumith · 2019-05-01T01:22:31Z

it's similar in functionality, but likely better perf because it directly integrates with nn.BatchNorm

jjsjann123 added 2 commits December 3, 2018 14:09

Merge remote-tracking branch 'origin/master' into sbn_PR

9c6fe56

[SyncBatchNorm]

e7135ca

fixing backwards path with process group fixing linter issue

fmassa mentioned this pull request Dec 12, 2018

Why FrozenBatchNorm2d in ResNet? facebookresearch/maskrcnn-benchmark#267

Closed

ssnl reviewed Dec 15, 2018

View reviewed changes

jjsjann123 added 5 commits December 21, 2018 15:25

Merge remote-tracking branch 'origin/master' into sbn_PR

1d6cbbf

Merge remote-tracking branch 'origin/master' into sbn_PR

e506a1d

addressing review comments:

fe5b595

1. fallback to batch_norm when sync is not required; 2. inplace operator to save memory; 3. swtich from narrow to use unbind.

moving SyncBatchNorm function from thnn to modules

c31bb40

removing trailing whitespace

50f2b26

jjsjann123 added 2 commits December 27, 2018 16:10

test added for sync batch norm layer

5d47a8c

fixing pylint issue with new line

c776e44

apaszke reviewed Dec 30, 2018

View reviewed changes

torch/nn/modules/_functions.py Outdated Show resolved Hide resolved

torch/nn/modules/batchnorm.py Show resolved Hide resolved

jjsjann123 added 4 commits January 22, 2019 15:46

Merge remote-tracking branch 'origin/master' into sbn_PR

9aaa304

update on conflicted C++ api

71779c1

Merge remote-tracking branch 'origin/master' into sbn_PR

257a4df

DistributedDataParallel condition check for SyncBatchNorm

b315427

1. added check so that SyncBatchNorm is only supported for single GPU per process run with DistributedDataParallel. 2. added utility function to convert BatchNorm layer in module to SyncBatchNorm layer

pietern reviewed Feb 14, 2019

View reviewed changes

jjsjann123 added 3 commits February 15, 2019 13:43

Merge remote-tracking branch 'origin/master' into sbn_PR

c9b212c

addressing review comments:

dc3b07a

1. adding async_op for all_reduce calls 2. renaming variables 3. removing redundant code

resolving pylint issue

18a7e8d

facebook-github-bot reviewed Feb 28, 2019

View reviewed changes

Merge remote-tracking branch 'origin/master' into HEAD

41a1777

facebook-github-bot reviewed Mar 1, 2019

View reviewed changes

ezyang added 2 commits March 1, 2019 17:03

s/int64_t/int/

28a438d

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Undo incorrect change

9990736

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

facebook-github-bot reviewed Mar 6, 2019

View reviewed changes

facebook-github-bot closed this in 3966931 Mar 6, 2019

pytorchbot added the merged label Mar 6, 2019

kjgfcdb mentioned this pull request Mar 13, 2019

Replacing FrozenBatchNorm with SyncBatchNorm? facebookresearch/maskrcnn-benchmark#561

Closed

DrJimFan mentioned this pull request May 1, 2019

Comparison with PyTorch 1.1 SyncBN NVIDIA/apex#283

Open

pietern mentioned this pull request May 7, 2019

SyncBatchNorm should support 2D input (B, C) #20204

Open

ezyang added the open source label Jun 24, 2019

[sync BN] #14267

[sync BN] #14267

Uh oh!

Conversation

jjsjann123 commented Nov 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jjsjann123 commented Nov 21, 2018

Uh oh!

jjsjann123 commented Nov 21, 2018

Uh oh!

jjsjann123 commented Nov 28, 2018

Uh oh!

weiyangfb commented Nov 29, 2018

Uh oh!

jjsjann123 commented Dec 3, 2018

Uh oh!

jjsjann123 commented Dec 4, 2018

Uh oh!

ssnl left a comment

Choose a reason for hiding this comment

Uh oh!

ssnl Dec 15, 2018

Choose a reason for hiding this comment

Uh oh!

jjsjann123 Dec 17, 2018

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fmassa commented Dec 17, 2018

Uh oh!

jjsjann123 commented Dec 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ppwwyyxx commented Dec 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jjsjann123 commented Dec 20, 2018

Uh oh!

jjsjann123 commented Dec 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jjsjann123 commented Jan 15, 2019

Uh oh!

jjsjann123 commented Jan 26, 2019

Uh oh!

soumith commented Feb 8, 2019

Uh oh!

jjsjann123 commented Feb 8, 2019

Uh oh!

pietern left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

pietern Feb 14, 2019

Choose a reason for hiding this comment

Uh oh!

jjsjann123 Feb 15, 2019

Choose a reason for hiding this comment

Uh oh!

jjsjann123 Feb 16, 2019

Choose a reason for hiding this comment

Uh oh!

alekhka Mar 8, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

ezyang commented Feb 28, 2019

jjsjann123 commented Nov 21, 2018 •

edited

Loading

jjsjann123 commented Dec 17, 2018 •

edited

Loading

ppwwyyxx commented Dec 20, 2018 •

edited

Loading

jjsjann123 commented Dec 27, 2018 •

edited

Loading