-
Notifications
You must be signed in to change notification settings - Fork 74k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batch Normalization for Multi-GPU / Data Parallelism #7439
Comments
How does Torch handle multi-GPU batch normalization? Batch normalization on multi-GPU batch incurs and extra performance penalty because statistics need to be communicated across all GPUs, so are some performance questions to consider in. You can aggregate statistics on CPU, aggregate them by going around in a ring along the lines of how Nvidia NCCL all-reduce does, or aggregate them by doing a tree reduction. Also you can also do a "pseudo-batch normalization", by using existing batch norm layer to normalize GPU-sized batches, and then add batches together for a single "multi-GPU batch". I suspect there are easier ways to handle normalization of huge batches that doesn't introduce the performance hit you would see with batch normalization, like weight normalization -- https://arxiv.org/pdf/1602.07868.pdf |
@kvrd18 As pointed out above, there are just too many ways to implement batch norm across GPUs. TensorFlow now doesn't seem to provide a "default" way how it is implemented. @yaroslavvb My understanding is that most frameworks (including caffe & torch) doesn't aggregate statistics across GPUs at all. Different GPUs maintain statistics independently and statistics from only one GPU are used at test time. The official inceptionv3 example in tensorflow/models also does something similar. |
I have the same issue, too. I believe distributed batch normalization is very important for some problems like action recognition. It will be very useful if Tensorflow can provide an implementation. Here is my trick for train BN on action recognition, which just rearranges the sample order so that every GPU can get reasonable mean and variance. |
@ppwwyyxx @yaroslavvb As far as i know, shuffling the training data is not enough for tasks like action recognition. Because that one GPU can only handle 2 videos (each video have 16 frames), which is a too small batch size for calculating the mean and variance. That's the reason i shuffle the batch itself at each step. After go through the CNN, the batch will be rearranged to its original order to be fed into LSTM. However, my trick will hurt the speed severely. |
@yaroslavvb, in Torch, the weights updates for each module in a replica are accumulated and summed together on the first replica. Owing to Torch's modular code base, with To define a model in Torch, you would do this, function makeConvNet()
model = nn.Sequential()
model:add(nn.SpatialConvolution(1,32,3,3))
model:add(nn.SpatialBatchNormalization(32))
model:add(nn.View(-1):setNumInputDims(3))
return model
end Here, -- CONSTRUCT MODEL:
conv_net = makeConvNet() -- i.e. create nn.Sequential() and fill it
net = nn.DataParallelTable(1) -- Split along first (batch) dimension
net:add(conv_net, {1, 2}) -- Use GPUs 1 and 2
-- TRAINING:
for i = 1, num_epochs do
local output = net:forward(input)
local err = criterion:forward(output, target)
net:zeroGradParameters()
local gradOutput = criterion:backward(output, target)
local gradInput = net:backward(input, gradOutput)
net:updateParameters(lr)
end |
@kvrd18 What you described is the general case for most modules in torch. However for batch normalization, my best understanding is that torch by default doesn't synchronize the mean/variance among GPUs, but only the other two parameters (scaling and shifting). |
It's going to be painful to train fully convolutional networks on multiple GPUs that cannot afford to have huge batch sizes to alleviate the problems that might arise out of not synchronizing mean and variance among GPUs. |
It would be good experiment to make -- compare torch approach, vs. keeping variance on GPU0 vs keep variance on CPU. I suspect when your GPUs are p2p connected, keeping vars on GPU0 will be better. (ie, I found cifar multi-GPU example runs 15% faster when weights are pinned to GPU0) |
@kvrd18 It could be an improvement to aggregate the statistics (before the actual normalization), instead of normalizing by each GPU's own statistics. This can avoid potential problems that the statistics of a small batch is too unstable. You can do this in tensorflow but this is going to be very expensive. Maybe Batch Renormalization is a better option in this case. It shows a better performance on small batches. |
This question might be better asked on StackOverflow since it is not a clear feature request yet. However, if we define this as a feature for simple easy to use multi-gpu batch normalization, that would be a great contribution. Marking this as contributions welcome as a result. Thanks! |
I've built a batch normalization layer for multi-GPU. It predicts well on the validation set only if the def _variable_on_cpu(name, shape, initializer, trainable=True):
with tf.device('/cpu:0'):
dtype = tf.float32
var = tf.get_variable(name, shape, initializer=initializer, dtype=dtype, trainable=trainable)
return var
def BatchNorm(inputs, is_training, decay = 0.9, epsilon=1e-3):
scale = _variable_on_cpu('scale', inputs.get_shape()[-1], tf.constant_initializer(1.0))
beta = _variable_on_cpu('beta', inputs.get_shape()[-1], tf.constant_initializer(0.0))
pop_mean = _variable_on_cpu('mean', inputs.get_shape()[-1], tf.constant_initializer(0.0), trainable=False)
pop_var = _variable_on_cpu('variance', inputs.get_shape()[-1], tf.constant_initializer(1.0), trainable=False)
axis = list(range(len(inputs.get_shape())-1))
def Train(inputs, pop_mean, pop_var, scale, beta):
batch_mean, batch_var = tf.nn.moments(inputs,axis)
train_mean = tf.assign(pop_mean, pop_mean * decay + batch_mean * (1 - decay))
train_var = tf.assign(pop_var, pop_var * decay + batch_var * (1 - decay))
with tf.control_dependencies([train_mean,train_var]):
return tf.nn.batch_normalization(inputs, batch_mean, batch_var, beta, scale, epsilon)
def Eval(inputs, pop_mean, pop_var, scale, beta):
return tf.nn.batch_normalization(inputs, pop_mean, pop_var, beta, scale, epsilon)
return tf.cond(is_training, lambda: Train(inputs, pop_mean, pop_var, scale, beta),
lambda: Eval(inputs, pop_mean, pop_var, scale, beta)) This is working well on multi-GPU / data parallelism as long as the module is in training mode. |
@kvrd18 For a multi-GPU batch norm, we have to sync the batch_mean and batch_var, not just the moving_mean and moving_var, so that every GPU will get batch_mean and batch_var which are close to the global mean and variance. |
@shiyemin Each layer in each GPU will have its own |
@shiyemin If I understood you right, I'd have to compute the |
The inception example as pointed in #7439 (comment) is a good enough solution for me now. Thanks, @ppwwyyxx. |
Hello, I am training the BatchNorm layer in multiple GPUs using
However, I found that utilization of the function has some ways 1.Inside a loop function cifar10_main
2.Outside a loop function cifar10_multi_gpu
3.Both inside and outside a loop function inception v3, cifar10
What is the right way? In my opinion, it may be the third way |
I've found a simple way to implement distributed batch normalization in pure tensorflow, which I would like to share with you guys: batch norm across GPUs. This may be interesting for video/action recognition, image segmentation and other domains where the batch size is very limited in a single GPU. |
@ppwwyyxx hi, yuxin, is this the way to go for distributed batchnorm ? |
@MrWanter The code you linked to has nothing to do with batchnorm. |
For batch norm with multi-gpu statistics, the link given by @holyseven looks like a right implementation. Tensorpack also has such feature that works with its multigpu trainers. |
@holyseven's implementation seems split batch and construct each layer on each gpu one by one, but in distributed training with multiple workers would that be hard to integrate into existing functions like |
It averages the trainable parameters and the moving averages among workers. It does not perform cross-gpu or cross-machine batch norm.
It's very far from that.
With a working horovod training code, adding an option |
Thanks for the response, how about using your implementation of with |
I meant there might be room for horovod, not much for nccl. |
I see, have you tried using nccl for synchronous distributed batchnorm, the communication would be heavier since batch statistics need to aggregate per layer over workers through internet connection |
nccl does not support it. |
seems nccl2 support inter-node all_reduce operation |
seems there is no way to use it from tensorflow |
What is the difference between @holyseven's implementation and the inception example? |
Here is a custom Keras layer which implements train-phase cross-replica batch normalization under a MirroredStrategy. If anyone finds this useful or wants to submit a PR, I could use some help implementing the prediction phase (i.e. moving mean/variance). from __future__ import absolute_import
from __future__ import print_function
from __future__ import division
from tensorflow.python.keras.engine.base_layer import InputSpec, Layer
import tensorflow as tf
class SyncBatchNorm(Layer):
"""Cross-replica batch normalization layer"""
def __init__(
self,
center=True,
scale=False,
trainable=True,
name=None,
**kwargs
):
super(SyncBatchNorm, self).__init__(
name=name, trainable=trainable, **kwargs)
self.axis = -1
self.center = center
self.scale = scale
self.supports_masking = True
self.epsilon = 1e-3
def build(self, input_shape):
dim = input_shape[self.axis]
if dim is None:
raise ValueError(
'Axis ' + str(self.axis) + ' of '
'input tensor should have a defined dimension '
'but the layer received an input with shape ' +
str(input_shape) + '.'
)
self.input_spec = InputSpec(
ndim=len(input_shape),
axes={self.axis: dim}
)
shape = (dim,)
if self.scale:
self.gamma = self.add_weight(
shape=shape,
name='gamma',
initializer='ones',
)
else:
self.gamma = None
if self.center:
self.beta = self.add_weight(
shape=shape,
name='beta',
initializer='zeros',
)
else:
self.beta = None
self.built = True
def call(self, x, training=None):
ctx = tf.distribute.get_replica_context()
n = ctx.num_replicas_in_sync
mean, mean_sq = ctx.all_reduce(
tf.distribute.ReduceOp.SUM,
[tf.reduce_mean(x, axis=0) / n,
tf.reduce_mean(x**2, axis=0) / n]
)
variance = mean_sq - mean ** 2
return tf.nn.batch_normalization(
x,
mean,
variance,
self.beta,
self.gamma,
self.epsilon)
def compute_output_shape(self, input_shape):
return input_shape
def get_config(self):
return {
'axis': self.axis,
'epsilon': self.epsilon,
'center': self.center,
'scale': self.scale,
} |
have you got the right way? thanks |
I never used tensorflow again because it is hard to research. I killed it without finding the answer. I guess to move to pytorch |
Does run under horovod average batch norm statistics across gpus? |
The answer is no AFAIK, at least for Horovod 0.19.0. FYI, TF just added Sync BN: adf7690 |
Where is the batch normalization implementation for Multi-GPU scenarios? How does one keep track of
mean
,variance
,offset
andscale
in the context of the Multi-GPU example as given in the CIFAR-10 tutorial?Why is the question on StackOverflow left unanswered for so long?
For all the beauty that it brings with Tensorboard etc.. , it's kinda appalling to see Tensorflow so far behind Torch in terms of its modeling capability. I'd be really glad if someone takes up responsibility and comes up with a decent Batch Normalization implementation for all cases. Even if it is already there, could anyone care enough to make a good documentation out of it?
There are so many issues pertaining to batch normalization with Tensorflow. It's important that you guys straighten this out as batch normalization enables super-fast convergence for very deep networks and it is REALLY important for modern day deep learning research.
PS: Please spare my outburst. I've been a Torch user for more than a year and I had very high hopes on Tensorflow.
The text was updated successfully, but these errors were encountered: