Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for cudnn's group convolution. #25818

Merged
merged 4 commits into from Apr 2, 2019

Conversation

@ppwwyyxx
Copy link
Contributor

ppwwyyxx commented Feb 17, 2019

This PR enables group convolution in cudnn, a feature that's highly desired for many years (#3332, #12052 (comment), #11662, #10482).

With this PR, now it's allowed to call tf.nn.conv2d(inputs, filters), where the depth of inputs is not necessarily equal to filters.shape[2], but be a multiple of filters.shape[2].

The core of this PR is only two lines of code (#3332 (comment)) which removes the shape check. Then I added some extra checks and tests.

This benchmark script:

import tensorflow as tf
import time
import os

N = 64
C = 256
G = 32
H, W = 64, 64
print("N, C, H, W:", [N, C, H, W])


def benchmark_all(use_loop, format):
    shape4d = [N, C, H, W] if format == 'NCHW' else [N, H, W, C]

    tf.reset_default_graph()
    input = tf.get_variable('input', shape=shape4d, dtype=tf.float32)
    filter = tf.get_variable('filter', shape=[3, 3, C // G, C], dtype=tf.float32)

    if use_loop:
        inputs = tf.split(input, G, axis=1 if format == 'NCHW' else 3)
        filters = tf.split(filter, G, axis=3)
        output = tf.concat(
            [tf.nn.conv2d(i, f,
                strides=[1,1,1,1],
                padding='SAME',
                data_format=format) for i, f in zip(inputs, filters)], axis=1 if format == 'NCHW' else 3)
    else:
        output = tf.nn.conv2d(input, filter, strides=[1, 1, 1, 1], padding='SAME', data_format=format)


    forward_op = output.op
    cost = tf.reduce_sum(output)
    backward_op = tf.train.GradientDescentOptimizer(0.1).minimize(cost)

    def benchmark(op, nr_iter=200, nr_warmup=10):
        for k in range(nr_warmup):
            op.run()
        start = time.perf_counter()
        for k in range(nr_iter):
            op.run()
        end = time.perf_counter()
        itr_per_sec = nr_iter * 1. / (end - start)
        return itr_per_sec

    sess = tf.Session()
    with sess.as_default():
        sess.run(tf.global_variables_initializer())

        spd_forward = benchmark(forward_op)
        print("Loop={}, Format={}, Forward: {} itr/s".format(use_loop, format, spd_forward))
        spd_backward = benchmark(backward_op)
        print("Loop={}, Format={}, Backward: {} itr/s".format(use_loop, format, spd_backward))


formats = ['NHWC', 'NCHW']
for format in formats:
    for use_loop in [True, False]:
        benchmark_all(use_loop, format)

Executed on V100, cuda10, cudnn 7.4.2, it prints:

N, C, H, W: [64, 256, 64, 64]
Loop=True, Format=NHWC, Forward: 65.49446747235214 itr/s
Loop=True, Format=NHWC, Backward: 32.26484275606916 itr/s
Loop=False, Format=NHWC, Forward: 117.40288830454352 itr/s
Loop=False, Format=NHWC, Backward: 50.051492362319074 itr/s
Loop=True, Format=NCHW, Forward: 98.8428390274372 itr/s
Loop=True, Format=NCHW, Backward: 35.672312085388455 itr/s
Loop=False, Format=NCHW, Forward: 152.24726060851506 itr/s
Loop=False, Format=NCHW, Backward: 56.21414524041962 itr/s

which shows around 50~80% speed up over a naive loop-based implementation.

@googlebot googlebot added the cla: yes label Feb 17, 2019
@rthadur rthadur self-assigned this Feb 19, 2019
@rthadur rthadur requested a review from penpornk Feb 19, 2019
@rthadur rthadur added this to Assigned Reviewer in PR Queue via automation Feb 19, 2019
@penpornk penpornk requested review from chsigg and removed request for penpornk Feb 22, 2019
@ppwwyyxx

This comment has been minimized.

Copy link
Contributor Author

ppwwyyxx commented Feb 26, 2019

I found this op actually does not have the correct gradient w.r.t the filters. Please hold the PR for now.

@ppwwyyxx

This comment has been minimized.

Copy link
Contributor Author

ppwwyyxx commented Feb 27, 2019

The backward kernels appear to support group conv, but in fact the implementation used wrong shapes and was never tested.
Now the backward pass is correct and unit tests are added.

@chsigg

This comment has been minimized.

Copy link
Contributor

chsigg commented Feb 28, 2019

Hi Yuxin, thanks a lot for your contribution. I'm trying to merge your CL, but I'm hitting some test failures. It might take a few more days before this lands. Thanks for your patience!

@chsigg

This comment has been minimized.

Copy link
Contributor

chsigg commented Mar 14, 2019

Hi Yuxin, I've tried to merge this CL but there are failing XLA convolution tests.

You can enable XLA here:

xla_enable_strict_auto_jit = False,

and then run:
bazel test --config=cuda --config=xla //tensorflow/python/kernel_tests:conv_ops_test

Thanks again for your contribution!

@ppwwyyxx

This comment has been minimized.

Copy link
Contributor Author

ppwwyyxx commented Mar 14, 2019

I noticed that the code in compiler/tf2xla/kernels/conv_op_helpers.cc does not properly translate group convolution to XLA ops, although the XLA ops seem to have support for it already.
However, I was unable to reproduce test failures.
I rebase this PR on top of cf3c25b (latest master), rebuild after bazel clean, and saw no test failures:

$bazel test --jobs 1 --cache_test_results=no  --config=cuda --config=xla //tensorflow/python/kernel_tests:conv_ops_test
$TEST_TMPDIR defined: output root default is '/tmp/bazel/' and max_idle_secs default is '15'.
WARNING: The following configs were expanded more than once: [cuda, xla]. For repeatable flags, repeats are counted twice and may lead to unexpected behavior.
WARNING: /XX/tensorflow/tensorflow/python/BUILD:3239:1: in py_library rule //tensorflow/python:standard_ops: target '//tensorflow/python:standard_ops' depends on deprecated target '//tensorflow/python/ops/distributions:distributions': TensorFlow Distributions has migrated to TensorFlow Probability (https://github.com/tensorflow/probability). Deprecated copies remaining in tf.distributions will not receive new
features, and will be removed by early 2019. You should update all usage of `tf.distributions` to `tfp.distributions`.
WARNING: /XX/tensorflow/tensorflow/contrib/metrics/BUILD:16:1: in py_library rule //tensorflow/contrib/metrics:metrics_py: target '//tensorflow/contrib/metrics:metrics_py' depends on deprecated target '//tensorflow/python/ops/distributions:distributions': TensorFlow Distributions has migrated to TensorFlow Probability (https://github.com/tensorflow/probability). Deprecated copies remaining in tf.distributions
will not receive new features, and will be removed by early 2019. You should update all usage of `tf.distributions` to `tfp.distributions`.
INFO: Analysed target //tensorflow/python/kernel_tests:conv_ops_test (0 packages loaded, 0 targets configured).
INFO: Found 1 test target...
Target //tensorflow/python/kernel_tests:conv_ops_test up-to-date:
  bazel-bin/tensorflow/python/kernel_tests/conv_ops_test
INFO: Elapsed time: 221.848s, Critical Path: 116.95s, Remote (0.00% of the time): [queue: 0.00%, setup: 0.00%, process: 0.00%]
INFO: 5 processes: 5 local.
INFO: Build completed successfully, 6 total actions
//tensorflow/python/kernel_tests:conv_ops_test                           PASSED in 93.2s
  Stats over 4 runs: max = 93.2s, min = 28.3s, avg = 49.1s, dev = 25.7s

Executed 1 out of 1 test: 1 test passes.
INFO: Build completed successfully, 6 total actions
$git diff
diff --git i/tensorflow/tensorflow.bzl w/tensorflow/tensorflow.bzl
index 28c189e953..df0b5a64c7 100644
--- i/tensorflow/tensorflow.bzl
+++ w/tensorflow/tensorflow.bzl
@@ -2106,7 +2106,7 @@ def gpu_py_tests(
         shard_count = shard_count,
         tags = test_tags,
         xla_enabled = xla_enabled,
-        xla_enable_strict_auto_jit = False,
+        xla_enable_strict_auto_jit = True,
     )

 # terminology changes: saving cuda_* definition for compatibility

@chsigg could you help me figure out how to properly run the XLA tests?

@chsigg

This comment has been minimized.

Copy link
Contributor

chsigg commented Mar 15, 2019

Sorry, I think I got the wrong line to change. Can you try this one:

xla_enable_strict_auto_jit = False,

@ppwwyyxx ppwwyyxx force-pushed the ppwwyyxx:master branch from b974ce4 to bc3c5dd Mar 15, 2019
@ppwwyyxx

This comment has been minimized.

Copy link
Contributor Author

ppwwyyxx commented Mar 15, 2019

Thanks! Now I can reproduce the failures. The code was updated to include group conv support in tf2xla.

@CyFeng16

This comment has been minimized.

Copy link

CyFeng16 commented Mar 26, 2019

Group-convolution is not supported even in TF2.0(alpha) 👎
It's hard to implement many kinds of newest efficient models using TensorFlow :(

@ppwwyyxx

This comment has been minimized.

Copy link
Contributor Author

ppwwyyxx commented Mar 28, 2019

Hi @chsigg , is there an update on this CL?

@chsigg

This comment has been minimized.

Copy link
Contributor

chsigg commented Mar 28, 2019

There are a few test failures still, I'm looking into them and will merge ASAP. Thanks for your patience.

@tensorflow-copybara tensorflow-copybara merged commit bc3c5dd into tensorflow:master Apr 2, 2019
2 checks passed
2 checks passed
cla/google All necessary CLAs are signed
import/copybara Change imported to the internal review system
Details
PR Queue automation moved this from Assigned Reviewer to Merged Apr 2, 2019
@veqtor

This comment has been minimized.

Copy link

veqtor commented Jun 5, 2019

Can we expect this in 1.14?
Also, this is only for 2D or?

@CyFeng16

This comment has been minimized.

Copy link

CyFeng16 commented Jun 5, 2019

I'm using bool_mask and sequence_mask to simulate dense group convolution in Eager Execution. Getting a similar function except for the waste of time. Any other solutions are welcome.

@ppwwyyxx

This comment has been minimized.

Copy link
Contributor Author

ppwwyyxx commented Jun 5, 2019

This is already in 1.14 and the usage is above.

@andravin

This comment has been minimized.

Copy link

andravin commented Jun 27, 2019

Is group convolution supported on TPU?

@CyFeng16

This comment has been minimized.

Copy link

CyFeng16 commented Jun 27, 2019

@andravin

Unfortunately, currently not supported.
You can have a try PyTorch on TPU if you are familiar with it, which originally supports group convolution.

More info is as follow:

tensorflow/tpu#415

@ppwwyyxx

This comment has been minimized.

Copy link
Contributor Author

ppwwyyxx commented Jun 27, 2019

Hi Andrew,
I have not used a TPU, and if I'm not mistaken the TPU compiler stack running on the TPU server is closed-source so I cannot tell what it does. My best guess is that group conv is supported, since the XLA stack is able to lower the group conv operations (implemented in this PR), and some official TPU examples contain depthwise conv.

Unrelated, but Pytorch apparently does not support group conv on TPU: https://github.com/pytorch/xla/blob/master/torch_xla/csrc/aten_xla_type.cpp#L814-L818

@andravin

This comment has been minimized.

Copy link

andravin commented Jun 27, 2019

1.14 is not yet supported on cloud TPU: https://cloud.google.com/tpu/docs/supported-versions

@amitsabne1

This comment has been minimized.

Copy link
Contributor

amitsabne1 commented Jun 28, 2019

Grouped convolutions are supported on TPUs.

@CyFeng16

This comment has been minimized.

Copy link

CyFeng16 commented Jun 28, 2019

@andravin
Cloud TPU supports 1.14.1-dev20190508 and 1.14.1-dev20190518.
You can also find a nightly version of TensorFlow, and please ignore the inadequate documents.

@amitsabne1
Would you please tell me which version supports the group convolution ops?
Thanks.

@amitsabne1

This comment has been minimized.

Copy link
Contributor

amitsabne1 commented Jun 28, 2019

AFAIK, grouped convolutions have been around on TPUs for a while; definitely since 1.12. Please let me know if you find otherwise.

@amitsabne1

This comment has been minimized.

Copy link
Contributor

amitsabne1 commented Jul 1, 2019

The XLA test for grouped convolutions
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/compiler/xla/tests/grouped_convolution_test.cc
works on TPUs. The examples there may help you map back to TF programs.

@andravin

This comment has been minimized.

Copy link

andravin commented Jul 2, 2019

@amitsabne1 Thanks, but realistically I am not going to spend any more time on this.

@KyotoSunshine

This comment has been minimized.

Copy link
Contributor

KyotoSunshine commented Jul 23, 2019

Grouped convolutions are still unsupported on CPUs btw...

#29005

@x10000year

This comment has been minimized.

Copy link

x10000year commented Dec 2, 2019

why group convolution for half type is extremely slow? it can consume more than 100ms while a regular loop-based implementation only consume several ms. @ppwwyyxx

@AlexWang1900

This comment has been minimized.

Copy link

AlexWang1900 commented Feb 12, 2020

Hi, I implement the group convolution layer in KERAS, according to the Code example above.
both in LOOP way and in native tf.conv group convolution .

I found a big problem here, the two ways have different parameter sizes, and different meaning.
the loop way has 32 small 3x3 filtters , the tf.conv way has only one 3x3 filtter. in a resnet-20 3stage configuration, the tf.conv way has 0.5million~ parameters fewer , and the model is much worse in accuracy.

according to the original paper:"Aggregated Residual Transformations for Deep Neural Networks",

section 3.3 AggregatedTransformations

F(x)=XTi(x),
We settheindividualtransformationTitobethebottleneck-
shapedarchitecture[14],asillustratedinFig.1(right).In thiscase,thefirst1×1layerineachTiproducesthelow-dimensionalembedding.
EasyCapture333

and it shows in the picture, each small filters better to be different, in order to have an aggregated effect.

attach my code for reference:

class GroupConv2D(Conv2D):
    def __init__(self, filters,
                 kernel_size,
                 strides=(1, 1),
                 padding='same',
                 data_format=None,
                 dilation_rate=(1, 1),
                 num_group=1,
                 activation=None,
                 use_bias=True,
                 kernel_initializer='glorot_uniform',
                 bias_initializer='zeros',
                 kernel_regularizer=None,
                 bias_regularizer=None,
                 activity_regularizer=None,
                 kernel_constraint=None,
                 bias_constraint=None,
                 **kwargs):
        super(GroupConv2D, self).__init__(
            filters=filters,
            kernel_size=kernel_size,
            strides=strides,
            padding=padding,
            data_format=data_format,
            dilation_rate=dilation_rate,
            activation=activation,
            use_bias=use_bias,
            kernel_initializer=kernel_initializer,
            bias_initializer=bias_initializer,
            kernel_regularizer=kernel_regularizer,
            bias_regularizer=bias_regularizer,
            activity_regularizer=activity_regularizer,
            kernel_constraint=kernel_constraint,
            bias_constraint=bias_constraint,
            **kwargs)
        self.num_group = num_group
        if self.filters % self.num_group != 0:
            raise ValueError("filters must divided by num_group with no remainders!")
        #self.input_spec = InputSpec(ndim=4)

    def build(self, input_shape):
        
        input_shape = tensor_shape.TensorShape(input_shape)

        input_channel = self._get_input_channel(input_shape)

        kernel_shape = self.kernel_size + (input_channel//self.num_group, self.filters)



        self.kernel = self.add_weight(

            name='kernel',

            shape=kernel_shape,

            initializer=self.kernel_initializer,

            regularizer=self.kernel_regularizer,

            constraint=self.kernel_constraint,

            trainable=True,

            dtype=self.dtype)

        if self.use_bias:

          self.bias = self.add_weight(

              name='bias',

              shape=(self.filters,),

              initializer=self.bias_initializer,

              regularizer=self.bias_regularizer,

              constraint=self.bias_constraint,

              trainable=True,

              dtype=self.dtype)

        else:

          self.bias = None

        channel_axis = self._get_channel_axis()

        self.input_spec = InputSpec(ndim=self.rank + 2,

                                    axes={channel_axis: input_channel})



        self._build_conv_op_input_shape = input_shape

        self._build_input_channel = input_channel

        self._padding_op = self._get_padding_op()

        self._conv_op_data_format = conv_utils.convert_data_format(

            self.data_format, self.rank + 2)

        self._convolution_op = nn_ops.Convolution(

            input_shape,

            filter_shape=self.kernel.shape,

            dilation_rate=self.dilation_rate,

            strides=self.strides,

            padding=self._padding_op,

            data_format=self._conv_op_data_format)

        self.built = True

    def get_config(self):
        config = super(Conv2D, self).get_config()
        config.pop('rank')
        config["num_group"] = self.num_group
        return config



def grouped_convolution_block(x, grouped_channels, cardinality , strides,cudnn = True, weight_decay=1e-4):

    

    


    group_list = []



    if cardinality == 1:

        # with cardinality 1, it is a standard convolution
        x = BatchNormalization()(x)
        
        x = Activation('relu')(x)

        x = Conv2D(grouped_channels, (3, 3), padding='same',  strides=(strides, strides),use_bias=True,

                   kernel_initializer='he_normal', kernel_regularizer=l2(weight_decay))(x)

        return x


    x = BatchNormalization()(x)
        
    x = Activation('relu')(x)
    
    if cudnn == True:
        group_merge =  GroupConv2D(grouped_channels, (3, 3), padding='same', 
                                   use_bias=True, kernel_initializer='he_normal',strides=(strides, strides),
                                        kernel_regularizer=l2(weight_decay))(x)
    else:
        input_list = tf.split(x, cardinality, axis=-1)
        output_list = []
        for conv_idx, input_tensor in enumerate(input_list):
            tmp = Conv2D(grouped_channels, (3, 3), padding='same', use_bias=True, strides=(strides, strides),

                       kernel_initializer='he_normal', kernel_regularizer=l2(weight_decay))(input_tensor)

            output_list.append(tmp)
        group_merge = concatenate(output_list, axis=-1)




    return group_merge


And the performances of two ways:

loop:
Epoch 6/60
353/353 [==============================] - 83s 235ms/step - loss: 2.6220 - dense_1_loss: 1.4720 - dense_1_accuracy: 0.5991 - val_dense_1_accuracy: 0.5416

cudnn:
Epoch 6/60
353/353 [==============================] - 53s 150ms/step - loss: 4.3920 - dense_1_loss: 2.8750 - dense_1_accuracy: 0.2673 - val_dense_1_accuracy: 0.2203

@ppwwyyxx

This comment has been minimized.

Copy link
Contributor Author

ppwwyyxx commented Feb 12, 2020

The code in the PR above uses the same filter for both branch, so apparently they have the same size.

@AlexWang1900

This comment has been minimized.

Copy link

AlexWang1900 commented Feb 12, 2020

@ppwwyyxx

I see ,the code above using the same filter for both branch,
but it doesn't solve the problem in implementing RESNEXT ,
Apparantly, RESNEXT's group convolution needs 32 different filters, not one for 32 group.

@ppwwyyxx

This comment has been minimized.

Copy link
Contributor Author

ppwwyyxx commented Feb 12, 2020

You probably have misunderstanding of resnext.

@AlexWang1900

This comment has been minimized.

Copy link

AlexWang1900 commented Feb 12, 2020

at least, In my exeperiment , using the group convolution code here constructing resnext is not on par of resnet of same parameter size,
using loop group convolution code, resnext can perform better than resnet.

@ppwwyyxx

This comment has been minimized.

Copy link
Contributor Author

ppwwyyxx commented Feb 12, 2020

@AlexWang1900

This comment has been minimized.

Copy link

AlexWang1900 commented Feb 16, 2020

My experiments using this PR worked (https://github.com/tensorpack/tensorpack/tree/master/examples/ResNet).

I think I made a mistake in my code, I set the group to 1 by default and forgot to set them to 4.
Thanks for your patience.

@AlexWang1900

This comment has been minimized.

Copy link

AlexWang1900 commented Feb 16, 2020

and in
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/ops/nn_ops.py
in
class Convolution(object):


if not input_channels_dim.is_compatible_with(

        filter_shape[num_spatial_dims]):

      raise ValueError(

          "number of input channels does not match corresponding dimension of "

          "filter, {} != {}".format(input_channels_dim,

                                    filter_shape[num_spatial_dims]))

should be deleted too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
PR Queue
  
Merged
Linked issues

Successfully merging this pull request may close these issues.

None yet

You can’t perform that action at this time.