New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Match conv2 weights stddev with cuda-convnet's layer def. #2374

Closed
wants to merge 1 commit into
base: master
from

Conversation

Projects
None yet
7 participants
@keiji
Copy link
Contributor

keiji commented May 15, 2016

conv2 initW is 0.01 in cuda-convnet.

https://code.google.com/p/cuda-convnet/source/browse/trunk/example-layers/layers-conv-local-11pct.cfg

I'm trying exercise.

I change layer type from full connected to locally connected(just convolutional..).
Vanishing-Gradient problem is occurred at conv2.

Is this intended?

@tensorflow-jenkins

This comment has been minimized.

Copy link
Collaborator

tensorflow-jenkins commented May 15, 2016

Can one of the admins verify this patch?

@googlebot googlebot added the cla: yes label May 15, 2016

@caisq

This comment has been minimized.

Copy link
Contributor

caisq commented May 15, 2016

@tensorflow-jenkins test this please

@caisq

This comment has been minimized.

Copy link
Contributor

caisq commented May 15, 2016

@girving

This comment has been minimized.

Copy link
Contributor

girving commented May 16, 2016

I don't follow your message. Are you saying that your new value helps after you change the network to a different topology? What are the evaluation accuracies before and after your change?

@keiji

This comment has been minimized.

Copy link
Contributor

keiji commented May 16, 2016

Hi @girving,

conv2's parameter stddev is different from original cuda-convnet.
I think this may be a typo, because 1e-4 is same value with conv1 layer.

Currently parameter(1e-4) cause of vanishing-gradient at trying EXERCISE.


The original cuda-convnet's conv2 initW is 0.01.

[conv2]
type=conv
inputs=rnorm1
filters=64
padding=2
stride=1
filterSize=5
channels=64
neuron=relu
initW=0.01
partialSum=8
sharedBiases=1

And cifar10.py's conv2 stddev is 1e-4.

  with tf.variable_scope('conv2') as scope:
    kernel = _variable_with_weight_decay('weights', shape=[5, 5, 64, 64],
                                         stddev=1e-4, wd=0.0)

I think this may be a typo, because 1e-4 is same value with conv1 layer.

TensorFlow document is questioning

EXERCISE: The model architecture in inference() differs slightly from the CIFAR-10 model specified in 
cuda-convnet. In particular, the top layers of Alex's original model are locally connected and not fully 
connected. Try editing the architecture to exactly reproduce the locally connected architecture in the top 
layer.

I'm trying this exercise(unfortunately I cannot find model answer on web).
I tried replace local3 and local4 to convolutional layer. And if conv2's stddev was 1e-4, vanishing-gradient occurred at conv2 layer.

@keiji

This comment has been minimized.

Copy link
Contributor

keiji commented May 16, 2016

Here's my current inference code.

def inference(images):
  """Build the CIFAR-10 model.

  Args:
    images: Images returned from distorted_inputs() or inputs().

  Returns:
    Logits.
  """
  # We instantiate all variables using tf.get_variable() instead of
  # tf.Variable() in order to share variables across multiple GPU training runs.
  # If we only ran this model on a single GPU, we could simplify this function
  # by replacing all instances of tf.get_variable() with tf.Variable().
  #
  # conv1
  with tf.variable_scope('conv1') as scope:
    kernel = _variable_with_weight_decay('weights', shape=[5, 5, 3, 64],
                                         stddev=1e-4, wd=0.0)
    conv = tf.nn.conv2d(images, kernel, [1, 1, 1, 1], padding='SAME')
    biases = _variable_on_cpu('biases', [64], tf.constant_initializer(0.0))
    bias = tf.nn.bias_add(conv, biases)
    conv1 = tf.nn.relu(bias, name=scope.name)
    _activation_summary(conv1)

  # pool1
  pool1 = tf.nn.max_pool(conv1, ksize=[1, 3, 3, 1], strides=[1, 2, 2, 1],
                         padding='SAME', name='pool1')

  # norm1
  norm1 = tf.nn.lrn(pool1, 4, bias=1.0, alpha=0.001 / 9.0, beta=0.75,
                    name='norm1')

  # conv2
  with tf.variable_scope('conv2') as scope:
    kernel = _variable_with_weight_decay('weights', shape=[5, 5, 64, 64],
                                         stddev=1e-2, wd=0.0)
    conv = tf.nn.conv2d(norm1, kernel, [1, 1, 1, 1], padding='SAME')
    biases = _variable_on_cpu('biases', [64], tf.constant_initializer(0.1))
    bias = tf.nn.bias_add(conv, biases)
    conv2 = tf.nn.relu(bias, name=scope.name)
    _activation_summary(conv2)

  # norm2
  norm2 = tf.nn.lrn(conv2, 4, bias=1.0, alpha=0.001 / 9.0, beta=0.75,
                    name='norm2')
  # pool2
  pool2 = tf.nn.max_pool(norm2, ksize=[1, 3, 3, 1],
                         strides=[1, 2, 2, 1], padding='SAME', name='pool2')

  # local3
  with tf.variable_scope('local3') as scope:
    kernel = _variable_with_weight_decay('weights', shape=[3, 3, 64, 64],
                                         stddev=0.04, wd=0.04)
    conv = tf.nn.conv2d(pool2, kernel, [1, 1, 1, 1], padding='SAME')
    biases = tf.Variable(tf.constant(0.1, shape=[64]))
    bias = tf.nn.bias_add(conv, biases)
    local3 = tf.nn.relu(bias, name=scope.name)
    _activation_summary(local3)

  # local4
  with tf.variable_scope('local4') as scope:
    kernel = _variable_with_weight_decay('weights', shape=[3, 3, 64, 32],
                                         stddev=0.04, wd=0.04)
    conv = tf.nn.conv2d(local3, kernel, [1, 1, 1, 1], padding='SAME')
    biases = tf.Variable(tf.constant(0.1, shape=[32]))
    bias = tf.nn.bias_add(conv, biases)
    local4 = tf.nn.relu(bias, name=scope.name)
    _activation_summary(local4)

  # softmax, i.e. softmax(WX + b)
  with tf.variable_scope('softmax_linear') as scope:
    # Move everything into depth so we can perform a single matrix multiply.
    reshape = tf.reshape(local4, [FLAGS.batch_size, -1])
    dim = reshape.get_shape()[1].value

    weights = _variable_with_weight_decay('weights', [dim, NUM_CLASSES],
                                          stddev=1 / float(NUM_CLASSES), wd=0.0)
    biases = _variable_on_cpu('biases', [NUM_CLASSES],
                              tf.constant_initializer(0.0))
    softmax_linear = tf.nn.softmax(tf.nn.bias_add(tf.matmul(reshape, weights), biases), name=scope.name)

    _activation_summary(softmax_linear)

  return softmax_linear

If weight stddev was 1e-4 at conv2, evaluation accuracies is just 10%.

2016-05-17 08:29:43.386243: precision @ 1 = 0.100

And tensorboard shows vanishing-gradient at conv2 layer.

screen-shot-2016-05-16-at-03 28 57

@girving

This comment has been minimized.

Copy link
Contributor

girving commented May 16, 2016

@vincentvanhoucke: Can I get your eye here? His change seems reasonable to me (I would think 1e-2 is a better std for both conv layers).

@vincentvanhoucke

This comment has been minimized.

Copy link
Member

vincentvanhoucke commented May 17, 2016

@keiji your solution doesn't exactly replicate Alex's model. You use convolutional layers for local3 and local4, whereas Alex used locally-connected layers that do not share their parameters across patches. My guess is that's possibly one reason why you see such a strong effect of initialization.
Come to think of it, I don't know how one would do that in TensorFlow. @girving do you know who wrote the tutorial and whether they had a specific trick in mind for locally connected, non-convolutional layers?

@girving

This comment has been minimized.

Copy link
Contributor

girving commented May 17, 2016

@vincentvanhoucke: To make sure I understand: you mean a convolutional layer but using separate filters for each output point? You could do it with batch matmul and a bunch of reshaping / tiling logic, but it would be quite slow. I think we'd need a custom op to it fast. And actually even the reshaping / tiling logic may be out of reach in a performant way.

@vincentvanhoucke

This comment has been minimized.

Copy link
Member

vincentvanhoucke commented May 17, 2016

@girving yes, that's what the original Cifar10 model uses. We should remove that exercise if we don't know ourselves how to do it well at this point, unless the author of the tutorial has something up their sleeve I don't know.

@girving

This comment has been minimized.

Copy link
Contributor

girving commented May 17, 2016

@shlens: Looks like you wrote parts of the deep_cnn tutorial. Do you know a trick for untied convolutions?

@shlens

This comment has been minimized.

Copy link
Member

shlens commented May 17, 2016

There is no trick. We envisioned the exercise to be difficult because a user would need to invoke several tiling, reshape and matmul operations. Most of the challenge would be in the construction of the tiling operation across the local3 and local4.

@keiji

This comment has been minimized.

Copy link
Contributor

keiji commented May 19, 2016

As far as I understand, TensorFlow doesn't have convenience operations for making locally-connected layer. We have to implement locally connected logic manually.

I'll retry exercise with this approach.

@vincentvanhoucke

This comment has been minimized.

Copy link
Member

vincentvanhoucke commented May 21, 2016

I'm going to close this PR. Please reopen if you feel strongly that the defaults should match cudaconvnet. As it stands, I'm not convinced it's an issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment