Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there a way to turn off variable reuse a number of scopes down? #537

Closed
cinjon opened this issue Dec 17, 2015 · 15 comments
Closed

Is there a way to turn off variable reuse a number of scopes down? #537

cinjon opened this issue Dec 17, 2015 · 15 comments
Assignees

Comments

@cinjon
Copy link

cinjon commented Dec 17, 2015

I'm having trouble with an under-sharing error for scope reuse. A specific example that I can point to using TF repo code is if we were to build an architecture with parallel attention modules.

Say we did something in the attention_decoder in rnn/seq2seq.py like this:

        # module can be any of ['a', 'b', 'c'...]
        with tf.variable_scope(module, reuse=None):
          k = tf.get_variable('AttnW', [1, 1, attn_size, attention_vec_size])
          hidden_features.append(tf.nn.conv2d(hidden, k, [1, 1, 1, 1], "SAME"))
          v.append(tf.get_variable('AttnV', [attention_vec_size]))

I then build the model by running:

for module in modules:
  _outputs, _losses = seq2seq.model_with_buckets(..., 
                  lambda x, y: seq2seq.embedding_attention_seq2seq(x, y, module, False), 
                  ...)
  ...

This works just fine for one module, i.e. the first loop goes off without a hitch. When I get to the second module though, I get the following error:

ValueError: Under-sharing: Variable embedding_attention_seq2seq/embedding_attention_decoder/attention_decoder/b/AttnW_0 does not exist, disallowed. Did you mean to set reuse=None in VarScope?

I realize that I can build the variables up front and then push them through functions to where they're needed. However, that seems really bad because then there are floating variables built in the beginning that are way out of program scope. Is there a better way?

@lukaszkaiser
Copy link
Contributor

Thanks for your report cinjon, I think it's a bug in seq2seq.model_with_buckets.
In particular, I think the problem is in these lines.
if j > 0:
vs.get_variable_scope().reuse_variables()
It just set variable sharing for the current scope, which is not the right thing to do: it should create a separate scope (reusing or not, depending on j). For now, this setting leaks outside of this function and compromises your second module.

We'll work on a fix, I'll also test it with a bunch of other models. In the meantime, can you try replacing these two lines in model_with_buckets by something like this:
with tf.variable_scope("model_with_buckets", reuse=True if j > 0 else None):
... and shift the body to be in this scope ...

I think that should help and be better than doing your own variables.

Thanks for catching this problem!

@cinjon
Copy link
Author

cinjon commented Dec 18, 2015

Awesome, thanks for looking into this Lukasz. I'll try to implement your fix in the meantime and will report back.

@cinjon
Copy link
Author

cinjon commented Dec 20, 2015

Hey Lukasz, I just had some time to try this out and I'm still getting the Under-Sharing error. This makes sense because the scope change you're suggesting takes place in a superset scope of the attention module. So when we set the reuse=True and the module introduces a new variable name (given in the with tf.variable_scope(module, reuse=None)), it throws the error.

If I instead turn off the reuse=True in model_with_buckets, then we get an Over-Sharing error in the embedding wrapper. This also makes sense because of the vs.get_variable_scope().reuse_variables() in ops/rnn.py.

It seems like there isn't a way to break the reuse contract in a sub scope. Is that right? If so, is there a way to satisfy this without passing the variables through to the right function / making a separate attention module?

@cinjon
Copy link
Author

cinjon commented Dec 20, 2015

I tried a few things to solve this including changing the reuse_variables() in rnn.rnn to similarly be a with block. They didn't work, so I moved towards making the attention module modular.

That's complicated by the fact that the attention_states aren't calculated until embedding_attention_seq2seq. I moved it to its own function like this:

def attention_module(attention_states, module=None, num_heads=1):
    ...
    # To calculate W1 * h_t we use a 1-by-1 convolution, need to reshape before.                                                                                                                            
    ...                                                                                                                              
    for a in xrange(num_heads):
      k_scope = 'AttnW_%d' % a
      v_scope = 'AttnV_%d' % a
      if module:
        k_scope += '_%s' % module
        v_scope += '_%s' % module

      k = vs.get_variable(k_scope, [1, 1, attn_size, attention_vec_size])
      hidden_features.append(nn_ops.conv2d(hidden, k, [1, 1, 1, 1], "SAME"))
      v.append(vs.get_variable(v_scope, [attention_vec_size]))

    def attention(query):
      """Put attention masks on hidden using hidden_features and query."""
      attn_scope = 'Attention'
      if module:
        attn_scope += '_%s' % module

      for a in xrange(num_heads):
        with vs.variable_scope("%s_%d" % (attn_scope, a)):
          ...
      return ds

    return attention

And then instantiated it in embedding_attention_seq2seq:

...
top_states = [array_ops.reshape(e, [-1, 1, cell.output_size]) for e in encoder_outputs]
attention_states = array_ops.concat(1, top_states)
attention_func = attention_module(attention_states, module, num_heads) 
...

But there is still this Under-Sharing problem. This time, it's on the AttnW_0_b. And it happens because to instantiate the modules with the overall model, I am doing this:

for num, module in enumerate(modules):
  with vs.variable_scope('modules', reuse=True if num > 0 else None):
    _outputs, _losses = seq2seq.model_with_buckets(..., 
                    lambda x, y: seq2seq.embedding_attention_seq2seq(x, y, module, False), 
                    ...)
  ...

If I remove the reuse=True on that, then the error is again over-sharing, this time because modules/embedding_attention_seq2seq/RNN/cell_output/EmbeddingWrapper/embedding already exists, which I'm pretty sure is a side-effect of rnn.py.

Is there something else I could do to instantiate the graph like I'm describing? Thanks.

@lukaszkaiser
Copy link
Contributor

Hi cinjon. I think I corrected the reuse leaking from model_with_buckets in a recent commit, so now you're hitting a different problem, this one.

If I remove the reuse=True on that, then the error is again over-sharing, this time because
modules/embedding_attention_seq2seq/RNN/cell_output/EmbeddingWrapper/embedding already exists.

I think removing reuse=True in your case is the right thing to do (you don't want to reuse across modules, right?). But then - do you want to share the encoder embedding across modules? If you do, then you should create a variable just for the embedding and pass it to EmbeddingWrapper, I think. If you don't, then maybe just put each module in it's own scope (e.g., with variable_scope("module" + str(num))) -- that will make all variables unique.

Does that help? I'm not fully sure how exactly you want to share variables -- if every module is separate, then I think the best way is to have separate scope for each of them.

@cinjon
Copy link
Author

cinjon commented Dec 21, 2015

I haven't been able to code up a solution where the modules share the encoder embedding yet, but yes, that's the idea. The only thing that I want to be unique is the attention component - the encoder should be idempotent across the modules. I'll report back on doing the former.

@cinjon
Copy link
Author

cinjon commented Dec 22, 2015

@lukaszkaiser , I got this to work by declaring all of the attention modules on every run so there is no under-sharing. The gradients are then the sum of the gradients of each module. At step time, I only feed the losses for the current module into the output. This is inefficient, but it seems to work. I'm going to dig into the graph that TensorFlow made just to make sure that it's right and then work on making it more efficient.

It does seem though that it would be a lot easier to build this graph if there was a way to turn off reuse in a sub-scope.

@cinjon
Copy link
Author

cinjon commented Dec 26, 2015

Hi again. I needed to make this more efficient, but my attempts haven't been working. I also realized that the previous solution I had built failed when there is more than one bucket (I had made it uni-bucket for testing). TF would throw an error about the encoder and decoder placeholders needing values. If I used a bucket of (10,12) and had set it up to also have a bucket of (20,25), then the error would be for encoders 10 through 20 and decoders 12 through 25.

This was confusing because it seemed like I was building the graph similarly. What I have now is that I make the encoder_cell and the embedding in the Seq2Seq init:

    ...
    # Create the internal multi-layer cell for our RNN.                                                                                                                                                     
    single_cell = rnn_cell.GRUCell(layer_size)
    if use_lstm:
      single_cell = rnn_cell.BasicLSTMCell(layer_size)
    cell = single_cell
    if num_layers > 1:
      cell = rnn_cell.MultiRNNCell([single_cell] * num_layers)

    encoder_cell = rnn_cell.EmbeddingWrapper(cell, total_vocab_size)
    with vs.variable_scope('embedding_decoder_top_level'):
        with ops.device("/cpu:0"):
            embedding = vs.get_variable("embedding",
                                        [vocab_size, cell.input_size])

    # The seq2seq function: we use embedding for the input and attention.                                                                                                                                   
    def seq2seq_f(encoder_inputs, decoder_inputs, do_decode):
      return seq2seq.embedding_attention_seq2seq(
        encoder_inputs, decoder_inputs, cell, vocab_size_in,
        vocab_size_out, output_projection=output_projection,
        feed_previous=do_decode, encoder_cell=encoder_cell, 
        embedding=embedding)

I'm then passing them through into the embedding_attention_seq2seq:

    self.module_graph = {}

    for num, module in enumerate(modules):
        scope_name = 'module_%s' % language
        with vs.variable_scope(scope_name):
          if forward_only:
            outputs, losses = seq2seq.model_with_buckets(...)
          else:
            outputs, losses = seq2seq.model_with_buckets(...)

        self.module_graph[module] = [outputs, losses]
        params = [param for param in tf.trainable_variables()
                  if param.name.startswith(scope_name)]

        if not forward_only:
            gradient_norms = []
            updates = []
            opt = tf.train.GradientDescentOptimizer(self.learning_rate)
            for b in xrange(len(buckets)):
                gradients = tf.gradients(losses[b], params)
                clipped_gradients, norm = tf.clip_by_global_norm(
                    gradients, max_gradient_norm)
            gradient_norms.append(norm)
            updates.append(opt.apply_gradients(
                zip(clipped_gradients, params), global_step=self.global_step))

            self.module_graph[module].extend([gradient_norms, updates])

This throws a confusing error even before I can start training it:

  File ".../tensorflow/python/ops/nn.py", line 835, in sampled_softmax_loss
    name=name)
  File ".../tensorflow/python/ops/nn.py", line 654, in _compute_sampled_logits
    true_logits += true_b
  File ".../tensorflow/python/ops/math_ops.py", line 425, in binary_op_wrapper
    y = ops.convert_to_tensor(y, dtype=x.dtype.base_dtype, name="y")
  File ".../tensorflow/python/framework/ops.py", line 528, in convert_to_tensor
    ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
  File ".../tensorflow/python/framework/ops.py", line 472, in _TensorTensorConversionFunction
    % (dtype.name, t.dtype.name, str(t)))

ValueError: Tensor conversion requested dtype float32 for Tensor with dtype int32: 'Tensor("module_a/model_with_buckets/module_a/sequence_loss/sequence_loss_by_example/sampled_softmax_loss/Reshape_5:0", shape=(?, 1), dtype=int32, device=/cpu:0)'

I also printed out the params (see below) to see if there was anything peculiar and I noticed the first value is 'module_a/RNN/cell_output/EmbeddingWrapper/embedding:0' (cell_output my own name for scope in rnn/rnn.py). This is the encoder_cell that's passed to the embedding_attention_seq2seq but it's under module_a's scope. Afaict, this means that it won't share training with module_b. Is that right?

[u'module_a/RNN/cell_output/EmbeddingWrapper/embedding:0', 
u'module_a/RNN/cell_output/GRUCell/Gates/Linear/Matrix:0', 
u'module_a/RNN/cell_output/GRUCell/Gates/Linear/Bias:0', 
u'module_a/RNN/cell_output/GRUCell/Candidate/Linear/Matrix:0', 
u'module_a/RNN/cell_output/GRUCell/Candidate/Linear/Bias:0', 
u'module_a/embedding_attention_seq2seq/embedding_attention_decoder/attention_decoder/AttnW_0:0', 
u'module_a/embedding_attention_seq2seq/embedding_attention_decoder/attention_decoder/AttnV_0:0', 
u'module_a/embedding_attention_seq2seq/embedding_attention_decoder/attention_decoder/Linear/Matrix:0', 
u'module_a/embedding_attention_seq2seq/embedding_attention_decoder/attention_decoder/Linear/Bias:0', 
u'module_a/embedding_attention_seq2seq/embedding_attention_decoder/attention_decoder/GRUCell/Gates/Linear/Matrix:0', 
u'module_a/embedding_attention_seq2seq/embedding_attention_decoder/attention_decoder/GRUCell/Gates/Linear/Bias:0', 
u'module_a/embedding_attention_seq2seq/embedding_attention_decoder/attention_decoder/GRUCell/Candidate/Linear/Matrix:0', 
u'module_a/embedding_attention_seq2seq/embedding_attention_decoder/attention_decoder/GRUCell/Candidate/Linear/Bias:0', 
u'module_a/embedding_attention_seq2seq/embedding_attention_decoder/attention_decoder/Attention_0/Linear/Matrix:0', 
u'module_a/embedding_attention_seq2seq/embedding_attention_decoder/attention_decoder/Attention_0/Linear/Bias:0', 
u'module_a/embedding_attention_seq2seq/embedding_attention_decoder/attention_decoder/AttnOutputProjection/Linear/Matrix:0', 
u'module_a/embedding_attention_seq2seq/embedding_attention_decoder/attention_decoder/AttnOutputProjection/Linear/Bias:0']

I realize this might be out of scope of a Github issue now. Let me know if you'd rather take this offline. Thanks @lukaszkaiser

@cinjon
Copy link
Author

cinjon commented Dec 26, 2015

Some more info on the above error:

  File ".../tensorflow/python/ops/nn.py", line 835, in sampled_softmax_loss
    name=name)
  File ".../tensorflow/python/ops/nn.py", line 654, in _compute_sampled_logits
    true_logits += true_b
  File ".../tensorflow/python/ops/math_ops.py", line 425, in binary_op_wrapper
    y = ops.convert_to_tensor(y, dtype=x.dtype.base_dtype, name="y")
  File ".../tensorflow/python/framework/ops.py", line 528, in convert_to_tensor
    ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
  File ".../tensorflow/python/framework/ops.py", line 472, in _TensorTensorConversionFunction
    % (dtype.name, t.dtype.name, str(t)))

ValueError: Tensor conversion requested dtype float32 for Tensor with dtype int32: 'Tensor("module_a/model_with_buckets/module_a/sequence_loss/sequence_loss_by_example/sampled_softmax_loss/Reshape_5:0", shape=(?, 1), dtype=int32, device=/cpu:0)'

This happens on the second module. The loop makes the first one without a hitch and then goes on to make the second one where it finds that the softmax Reshapes are now int32s.

Printout of true_b after true_b = array_ops.reshape(true_b, [-1, num_true]) (~ L671) in ops/nn.py:

...
Tensor("modules/model_with_buckets/modules/sequence_loss/sequence_loss_by_example/sampled_softmax_loss/Reshape_5:0", shape=(?, 1), dtype=float32, device=/cpu:0)
...

And then on the second module, just before the error is thrown:

Tensor("modules_1/model_with_buckets/modules/sequence_loss/sequence_loss_by_example/sampled_softmax_loss/Reshape_5:0", shape=(?, 1), dtype=int32, device=/cpu:0)

@cinjon
Copy link
Author

cinjon commented Dec 26, 2015

It looks like if I change _compute_sampled_logits in ops/nn.py to cast all_b into a float32, then I get past this error and can build the graph, but later run into another error at step time where the params are not 1-dimensional.

...
all_b = embedding_ops.embedding_lookup(biases, all_ids)
all_b = math_ops.cast(all_b, dtypes.float32)
...
[[Node: modules_1/model_with_buckets/modules/sequence_loss/sequence_loss_by_example/sampled_softmax_loss/embedding_lookup_1 = Gather[Tindices=DT_INT64, Tparams=DT_INT32, 
_device="/job:localhost/replica:0/task:0/cpu:0"](modules_1/model_with_buckets/modules/sequence_loss/sequence_loss_by_example/sampled_softmax_loss/embedding_lookup_1/params_0, 
modules_1/model_with_buckets/modules/sequence_loss/sequence_loss_by_example/sampled_softmax_loss/concat)]]

  File "../tensorflow/python/ops/nn.py", line 840, in sampled_softmax_loss
    name=name)
  File "../tensorflow/python/ops/nn.py", line 633, in _compute_sampled_logits
    all_b = embedding_ops.embedding_lookup(biases, all_ids)
  File ".../tensorflow/python/ops/embedding_ops.py", line 82, in embedding_lookup
    return array_ops.gather(params[0], ids, name=name)
  File ".../tensorflow/python/ops/gen_array_ops.py", line 301, in gather
    name=name)
  File ".../tensorflow/python/ops/op_def_library.py", line 664, in apply_op
    op_def=op_def)
  File ".../tensorflow/python/framework/ops.py", line 1850, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File ".../tensorflow/python/framework/ops.py", line 1049, in __init__
    self._traceback = _extract_stack()

EDIT: I'm going to dig more into this tomorrow, but in the meantime, I forgot to mention an important part and that's that the above error comes after we use control_flow_ops.cond to choose a different attention weight/bias (via the def attention_module above) based on which module we're using. This happens in the attention_decoder, where instead of attns = attention(new_state), we have a recursive build up of attns = control_flow_ops.cond(module == m, lambda: attention_func(new_state), lambda: attns).

@cinjon
Copy link
Author

cinjon commented Dec 27, 2015

I couldn't leave it alone and tried one more thing, which was to use a control_flow_ops.cond on the models themselves. What this looks like is that I have a cond set up to branch to a different model_with_buckets output / loss based off of the module. This didn't work because I run into a problem with branching in control_flow_ops's BuildCondBranch (~ L568: elif v.name not in self._values:) where because the function's output is a list of lists (the buckets), it chokes on getting the name. I can't just flatten the list though because then the branching is really weird.

At this point, I'm unsure what else to try to get this to work. I feel like I am missing some key understanding that should make this easy. :(

@cinjon
Copy link
Author

cinjon commented Jan 4, 2016

@lukaszkaiser Thoughts?

@lukaszkaiser
Copy link
Contributor

I was offline for a while cinjon, and I'm not sure I understand now what you were trying to accomplish. If you still have this problem, let's take it offline and, if needed, open another issue (just to not prolong this one, it becomes unreadable). Thanks!

@cinjon
Copy link
Author

cinjon commented Jan 20, 2016

Hey Lukasz, thanks for the reply.
I've set this aside for a minute while working on other nets. I appreciate you coming back round to it and I'll let you know when I look into this again.

@lukaszkaiser
Copy link
Contributor

As per my last comment -- it's not clear from this thread what the issue is any more. If you're having some issue, feel free to open a new one and please explain precisely. Also note that variable reuse is inherited on purpose -- the other design would break compositionality across modules. If you have a variable in the same module or function, why not reuse it by reference?

darkbuck pushed a commit to darkbuck/tensorflow that referenced this issue Jan 23, 2020
…le-nvvm-emission

[XLA:GPU] Emit llvm::Intrinsic::nvvm_atomic_load_add_f32 only on NVPTX
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants