A bug on embedding_attention_seq2seq when output_projection is None? #4938

jihunchoi · 2016-10-13T10:04:49Z

When I use embedding_attention_seq2seq without giving an output_projection argument, I experienced that the program crashes by a memory allocation error, even when the model ran fine in other libraries or my own implementation of attention seq2seq.
I didn't suffer this problem when I give an output_projection argument to the function explicitly.

I suspect it occurs by the following fact: when output_projection is None, it wraps the cell with OutputProjectionWrapper) in embedding_attention_seq2seq. This wrapped cell emits output whose dimension matches the number of decoder symbols.
Then the wrapped cell variable is passed to embedding_attention_decoder, and to attention_decoder.
Here, in the attention_decoder function, the awkward memory allocation happens.

Since the output_size argument is None, it is set to cell.output_size, which is identical to the number of decoder symbols. (line 576)
cell_output of line 650 has dimensions proportional to the number of decoder symbols.
Thus, in line 660, it creates a very large matrix whose size is proportional to the square of the number of decoder symbols.

According to the paper the implementation is referencing, the attention mechanism should not depend on the number of decoder symbols.
So I think the implementation is somewhat wrong and should be corrected by passing the cell without wrapping an OutputProjectionWrapper and then performing projection afterwards.

If it is truly a bug and no one is working now, I will submit a pull request since it can be easily fixed, IMO.

The text was updated successfully, but these errors were encountered:

jart · 2016-10-14T05:25:11Z

Thanks for reporting this issue and taking the time to really dig into what the code is doing. @lukaszkaiser would be the best person to comment on this.

lukaszkaiser · 2016-10-14T19:15:40Z

This is indeed a bug, it's exactly as you say and it shouldn't be like this. The problem is that there doesn't seem to be a backward-compatible way of correcting it. If we change the default, we'll break every model trained with the current function without output_projection (because the change of attention variables will change and old checkpoints won't load any more).

That's why, unless you have something in mind to make it backwards-compatible, I'd rather avoid correcting this. The current seq2seq module will be deprecated anyway, because it does static graph construction (we have while_loop now) and we're also moving away from the list-based API to a single-tensor one (time being the first or second dimension). So it looks to me like correcting this is more troube than it's worth.

By the way - the work on the new, dynamic seq2seq is happening in contrib.seq2seq and it's happening on github, lead by alrojo -- see, for example, issue #4686. There is no attention decoder there yet, though ethancaballero has asked for it in #4761. So maybe you could sync and work with them to make a new, bug-free attention decoder for the new contrib.seq2seq?

Let me know what you think, thanks for catching this!

jart · 2016-10-14T19:25:12Z

@lukaszkaiser maybe we could write a one line PR adding a comment to the code that links to this issue, so if anyone gets confused by this behavior in the future, they'll know it's been addressed?

lukaszkaiser · 2016-10-14T19:44:18Z

Doing now.

jart · 2016-10-14T19:59:48Z

The documentation change is now approved internally. It will be synced to GitHub soon, at which point this issue will be updated. It's always sad when bugs have to become features. I'm reminded of how the version number for TeX will become π when Knuth dies. But at least future users will be able to avoid this problem.

jihunchoi · 2016-10-15T07:49:46Z

Thanks for the clear answer!
And it is a very good news to me that TF will support a variable-length sequences in the future seq2seq module.
I will look for it to find whether there exist parts I can contribute.

jart assigned jart and lukaszkaiser and unassigned jart Oct 14, 2016

lukaszkaiser closed this as completed Oct 14, 2016

lukaszkaiser mentioned this issue Nov 22, 2016

Is there a bug in embedding_attention_seq2seq? #5665

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A bug on embedding_attention_seq2seq when output_projection is None? #4938

A bug on embedding_attention_seq2seq when output_projection is None? #4938

jihunchoi commented Oct 13, 2016 •

edited

jart commented Oct 14, 2016

lukaszkaiser commented Oct 14, 2016

jart commented Oct 14, 2016

lukaszkaiser commented Oct 14, 2016

jart commented Oct 14, 2016

jihunchoi commented Oct 15, 2016 •

edited

A bug on embedding_attention_seq2seq when output_projection is None? #4938

A bug on embedding_attention_seq2seq when output_projection is None? #4938

Comments

jihunchoi commented Oct 13, 2016 • edited

jart commented Oct 14, 2016

lukaszkaiser commented Oct 14, 2016

jart commented Oct 14, 2016

lukaszkaiser commented Oct 14, 2016

jart commented Oct 14, 2016

jihunchoi commented Oct 15, 2016 • edited

jihunchoi commented Oct 13, 2016 •

edited

jihunchoi commented Oct 15, 2016 •

edited