Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A bug on embedding_attention_seq2seq when output_projection is None? #4938

Closed
jihunchoi opened this issue Oct 13, 2016 · 6 comments
Closed
Assignees

Comments

@jihunchoi
Copy link
Contributor

jihunchoi commented Oct 13, 2016

When I use embedding_attention_seq2seq without giving an output_projection argument, I experienced that the program crashes by a memory allocation error, even when the model ran fine in other libraries or my own implementation of attention seq2seq.
I didn't suffer this problem when I give an output_projection argument to the function explicitly.

I suspect it occurs by the following fact: when output_projection is None, it wraps the cell with OutputProjectionWrapper) in embedding_attention_seq2seq. This wrapped cell emits output whose dimension matches the number of decoder symbols.
Then the wrapped cell variable is passed to embedding_attention_decoder, and to attention_decoder.
Here, in the attention_decoder function, the awkward memory allocation happens.

  1. Since the output_size argument is None, it is set to cell.output_size, which is identical to the number of decoder symbols. (line 576)
  2. cell_output of line 650 has dimensions proportional to the number of decoder symbols.
  3. Thus, in line 660, it creates a very large matrix whose size is proportional to the square of the number of decoder symbols.

According to the paper the implementation is referencing, the attention mechanism should not depend on the number of decoder symbols.
So I think the implementation is somewhat wrong and should be corrected by passing the cell without wrapping an OutputProjectionWrapper and then performing projection afterwards.

If it is truly a bug and no one is working now, I will submit a pull request since it can be easily fixed, IMO.

@jart
Copy link
Contributor

jart commented Oct 14, 2016

Thanks for reporting this issue and taking the time to really dig into what the code is doing. @lukaszkaiser would be the best person to comment on this.

@jart jart assigned jart and lukaszkaiser and unassigned jart Oct 14, 2016
@lukaszkaiser
Copy link
Contributor

This is indeed a bug, it's exactly as you say and it shouldn't be like this. The problem is that there doesn't seem to be a backward-compatible way of correcting it. If we change the default, we'll break every model trained with the current function without output_projection (because the change of attention variables will change and old checkpoints won't load any more).

That's why, unless you have something in mind to make it backwards-compatible, I'd rather avoid correcting this. The current seq2seq module will be deprecated anyway, because it does static graph construction (we have while_loop now) and we're also moving away from the list-based API to a single-tensor one (time being the first or second dimension). So it looks to me like correcting this is more troube than it's worth.

By the way - the work on the new, dynamic seq2seq is happening in contrib.seq2seq and it's happening on github, lead by alrojo -- see, for example, issue #4686. There is no attention decoder there yet, though ethancaballero has asked for it in #4761. So maybe you could sync and work with them to make a new, bug-free attention decoder for the new contrib.seq2seq?

Let me know what you think, thanks for catching this!

@jart
Copy link
Contributor

jart commented Oct 14, 2016

@lukaszkaiser maybe we could write a one line PR adding a comment to the code that links to this issue, so if anyone gets confused by this behavior in the future, they'll know it's been addressed?

@lukaszkaiser
Copy link
Contributor

Doing now.

@jart
Copy link
Contributor

jart commented Oct 14, 2016

The documentation change is now approved internally. It will be synced to GitHub soon, at which point this issue will be updated. It's always sad when bugs have to become features. I'm reminded of how the version number for TeX will become π when Knuth dies. But at least future users will be able to avoid this problem.

@jihunchoi
Copy link
Contributor Author

jihunchoi commented Oct 15, 2016

Thanks for the clear answer!
And it is a very good news to me that TF will support a variable-length sequences in the future seq2seq module.
I will look for it to find whether there exist parts I can contribute.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants