New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A bug on embedding_attention_seq2seq when output_projection is None? #4938
Comments
Thanks for reporting this issue and taking the time to really dig into what the code is doing. @lukaszkaiser would be the best person to comment on this. |
This is indeed a bug, it's exactly as you say and it shouldn't be like this. The problem is that there doesn't seem to be a backward-compatible way of correcting it. If we change the default, we'll break every model trained with the current function without output_projection (because the change of attention variables will change and old checkpoints won't load any more). That's why, unless you have something in mind to make it backwards-compatible, I'd rather avoid correcting this. The current seq2seq module will be deprecated anyway, because it does static graph construction (we have while_loop now) and we're also moving away from the list-based API to a single-tensor one (time being the first or second dimension). So it looks to me like correcting this is more troube than it's worth. By the way - the work on the new, dynamic seq2seq is happening in contrib.seq2seq and it's happening on github, lead by alrojo -- see, for example, issue #4686. There is no attention decoder there yet, though ethancaballero has asked for it in #4761. So maybe you could sync and work with them to make a new, bug-free attention decoder for the new contrib.seq2seq? Let me know what you think, thanks for catching this! |
@lukaszkaiser maybe we could write a one line PR adding a comment to the code that links to this issue, so if anyone gets confused by this behavior in the future, they'll know it's been addressed? |
Doing now. |
The documentation change is now approved internally. It will be synced to GitHub soon, at which point this issue will be updated. It's always sad when bugs have to become features. I'm reminded of how the version number for TeX will become π when Knuth dies. But at least future users will be able to avoid this problem. |
Thanks for the clear answer! |
When I use embedding_attention_seq2seq without giving an
output_projection
argument, I experienced that the program crashes by a memory allocation error, even when the model ran fine in other libraries or my own implementation of attention seq2seq.I didn't suffer this problem when I give an
output_projection
argument to the function explicitly.I suspect it occurs by the following fact: when
output_projection
is None, it wraps thecell
withOutputProjectionWrapper
) inembedding_attention_seq2seq
. This wrappedcell
emits output whose dimension matches the number of decoder symbols.Then the wrapped
cell
variable is passed toembedding_attention_decoder
, and toattention_decoder
.Here, in the
attention_decoder
function, the awkward memory allocation happens.output_size
argument is None, it is set tocell.output_size
, which is identical to the number of decoder symbols. (line 576)cell_output
of line 650 has dimensions proportional to the number of decoder symbols.According to the paper the implementation is referencing, the attention mechanism should not depend on the number of decoder symbols.
So I think the implementation is somewhat wrong and should be corrected by passing the
cell
without wrapping anOutputProjectionWrapper
and then performing projection afterwards.If it is truly a bug and no one is working now, I will submit a pull request since it can be easily fixed, IMO.
The text was updated successfully, but these errors were encountered: