Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What the comment mean? #1

Open
ghost opened this issue Aug 10, 2015 · 7 comments
Open

What the comment mean? #1

ghost opened this issue Aug 10, 2015 · 7 comments

Comments

@ghost
Copy link

ghost commented Aug 10, 2015

The comment of pseudo_cost() says: "This cost should have the same gradient but hopefully theano will use a more stable implementation of it." What does this mean actually? Is this implementation not stable for now?

@skaae
Copy link
Owner

skaae commented Aug 10, 2015

Not entirely sure. You'll have to ask the original author mentioned in the docs.

@mpezeshki
Copy link

As far as I remember, the only difference is normalization. So one can get the gradients from pseudo_cost and use cost for monitoring purposes.

@kshmelkov
Copy link

Honestly I can't understand the purpose of that function. I couldn't find any equivalent in the original code. As far as I understand, pseudo_cost computes gradients manually instead of relying on theano autodiff. It seems that in the example from @mohammadpz repo only cost is used for both of training and monitoring.

@pbrakel
Copy link

pbrakel commented Aug 17, 2015

Hey, sorry for the unclear comment. I think I wrote that more as a note to myself somehow and it refers to the fact that I feared the gradient might still be unstable without using the skip_softmax option (as turned out to be true). The pseudo_cost function computes the gradient manually first and than combines it with the input values to obtain a score that fools theano into retrieving the gradient again.

As an example, the gradient of the categorical cross entropy is something like $-t/y$ and after multiplying it with the softmax derivative it becomes $y-t$, where $y$ is the softmax output and $t$ your desired label in one-hot coding. The first of these two gradients is numerically quite risky due to the possible division by zero so it would be nice if we could skip it to get to $y-t$ directly. Knowing this, we can simply compute $y-t$ by hand but we need to give some cost to theano to compute the gradient of, such that it will use the chain rule and multiply that gradient with the other derivatives it will compute. By substituting $y-t$ with some matrix/vector $a$ we consider constant (i.e., we don't try to propagate gradients through it), we can write $L=sum(a*o)$, where $o$ is the output before it goes into softmax. Theano will conclude that the gradient of this cost wrt $o$ is $a=y-t$ even though $L$ will most likely be very different from the actual cross entropy and doesn't need to be positive.

This is what pseudo_cost tries to do for CTC because the original cost was numerically unstable. If my reasoning is wrong please let me know but so far we've gotten decent results with this CTC implementation. I fully admit it's not the most beautiful solution and it would probably be nicer to write a theano op that does this but I didn't find the time for that yet.

@kshmelkov
Copy link

It makes much more sense now, thank you. However, I don't see how it is specific to CTC cost. If it is related only to softmax/cross-entropy, it must be a trouble for almost any convnet implementation. Do you suggest that theano's backpropagation of the categorial cross-entropy is numerically unstable in general?

@pbrakel
Copy link

pbrakel commented Aug 17, 2015

I remember some implementations of it being more reliable than others. The one taking indices seems more stable than the one that expects one-hot coding if I remember correctly. The problem is also that our batch version of CTC needs to propagate zeros in log domain, which leads to some computations that might lead to things like inf - inf or inf * 0.

@kshmelkov
Copy link

Well, I have done some experiments on my tasks. I agree that pseudo_cost behaves somewhat more stable, but I couldn't find a pattern (i.e. the effect is inconsistent). For my tasks rmsprop and adadelta are stable enough even using cost.

Anyway I suggest that it should be solved on Theano level. As I said log(softmax(.)) is very common function, it has to be treated correctly. I have done some googling, this problem was noticed and reported in Theano upstream a few times already: Theano/Theano#2944, Theano/Theano#2781, mila-iqia/blocks#654. It seems also that Theano contains a related optimization, but I don't understand its semantics (it is buried in cuDNN). Somebody mentioned very different stability depending on mode=FAST_RUN or FAST_COMPILE (which makes sense if it is just an optimization).

What I took away from these discussions is that Theano can optimize log(softmax(.)) (on CPU also), but sometimes doesn't. Presumably because of scan between two operators. @pbrakel, might it be the case in CTC?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants