What the comment mean? #1

ghost · 2015-08-10T06:22:16Z

The comment of pseudo_cost() says: "This cost should have the same gradient but hopefully theano will use a more stable implementation of it." What does this mean actually? Is this implementation not stable for now?

skaae · 2015-08-10T13:01:02Z

Not entirely sure. You'll have to ask the original author mentioned in the docs.

mpezeshki · 2015-08-11T00:16:32Z

As far as I remember, the only difference is normalization. So one can get the gradients from pseudo_cost and use cost for monitoring purposes.

kshmelkov · 2015-08-17T21:07:44Z

Honestly I can't understand the purpose of that function. I couldn't find any equivalent in the original code. As far as I understand, pseudo_cost computes gradients manually instead of relying on theano autodiff. It seems that in the example from @mohammadpz repo only cost is used for both of training and monitoring.

pbrakel · 2015-08-17T23:35:22Z

Hey, sorry for the unclear comment. I think I wrote that more as a note to myself somehow and it refers to the fact that I feared the gradient might still be unstable without using the skip_softmax option (as turned out to be true). The pseudo_cost function computes the gradient manually first and than combines it with the input values to obtain a score that fools theano into retrieving the gradient again.

As an example, the gradient of the categorical cross entropy is something like $-t/y$ and after multiplying it with the softmax derivative it becomes $y-t$, where $y$ is the softmax output and $t$ your desired label in one-hot coding. The first of these two gradients is numerically quite risky due to the possible division by zero so it would be nice if we could skip it to get to $y-t$ directly. Knowing this, we can simply compute $y-t$ by hand but we need to give some cost to theano to compute the gradient of, such that it will use the chain rule and multiply that gradient with the other derivatives it will compute. By substituting $y-t$ with some matrix/vector $a$ we consider constant (i.e., we don't try to propagate gradients through it), we can write $L=sum(a*o)$, where $o$ is the output before it goes into softmax. Theano will conclude that the gradient of this cost wrt $o$ is $a=y-t$ even though $L$ will most likely be very different from the actual cross entropy and doesn't need to be positive.

This is what pseudo_cost tries to do for CTC because the original cost was numerically unstable. If my reasoning is wrong please let me know but so far we've gotten decent results with this CTC implementation. I fully admit it's not the most beautiful solution and it would probably be nicer to write a theano op that does this but I didn't find the time for that yet.

kshmelkov · 2015-08-17T23:48:48Z

It makes much more sense now, thank you. However, I don't see how it is specific to CTC cost. If it is related only to softmax/cross-entropy, it must be a trouble for almost any convnet implementation. Do you suggest that theano's backpropagation of the categorial cross-entropy is numerically unstable in general?

pbrakel · 2015-08-17T23:54:09Z

I remember some implementations of it being more reliable than others. The one taking indices seems more stable than the one that expects one-hot coding if I remember correctly. The problem is also that our batch version of CTC needs to propagate zeros in log domain, which leads to some computations that might lead to things like inf - inf or inf * 0.

kshmelkov · 2015-08-18T08:22:30Z

Well, I have done some experiments on my tasks. I agree that pseudo_cost behaves somewhat more stable, but I couldn't find a pattern (i.e. the effect is inconsistent). For my tasks rmsprop and adadelta are stable enough even using cost.

Anyway I suggest that it should be solved on Theano level. As I said log(softmax(.)) is very common function, it has to be treated correctly. I have done some googling, this problem was noticed and reported in Theano upstream a few times already: Theano/Theano#2944, Theano/Theano#2781, mila-iqia/blocks#654. It seems also that Theano contains a related optimization, but I don't understand its semantics (it is buried in cuDNN). Somebody mentioned very different stability depending on mode=FAST_RUN or FAST_COMPILE (which makes sense if it is just an optimization).

What I took away from these discussions is that Theano can optimize log(softmax(.)) (on CPU also), but sometimes doesn't. Presumably because of scan between two operators. @pbrakel, might it be the case in CTC?

pbrakel mentioned this issue Aug 17, 2015

example not running mpezeshki/CTC-Connectionist-Temporal-Classification#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What the comment mean? #1

What the comment mean? #1

ghost commented Aug 10, 2015

skaae commented Aug 10, 2015

mpezeshki commented Aug 11, 2015

kshmelkov commented Aug 17, 2015

pbrakel commented Aug 17, 2015

kshmelkov commented Aug 17, 2015

pbrakel commented Aug 17, 2015

kshmelkov commented Aug 18, 2015

What the comment mean? #1

What the comment mean? #1

Comments

ghost commented Aug 10, 2015

skaae commented Aug 10, 2015

mpezeshki commented Aug 11, 2015

kshmelkov commented Aug 17, 2015

pbrakel commented Aug 17, 2015

kshmelkov commented Aug 17, 2015

pbrakel commented Aug 17, 2015

kshmelkov commented Aug 18, 2015