Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generator converging to wrong output #12

Open
sebastianGehrmann opened this issue Jun 20, 2017 · 1 comment
Open

Generator converging to wrong output #12

sebastianGehrmann opened this issue Jun 20, 2017 · 1 comment

Comments

@sebastianGehrmann
Copy link

Hey Tao,

I am trying to implement your rationale model in pytorch right now and I keep running into the problem, that after a couple iterations z becomes all one's. This obviously makes the encoder quite strong but does not do what I want.

The generator's loss function is cost(x,y,z) * logp(z|x). While the first term is large, logpz becomes all zeros (since the model learns to predict all ones with 100% prob and log(1) = 0). Therefore, the overall loss (and derivative) for the generator becomes zero, leading to this all one's phenomenon.

How did you address this in your code?

@taolei87
Copy link
Owner

taolei87 commented Jun 20, 2017

Hi @sebastianGehrmann

The issue is probably due to one of the following reasons:

  1. The cost function includes an "sparsity" regularization that tries to suppress the selection too many one's. In my current implementation, the weight of this regularization (--sparsity) has to be carefully tuned, so neither the model selects all one's nor it selects all zero's.

    I monitored the % of one's on the training and dev sets, and found a reasonable value range for the beer review dataset.

  2. Since the learning procedure samples the gradient via REINFORCE, the variance of gradient is high and the model sometimes suddenly "jumps" to the bad optima of selecting all one's or zero's. I saw this more often for the dependent-selection version. To alleviate this, I used a larger batch size of 256 and a smaller initial learning rate. The code also monitors the cost value on train and dev sets. If the cost jumps after one epoch, I redo the parameter changes of this epoch and halve the learning rate. See this.

    For general REINFORCE and reinforcement learning, there are more principled ways of reducing the gradient variance. One is called "baseline trick". See Jiwei's follow-up paper (page 8).

Hope this can help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants