Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conjugate Gradient Optimization sometimes fails (with NaN parameters) #24

Closed
alexbeloi opened this issue Jun 30, 2016 · 7 comments
Closed

Comments

@alexbeloi
Copy link
Contributor

In some of my experiments I sometimes get NaN parameters when training using TRPO and TNPG algorithms, this leads to the file containing ConjugateGradientOptimizer where it appears that under some circumstances the value passed to np.sqrt in line 168-170 is negative (specifically descent_direction.dot(Hx(descent_direction)) is negative), this defines initial_step_size which is then set to NaN.

Is there any citation available for this initial step size?

The variable naming for the terms in descent_direction.dot(Hx(descent_direction)) suggests that this is an inner product with respect to a Hessian (which would be positive semi-definite), but I'm not sure that's the case.

@dementrock
Copy link
Member

Hi @alexbeloi, the step size is computed according to the TRPO paper: https://arxiv.org/pdf/1502.05477v4.pdf. You can find the formula in Appendix.C.

How negative is the computed value of descent_direction.dot(Hx(descent_direction)), and can you describe more about your setup? This could happen if the code has a bug so that if you compute the mean KL is nonzero (or not sufficiently close to zero) before taking the step. We've also observed it sometimes happen with recurrent networks, although adjusting the nonlinearity seems to have solved it.

@alexbeloi
Copy link
Contributor Author

Hi @dementrock, it appears that mean KL is nonzero before taking the step because of something I'm doing. This issue came up when debugging the ISSampler with TRPO.

What I'm doing is taking (off-policy) stored paths, computing the agent_infos for those paths with respect to the current policy using _, agent_infos = policy.get_action(observations), and then those agent_infos get passed to old_dist_info_vars_list in the optimizer.

What I expected was that the on-policy agent_infos that I computed would be identical to the dist_info_vars = policy.dist_info_sym(obs_var, state_info_vars) evaluated by the optimizer before taking the step, so kl = dist.kl_sym(old_dist_info_vars, dist_info_vars) would be zero before the step, but this isn't the case.

Is there a difference between agent_info computed from _, agent_infos = policy.get_action(observations) and the evaluation of dist_info_vars = policy.dist_info_sym(obs_var, state_info_vars) for obs_var evaluated at observations?

@alexbeloi
Copy link
Contributor Author

I feel there is some confusion on my part. Where does the NPO algorithm get values for old_dist_info_vars and dist_info_vars from?

@alexbeloi
Copy link
Contributor Author

Oh wow, super silly bug on my part. The last line of is_sampler.py should return samples not return paths. This was the root of the issue.

@dementrock
Copy link
Member

@alexbeloi Re difference between agent_infos and evaluating dist_info_vars: agent_infos may contain more entries than dist_info_vars, but for the common keys their values should be the same. Otherwise there is a bug somewhere.

Does replacing return paths with return samples solve the NaN issue?

@alexbeloi
Copy link
Contributor Author

@dementrock yes, that one line fix solves the NaN issue. I made a pull request with the patch and a (now working) example of TRPO with ISSampler.

@dementrock
Copy link
Member

Awesome, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants