Skip to content

Conversation

@CatherineSue
Copy link
Member

@CatherineSue CatherineSue commented Oct 29, 2018

This commit fixes policy entropy. Currently it uses the following three
options:
• maximum entropy (augmented reward)
• with a logli estimator (-logli(action|policy)
• with the gradient turned off (tf.stop_gradient(entropy))

See: #281

@CatherineSue CatherineSue requested a review from a team as a code owner October 29, 2018 17:18
@codecov
Copy link

codecov bot commented Oct 29, 2018

Codecov Report

Merging #368 into master will increase coverage by 0.1%.
The diff coverage is 91.66%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master     #368     +/-   ##
=========================================
+ Coverage   64.62%   64.72%   +0.1%     
=========================================
  Files         212      212             
  Lines       13977    13985      +8     
=========================================
+ Hits         9032     9052     +20     
+ Misses       4945     4933     -12
Impacted Files Coverage Δ
garage/tf/algos/npo.py 95.02% <91.66%> (-0.32%) ⬇️
garage/tf/distributions/diagonal_gaussian.py 66.66% <0%> (-5%) ⬇️
garage/envs/mujoco/gather/gather_env.py 57.57% <0%> (+0.67%) ⬆️
.../theano/optimizers/conjugate_gradient_optimizer.py 74.05% <0%> (+1.26%) ⬆️
garage/envs/mujoco/maze/maze_env.py 82.21% <0%> (+1.44%) ⬆️
garage/theano/optimizers/first_order_optimizer.py 85.5% <0%> (+1.44%) ⬆️
garage/tf/distributions/categorical.py 66.66% <0%> (+1.51%) ⬆️
garage/tf/distributions/recurrent_categorical.py 79.16% <0%> (+4.16%) ⬆️
garage/envs/mujoco/point_env.py 77.77% <0%> (+11.11%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e89ce7a...3589a48. Read the comment docs.

Copy link
Member

@ryanjulian ryanjulian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did this fix the behavior where the entropy term has no effect?

policy_entropy = tf.reduce_mean(policy_entropy * i.valid_var)
policy_entropy = policy_entropy * i.valid_var

if self._stop_entropy_gradients:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stop_entropy_gradient

policy=None,
policy_ent_coeff=1e-2,
use_softplus_entropy=True,
log_estimate_entropy=True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use_logli_entropy

name="NPO",
policy=None,
policy_ent_coeff=1e-2,
use_softplus_entropy=True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this still needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

@coveralls
Copy link

coveralls commented Oct 30, 2018

Coverage Status

Coverage increased (+0.09%) to 65.171% when pulling 3589a48 on fix_pol_ent into e89ce7a on master.

@ryanjulian ryanjulian requested a review from zhanpenghe October 30, 2018 18:06
Copy link
Member

@zhanpenghe zhanpenghe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

Copy link
Member

@zhanpenghe zhanpenghe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please check the example files or benchmarking comments? I believe there should be some changes with hyper parameters.

For not using negative log likelihood as entropy term, since you take out the mean now, the scale of the coefficient might be different too.

@ryanjulian
Copy link
Member

Do any of the examples even use entropy?

@zhanpenghe
Copy link
Member

zhanpenghe commented Oct 30, 2018

policy_ent_coeff was by default set as 1e-2 before so any of the examples that does not explicit set it to 0 would use entropy.

@CatherineSue
Copy link
Member Author

CatherineSue commented Oct 30, 2018

The default value of use_logli_entropy is True, so every related example and benchmark will be using negative log likelihood as entropy term if they don't set policy_ent_coeff to 0.

@CatherineSue CatherineSue force-pushed the fix_pol_ent branch 3 times, most recently from 3d832b6 to c8c02e8 Compare October 30, 2018 21:55
@CatherineSue
Copy link
Member Author

I changed the default policy_ent_coeff to 0.0.

Copy link
Member

@zhanpenghe zhanpenghe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Please update the comments about benchmarking in the examples

This commit fixes policy entropy. Currently it uses the following three
options:
• maximum entropy (augmented reward)
• with a logli estimator (-logli(action|policy)
• with the gradient turned off (tf.stop_gradient(entropy))
@CatherineSue CatherineSue merged commit 26c14a3 into master Oct 30, 2018
@zhanpenghe zhanpenghe deleted the fix_pol_ent branch October 31, 2018 17:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants