DDPG on-policy? #6

hashbangCoder · 2016-05-16T03:00:04Z

Hi,

In the ddpg.py, I assume you're following this paper by Silver et.al. If so, your algorithm doesn't seem to mirror theirs. Over here, your are going on-policy for the actor. But DDPG is an off-policy approach due to exploration noise being added (which I can't seem to find in your code). If its completely deterministic, there is no scope for exploration. Unless you're following a stochastic policy in which case, doesnt it defeat the purpose of DDPG?

Am I missing something?

dementrock · 2016-05-16T04:52:56Z

Hi @hashbangCoder, in the DDPG implementation, we compute the on-policy Q value which provides gradient signal to improve the policy, but the sampling of trajectories is done off-policy. The DDPG class accepts a parameter called es, which is abbreviation for exploration strategy. It is then used to generate an off-policy action in this line: https://github.com/rllab/rllab/blob/master/rllab/algos/ddpg.py#L224

Currently we implement two exploration strategies, one is the Brownian motion noise as mentioned in the paper and the other is Gaussian noise. You can find them here: https://github.com/rllab/rllab/tree/master/rllab/exploration_strategies

hashbangCoder · 2016-05-16T15:19:37Z

Sheesh can't believe I missed that. My bad. Sorry.

hashbangCoder closed this as completed May 16, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DDPG on-policy? #6

DDPG on-policy? #6

hashbangCoder commented May 16, 2016

dementrock commented May 16, 2016

hashbangCoder commented May 16, 2016

DDPG on-policy? #6

DDPG on-policy? #6

Comments

hashbangCoder commented May 16, 2016

dementrock commented May 16, 2016

hashbangCoder commented May 16, 2016