-
Notifications
You must be signed in to change notification settings - Fork 805
Closed
Description
Problem Description
Hi! It seems that the output of target_actor in DDPG/TD3 has been directly clipped to fit the action range boundaries, without multiplying by max_action. But in Fujimoto's DDPG/TD3 code[1] and some other implementations, the max_action has been add in the last tanh layer of the actor network, so they don't use clip. Have u ever tried the second implementation?
if global_step > args.learning_starts:
data = rb.sample(args.batch_size)
with torch.no_grad():
next_state_actions = (target_actor(data.next_observations)).clamp(
envs.single_action_space.low[0], envs.single_action_space.high[0]
)
qf1_next_target = qf1_target(data.next_observations, next_state_actions)
next_q_value = data.rewards.flatten() + (1 - data.dones.flatten()) * args.gamma * (qf1_next_target).view(-1)
Metadata
Metadata
Assignees
Labels
No labels