Skip to content

DDPG/TD3 target_actor output clip #196

@huxiao09

Description

@huxiao09

Problem Description

Hi! It seems that the output of target_actor in DDPG/TD3 has been directly clipped to fit the action range boundaries, without multiplying by max_action. But in Fujimoto's DDPG/TD3 code[1] and some other implementations, the max_action has been add in the last tanh layer of the actor network, so they don't use clip. Have u ever tried the second implementation?

if global_step > args.learning_starts:
                data = rb.sample(args.batch_size)
                with torch.no_grad():
                    next_state_actions = (target_actor(data.next_observations)).clamp(
                        envs.single_action_space.low[0], envs.single_action_space.high[0]
                    )
                    qf1_next_target = qf1_target(data.next_observations, next_state_actions)
                    next_q_value = data.rewards.flatten() + (1 - data.dones.flatten()) * args.gamma * (qf1_next_target).view(-1)

[1] https://github.com/sfujim/TD3

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions