# Twin delayed DDPG

Now, we will look into another interesting actor-critic algorithm, known as TD3.
TD3 is an improvement (and basically a successor) to the DDPG algorithm we just
covered.

In the previous section, we learned how DDPG uses a deterministic policy to
work on the continuous action space. DDPG has several advantages and has been
successfully used in a variety of continuous action space environments.

We understood that DDPG is an actor-critic method where an actor is a policy
network and it finds the optimal policy, while the critic evaluates the policy
produced by the actor by estimating the Q function using a DQN.
One of the problems with DDPG is that the critic overestimates the target Q value.
This overestimation causes several issues. We learned that the policy is improved
based on the Q value given by the critic, but when the Q value has an approximation
error, it causes stability issues to our policy and the policy may converge to local
optima.

Thus, to combat this, TD3 proposes three important features, which are as follows:
1. Clipped double Q learning
2. Delayed policy updates
3. Target policy smoothing

First, we will understand how TD3 works intuitively, and then we will look at the
algorithm in detail.


## Key features of TD3

TD3 is essentially the same as DDPG, except that it proposes three important features
to mitigate the problems in DDPG. In this section, let's first get a basic understanding
of the key features of TD3. The three key features of TD3 are:

__Clipped double Q learning:__ Instead of using one critic network, we use two
main critic networks to compute the Q value and also use two target critic
networks to compute the target value.
We compute two target Q values using two target critic networks and use the
minimum value of these two while computing the loss. This helps to prevent
overestimation of the target Q value. We will learn more about this in detail
in the next section.

__Delayed policy updates:__ In DDPG, we learned that we update the parameter
of both the actor (policy network) and critic (DQN) network at every step
of the episode. Unlike DDPG, here we delay updating the parameter of the
actor network.
That is, the critic network parameter is updated at every step of the episode,
but the actor network (policy network) parameter is delayed and updated
only after every two steps of the episode.

__Target policy smoothing:__ The DDPG method produces different target
values even for the same action. Hence, the variance of the target value will
be high even for the same action, so we reduce this variance by adding some
noise to the target action.


__Now that we have a basic idea of the key features of TD3, in the next section we will get into
more detail and learn how exactly these three key features work and how
they solve the problems associated with DDPG.__