# Report

This project has benchmarked the DDPG algorithm ([Google Deepmind, 2016](https://arxiv.org/pdf/1509.02971.pdf)) for multi-agent settings, MADDPG ([OpenAI, 2018](https://arxiv.org/pdf/1706.02275.pdf)) with the D4PG ([Google Deepmind, 2018](https://arxiv.org/pdf/1804.08617.pdf)).

This report describes the methods as well as the implementations.

## DDPG

Actor-Critic methods like the DDPG algorithm combine Q learning with policy gradients. The name of the DDPG algorithm (Deep Deterministic Policy Gradient) suggests that it is a policy gradient algorithm. However, policy gradient algorithms are on-policy methods, while the DDPG is an __off-policy__ method, because the policy that is being used for estimating the Q-values (the behaviour policy) is not the one that is followed and improved (the estimation policy).

More distinctions from vanilla Actor-Critic methods is the use of the __replay buffer__ to sample from old experiences to prevent temporal correlations between episodes. The last trick is to use __target networks__ to estimate the TD target, which are the Q-values at the next state action pairs. The goal of the Critic is to minimize the loss $L$ between the TD (Temporal Difference) target $y$ and it's estimate, simply the Mean Squared Error:


\begin{equation}
L = \frac{1}{N} \sum_{i} (y_i - Q(s_i, a_i | \theta^{Q})^2) 
\end{equation}

which is obtaind by the target networks $Q'(s_{t+1}, a_{t+1})$ and $\mu'(s_{t+1})$):

\begin{equation}
y_i = r_i + \gamma Q'(s_{i + 1}, \mu'(s_{i+1}|\theta^{\mu'})|\theta^{Q'})
\end{equation}

The deterministic policy's weights are updated to maximize the expected reward:

\begin{equation}
\nabla_{\theta^{\mu}} \mu \approx \mathbb{E}_{\mu'} \big [ \nabla_{a} Q(s, a|\theta^{Q})|_{s=s_t,a=\mu(s_t)} \nabla_{\theta^{\mu}} \mu(s|\theta^{\mu})|_{s=s_t} \big ]
\end{equation}

Now, this can be simplified resulting in the following policy gradient:

\begin{equation}
\theta^{\mu}_{k + 1} = \theta^{\mu}_k + \alpha \mathbb{E}_{\mu'^{k}} \big [ \nabla_{\theta} Q(s, \mu (s|\theta^{\mu}_k)|\theta^{Q}_k)  \big ].
\end{equation}


The implementation is fairly simple:

In [None]:
q_targets = rewards + (gamma * q_targets_next))
q_expected = critic(states, actions)
critic_loss = F.mse_loss(q_expected, q_targets)

And for the policy loss we have:

In [None]:
actor_loss = -self.critic_local(states, actions_pred).mean()

In order to do better exploration, explorative actions are taking by adding noise $\mathcal{N}$ to the policy, which in code looks like:

In [None]:
action += self.noise.sample() * noise_weight

In which the noise is generated as a Ornstein-Uhlenbeck process. This is more or less a dynamic stochastic process (governed by differential equations).

## D4PG 

The D4PG (Distributed Distributional Deterministic Policy Gradients) has several improvements upon the DDPG. First, the name __Distributed__ comes from the fact that is uses many actors in parallel, allowing to increase the learning speed. Next, __Distribution__ comes from the fact that the Critic estimates the Q values as a random variable, following a distribution $Z_w$. The goal is now to minimize the difference between two distributions:

\begin{equation}
L(w) = \mathbb{E}[d(\mathcal{T}_{\mu_\theta}, Z_{w’}(s, a), Z_w(s, a)]
\end{equation}

This implemented in the code, by first calculating the log probability distribution (the output of the critic), follewed by the target distribution (catagorial). The loss is calculated as the Cross Entropy:

In [None]:
log_probs = critic(obs, actions)
target_probs = critic_target(next_obs, target_actions)
target_dist = categorical(rewards, target_probs, dones)
critic_loss = -(target_dist * log_probs).sum(-1).mean()