# Project 2: Continuous control

In [2]:
from IPython.display import HTML
style = "<style>.pseudo-code ul { list-style-type: none;}</style>"
HTML(style)

## 1: Learning algorithm

### Description of algorithm

<style>
.pseudo-code ul {
  list-style-type: none;
}
.pseudo-code ol {
  list-style-type: none;
}
</style>

Deep learning and reinforcement learning have recently been combined in different ways, including variants of "Deep Q Network" (DQN) as used in the Navigation project. While DQN works well with high-dimensional _observation spaces_, it can only handle discrete and low dimensional _action spaces_.

In physical control tasks such as this one, we have continuous and high dimensional action spaces and algothithms like _deep deterministic policy gradient_ (DDPG) are better suited. In brief, DDPG is a model-free, off-policy actor-critic algorithm using deep function approximators.

The DDPG algorithm works as follows:

<nav class="pseudo-code">

* Randomly initialize critic network $Q(s,a|\theta^Q)$ and actor $\mu(s|\theta^\mu)$ with weights $\theta^Q$ and $\theta^\mu.$
* Initialize target network $Q'$ and $\mu'$ with weights $\theta^{Q'} \leftarrow \theta^Q, \theta^{\mu'} \leftarrow \theta^\mu$
* Initialize replay buffer $R$
   
   **for** episode = 1, M **do**
   
   * Initialize a random process $\mathcal{N}$ for action exploration

   * Receive initial observation state $s_1$

   * **for** t = 1, T **do**
     * Select action $a_t = \mu(s_t|\theta^\mu) + \mathcal{N}_t$ according to the current policy and exploration noise
   
     * Execute action $a_t$ and observe reward $r_t$ and observe new state $s_{t+1}$
   
     * Store transition $(s_t, a_t, r_t, s_{t+1})$ in $R$
   
     * Sample a random minibatch of $\mathcal{N}$ transitions $(s_i, a_i, r_i, s_{i+1})$ from $R$
   
     * Set $y_i = r_i + \gamma Q'(s_{i+1}, \mu'(s_{i+1}|\theta^{\mu'})|\theta^{Q'})$
   
     * Update cretic by minimizing the loss: $L = \frac{1}{N} \sum_i(y_i - Q(s_i, a_i|\theta^Q))^2$
   
     * Update the actor policy using the sampled policy gradient:
   
     * \begin{equation} \nabla_{\theta^\mu} J \approx \frac{1}{N} \sum_i \nabla_a Q(s,a|\theta^Q)|_{s=s_i, a=\mu(s_i)} \nabla_{\theta^\mu}\mu(s|\theta^\mu)|_{s_i}  \end{equation}
   
     * Update the target networks:
   
   \begin{align}
   \theta^{Q'} &\leftarrow \tau\theta^Q + (t-\tau) \theta^{Q'}\\
   \theta^{\mu'} &\leftarrow \tau\theta^\mu + (t-\tau)\theta^{\mu'}
   \end{align}

   * **end for**

  **end for**

</nav>


### Chosen Hyperparameters

```
BUFFER_SIZE = int(1e5)  # replay buffer size
BATCH_SIZE = 128        # minibatch size
GAMMA = 0.99            # discount factor
TAU = 1e-3              # for soft update of target parameters
LR_ACTOR = 2e-4         # learning rate of the actor 
LR_CRITIC = 3e-4        # learning rate of the critic
WEIGHT_DECAY = 0        # L2 weight decay
NUM_AGENTS = 20         # number of agents
UPDATE_RATE = 20        # number of time steps between updates
NUM_UPDATES = 10        # how many times train the agens on each update
EPSILON = .5            # initial noise magnitude
EPSILON_DECAY = 0.01    # noise decay per episode
```

### Neural Networks

Two similarly structured networks are used for the _actor_ and the _critic_. For this problem ``state_size = 33``
 and ``action_size = 4``.

#### Actor
A basic three-layered feed-forward network with fully connected layers is used for the actor:

* BatchNorm 1
* Layer 1: (state_size, 128)
* ReLU 1
* BatchNorm 2
* Layer 2: (128, 128)
* ReLU 2
* BatchNorm 3
* Layer 3: (128, action_size)
* Tanh

#### Critic
A slightly wider three-layered feed-forward network with fully connected layers is used for the critic:

* Layer 1: (state_size, 256)
* ReLU 1
* BatchNorm
* Layer 2: (cat(128, action_size), 128)
* ReLU 2
* Layer 3: (128, action_size)


## 2: Plot of Rewards

This setup was able to solve the problem in 50 episodes. A plot of score per episode is illustraded below:

<img src="score-plot.png" width="600">

## 3: Ideas for Future Work

While the problem is solved reasonably quickly with basic Deep Q-learning, the agent could possibly improved in ways including:   

* Search for optimal hyperparameters and neural network size/shape.
* Try extensions of the Q-learning algorithm, including "Double DQN" and "Dueling DQN", and apply prioritized experience replay rather than the current uniform implementation.
* Observe the raw pixels instead (or in addition to) the current ray-based "sensor" inputs. 
