[//]: # (Image References)

[image1]: https://user-images.githubusercontent.com/10624937/42135619-d90f2f28-7d12-11e8-8823-82b970a54d7e.gif "Trained Agent"

# Project 1: Continuous Control

[image1]: https://user-images.githubusercontent.com/10624937/43851024-320ba930-9aff-11e8-8493-ee547c6af349.gif "Trained Agent"

### Introduction

For this project, you will work with the [Reacher](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Learning-Environment-Examples.md#reacher) environment.

![Trained Agent][image1]
  
In this environment, a double-jointed arm can move to target locations. A reward of +0.1 is provided for each step that the agent's hand is in the goal location. Thus, the goal of your agent is to maintain its position at the target location for as many time steps as possible.

The observation space consists of 33 variables corresponding to position, rotation, velocity, and angular velocities of the arm. Each action is a vector with four numbers, corresponding to torque applicable to two joints. Every entry in the action vector should be a number between -1 and 1.

The environment is considered solved, when the average (over 100 episodes) of those average scores is at least +30.

### Agent

#### DDPG Agent
DDPG agent, [ddpg_agent.py](ddpg_agent.py), implements the DDPG algorithm from [DDPG paper](https://arxiv.org/pdf/1509.02971). 

1. It is a policy gradient algorithm that employs actor-critic model.

    Actor network parameters
    ```
    ----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
    ================================================================
           BatchNorm1d-1                   [-1, 33]              66
                Linear-2                  [-1, 256]           8,704
                Linear-3                  [-1, 128]          32,896
                Linear-4                    [-1, 4]             516
    ================================================================
    Total params: 42,182
    Trainable params: 42,182
    Non-trainable params: 0
    ----------------------------------------------------------------
    Input size (MB): 0.00
    Forward/backward pass size (MB): 0.00
    Params size (MB): 0.16
    Estimated Total Size (MB): 0.16
    ----------------------------------------------------------------
    ```
    
    Critic network parameters
    ```
    ----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
    ================================================================
           BatchNorm1d-1                   [-1, 33]              66
                Linear-2                  [-1, 256]           8,704
                Linear-3                  [-1, 128]          33,408
                Linear-4                    [-1, 1]             129
    ================================================================
    Total params: 42,307
    Trainable params: 42,307
    Non-trainable params: 0
    ----------------------------------------------------------------
    Input size (MB): 0.00
    Forward/backward pass size (MB): 0.00
    Params size (MB): 0.16
    Estimated Total Size (MB): 0.17
    ----------------------------------------------------------------
    ```
2. DDPG is a off-policy learning as the target determinstic policy is different from the noisy learning behavior policy.
3. Soft update is employed for both actor and critic network to achieve a stable learning process. The parameter of $\tau$ is used to control how much we mix in from the behavior network into the target network each time.

    $$
    \theta_{target} = \tau*\theta_{local} + (1-\tau)*\theta_{target}
    $$
    
4. The critic network is learned with temporal difference(TD) approach.
    $$
    y_t = r_t + discount * Q'(s_{t+1},a,\theta_t')
    $$    $$
    L^{critic} = \frac{1}{N}\sum(y_t - Q(s_t,a,\theta_t))^2
    $$

5. The actor network is learned by maximizing objective function $J(\theta)$. It means that we encourage the actions that produce bigger Q value, and on the other hand discourage the actions with smaller Q value:
![Objective Function](./assets/objective_function.jpg)

6. DDPG algorithm.
![Algorithm](./assets/algorithm.png)

#### Hyper-Parameters
* Replay buffer size, BUFFER_SIZE, of **100,000**.
* Minibatch size, BATCH_SIZE, of **128**.
* Discount factor, GAMMA, of **0.99**.
* Soft update of target parameters, TAU, of **0.001**.
* Actor learning rate of **0.0001** and critic learning rate of **0.001**.
* Unmodified Ornstein-Uhlenbeck noise: mu of **0.0**, theta of **0.15** and sigma of **0.2**.

#### Result
The following is training result, which is resolved in 17 episodes with average score of 30.26 over the last 100 episodes.

![Result](./assets/result.jpg)

### Future Improvements
I am interested in using PPO to resolve the problem. Unfortunately, with lots of time put in trying to train a PPO actor-critic for PongDeterministicV4, I did not finally get it working. It can learn but having memory leak issues. The issue was posted [here](https://discuss.pytorch.org/t/constant-memory-leak/28753) and the repo is [here](https://github.com/weicheng113/PongPPO). Hopefully, I can get it working in future and be able to experiment more with PPO approach.
