# Continuous control project
This notebook describes the implementation and illustrates the live performance of the Reacher agent in the actual environment.

# Background
#### Deep Deterministic Policy Gradients (DDPG)
DDPGs are a form of actor-critic methods that are applicable to continuous action space states, in contrast to other algorithms like Deep Q Networks. The original description of DDPGs can be found [here](https://arxiv.org/pdf/1509.02971.pdf). DDPGs train simultaneously two networks: An actor, which selects the optimal (deterministic) policy based on the current state, and a critic, which approximates the value function of the state-action pair.

Both the actor and the critic are cloned, much like in DQN networks. To update the networks, the outputs of the cloned networks are used.

# Parameters
#### Actor
The actor network has three layers, with sizes `33` (input), `240`, `120` and `4` (output). Batch normalization is applied after the first layer. ReLUs are applied after the first and second layer. A tanh function is appliead after the last layer to ensure outputs in the range `[-1, 1]`. The actor applies noise to its actions as a Ornstein-Uhlenbeck process with $\mu=0$, $\theta=0.05$ and $\sigma=0.02$

#### Critic
The critic network has a first layer of sizes `33`, `240`, after wich batch normalization and a ReLU are applied. To this output we append `4` additional inputs (equivalent to the action space) as an input to layer 2, with an output size of `120` and a ReLU at its end. The last layer has an output size of `1`.

#### Learning
This implementation uses the following learning parameters
* A replay buffer of size `10⁶` is used
* The mini-batch size is `256`
* Discount rate $\gamma=0.99$
* Soft update rate $\tau=10³$
* Learning rate $\eta=10^{⁻4}$ (for both networks) 
* Weight decay `= 0`
* The critic's gradient is clipped to the range \[-1, 1\]

# Results

![Results](media/agent-performance-v0.png)

The DDPG solves the environment after 287 episodes.

# Improvements

The following are suggestions on how to improve the performance.
* **Rewrite the replay buffer.** The replay buffer is implemented as a python deque, which performs fast queue/deque operations, but is [inefficient (O(n))](https://docs.python.org/3/library/collections.html#collections.deque) when sampling from the middle of the buffer. Having a faster replay buffer means faster trainings and more time for experiments.
* **Perform a hyperparameter search.** There is guaranteed a better set of hyperparameters available, which would also speed up learning.
* **Implement multiple agents.** Training multiple agents in parallel has been shown to speed up training times.

# Agent demonstration
Below we will see the trained agents following a greedy policy in the actual environment.

**Note:** Running the cell under "Initialization" will open a full-screen Unity window. Switch windows or run the whole script at once

In [1]:
from unityagents import UnityEnvironment
import numpy as np

# Invite our agent & import utils
from ddpg_agent import Agent
#from random import random as rnd
import torch

### Initialization

In [2]:
PATH_TO_ENV = "Reacher_Linux/Reacher.x86"
BRAIN = "ReacherBrain"
TRAINING = False

env = UnityEnvironment(file_name=PATH_TO_ENV, no_graphics=TRAINING)

ACTION_SIZE = env.brains[BRAIN].vector_action_space_size
STATE_SIZE = env.brains[BRAIN].vector_observation_space_size

def act(env, actions, brain_name=BRAIN) -> tuple:
    """Sends actions to the environment env and observes the results.
    Returns a tuple of rewards, next_states, dones (One per agent)"""
    action_result = env.step(actions)[brain_name] # Act on the environment and observe the result
    return (action_result.rewards,
            action_result.vector_observations, # next states
            action_result.local_done) # True if the episode ended
    
def reset(env, training=TRAINING, brain_name=BRAIN) -> np.ndarray:
    """Syntactic sugar for resetting the unity environment"""
    return env.reset(train_mode=training)[brain_name].vector_observations

def visualize(agent, env): 
    states = reset(env)
    score = 0
    done = False
    while not done:
        actions = agent.decide(states)      # Choose an action based on the state
        rewards, next_states, dones = act(env, actions)    # Send the action to the environment
        score += rewards[0]                                # Update the score
        states = next_states                             # Roll over the state to next time step
        done = any(dones)
    print("Score: {}".format(score))  

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		goal_speed -> 1.0
		goal_size -> 5.0
Unity brain name: ReacherBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 33
        Number of stacked Vector Observation: 1
        Vector Action space type: continuous
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 


### Run the simulation

In [3]:
agent = Agent(STATE_SIZE, ACTION_SIZE, 0)
state_dict = torch.load('models/30.0-289-27.12.19_18.05-checkpoint.pth')

In [4]:
agent.actor_local.load_state_dict(state_dict['actor_state_dict'])
agent.critic_local.load_state_dict(state_dict['critic_state_dict'])

<All keys matched successfully>

In [6]:
visualize(agent, env)

Score: 36.699999179691076


In [7]:
# Shuts down the Unity environment
env.close()