# Navigation project
This notebook describes the implementation and illustrates the live performance of the Bananator agents in the actual environment.

# Background
#### Deep Q Networks
The original paper on Deep Q Networks (short: DQN) can be found [here](https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf). An excellent in-depth introduction to Reinforcement Learning can be found in this [famous book](https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf) by Sutton & Barto.

This repository follows the ideas proposed in the DQN paper and recreates a Deep Q Network. The network takes a state (a 37-dimensional observation of the environment) as input and outputs the state-action values of the four actions that are available to the agent (move forward and back, turn left and right). These state-action values are an estimate of the expected total reward that the agent will get from that moment on if it follows that action.

Once the DQN has produced the state-action values, the agent will decide on an action following an epsilon-greedy policy, i.e. it will choose the action with the highest value with a probability of $1-\epsilon$. With a probability of $\epsilon$, the agent will choose an action at random. $\epsilon$ equals one in the first episode and slowly decreases to a small number >0 with time.

#### Prioritized experience replay
Many improvements on DQN have been proposed. All of them have been combined into "Rainbow", an alogrithm which outperforms vanilla DQN systematically. One of the most important contributions is prioritized experience replay (short: PER). In this variation, the experiences are sampled from the replay buffer according to their TD error. Experiences that had a larger TD error are sampled with a higher probability. More details on PER can be consulted [here](https://arxiv.org/abs/1511.05952). [This blog post](https://danieltakeshi.github.io/2019/07/14/per/) was particularly useful to understand the implementation details.

The agents in this repository augments DQN with PER and shows both results side by side. The vanilla agent bananator_vanilla implements DQN only, while bananator_per implements DQN and PER.

# Parameters
#### Network parameters
The DQN has three layers. The input size is 37 and the output size 4. The middle layer have an input size of 90 and an output size of 45.

#### Agent parameters
Agent parameters are defined as a named tuple in the same file that contains the agent. This ensures consistency when importing. The parameters used can be examined below, in the section "Load the agents".

# Results

![Results](media/results.png)

The vanilla agent reaches the "solved" threshold after about 500 episodes. Surprisingly, the PER agent does not perform as well as expected. This is not a serious problem, though, since the most likely cause is the choice of hyperparameters. With enough time, we would find a collection of values which would outperform vanilla DQN. Due to severe time constrains, I have to skip the hyperparameter search at this point.

# Improvements

The following are suggestions on how to improve the performance.
* **Rewrite the replay buffer**. The replay buffer is implemented as a python deque, which performs fast queue/deque operations, but is [inefficient (O(n))](https://docs.python.org/3/library/collections.html#collections.deque) when sampling from the middle of the buffer. Having a faster replay buffer means faster trainings and more time for experiments.
* **Implement dynamic PER parameters.** In this implementation, the PER parameters $\alpha$ and $\beta$ are implemented as constants. It should be easy to implement them as variables that can change as the training progresses, and according to the paper that is expected to improve performance.
* **Perform a hyperparameter search.** There is guaranteed a better set of hyperparameters available. During development, average scores of up to 16 were seen.
* **Implement the rest of Rainbow.** There are many other known augmentations to the original DQN model.
* **Run the training several times.** Deep reinforcement learning is [known to depend critically on (random) initial conditions](https://www.alexirpan.com/2018/02/14/rl-hard.html). Re-train several times with the same parameters to find the best performing agent.


# Agent demonstration
Below we will see the trained agents following a greedy policy in the actual environment.

**Note:** Running the cell under "Initialization" will open a full-screen Unity window. Switch windows or run the whole script at once

In [1]:
from unityagents import UnityEnvironment
import numpy as np

# Invite our agent & import utils
from dqn_agent import Agent, Parameters
from random import random as rnd
import torch

### Initialization

In [2]:
PATH_TO_ENV = "Banana_Linux/Banana.x86"
BRAIN = "BananaBrain"
TRAINING = False

env = UnityEnvironment(file_name=PATH_TO_ENV, no_graphics=TRAINING)
bananator = env.brains[BRAIN]

# number of actions
ACTION_SIZE = bananator.vector_action_space_size

# reset the environment
env_info = env.reset(train_mode=TRAINING)[BRAIN]

# examine the state space 
state = env_info.vector_observations[0]
STATE_SIZE = len(state)

def act(env, action, brain_name=BRAIN) -> tuple:
    """Lets brain_name perform action on the environment env.
    Returns a tuple of reward, next_state, done"""
    action_result = env.step(action)[brain_name] # Act on the environment and observe the result
    return (action_result.rewards[0], # reward from action
            action_result.vector_observations[0], # next state
            action_result.local_done[0]) # True if the episode ended
    
def reset(env, training=TRAINING, brain_name=BRAIN) -> tuple:
    """Syntactic sugar for resetting the unity environment"""
    return env.reset(train_mode=training)[brain_name].vector_observations[0] 

def visualize(agent, env): 
    state = reset(env)
    score = 0
    done = False
    j = 0
    while not done:
        j += 1
        action, _ = agent.decide(state)      # Choose an action based on the state
        reward, next_state, done = act(env, action)    # Send the action to the environment
        score += reward                                # Update the score
        state = next_state                             # Roll over the state to next time step
    print("Score: {}".format(score))

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: BananaBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 37
        Number of stacked Vector Observation: 1
        Vector Action space type: discrete
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 


### Load the agents

In [3]:
# Load the agents' parameters
vanilla = torch.load('models/vanilla-checkpoint.pth')
per = torch.load('models/per-checkpoint.pth')

In [4]:
# Prepare vanilla
bananator_vanilla = Agent(vanilla['agent_params'], cuda=False)
bananator_vanilla.qnetwork_local.load_state_dict(vanilla['state_dict'])

# Examine vanilla's parameters -- See dqn_agent.py for a description of the individual fields
for name, value in zip(vanilla['agent_params']._fields, vanilla['agent_params']):
    print(f'{name}: {value}')

buffer_size: 100000
batch_size: 64
gamma: 0.99
tau: 0.001
lr: 0.0005
update_every: 4
state_size: 37
action_size: 4
use_per: False
per_min_priority: 0.05
per_prio_coeff: 1
per_w_bias_coeff: 1
seed: 557


In [5]:
# Prepare per
bananator_per = Agent(per['agent_params'], cuda=False)
bananator_per.qnetwork_local.load_state_dict(per['state_dict'])

# Examine per's parameters -- See dqn_agent.py for a description of the individual fields
for name, value in zip(per['agent_params']._fields, per['agent_params']):
    print(f'{name}: {value}')

buffer_size: 100000
batch_size: 64
gamma: 0.99
tau: 0.001
lr: 0.0005
update_every: 4
state_size: 37
action_size: 4
use_per: True
per_min_priority: 0.05
per_prio_coeff: 0.7
per_w_bias_coeff: 0.7
seed: 961


# Run the simulation

In [6]:
visualize(bananator_vanilla, env)

Score: 17.0


In [7]:
visualize(bananator_per, env)

Score: 8.0


In [8]:
env.close()