# Banana Navigation Project

![Screenshot of banana environment](doc/bannerImage.png)

This is an implementation of Deep Reinforcement Q-Learning, applied to train an agent with four possible actions (move left, right, forward, or backward), to pick up yellow bananas and avoid blue bananas.

## Table of Contents
+ Environment Setup
+ Description of Algorithm
+ Implementation of Algorithm
  - Hyperparameter Definitions
  - Multi-step Learning, Prioritized Replay Buffer
  - Action Value Distribution Function (Neural Network)
  - Bellman Update (Loss) Computation
  - Training Loop
+ Training
+ Results
+ References

## Environment Setup

+ Follow instructions [here](https://github.com/udacity/Value-based-methods#dependencies) to set up the environment, *with the following changes:*
  - Before running `pip install .`, edit `Value-based-methods/python/requirements.txt` and remove the `torch==0.4.0` line
  - After running `pip install .`, run the appropriate PyTorch installation command for your system indicated [here](https://pytorch.org/get-started/locally/)
  - Continue following the instructions [here](https://github.com/udacity/Value-based-methods#dependencies) to their conclusion.
+ Download the appropriate Unity Environment for your platform:
  - [Linux](https://s3-us-west-1.amazonaws.com/udacity-drlnd/P1/Banana/Banana_Linux.zip)
  - [Mac OSX](https://s3-us-west-1.amazonaws.com/udacity-drlnd/P1/Banana/Banana.app.zip)
  - [Windows (32-bit)](https://s3-us-west-1.amazonaws.com/udacity-drlnd/P1/Banana/Banana_Windows_x86.zip)
  - [Windows (64-bit)](https://s3-us-west-1.amazonaws.com/udacity-drlnd/P1/Banana/Banana_Windows_x86_64.zip)
+ Place the Unity Environment zip file in the `p1_navigation/` folder of the repository cloned in the first step, and unzip the file.
+ Clone this repository into the `p1_navigation/` folder.

### Supplemental Packages
Run the following code cell *once* to install additional packages required by the implementation

In [None]:
!pip install

### Imports
Run the following code cell at every kernel instance start-up to bring implementation dependencies into the notebook namespace

In [8]:
from unityagents import UnityEnvironment
import numpy as np

## Description of Algorithm

Deep Reinforcement Q-Learning is a *value-based* class of reinforcement learning algorithms.  These algorithms aim to accurately approximate either the expected reward or reward probability distribution, for every possible pair (state, agent response) in the environment.  With either of these approximations, an agent may be controlled by, when in each state, selecting the action with the highest expected reward.

### Value Distribution
This implementation, like in aims to find the reward probability distribution [1]:<br><br>
$$d_t^{(n)}\equiv(R_t^{(n)}+\gamma_t^{(n)}\textbf{z},\textbf{p}(S_{t+n},a^{*}_{t+n}))$$
<br>
This is an *n-step* value distribution.  The value of the random variable $d_t^{(n)}$ is the sum $R_t^{(n)}$ of the rewards over the next *n* environment time steps, plus the reward distribution $\textbf{z}$ discounted by the factor $\gamma_t^{(n)}$.  The probabilities for the values for the random variable are those that result from, when the agent is in the state $S_{t+n}$, *n* steps advanced from present, the optimal action $a^{*}_{t+n}$ is selected. <br><br>
In practice, the continuous distribution of values is approximated by histogram binning.  The bins are called *atoms* in the literature and typically form an evenly spaced grid between maximum and minimum allowed values $v_{max}$ and $v_{min}$.

### Parameterized Model
As the product of the state and action spaces is very large (infinite, since the state variables are continuous), it is necessary to represent the reward distribution with a parameterized function.  The 'Deep' in Deep Reinforcement Q-Learning implies that the parameterized function is going to be a multi-layer neural network.

Using the notation in [1], let $p_{\theta}^i(s,a)$ denote this function, with set of parameters $\theta$.  Optimization of the parameters in $\theta$ shall be performed, such that, given the selection of an action $a$ by the agent, when the environment is in state $S$, $p_{\theta}^i(s,a)$ approximates the probability that the *n-step* reward will be $z_i$.  As in [1], the available $z_i$ will be defined by:<BR><BR>
$$z_i \equiv v_{min} + (i-1)\frac{v_{max}-v_{min}}{N_{atoms}-1},  i \in {1,...,N_{atoms}}$$



## Implementation of Algorithm

### Hyperparameter Definitions

### Multi-step Learning, Prioritized Replay Buffer

### Action Value Distribution Function (Neural Network)

### Bellman Update (Loss) Computation

### Training Loop

## Results

## References
[1] Hessel et. al., Rainbow: Combining Improvements in Deep Reinforcement Learning, arXiv:1710.02298 <br>
[2] Bellemare et. al., A Distributional Perspective on Reinforcement Learning, arXiv:1707.06887 <br>
[3] Schaul et. al., Prioritized Experience Replay, arXiv:1511.05952