[//]: # (Image References)

[image1]: learning_curves.png "Learning Curves for Proximal Policy Optimization with Generalized Advantage Estimation"

# Report: Continuous Control Project

This work trains a simulated robotic arm to track a dynamic target position for their end effector. The environment has a continuous state space, representing orientations and rates of arm joints, as well as a continuous action space, representing torques applied to arm joints. To accelerate training, the environment contains 20 independent copies of the robotic arm that each call the same policy to generate actions; thus allowing a 20x increase in training data per episode.

The reinforcement learning algorithm used in this work is [proximal policy optimization (PPO)](https://arxiv.org/pdf/1707.06347.pdf) in an actor-critic form and a [generalized advantage estimate (GAE)](https://arxiv.org/pdf/1506.02438.pdf). Two deep neural networks are trained: one for the stochastic policy that maps the robots state to actions, and one for the value function that estimates the expected returns from a given state of the robot. 


## Learning Algorithm

As previously mentioned, this work uses PPO for policy training and GAE for advantage estimation. This is an actor-critic method with two neural networks: one for the policy and one for the value function.

The algorithm works by rolling out a fixed policy for a fixed number of timesteps to generate a batch of training data. Each batch is composed of $NT$ timesteps of data from the $N=20$ independent robotic arm agents being executed for $T$ timesteps. After a rollout, the empirical returns, value estimates, and advantage estimates are calculated for every timestep of the training batch. The advantage estimates are used to compute the policy loss and update the policy parameters $\theta$. The returns and value estimates are then used to compute the value loss and update the value function parameters $w$.

#### Policy Loss Function
Proximal policy optimization learns by minimization a surrogate loss function, $L^{\text{CLIP}+\text{S}}$, over the policy parameters $\theta$ based on a batch of training data. The loss function is of the form

\begin{equation}
L \left( \theta \right) = \mathbb{E} \left[ \frac{\pi(a_t \mid s_t, \theta) }{\pi_{\text{old}}(a_t \mid s_t, \theta)} A^{\text{GAE}}(s_t,a_t)\right] = \mathbb{E} \left[ \rho_t (\theta) A_t\right]
\end{equation}

\begin{equation}
L^{\text{CLIP}+\text{S}} \left( \theta \right) = - \mathbb{E} \left[ \min{\left( \rho_t (\theta) A_t, \text{clip}\left(\rho_t(\theta), 1-\epsilon, 1+\epsilon\right) \right)} + \beta S \right] 
\end{equation}

where $s$ is the state of the system, $a$ is the action the agent takes, $S$ is a entropy bonus to encourage exploration weighted by the hyperparameter $\beta$.

#### Value Loss Function
The advantage estimate $A_t$ is computed as 

\begin{align}
A_t &= \sum_{l=0}^{\infty} \left( \gamma \lambda \right)^l \delta_{t+l}^{V} \\
\delta_{t}^{V} &= r_t + \gamma V_w (s_{t+1}) - V_w (s_t)
\end{align}

Where the value estimates $V_w (s_t)$ are generated by the value network which is trained by minimizing over value parameters $w$ the squared error with respect the empirical returns from a training batching; i.e. minimizing

\begin{equation}
L^V (w) = \left(V_w (s_t) - G_t \right)^{2}
\end{equation}

### Execution Notebook: `Continuous_Control.ipynb`

The jupyter notebook `Continuous_Control.ipynb` is effectively the "main" function for the code. It is responsible for establishing the environment (i.e. `unityagents.UnityEnvironment`), creating the agents that interact with the environment (i.e. `continuous_control_brain_agent.ActorCriticMind`), and executing the training for various agents (i.e. `continuous_control_brain_agent.train_actor_critic_mind`).

Alternatively, one can run the training via command line using 
```bash
python continuous_control_training_script.py --exp-name my_experiment --gae-lambda 0.9
```

### Source Code and Model Architecture 

The python module `continuous_control_brain_agent.py` contains most of the source code run by `Continuous_Control.ipynb`. The `continuous_control_brain_agent` module provides a class for defining trainable agents (`ActorCriticMind`) and a functions for stepping through the training processes (`train_actor_critic_mind` and `collect_trajectories`) as well as functions for computing the policy loss (`clipped_surrogate_objective`). 

The `ActorCriticMind` class defines objects for storing and training the agents that interact with the environment. The policy for selecting actions from a continuous action space given an observation of the environment is stochastic. The output layer of the policy generates outputs parameters for a normal distribution for each dimension of the action space. These distributions are then randomly sampled in order to selecting the action that the agent applies to the environment. The policy function is stored in `ActorCriticMind.policy_network` and is a object of type `network_models.GaussianActorNetwork` defined in the `network_models.py` module. The `ActorCriticMind.policy_network` action value function is a deep neural network defined using PyTorch.

Similarly the `ActorCriticMind.value_network` object of type `network_models.DeepNetwork` stores a deep neural network for computing an estimated expected return given an environment state. The 

The policy network takes input vectors of size 20x33 (i.e. the observation space size for the environment by the number of agents) and outputs vectors of size 4 (i.e. the action space size for the environment). The value network takes in the same input vectors and outputs a single floating point value. For both networks a deep neural net is used with fully-connected hidden layers of size 256, 64, and 32, respectively. Each layer uses a ReLU activation function.

As the `train_actor_critic_mind` calls the `collect_trajectories` function that rolls out the policy in the environment for a predetermined number of steps or until the end of an episode. The `train_actor_critic_mind` function than uses the batch of data to train the agent's policy and value networks by breaking the batch into randomized minibatches, computing policy and value loss for each minibatches; repeating the process over multiple epochs per batch. 


### Data Files

### Hyperparameters

## Plot of Rewards

![PPO-GAE Learning Curves][image1]

## Ideas for Future Work