# An Investigation of Deep Reinforcement Learning Algorithms for Quadrotor Flight Control
### S. Morrison, A. Fisher, F. Zambetta

#### Abstract:
Deep reinforcement learning (RL) is an exciting branch of the wider AI discipline that has recently led to advances in computer games AI, and robotics controllers that are trained end-to-end from scratch. To date, there have been numerous applications of this technology to learning quadrotor flight controllers, though many gaps in the body of knowledge still remain. To our knowledge, no study has been done on comparing the effectiveness of various state-of-the-art algorithms on learning continuous action flight tasks, that includes both a broad sweep of current algorithms, and analysis of the difference between online and offline methods. Furthermore, no work that we are aware of has attempted to quantify the effect that including a traditional controller in the architecture has on learning and converged performance. Finally, our experience working with deep RL algorithms has shown that the action-selection frequency of the policy can alter the convergence horizon of discounted rewards, which can in turn affect a controller's ability to learn. We test this effect in order to determine if there is a "best" action selection frequency. Through the course of these experiments, we demonstrate learned flight controllers that learn tasks previously demonstrated in the literature (hover and waypoint navigation), as well as tasks that other works have had difficulty with (landing), which we show traditional algorithms are able to learn without modification. Finally, we conclude with a series of "best practice" recommendations for researchers looking to apply these algorithms to their own research, and some common pitfalls when trying to get certain algorithms working.

## 1. Introduction
Deep reinforcement learning (deep RL) has recently seen widespread acclaim for its ability to learn sophisticated control strategies in computer games []. In tandem with these breakthroughs in learning games-based AI, there has been a flurry of research in end-to-end learning of neural network robotics controllers using black-box optimization -- typically deep RL-based -- with notable developments in model-based [], model-free [], and hybrid methodologies. Additionally, exciting work has been done in meta-learning for few- (or even zero-) shot transfer of control policies [], and theoretical groundwork that links discrete-action algorithms that are an offshoot of tabular methods for games, with stochastic policy search methods used in continuous action control tasks (typically robotics) [].

Alongside this explosion in research, questions have been raised about the scholarship of many previous studies, including the difficulty of replication [], and the inclusion of ad hoc additions to algorithms that appear to improve their performance, but the effect of which is not tested []. A consistent and troubling conclusion of these previous studies on replication of RL algorithms indicates that implementation matters, and that different implementations of the same algorithm can often provide different results for the same task.

As researchers in the field of unmanned aerial systems, these developments in deep RL are particularly exciting, because the pairing of reincforcement learning with deep neural nets for sensor fusion and control could potentially make headway in areas where traditional controllers typically struggle. Indeed, early research has already demonstrated notable successes in exploring and mapping unstructured environments [], learning robust controllers for aircraft recovery, and even quadrotor racing []. Further applications that we look forward to might be producing unmanned aircraft capable of intelligently fighting fires, or herding sheep and cattle as is typically done with manned helicopters today.

Despite these notable early applications, however, there seems to be a dearth of fundamental groundwork on how deep RL algorithms should be best applied to autonomous unmanned systems, particularly quadrotors. It's apparent that such algorithms *can* be applied to quadrotors rather successfully, but to date, no study that we are aware of looks quantitatively at the effects of architecture (e.g. including a traditional controller in the loop, versus outputting rotor commands directly), action selection frequency, or even broad algorithm type (e.g. online versus offline) on learning speed and performance of RL-based flight controllers.

Within this broader context, the goal of this study is to explore the effects of these different variables on the learning speed and final performance of RL-based quadrotor flight controllers. In particular, we wish to answer the research questions:

1. What types of algorithm perform best in direct RPM output control? Surveys such as [] indicate that different algorithms can perform better than other, even theoretically "better" algorithms on some tasks. It warrants testing to determine if the algorithms that have been applied to RL learning of quadrotor flight controllers are indeed the "best" algorithms for the domain.
2. Does algorithm performance significantly change when the agent is outputting a set of rates for a PD controller rather than a direct RPM command? Furthermore, is this change in performance uniform across online and offline algorithms?
3. What is the effect of action selection frequency on the performance of the algorithm? This is a general parameter that is glossed over in most RL papers, though our experience indicates that this is quite an important factor in training.

We start with the working hypotheses that:

1. Offline algorithms learn quadrotor flight control tasks more effectively than online algorithms, even compared to current state-of-the-art online algorithms such as SAC.

2. Having the policy output errors for a traditional flight controller will result in faster learning for simple tasks.

3. There is an optimal action selection frequency for training quadrotor flight controllers, though controllers trained at this action selection frequency can be run at higher frequencies with no adverse affects.

Within the context of the broader body-of-knowledge on RL, we aim to explore these questions while being as consistent as possible in terms of algorithm and training conventions, such that we are comparing like-for-like. This study is broken down into the follwing sections: Section 2 covers related work in more depth, including previous works, their limitations, and gaps in the body of knowledge; Section 3 goes through background knowledge of both deep reinforcement learning and quadrotor flight control; Section 4 covers the methodology of our experiments, including the learning environment, the tasks we train controllers for, and how we vary conditions for learning; Section 5 goes through the implementation of our algorithms, PD flight controller, and basic unit tests for each; Section 6 is where we run our experiments and collect data that is presented in Section 7. We conclude with a few remarks on what we found to be the best practice for applying these algorithms to quadrotor flight control, and directions of further research.


## 2. Related Work
[] used MPC-Guided Policy Search to learn to fly a quadrotor in Gazebo through unstructured environments using LIDAR. This was done using a heavily modified fork of RotorS [], with a custom control node running the policy search algorithm, and using model predictive flight controller to guide training for the policy. This research proved the efficacy of using RL to train continuous-action controllers for quadrotors, and was one of the first to use a neural network to directly output RPM commands. Testing whether or not this was the best configuration for an RL-trained flight controller was outside the scope of this study. Comparison was made to _, and it was shown that the MPG-guided policy search resulted in better performance, though no broader sweep of algorithms and architectures was done.

[] used a method similar to guided policy search to train a quadrotor controller to avoid collisions in obstacle-filled environments. Whereas GPS uses mirror-descent to optimize the policy and select actions in the MPC loop, [] directly trained the policy on targets from a PD controller, effectively turning the task into a supervised learning problem. In order to overcome the relative sparsity of catastrophic events, [] used a novel sampling strategy to overexpose the agent to collisions, resulting in a more risk averse policy. The authors successfully tested the learned policy on the CrazyFlie nanoquad using a native C-library for the network weights, and a Vicon camera system for state estimation. These were able to demonstrate the desired behaviour, despite the limitations of such a small platform.

[] used a custom natural policy gradient method to recover a tumbling aircraft and return it to a target point. To do this, the authors used a novel "junction" exploration strategy, and wrote a very simple flight dynamics model -- omitting the effects of drag -- to demonstrate that they could still learn robust control policies using deep RL. The policies were successfully tested on a real-world aircraft using a Vicon camera system for state estimation. Though this study tested the proposed algorithm against DDPG and TRPO, the results were presented according to wall-clock time, which is non-standard in the RL literature (though an important metric on its own). By the authors' own admission, the learning efficiency of the proposed algorithm was similar to TRPO when measured on the basis of timesteps. Furthermore, the proposed algorithm used a tuned PD controller to help stabilize learning, but this controller wasn't used during the final tests. It's not clear if the PD controller was also used to help TRPO and DDPG during training, which could have had a substantial impact on the outcome of the experiments. It's not immediately obvious, for example, that the method used to guide training of an offline algorithm (TPRO and the proposed algorithm) would translate to DDPG, which uses a memory buffer to train the agent.

Our work is distinct from these previous studies in that our explicit goal is explore the combination of algorithm and architecture, for common state-of-the-art RL algorithms, and different combinations of controller-in-the-loop. Most of these previous works used some form of traditional flight controller to guide the training process, or to fly the aircraft outright; given that quadrotors are an inherently unstable platform, and that even the smallest error can result in catastrophic loss of the aircraft, this concession is entirely unsurprising.



## 3. Background
### 3.1 Deep Reinforcement Learning
Deep reinforcement learning combines neural network function approximators with reinforcement learning in order to learn controllers (or agents) that are capable of performing sophisticated control tasks, including playing computer games using vision as the state input, or exploring unknown environments. Rather than relying on human knowledge and heuristics for a task, the engineer provides a reward signal that the agent then tries to maximize within an environment. A basic diagram of this setup (from []) has been provided below.

![rl_diagram](https://docs.google.com/uc?export=download&id=1IVYva4U-J9C_XtfRngJNQ-XUTpgeOV1l)

Formally, we define a tuple $(s, a, r)$ for states $s \in \mathcal{S}$, actions $a \in \mathcal{A}$, and rewards $r \in \mathcal{R}$. Our agent takes a trajectory $<s_0, a_0, r_0, s_1, a_1, r_1, ..., s_N>$ by selecting actions actions using a policy $\pi$. The agent's goal is to maximize the expected return under the policy:

$\mathbb{E}_{\pi}\left[\sum_{k=t}^{T-1}\gamma^{k-t}r_{k}\right]$

in which we denote the state as state-action value functions for the policy $\pi$ as being:

$V^{\pi}(s_t) = \mathbb{E}_{\pi}\left[\sum_{k=t}^{T-1}\gamma^{k-t}r_{k}|s_t = s\right]$

$Q^{\pi}(s_t,a_t) = \mathbb{E}_{\pi}\left[\sum_{k=t}^{T-1}\gamma^{k-t}r_{k}|s_t = s, a_t = a\right]$

Where the following useful relations are defined:

$Q^{\pi}(s_t,a_t) = r_t + \gamma V^{\pi}(s_{t+1})$

$V^{\pi}(s_t) = \sum_{a\in\mathcal{A}}\pi(a_t|s_t)Q^{\pi}(s_t,a_t)$

$A^{\pi}(s_t,a_t) = Q^{\pi}(s_t,a_t) - V^{\pi}(s_t) = r_t + \gamma V^{\pi}(s_{t+1}) - V^{\pi}(s_t)$

Within RL, there are several broad frameworks for learning how to take the best action in an environment, givensome reward function, but for the purposes of this study, we focus on online and offline policy gradient methods for learning continuous actions. These methods share the common theme of parameterizing a policy $\pi_\theta : \mathcal{S} \rightarrow \mathcal{A}$, that maps a state input (given by the environment) to an action output, but differ in how they train the policy. Offline methods use a variant of the score function estimator known as REINFORCE:

$\nabla_\theta J(\theta) = \mathbb{E}_{\pi}\left[\sum_{t}^{T-1}\nabla_{\theta}\log{\pi_\theta(a_t|s_t)}\left(Q^{\pi}(s_t, a_t)-V_{\phi}^{\pi}(s_t)\right)\right]$

Where $Q^{\pi}(s_t, a_t)$ is the empirical value function that is obtained by rolling out one (or multiple) trajectories, and $V_{\phi}^{\pi}(s_t)$ is a neural network baseline parameterized by $\phi$, and trained on empirical rollouts. Since we are taking a Monte Carlo estimate of the gradient, $V_{\phi}^{\pi}(s_t)$ acts as a control variate to minimize the variance of the gradient estimate, improving the efficiency of learning. Modern innovations include the use of General Advantage Estimation (GAE) to do a weighted average over an infinite horizon [], the use of importance sampling in the gradient estimate to take multiple update steps while correcting for the new policy distribution [], and constraining the update to a trust region in order to prevent the policy from moving into a part of the space where performance collapses []. Common offline methods that see application are REINFORCE (with or without GAE), Proximal Policy Optimization (PPO), and Trust Region Policy Optimization (TRPO).

In contrast, online methods typically take advantage of the fact that neural networks are differentiable to implement a gradient path using a learned state-action value function, and update the policy using: 

$\nabla_\theta J(\theta) = \mathbb{E}_{a\sim\pi, s\sim\mathcal{D}}\left[\nabla_{\theta}Q_{\phi}^{\mu}(s, \pi_\theta(s))\right]$

In general, these methods maintain a replay buffer of transitions, which they sample from randomly in order to train the value function, and train at every time step rather than every batch of trajectories. This is done by feeding the policy's action through the learned value function, and then pushing the gradient back into the policy to update parameter weights in the direction that maximizes this learned value function. These methods are required to bootstrap their value estimate, and as such, they can be unstable. To mitigate this, they require at least one (sometimes more) target network to smooth updates of the state-action value function. As they learn a value function based on transitions taken under previous policies, these methods are said to be off-policy, as opposed to REINFORCE-based methods which update based on the value of the current policy (and are thus on-policy).

Popular offline algorithms include the Deep Deterministic Policy Gradient (DDPG), TD3, and the Soft Actor-Critic (SAC). We also include the Stochastic Value Gradient in our study, since it provides a conceptual bridge between DDPG and SAC.


### 3.2 Quadrotor Flight Control
Our quadrotor is a standard plus-configuration vehicle that is modeled after the Iris in a NED axis system as shown below: 

![quad](https://docs.google.com/uc?export=download&id=1_ym3WoM5DCPPq0484me-tdLHpm2LF8p_)

Our equations of motion are:

$\begin{bmatrix}
		0 \\
		\mathbf{\dot{v}}
	\end{bmatrix} = \frac{1}{m}
	\begin{bmatrix}
		0 \\
		\mathbf{F}_{b}
	\end{bmatrix}
	+\mathbf{q} \otimes 
	\begin{bmatrix}
		0 \\
		\mathbf{G}_{i}
	\end{bmatrix} \otimes
	\mathbf{q}^{-1}
	-\begin{bmatrix}
		0 \\
		\mathbf{\omega} \times \mathbf{v}
	\end{bmatrix}$


$\mathbf{\dot{\omega}} = \mathbf{J}^{-1}(\mathbf{Q}_{b}-\mathbf{\omega}\times \mathbf{J}\mathbf{\omega})$

$\begin{bmatrix}
		0 \\
		\mathbf{\dot{x}}
	\end{bmatrix} = \mathbf{q}^{-1} \otimes 
	\begin{bmatrix}
		0 \\
		\mathbf{v}
	\end{bmatrix} \otimes
	\mathbf{q}$

$\mathbf{\dot{q}} = -\frac{1}{2}\begin{bmatrix}
		0 \\
		\mathbf{\omega}
	\end{bmatrix} \otimes
	\mathbf{q}$
    
    
Where $\mathbf{\dot{v}}$, $\dot{\omega}$,  $\mathbf{v}$, and $\mathbf{\omega}$ are the linear and angular accelerations and velocities expressed in the body frame,  $\mathbf{F}_{b}$ an $\mathbf{Q}_{b}$ are the external (thrust and aerodynamic) forces and moments acting on the aircraft and expressed in the body frame, $\mathbf{G}_{i}$ is the gravity vector expressed in the inertial frame, and $\mathbf{J}$ is the aircraft's inertia tensor. The unit quaternion $\mathbf{q}$ encodes the aircraft's attitude, with $\mathbf{q}^{-1}$ being its inverse, and $\otimes$ denoting the Hamilton product.

Motor thrusts and torques are modeled as:

$\mathbf{F}_{T}
	= k_{t}
	\begin{bmatrix}
    	0 \\
		0 \\
		\sum_{i=1}^{m} \Omega_{i}^{2}
	\end{bmatrix}
$

$\mathbf{Q}_{T}
	= 
	\begin{bmatrix}
    		l k_{T}(\Omega_{2}^{2}-\Omega_{4}^{2}) \\
		l k_{T}(\Omega_{1}^{2}-\Omega_{3}^{2}) \\
		k_{Q}(-\Omega_{1}^{2}+\Omega_{2}^{2}-\Omega_{3}^{2}+\Omega_{4}^{2})
	\end{bmatrix}
$

Where $\Omega$ is the angular velocity of the motor, $l$ is the arm length between the centre-of-mass and the centre-of-thrust, and $k_{T}$ and $k_{Q}$ are the thrust and torque coefficients of the motor, respectively. Aerodynamic effects are modeled using two positive scalar coefficients -- a linear drag coefficient $k_{D}$, and an aerodynamic damping coefficient $k_{M}$:

$\mathbf{F}_{A}=-k_{D}\mathbf{v}|\mathbf{v}|$

$\mathbf{Q}_{A}=-k_{M}\mathbf{\omega}$

With $|\mathbf{v}|$ denoting the Euclidean norm of the linear velocity vector. Body forces and moments are found using $\mathbf{F}_{b}=\mathbf{F}_{T}+\mathbf{F}_{A}$, and $\mathbf{Q}_{b}=\mathbf{Q}_{T}+\mathbf{Q}_{A}$. Motor response is modeled as a first order linear differential equation of the form:

$\dot{\Omega} = -k_{\Omega}\left(\Omega-\Omega_{c}\right)$

Where $\Omega$ is the current angular velocity of the motors, $\Omega_c$ is the commanded angular velocity, and $k_{\Omega}$ is a damping coefficient. We integrate the equations of motion using a standard RK4 scheme. Our simulation assumes deterministic dynamics, and doesn't model more advanced phenomena such as ground effect, indoor aerodynamic effects, blade flapping, or vortex ring state.

The model itself is built in MATLAB Simulink, which is then used to generate C++ code. We compile this code to a Python module so that we can use our simulation in conjunction with standard neural network libraries (in this case, PyTorch) by extending the OpenAI Gym framework. All code for this is provided at https://github.com/seanny1986/gym-aero.

Since quadrotors are inherently unstable, they require a control system in order to maintain flight. Though a number of different control methods see application in quadrotors, by far the most common is PD control. In standard PID control, actions are taken using:

$u(t) = K_p e(t) + K_i \int_{t'}e(t')dt' + K_d \frac{d}{dt}e(t)$

where $e(t)$ is the error from a desired state at time $t$, and $K_p$, $K_i$ and $K_d$ are known as the proportional, integral, and derivative gains, respectively. Often, an alternative representation can be used that gives the $K_i$ and $K_d$ gains an intuitive physical meaning:

$u(t) = K_p \left(e(t) + \frac{1}{T_i} \int_{t'}e(t')dt' + T_d \frac{d}{dt}e(t)\right)$

Intuitively, this controller takes actions in proportion to the error, while also taking the change and integral of the error into account, to either accelerate or smooth the response of the vehicle. As a result, this type of controller needs to be tuned so that the system is critically damped -- that is, the controller can overshoot or undershoot depending on the value of the gains and the error. This can be done by hand using the Stiegler-Nichols method, or general rules of thumb, though recently, black-box optimization of the values has become more common [].

In quadrotors, it is common to use a cascading hierarchy of PID controllers in order to achieve the desired behavior. The most high level are waypoint-based controllers that take an error in position, and output a desired attitude setting. This is then fed into an attitude controller, which calculates an error using the output of the position controller, and returns a desired rate. Finally, the desired rate is given to a rate controller, which outputs a set of desired in body rates (with rate-based controllers being the lowest level of flight control). We provide a basic block diagram of this process below:

![block_diagram](https://docs.google.com/uc?export=download&id=1nj4XfDXX5J1F33to2iOJVGX66oQQ_9Gi)

We can relate control forces and moments with the RPM of the vehicle using:

$\begin{bmatrix} F_z \\ \mathbf{\tau}_{\phi} \\ \mathbf{\tau}_{\theta} \\ \mathbf{\tau}_{\psi} \end{bmatrix} = \begin{bmatrix} K_T & K_T & K_T & K_T \\ 0 & -l K_T & 0 & l K_T \\ l K_T & 0 & -l K_T & 0 \\ -K_d & K_d & -K_d & K_d \end{bmatrix}\begin{bmatrix} \omega_1^2 \\ \omega_2^2 \\ \omega_3^2 \\ \omega_4^2 \end{bmatrix}$

Once we have the desired rates, we can use PID control to obtain a set of moments and forces that are then translated to motor rpms by inverting the coefficient matrix. Since the integral term can result in a phenomenon known as "wind-up" -- in which numerical errors are propagated forward until our controller goes unstable -- it is common in quadrotor flight control to set this gain to zero, resulting in PD control.

For the purposes of this study, we implemented a PD controller in Python, and tuned it for the aircraft using black-box optimization. Since we are primarily interested in using deep reinforcement learning for low-level flight control, we only test a PD vertical velocity and angular rate controller. Extending this to attitude or position-based control is fairly straightforward, and a direction of future research.


## 3. Methodology

### 3.1 Algorithms
We do a broad sweep of current popular algorithms, focusing primarily on the difference in performance between offline, on-policy REINFORCE-based algorithms (REINFORCE, PPO, and TRPO) and online, off-policy algorithms (DDPG, TD3, SVG(0), and SAC). Previous works have found that DDPG tends to perform poorly in this domain -- we would like to determine if this is because DDPG is unsuited to this task, or if it is an inherent property of online algorithms. To maintain the consistency of our comparison, we use our own implementations (see the below code), and benchmark them in a series of unit tests to ensure that they learn as intended.

For the sake of simplicity, we make some modifications to the algorithms as presented in their original form, which we list below:

1. Our implementation of REINFORCE uses importance sampling to correct the gradient estimate over multiple update steps. This is non-standard in most implementations of REINFORCE that we have seen online, but is necessary for multiple updates per iteration, as it corrects for the fact that the return of the trajectory is no longer on-policy after the first update.

2. We use a first order update for the critic in TRPO. A more faithful implementation would use L-BFGS to update the critic as was done in []. We haven't explicitly tested to determine what performance drop (if any) this causes, though our implementation performs roughly on-par with our implementation of PPO. Furthermore, this change allows us to more effectively utilize the GPU during training.

3. We don't normalize the input into the actor and critic networks. This is done in [] as it helps improve the speed of learning, and seems to be necessary for achieving state-of-the-art results in Mujoco locomotion tasks (at least for REINFORCE-based algorithms). However, our tasks are all bound within a fixed region based on a local reference frame. As a result, our inputs are all zero mean-centered by definition, if not necessarily unit Normal. For this reason, we maintain that normalizing the input is unnecessary in our case, though we recognize it may have an impact on performance in the unit tests.

4. We don't use target actors for our online agents. We haven't quantified what negative impact (if any) this change may have had on the algorithms. However, our unit tests show that the algorithms learn more or less as expected. Other research [] has also found that online algorithms still learn effectively even without target actors, and SAC doesn't use a target actor at all.

5. We don't use BatchNorm or LayerNorm in any of our online agent implementations. The reason for this is that neither method has been shown to improve the performance of DDPG-like agents on all tasks, and can even hinder performance in some cases []. We believe that the effect of these normalization schemes on learning quadrotor control tasks is somewhat tangential to our given research goal, and should be tested separately so as to minimize their effect.

6. We rely on PyTorch's default weight initialization scheme. This is in contrast to some other implementations, which use orthogonal initialization []. Previous works have found orthogonal initialization to have the lowest impact on RL performance, and so we opted not to include it for the sake of simplicity.

7. Where stochastic policies are used, we stick to independent Normal distributions over each control dimension (i.e. a diagonal covariance matrix). It is not clear to us that quadrotor motors should be asymmetrically correlated with one another. The exceptions we can think of are in situations with wind and other aerodynamic effects resulting from rotations, but we neglect wind in our experiments, and aerodynamic effects are assumed to be small.

8. Our stochastic policies output both $\mu(x)$ and $\log\sigma(x)$ of the distribution. In many domains, it is common for $\log\sigma$ to be a fixed parameter that is annealed, and not a function of the state input, as it can help improve learning performance. We counteract this by providing an entropy bonus to the agent.

9. Our version of SAC does not use a tanh squashing function as in []. This done to ensure a fair comparison across different algorithms, by ensuring that the same stochastic policy is used. The alternative would be to test all stochastic policies with a tanh squashing function, and using the change of variables formula to calculate the log probability. 

10. SAC uses reward scaling as a temperature parameter to help train the network. In order to maintain consistency in our benchmarks, we apply the same reward scaling value to all algorithms -- in this case, a value of 5, as it has been shown empirically to be a good default value [].

11. Our version of SAC does not use an adaptive alpha update. We found the adaptive version of the algorithm to be sensitive to hyperparameter settings, and difficult to replicate.

Most of these changes are made in the interests of maintaining consistency as much as possible between the different algorithms, and keeping the code clean and simple to help minimize the chance of errors, and so that others are able to validate our work. Many of the omitted features result in improvements (or otherwise) that are domain-specific, or make fair comparisons difficult -- we believe that their effect should be separately measured on quadrotor control tasks to determine if they do in fact improve learning and performance in this domain.

Finally, since most of our tasks involve some form of goal, we make use of Universal Value Function Approximation [] in our value functions. That is, our value functions are also a function of a goal state $g$, such that our functions become $Q^{\pi}(s,g,a)$, $V^{\pi}(s,g)$. This allows us to modify the goal state and produce policies that are able to generalize to any goal within a local horizon.

The following section outlines each environment in more detail.

### 3.2 Environments
We test the above algorithms on the following suite of tasks. These tasks have been chosen as they represent distinct levels of increasing difficulty. Where possible, we have tried to keep the observation space and reward structure as consistent as possible. Common elements between all environments include:

1. All observations are provided in the local frame of reference. The reason for this is to ensure that the state representation is both translationally and rotationally invariant. This is not explicitly necessary (see [], for example), though it has the advantage of not requiring that we mean-center and normalize the input to the networks. Since normalized input has been shown to be beneficial for training neural networks using reinforcement learning [], we believe this should be beneficial -- furthermore, it standardizes the state input across all of the algorithms, allowing for a better comparison, and more meaningful conclusions.
2. We use quadratic reward functions. The policy gradient theorem shows that the learning process reduces the cross-entropy between the stochastic policy and $\exp{f(s,a)}$, where $f(s,a)$ is some value function (see Appendix A for proof). Since our control policies are independent Normal distributions, we use the same structure for the reward function.

#### 3.2.1 Hover
The hover environment uses basic aircraft dynamics as outlined above. The aircraft is initialized at a point $(0,0,0)$, at hover rpm; the objective of the task is to remain stationary until the episode terminates. Since the aircraft is unstable, this requires control actuation to keep it on the right point, though technically the best solution is for the aircraft to always output hover RPM.

The agent has target states for position, attitude, angular rates, and linear velocities. Though terms other than position aren't explicitly required, we've found that including additional reward components results in faster learning and better performance. Furthermore, these additional terms allow us to specify a greater range of complex goals for our agent to learn.

The reward function for this task is:

$r_t = -k_1 d(x_t, x_g)^2 - k_2 \left(d(\sin\zeta_t, \sin\zeta_g)^2 + d(\cos\zeta_t, \cos\zeta_g)^2\right) - k_3 d(v_t, v_g)^2 - k_4 d(p_t, p_g)^2 $

Where _ penalizes the agent for changes in state that move it away from the target position. This reward function was chosen because it can be interpreted as giving the Gaussian log probability of the state, with actions that take the agent closer to goal state having higher probability mass associated with them. 

The observation space for this task is:

$S_t = \left[X_t-X_g, \sin\zeta_t-\sin\zeta_g + \cos\zeta_t-\cos\zeta_g, V_b - V_g, P - P_g, \hat{\Omega_t}, t\right]$

In the case of hover, the goal states are specified as:

$X_g = \left[0,0,0\right]$

$\sin\zeta_g = \left[0,0,0\right]$

$\cos\zeta_g = \left[1,1,1\right]$

$V_g = \left[0,0,0\right]$

$P_g = \left[0,0,0\right]$

#### 3.2.2 Random Waypoint, Fixed Heading
This environment randomly generates a single waypoint that the aircraft should navigate to. As in the hover environment, the aircraft is spawned at the point $(0,0,0)$, with rotors at hover RPM. The waypoint is sampled uniformly from a sphere of radius 1.5m around the aircraft. The objective of the task is for the aircraft to navigate to the goal while maintaining a fixed heading angle. The reward function is:

$r_{nh} = r_{fh} - k_v d(v, 0)^2$

The _ term rewards the aircraft for moving closer to the goal, while the _ term incentivises the aircraft to maintain a fixed direction.

The observation space is:

$Eqn$

We include time in the observation space, as we found it not only improved performance, but was also necessary to help prevent brachistochrone-type behavior, where the aircraft utilizes gravity to reach the waypoint faster. For the same reason, we don't include a time-based reward in the reward function.

#### 3.2.3 Random Waypoint, New Heading
This environment randomly generates a single waypoint that the aircraft should navigate to. As in the previous two environments, the aircraft is spawned at the point $(0,0,0)$, with rotors at hover RPM. The waypoint is sampled uniformly from a sphere of radius 1.5m around the aircraft, and a heading vector is sampled uniformly from a unit circle. The reward function is:

$r_{nh} = r_{fh} - k_v d(v, 0)^2$

The addition of the final term adds a nonholonomic constraint to the aircraft's motion -- that is, we want the aircraft to move in the same direction that it is pointing. This is substantially harder than allowing the aircraft to move in any direction, and as far as we are aware, has not been tested or incentivized in previous applications of RL to quadrotor flight control.

The observation space is:

$Eqn$

As with the previous tasks, we include time in the observation space, and don't include a time-based reward in the reward function.

#### 3.2.4 Landing
This environment spawns the aircraft in a uniformly random state above a desired landing zone. The goal is for the aircraft to navigate safely to the landing zone within the desired time. If the aircraft if the aircraft's radius penetrates the ground, we assume that the vehicle has crashed. Furthermore, we assume that if the aircraft exceeds $-2m/s$ in the body z-axis, it has entered a vortex ring state that also results in a crash (for further exposition, please see []). This assumption is based on prior experience working with small rotorcraft, rather than any given aerodynamic model. 

The reward function is:

$Eqn$

Where ...

The observation space is:

Where ...

#### 3.2.5 Trajectory Following
This environment spawns the aircraft at a point $(0,0,0)$, at hover rpm. We generate a 3D spine trajectory as an $f(t)$, such that:

The objective of the task is for the aircraft to navigate its way along the trajectory, matching it as closely as possible, and taking the full time to reach the waypoint.

The reward function is:

$Eqn$

Where ... . This is a fairly standard reward structure for this type of task, and can be thought of both minimizing the error between the aircraft's state and the trajectory, and a scalar reward for making progress along the path. As with the *Random Waypoint, New Heading* task, our goal is for the aircraft to fly nonholonomically, which we incentivize by driving velocity in the body y-direction to zero.

The observation space is:

$Eqn$

Where ...

### 3.1 PID Environment Action Selection
As mentioned previously, one of our goals is to determine what benefit, if any, including a PID controller into the learning process might have for learned flight controllers. To test this, we test two variations of each of the above environments -- one in which the RPM command is directly output by the neural net, and another where the network outputs a setpoint that is then passed to a lower level PID controller. In this case, we use a PD controller tracking $\dot{z}, p, q, r$. Our network $\theta$ outputs $e_\dot{z}, e_p, e_q, e_r$, which is then used to find the setpoint:

$\dot{z}_g = e_{\dot{z}} + \dot{z}$

$p_g = e_{p} + p$

$q_g = e_{q} + q$

$r_g = e_{r} + r$

The PD controller and simulation are then stepped forward $dt_\phi / dt_{PID}$ timesteps, and the process repeated. To measure the effect of action selection frequency, we vary $dt_\phi$. Since we are only training the neural network controller, we only calculate rewards every at $dt_\phi$ interval, which corresponds to the action taken by the network.


## 4. Implementation

In the interests of clarity, we provide the full implementation of our algorithms here, and include further detail in the Appendix. One of the major impediments to deep RL research has been the lack of reproducibility, in part due to the presence of "hacks" and workarounds in the code that improve performance, but aren't tested. See [] for further details. In order to mitigate this, we provide bare-bones implementations of current state-of-the-art algorithms in continuous-action RL. We test each algorithm's performance on a simple benchmark task in order to verify that our implementations do indeed learn as expected, and then test them on the actual tasks of interest. For the sake of brevity, the unit tests are omitted from the final paper, but can be found in the appendices.

### 4.1 Installing Necessary Packages

In [0]:
!pip install gym
!apt-get install python-opengl -y
!apt install xvfb -y

# For rendering environment, you can use pyvirtualdisplay.
!pip install pyvirtualdisplay
!pip install piglet

from google.colab import drive
drive.mount('/content/gdrive/')

import sys
sys.path.append('/content/gdrive/quadrotor_study')

!git clone https://github.com/seanny1986/gym-aero
%cd gym-aero/simulation
!make
%cd ..
!pip install .

Reading package lists... Done
Building dependency tree       
Reading state information... Done
python-opengl is already the newest version (3.1.0+dfsg-1).
The following package was automatically installed and is no longer required:
  libnvidia-common-430
Use 'apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 7 not upgraded.
Reading package lists... Done
Building dependency tree       
Reading state information... Done
xvfb is already the newest version (2:1.19.6-1ubuntu4.3).
The following package was automatically installed and is no longer required:
  libnvidia-common-430
Use 'apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 7 not upgraded.
Drive already mounted at /content/gdrive/; to attempt to forcibly remount, call drive.mount("/content/gdrive/", force_remount=True).
fatal: destination path 'gym-aero' already exists and is not an empty directory.
/content/gym-aero/simulation
make: Nothing to be done for 'all'.
/content/gym-ae

### 4.2 Imports and Utility Functions

In [0]:
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import gym
import gym_aero as ga
import scipy

from torch.distributions import Normal
from collections import NamedTuple

import utilities as utils

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

### 4.2 Angular Rates PID Controller

### 4.3 Control Policies and Value Functions

### 4.4 REINFORCE-Based Algorithms

#### 4.4.1 Architecture and Update Functions

#### 4.4.2 Training Loop

### 4.5 Pathwise Derivative Algorithms

#### 4.5.1 Architecture and Update Functions

#### 4.5.2 Training Loop

### 4.6 Unit Tests
The following blocks of code test the above implementations to ensure that they work as expected. As we are using continuous action controllers, we verify performance on Ant-v1 in Mujoco

#### 4.6.1 PID Rates Controller

#### 4.6.2 REINFORCE-Based Algorithms

![block_diagram](imgs/offline_unit_test.pdf)

As we can see from the graph, our REINFORCE-based algorithms have no trouble learning locomotion on the Ant-v1 task. Standard REINFORCE tends to collapse after reaching a good reward, as the variance of the gradient estimate tends to explode. TRPO and PPO avoid this by keeping updates bounded to a trust region so that the policy either maintains its performance, or improves. Generally, this collapse in performance seems to be the biggest thing holding REINFORCE back (though it may suffer as the action dimension increases due to the inherent variance of the score function estimator).

#### 4.6.3 Pathwise Derivative Algorithms

## 5. Experiments

We conduct the following experiments:
1. Flight control by outputting raw rotor speed commands
2. Flight control by outputting rates and then having the underlying PID controller fly the aircraft

We do this for the following tasks:
1. Hover.
2. Random waypoint navigation with a fixed heading.
3. Random, nonholonomic waypoint navigation with a new heading.
4. Landing.

We test implementations of the following algorithms:
1. REINFORCE-based algorithms (REINFORCE, PPO, TRPO)
2. Online algorithms (DDPG, TD, SVG(0), SAC)

Finally, we test the effect of action frequency on the ability of the each method. The following table summarizes the hyperparameter settings for each algorithm:



### 5.1 Direct Rotor Speed Commands

In [0]:
all_data = {"direct_rotor_command" : {"hover" : {"reinforce" : None,
                                                 "ppo" : None,
                                                 "trpo" : None,
                                                 "ddpg" : None,
                                                 "td3" : None,
                                                 "svg" : None,
                                                 "sac" : None},
                                     "waypoint_fh" : {"reinforce" : None,
                                                 "ppo" : None,
                                                 "trpo" : None,
                                                 "ddpg" : None,
                                                 "td3" : None,
                                                 "svg" : None,
                                                 "sac" : None},
                                     "waypoint_nh" : {"reinforce" : None,
                                                 "ppo" : None,
                                                 "trpo" : None,
                                                 "ddpg" : None,
                                                 "td3" : None,
                                                 "svg" : None,
                                                 "sac" : None},
                                     "landing" : {"reinforce" : None,
                                                 "ppo" : None,
                                                 "trpo" : None,
                                                 "ddpg" : None,
                                                 "td3" : None,
                                                 "svg" : None,
                                                 "sac" : None}},
            "rate_command" : {"hover" : {"reinforce" : None,
                                                 "ppo" : None,
                                                 "trpo" : None,
                                                 "ddpg" : None,
                                                 "td3" : None,
                                                 "svg" : None,
                                                 "sac" : None},
                                     "waypoint_fh" : {"reinforce" : None,
                                                 "ppo" : None,
                                                 "trpo" : None,
                                                 "ddpg" : None,
                                                 "td3" : None,
                                                 "svg" : None,
                                                 "sac" : None},
                                     "waypoint_nh" : {"reinforce" : None,
                                                 "ppo" : None,
                                                 "trpo" : None,
                                                 "ddpg" : None,
                                                 "td3" : None,
                                                 "svg" : None,
                                                 "sac" : None},
                                     "landing" : {"reinforce" : None,
                                                 "ppo" : None,
                                                 "trpo" : None,
                                                 "ddpg" : None,
                                                 "td3" : None,
                                                 "svg" : None,
                                                 "sac" : None}}}

#### 5.1.1 Hover

In [0]:
env = gym.make("Hover-v0")
state_dim = env.observation_space.shape[0]
hidden_dim = 256
action_dim = env.action_space.shape[0]

print("Working directory: ", path)
print("Observation space dim: ", env.observation_space.shape)
print("Action space dim: ", env.action_space.shape)

action_frequencies = [0.025, 0.05, 0.075, 0.1, 0.125, 0.15]

for af in action_frequencies:
    
    env.set_action_frequency(af)
    
    # REINFORCE
    reinforce_data = []
    for i in range(5):
        q_fn = ValueNet(state_dim + action_dim, hidden_dim, 1)
        v_fn = ValueNet(state_dim, hidden_dim, 1)
        beta = IndependentGaussianPolicy(state_dim, hidden_dim, action_dim)
        agent = REINFORCE(beta, q_fn, v_fn)
        opt = torch.optim.Adam(agent.parameters(), lr=3e-4)
        rname = path+"/" + args.env + "-" + args.alg + "-" + args.policy + "-score-agent-" + str(i)
        data = train_offline(env, agent, opt, batch_size, iterations, log_interval, render=ren
        reinforce_data.append(data)
    all_data["direct_rotor_command"]["hover"]["reinforce"] = reinforce_data

    # PPO
    ppo_data = []
    for i in range(5):
        q_fn = ValueNet(state_dim + action_dim, hidden_dim, 1)
        v_fn = ValueNet(state_dim, hidden_dim, 1)
        beta = IndependentGaussianPolicy(state_dim, hidden_dim, action_dim)
        agent = PPO(beta, q_fn, v_fn)
        opt = torch.optim.Adam(agent.parameters(), lr=3e-4)
        rname = path+"/" + args.env + "-" + args.alg + "-" + args.policy + "-score-agent-" + str(i)
        data = train_offline(env, agent, opt, batch_size, iterations, log_interval, render=ren
        ppo_data.append(data)
    all_data["direct_rotor_command"]["hover"]["ppo"] = ppo_data

    # TRPO
    trpo_data = []
    for i in range(5):
        q_fn = ValueNet(state_dim + action_dim, hidden_dim, 1)
        v_fn = ValueNet(state_dim, hidden_dim, 1)
        beta = IndependentGaussianPolicy(state_dim, hidden_dim, action_dim)
        agent = TRPO(beta, q_fn, v_fn)
        opt = torch.optim.Adam(agent.parameters(), lr=3e-4)
        rname = path+"/" + args.env + "-" + args.alg + "-" + args.policy + "-score-agent-" + str(i)
        data = train_offline(env, agent, opt, batch_size, iterations, log_interval, render=ren
        trpo_data.append(data)
    all_data["direct_rotor_command"]["hover"]["trpo"] = trpo_data

    # DDPG
    ddpg_data = []
    for i in range(5):
        q_fn = ValueNet(state_dim + action_dim, hidden_dim, 1)
        v_fn = ValueNet(state_dim, hidden_dim, 1)
        beta = IndependentGaussianPolicy(state_dim, hidden_dim, action_dim)
        agent = TRPO(beta, q_fn, v_fn)
        opt = torch.optim.Adam(agent.parameters(), lr=3e-4)
        rname = path+"/" + args.env + "-" + args.alg + "-" + args.policy + "-score-agent-" + str(i)
        data = train_online(env, agent, opt, batch_size, iterations, log_interval, render=ren
        ddpg_data.append(data)
    all_data["direct_rotor_command"]["hover"]["ddpg"] = ddpg_data

    # TD3
    td3_data = []
    for i in range(5):
        q_fn = ValueNet(state_dim + action_dim, hidden_dim, 1)
        v_fn = ValueNet(state_dim, hidden_dim, 1)
        beta = IndependentGaussianPolicy(state_dim, hidden_dim, action_dim)
        agent = TRPO(beta, q_fn, v_fn)
        opt = torch.optim.Adam(agent.parameters(), lr=3e-4)
        rname = path+"/" + args.env + "-" + args.alg + "-" + args.policy + "-score-agent-" + str(i)
        data = train_online(env, agent, opt, batch_size, iterations, log_interval, render=ren
        td3_data.append(data)
    all_data["direct_rotor_command"]["hover"]["td3"] = td3_data

    # SVG(0)
    svg_data = []
    for i in range(5):
        q_fn = ValueNet(state_dim + action_dim, hidden_dim, 1)
        v_fn = ValueNet(state_dim, hidden_dim, 1)
        beta = IndependentGaussianPolicy(state_dim, hidden_dim, action_dim)
        agent = TRPO(beta, q_fn, v_fn)
        opt = torch.optim.Adam(agent.parameters(), lr=3e-4)
        rname = path+"/" + args.env + "-" + args.alg + "-" + args.policy + "-score-agent-" + str(i)
        data = train_online(env, agent, opt, batch_size, iterations, log_interval, render=ren
        svg_data.append(data)
    all_data["direct_rotor_command"]["hover"]["svg"] = svg_data

    # SAC
    sac_data = []
    for i in range(5):
        q_fn = ValueNet(state_dim + action_dim, hidden_dim, 1)
        v_fn = ValueNet(state_dim, hidden_dim, 1)
        beta = IndependentGaussianPolicy(state_dim, hidden_dim, action_dim)
        agent = TRPO(beta, q_fn, v_fn)
        opt = torch.optim.Adam(agent.parameters(), lr=3e-4)
        rname = path+"/" + args.env + "-" + args.alg + "-" + args.policy + "-score-agent-" + str(i)
        data = train_online(env, agent, opt, batch_size, iterations, log_interval, render=ren
        sac_data.append(data)
    all_data["direct_rotor_command"]["hover"]["sac"] = sac_data

#### 5.1.2 Random Waypoint, Fixed Heading

In [0]:
env = gym.make("Hover-v0")
state_dim = env.observation_space.shape[0]
hidden_dim = 256
action_dim = env.action_space.shape[0]

print("Working directory: ", path)
print("Observation space dim: ", env.observation_space.shape)
print("Action space dim: ", env.action_space.shape)

action_frequencies = [0.025, 0.05, 0.075, 0.1, 0.125, 0.15]

for af in action_frequencies:
    
    env.set_action_frequency(af)
    
    # REINFORCE
    reinforce_data = []
    for i in range(5):
        q_fn = ValueNet(state_dim + action_dim, hidden_dim, 1)
        v_fn = ValueNet(state_dim, hidden_dim, 1)
        beta = IndependentGaussianPolicy(state_dim, hidden_dim, action_dim)
        agent = REINFORCE(beta, q_fn, v_fn)
        opt = torch.optim.Adam(agent.parameters(), lr=3e-4)
        rname = path+"/" + args.env + "-" + args.alg + "-" + args.policy + "-score-agent-" + str(i)
        data = train_offline(env, agent, opt, batch_size, iterations, log_interval, render=ren
        reinforce_data.append(data)
    all_data["direct_rotor_command"]["hover"]["reinforce"] = reinforce_data

    # PPO
    ppo_data = []
    for i in range(5):
        q_fn = ValueNet(state_dim + action_dim, hidden_dim, 1)
        v_fn = ValueNet(state_dim, hidden_dim, 1)
        beta = IndependentGaussianPolicy(state_dim, hidden_dim, action_dim)
        agent = PPO(beta, q_fn, v_fn)
        opt = torch.optim.Adam(agent.parameters(), lr=3e-4)
        rname = path+"/" + args.env + "-" + args.alg + "-" + args.policy + "-score-agent-" + str(i)
        data = train_offline(env, agent, opt, batch_size, iterations, log_interval, render=ren
        ppo_data.append(data)
    all_data["direct_rotor_command"]["hover"]["ppo"] = ppo_data

    # TRPO
    trpo_data = []
    for i in range(5):
        q_fn = ValueNet(state_dim + action_dim, hidden_dim, 1)
        v_fn = ValueNet(state_dim, hidden_dim, 1)
        beta = IndependentGaussianPolicy(state_dim, hidden_dim, action_dim)
        agent = TRPO(beta, q_fn, v_fn)
        opt = torch.optim.Adam(agent.parameters(), lr=3e-4)
        rname = path+"/" + args.env + "-" + args.alg + "-" + args.policy + "-score-agent-" + str(i)
        data = train_offline(env, agent, opt, batch_size, iterations, log_interval, render=ren
        trpo_data.append(data)
    all_data["direct_rotor_command"]["hover"]["trpo"] = trpo_data

    # DDPG
    ddpg_data = []
    for i in range(5):
        q_fn = ValueNet(state_dim + action_dim, hidden_dim, 1)
        v_fn = ValueNet(state_dim, hidden_dim, 1)
        beta = IndependentGaussianPolicy(state_dim, hidden_dim, action_dim)
        agent = TRPO(beta, q_fn, v_fn)
        opt = torch.optim.Adam(agent.parameters(), lr=3e-4)
        rname = path+"/" + args.env + "-" + args.alg + "-" + args.policy + "-score-agent-" + str(i)
        data = train_online(env, agent, opt, batch_size, iterations, log_interval, render=ren
        ddpg_data.append(data)
    all_data["direct_rotor_command"]["hover"]["ddpg"] = ddpg_data

    # TD3
    td3_data = []
    for i in range(5):
        q_fn = ValueNet(state_dim + action_dim, hidden_dim, 1)
        v_fn = ValueNet(state_dim, hidden_dim, 1)
        beta = IndependentGaussianPolicy(state_dim, hidden_dim, action_dim)
        agent = TRPO(beta, q_fn, v_fn)
        opt = torch.optim.Adam(agent.parameters(), lr=3e-4)
        rname = path+"/" + args.env + "-" + args.alg + "-" + args.policy + "-score-agent-" + str(i)
        data = train_online(env, agent, opt, batch_size, iterations, log_interval, render=ren
        td3_data.append(data)
    all_data["direct_rotor_command"]["hover"]["td3"] = td3_data

    # SVG(0)
    svg_data = []
    for i in range(5):
        q_fn = ValueNet(state_dim + action_dim, hidden_dim, 1)
        v_fn = ValueNet(state_dim, hidden_dim, 1)
        beta = IndependentGaussianPolicy(state_dim, hidden_dim, action_dim)
        agent = TRPO(beta, q_fn, v_fn)
        opt = torch.optim.Adam(agent.parameters(), lr=3e-4)
        rname = path+"/" + args.env + "-" + args.alg + "-" + args.policy + "-score-agent-" + str(i)
        data = train_online(env, agent, opt, batch_size, iterations, log_interval, render=ren
        svg_data.append(data)
    all_data["direct_rotor_command"]["hover"]["svg"] = svg_data

    # SAC
    sac_data = []
    for i in range(5):
        q_fn = ValueNet(state_dim + action_dim, hidden_dim, 1)
        v_fn = ValueNet(state_dim, hidden_dim, 1)
        beta = IndependentGaussianPolicy(state_dim, hidden_dim, action_dim)
        agent = TRPO(beta, q_fn, v_fn)
        opt = torch.optim.Adam(agent.parameters(), lr=3e-4)
        rname = path+"/" + args.env + "-" + args.alg + "-" + args.policy + "-score-agent-" + str(i)
        data = train_online(env, agent, opt, batch_size, iterations, log_interval, render=ren
        sac_data.append(data)
    all_data["direct_rotor_command"]["hover"]["sac"] = sac_data

#### 5.1.3 Random Waypoint, New Heading

In [0]:
env = gym.make("Hover-v0")
state_dim = env.observation_space.shape[0]
hidden_dim = 256
action_dim = env.action_space.shape[0]

print("Working directory: ", path)
print("Observation space dim: ", env.observation_space.shape)
print("Action space dim: ", env.action_space.shape)

action_frequencies = [0.025, 0.05, 0.075, 0.1, 0.125, 0.15]

for af in action_frequencies:
    
    env.set_action_frequency(af)
    
    # REINFORCE
    reinforce_data = []
    for i in range(5):
        q_fn = ValueNet(state_dim + action_dim, hidden_dim, 1)
        v_fn = ValueNet(state_dim, hidden_dim, 1)
        beta = IndependentGaussianPolicy(state_dim, hidden_dim, action_dim)
        agent = REINFORCE(beta, q_fn, v_fn)
        opt = torch.optim.Adam(agent.parameters(), lr=3e-4)
        rname = path+"/" + args.env + "-" + args.alg + "-" + args.policy + "-score-agent-" + str(i)
        data = train_offline(env, agent, opt, batch_size, iterations, log_interval, render=ren
        reinforce_data.append(data)
    all_data["direct_rotor_command"]["hover"]["reinforce"] = reinforce_data

    # PPO
    ppo_data = []
    for i in range(5):
        q_fn = ValueNet(state_dim + action_dim, hidden_dim, 1)
        v_fn = ValueNet(state_dim, hidden_dim, 1)
        beta = IndependentGaussianPolicy(state_dim, hidden_dim, action_dim)
        agent = PPO(beta, q_fn, v_fn)
        opt = torch.optim.Adam(agent.parameters(), lr=3e-4)
        rname = path+"/" + args.env + "-" + args.alg + "-" + args.policy + "-score-agent-" + str(i)
        data = train_offline(env, agent, opt, batch_size, iterations, log_interval, render=ren
        ppo_data.append(data)
    all_data["direct_rotor_command"]["hover"]["ppo"] = ppo_data

    # TRPO
    trpo_data = []
    for i in range(5):
        q_fn = ValueNet(state_dim + action_dim, hidden_dim, 1)
        v_fn = ValueNet(state_dim, hidden_dim, 1)
        beta = IndependentGaussianPolicy(state_dim, hidden_dim, action_dim)
        agent = TRPO(beta, q_fn, v_fn)
        opt = torch.optim.Adam(agent.parameters(), lr=3e-4)
        rname = path+"/" + args.env + "-" + args.alg + "-" + args.policy + "-score-agent-" + str(i)
        data = train_offline(env, agent, opt, batch_size, iterations, log_interval, render=ren
        trpo_data.append(data)
    all_data["direct_rotor_command"]["hover"]["trpo"] = trpo_data

    # DDPG
    ddpg_data = []
    for i in range(5):
        q_fn = ValueNet(state_dim + action_dim, hidden_dim, 1)
        v_fn = ValueNet(state_dim, hidden_dim, 1)
        beta = IndependentGaussianPolicy(state_dim, hidden_dim, action_dim)
        agent = TRPO(beta, q_fn, v_fn)
        opt = torch.optim.Adam(agent.parameters(), lr=3e-4)
        rname = path+"/" + args.env + "-" + args.alg + "-" + args.policy + "-score-agent-" + str(i)
        data = train_online(env, agent, opt, batch_size, iterations, log_interval, render=ren
        ddpg_data.append(data)
    all_data["direct_rotor_command"]["hover"]["ddpg"] = ddpg_data

    # TD3
    td3_data = []
    for i in range(5):
        q_fn = ValueNet(state_dim + action_dim, hidden_dim, 1)
        v_fn = ValueNet(state_dim, hidden_dim, 1)
        beta = IndependentGaussianPolicy(state_dim, hidden_dim, action_dim)
        agent = TRPO(beta, q_fn, v_fn)
        opt = torch.optim.Adam(agent.parameters(), lr=3e-4)
        rname = path+"/" + args.env + "-" + args.alg + "-" + args.policy + "-score-agent-" + str(i)
        data = train_online(env, agent, opt, batch_size, iterations, log_interval, render=ren
        td3_data.append(data)
    all_data["direct_rotor_command"]["hover"]["td3"] = td3_data

    # SVG(0)
    svg_data = []
    for i in range(5):
        q_fn = ValueNet(state_dim + action_dim, hidden_dim, 1)
        v_fn = ValueNet(state_dim, hidden_dim, 1)
        beta = IndependentGaussianPolicy(state_dim, hidden_dim, action_dim)
        agent = TRPO(beta, q_fn, v_fn)
        opt = torch.optim.Adam(agent.parameters(), lr=3e-4)
        rname = path+"/" + args.env + "-" + args.alg + "-" + args.policy + "-score-agent-" + str(i)
        data = train_online(env, agent, opt, batch_size, iterations, log_interval, render=ren
        svg_data.append(data)
    all_data["direct_rotor_command"]["hover"]["svg"] = svg_data

    # SAC
    sac_data = []
    for i in range(5):
        q_fn = ValueNet(state_dim + action_dim, hidden_dim, 1)
        v_fn = ValueNet(state_dim, hidden_dim, 1)
        beta = IndependentGaussianPolicy(state_dim, hidden_dim, action_dim)
        agent = TRPO(beta, q_fn, v_fn)
        opt = torch.optim.Adam(agent.parameters(), lr=3e-4)
        rname = path+"/" + args.env + "-" + args.alg + "-" + args.policy + "-score-agent-" + str(i)
        data = train_online(env, agent, opt, batch_size, iterations, log_interval, render=ren
        sac_data.append(data)
    all_data["direct_rotor_command"]["hover"]["sac"] = sac_data

#### 5.1.4 Landing

In [0]:
env = gym.make("Hover-v0")
state_dim = env.observation_space.shape[0]
hidden_dim = 256
action_dim = env.action_space.shape[0]

print("Working directory: ", path)
print("Observation space dim: ", env.observation_space.shape)
print("Action space dim: ", env.action_space.shape)

action_frequencies = [0.025, 0.05, 0.075, 0.1, 0.125, 0.15]

for af in action_frequencies:
    
    env.set_action_frequency(af)
    
    # REINFORCE
    reinforce_data = []
    for i in range(5):
        q_fn = ValueNet(state_dim + action_dim, hidden_dim, 1)
        v_fn = ValueNet(state_dim, hidden_dim, 1)
        beta = IndependentGaussianPolicy(state_dim, hidden_dim, action_dim)
        agent = REINFORCE(beta, q_fn, v_fn)
        opt = torch.optim.Adam(agent.parameters(), lr=3e-4)
        rname = path+"/" + args.env + "-" + args.alg + "-" + args.policy + "-score-agent-" + str(i)
        data = train_offline(env, agent, opt, batch_size, iterations, log_interval, render=ren
        reinforce_data.append(data)
    all_data["direct_rotor_command"]["hover"]["reinforce"] = reinforce_data

    # PPO
    ppo_data = []
    for i in range(5):
        q_fn = ValueNet(state_dim + action_dim, hidden_dim, 1)
        v_fn = ValueNet(state_dim, hidden_dim, 1)
        beta = IndependentGaussianPolicy(state_dim, hidden_dim, action_dim)
        agent = PPO(beta, q_fn, v_fn)
        opt = torch.optim.Adam(agent.parameters(), lr=3e-4)
        rname = path+"/" + args.env + "-" + args.alg + "-" + args.policy + "-score-agent-" + str(i)
        data = train_offline(env, agent, opt, batch_size, iterations, log_interval, render=ren
        ppo_data.append(data)
    all_data["direct_rotor_command"]["hover"]["ppo"] = ppo_data

    # TRPO
    trpo_data = []
    for i in range(5):
        q_fn = ValueNet(state_dim + action_dim, hidden_dim, 1)
        v_fn = ValueNet(state_dim, hidden_dim, 1)
        beta = IndependentGaussianPolicy(state_dim, hidden_dim, action_dim)
        agent = TRPO(beta, q_fn, v_fn)
        opt = torch.optim.Adam(agent.parameters(), lr=3e-4)
        rname = path+"/" + args.env + "-" + args.alg + "-" + args.policy + "-score-agent-" + str(i)
        data = train_offline(env, agent, opt, batch_size, iterations, log_interval, render=ren
        trpo_data.append(data)
    all_data["direct_rotor_command"]["hover"]["trpo"] = trpo_data

    # DDPG
    ddpg_data = []
    for i in range(5):
        q_fn = ValueNet(state_dim + action_dim, hidden_dim, 1)
        v_fn = ValueNet(state_dim, hidden_dim, 1)
        beta = IndependentGaussianPolicy(state_dim, hidden_dim, action_dim)
        agent = TRPO(beta, q_fn, v_fn)
        opt = torch.optim.Adam(agent.parameters(), lr=3e-4)
        rname = path+"/" + args.env + "-" + args.alg + "-" + args.policy + "-score-agent-" + str(i)
        data = train_online(env, agent, opt, batch_size, iterations, log_interval, render=ren
        ddpg_data.append(data)
    all_data["direct_rotor_command"]["hover"]["ddpg"] = ddpg_data

    # TD3
    td3_data = []
    for i in range(5):
        q_fn = ValueNet(state_dim + action_dim, hidden_dim, 1)
        v_fn = ValueNet(state_dim, hidden_dim, 1)
        beta = IndependentGaussianPolicy(state_dim, hidden_dim, action_dim)
        agent = TRPO(beta, q_fn, v_fn)
        opt = torch.optim.Adam(agent.parameters(), lr=3e-4)
        rname = path+"/" + args.env + "-" + args.alg + "-" + args.policy + "-score-agent-" + str(i)
        data = train_online(env, agent, opt, batch_size, iterations, log_interval, render=ren
        td3_data.append(data)
    all_data["direct_rotor_command"]["hover"]["td3"] = td3_data

    # SVG(0)
    svg_data = []
    for i in range(5):
        q_fn = ValueNet(state_dim + action_dim, hidden_dim, 1)
        v_fn = ValueNet(state_dim, hidden_dim, 1)
        beta = IndependentGaussianPolicy(state_dim, hidden_dim, action_dim)
        agent = TRPO(beta, q_fn, v_fn)
        opt = torch.optim.Adam(agent.parameters(), lr=3e-4)
        rname = path+"/" + args.env + "-" + args.alg + "-" + args.policy + "-score-agent-" + str(i)
        data = train_online(env, agent, opt, batch_size, iterations, log_interval, render=ren
        svg_data.append(data)
    all_data["direct_rotor_command"]["hover"]["svg"] = svg_data

    # SAC
    sac_data = []
    for i in range(5):
        q_fn = ValueNet(state_dim + action_dim, hidden_dim, 1)
        v_fn = ValueNet(state_dim, hidden_dim, 1)
        beta = IndependentGaussianPolicy(state_dim, hidden_dim, action_dim)
        agent = TRPO(beta, q_fn, v_fn)
        opt = torch.optim.Adam(agent.parameters(), lr=3e-4)
        rname = path+"/" + args.env + "-" + args.alg + "-" + args.policy + "-score-agent-" + str(i)
        data = train_online(env, agent, opt, batch_size, iterations, log_interval, render=ren
        sac_data.append(data)
    all_data["direct_rotor_command"]["hover"]["sac"] = sac_data

### 5.2 Rate Commands
#### 5.2.1 Hover

In [0]:
env = gym.make("Hover-v0")
state_dim = env.observation_space.shape[0]
hidden_dim = 256
action_dim = env.action_space.shape[0]

print("Working directory: ", path)
print("Observation space dim: ", env.observation_space.shape)
print("Action space dim: ", env.action_space.shape)

action_frequencies = [0.025, 0.05, 0.075, 0.1, 0.125, 0.15]

for af in action_frequencies:
    
    env.set_action_frequency(af)
    
    # REINFORCE
    reinforce_data = []
    for i in range(5):
        q_fn = ValueNet(state_dim + action_dim, hidden_dim, 1)
        v_fn = ValueNet(state_dim, hidden_dim, 1)
        beta = IndependentGaussianPolicy(state_dim, hidden_dim, action_dim)
        agent = REINFORCE(beta, q_fn, v_fn)
        opt = torch.optim.Adam(agent.parameters(), lr=3e-4)
        rname = path+"/" + args.env + "-" + args.alg + "-" + args.policy + "-score-agent-" + str(i)
        data = train_offline(env, agent, opt, batch_size, iterations, log_interval, render=ren
        reinforce_data.append(data)
    all_data["direct_rotor_command"]["hover"]["reinforce"] = reinforce_data

    # PPO
    ppo_data = []
    for i in range(5):
        q_fn = ValueNet(state_dim + action_dim, hidden_dim, 1)
        v_fn = ValueNet(state_dim, hidden_dim, 1)
        beta = IndependentGaussianPolicy(state_dim, hidden_dim, action_dim)
        agent = PPO(beta, q_fn, v_fn)
        opt = torch.optim.Adam(agent.parameters(), lr=3e-4)
        rname = path+"/" + args.env + "-" + args.alg + "-" + args.policy + "-score-agent-" + str(i)
        data = train_offline(env, agent, opt, batch_size, iterations, log_interval, render=ren
        ppo_data.append(data)
    all_data["direct_rotor_command"]["hover"]["ppo"] = ppo_data

    # TRPO
    trpo_data = []
    for i in range(5):
        q_fn = ValueNet(state_dim + action_dim, hidden_dim, 1)
        v_fn = ValueNet(state_dim, hidden_dim, 1)
        beta = IndependentGaussianPolicy(state_dim, hidden_dim, action_dim)
        agent = TRPO(beta, q_fn, v_fn)
        opt = torch.optim.Adam(agent.parameters(), lr=3e-4)
        rname = path+"/" + args.env + "-" + args.alg + "-" + args.policy + "-score-agent-" + str(i)
        data = train_offline(env, agent, opt, batch_size, iterations, log_interval, render=ren
        trpo_data.append(data)
    all_data["direct_rotor_command"]["hover"]["trpo"] = trpo_data

    # DDPG
    ddpg_data = []
    for i in range(5):
        q_fn = ValueNet(state_dim + action_dim, hidden_dim, 1)
        v_fn = ValueNet(state_dim, hidden_dim, 1)
        beta = IndependentGaussianPolicy(state_dim, hidden_dim, action_dim)
        agent = TRPO(beta, q_fn, v_fn)
        opt = torch.optim.Adam(agent.parameters(), lr=3e-4)
        rname = path+"/" + args.env + "-" + args.alg + "-" + args.policy + "-score-agent-" + str(i)
        data = train_online(env, agent, opt, batch_size, iterations, log_interval, render=ren
        ddpg_data.append(data)
    all_data["direct_rotor_command"]["hover"]["ddpg"] = ddpg_data

    # TD3
    td3_data = []
    for i in range(5):
        q_fn = ValueNet(state_dim + action_dim, hidden_dim, 1)
        v_fn = ValueNet(state_dim, hidden_dim, 1)
        beta = IndependentGaussianPolicy(state_dim, hidden_dim, action_dim)
        agent = TRPO(beta, q_fn, v_fn)
        opt = torch.optim.Adam(agent.parameters(), lr=3e-4)
        rname = path+"/" + args.env + "-" + args.alg + "-" + args.policy + "-score-agent-" + str(i)
        data = train_online(env, agent, opt, batch_size, iterations, log_interval, render=ren
        td3_data.append(data)
    all_data["direct_rotor_command"]["hover"]["td3"] = td3_data

    # SVG(0)
    svg_data = []
    for i in range(5):
        q_fn = ValueNet(state_dim + action_dim, hidden_dim, 1)
        v_fn = ValueNet(state_dim, hidden_dim, 1)
        beta = IndependentGaussianPolicy(state_dim, hidden_dim, action_dim)
        agent = TRPO(beta, q_fn, v_fn)
        opt = torch.optim.Adam(agent.parameters(), lr=3e-4)
        rname = path+"/" + args.env + "-" + args.alg + "-" + args.policy + "-score-agent-" + str(i)
        data = train_online(env, agent, opt, batch_size, iterations, log_interval, render=ren
        svg_data.append(data)
    all_data["direct_rotor_command"]["hover"]["svg"] = svg_data

    # SAC
    sac_data = []
    for i in range(5):
        q_fn = ValueNet(state_dim + action_dim, hidden_dim, 1)
        v_fn = ValueNet(state_dim, hidden_dim, 1)
        beta = IndependentGaussianPolicy(state_dim, hidden_dim, action_dim)
        agent = TRPO(beta, q_fn, v_fn)
        opt = torch.optim.Adam(agent.parameters(), lr=3e-4)
        rname = path+"/" + args.env + "-" + args.alg + "-" + args.policy + "-score-agent-" + str(i)
        data = train_online(env, agent, opt, batch_size, iterations, log_interval, render=ren
        sac_data.append(data)
    all_data["direct_rotor_command"]["hover"]["sac"] = sac_data

#### 5.2.2 Random Waypoint, Fixed Heading

In [0]:
env = gym.make("Hover-v0")
state_dim = env.observation_space.shape[0]
hidden_dim = 256
action_dim = env.action_space.shape[0]

print("Working directory: ", path)
print("Observation space dim: ", env.observation_space.shape)
print("Action space dim: ", env.action_space.shape)

action_frequencies = [0.025, 0.05, 0.075, 0.1, 0.125, 0.15]

for af in action_frequencies:
    
    env.set_action_frequency(af)
    
    # REINFORCE
    reinforce_data = []
    for i in range(5):
        q_fn = ValueNet(state_dim + action_dim, hidden_dim, 1)
        v_fn = ValueNet(state_dim, hidden_dim, 1)
        beta = IndependentGaussianPolicy(state_dim, hidden_dim, action_dim)
        agent = REINFORCE(beta, q_fn, v_fn)
        opt = torch.optim.Adam(agent.parameters(), lr=3e-4)
        rname = path+"/" + args.env + "-" + args.alg + "-" + args.policy + "-score-agent-" + str(i)
        data = train_offline(env, agent, opt, batch_size, iterations, log_interval, render=ren
        reinforce_data.append(data)
    all_data["direct_rotor_command"]["hover"]["reinforce"] = reinforce_data

    # PPO
    ppo_data = []
    for i in range(5):
        q_fn = ValueNet(state_dim + action_dim, hidden_dim, 1)
        v_fn = ValueNet(state_dim, hidden_dim, 1)
        beta = IndependentGaussianPolicy(state_dim, hidden_dim, action_dim)
        agent = PPO(beta, q_fn, v_fn)
        opt = torch.optim.Adam(agent.parameters(), lr=3e-4)
        rname = path+"/" + args.env + "-" + args.alg + "-" + args.policy + "-score-agent-" + str(i)
        data = train_offline(env, agent, opt, batch_size, iterations, log_interval, render=ren
        ppo_data.append(data)
    all_data["direct_rotor_command"]["hover"]["ppo"] = ppo_data

    # TRPO
    trpo_data = []
    for i in range(5):
        q_fn = ValueNet(state_dim + action_dim, hidden_dim, 1)
        v_fn = ValueNet(state_dim, hidden_dim, 1)
        beta = IndependentGaussianPolicy(state_dim, hidden_dim, action_dim)
        agent = TRPO(beta, q_fn, v_fn)
        opt = torch.optim.Adam(agent.parameters(), lr=3e-4)
        rname = path+"/" + args.env + "-" + args.alg + "-" + args.policy + "-score-agent-" + str(i)
        data = train_offline(env, agent, opt, batch_size, iterations, log_interval, render=ren
        trpo_data.append(data)
    all_data["direct_rotor_command"]["hover"]["trpo"] = trpo_data

    # DDPG
    ddpg_data = []
    for i in range(5):
        q_fn = ValueNet(state_dim + action_dim, hidden_dim, 1)
        v_fn = ValueNet(state_dim, hidden_dim, 1)
        beta = IndependentGaussianPolicy(state_dim, hidden_dim, action_dim)
        agent = TRPO(beta, q_fn, v_fn)
        opt = torch.optim.Adam(agent.parameters(), lr=3e-4)
        rname = path+"/" + args.env + "-" + args.alg + "-" + args.policy + "-score-agent-" + str(i)
        data = train_online(env, agent, opt, batch_size, iterations, log_interval, render=ren
        ddpg_data.append(data)
    all_data["direct_rotor_command"]["hover"]["ddpg"] = ddpg_data

    # TD3
    td3_data = []
    for i in range(5):
        q_fn = ValueNet(state_dim + action_dim, hidden_dim, 1)
        v_fn = ValueNet(state_dim, hidden_dim, 1)
        beta = IndependentGaussianPolicy(state_dim, hidden_dim, action_dim)
        agent = TRPO(beta, q_fn, v_fn)
        opt = torch.optim.Adam(agent.parameters(), lr=3e-4)
        rname = path+"/" + args.env + "-" + args.alg + "-" + args.policy + "-score-agent-" + str(i)
        data = train_online(env, agent, opt, batch_size, iterations, log_interval, render=ren
        td3_data.append(data)
    all_data["direct_rotor_command"]["hover"]["td3"] = td3_data

    # SVG(0)
    svg_data = []
    for i in range(5):
        q_fn = ValueNet(state_dim + action_dim, hidden_dim, 1)
        v_fn = ValueNet(state_dim, hidden_dim, 1)
        beta = IndependentGaussianPolicy(state_dim, hidden_dim, action_dim)
        agent = TRPO(beta, q_fn, v_fn)
        opt = torch.optim.Adam(agent.parameters(), lr=3e-4)
        rname = path+"/" + args.env + "-" + args.alg + "-" + args.policy + "-score-agent-" + str(i)
        data = train_online(env, agent, opt, batch_size, iterations, log_interval, render=ren
        svg_data.append(data)
    all_data["direct_rotor_command"]["hover"]["svg"] = svg_data

    # SAC
    sac_data = []
    for i in range(5):
        q_fn = ValueNet(state_dim + action_dim, hidden_dim, 1)
        v_fn = ValueNet(state_dim, hidden_dim, 1)
        beta = IndependentGaussianPolicy(state_dim, hidden_dim, action_dim)
        agent = TRPO(beta, q_fn, v_fn)
        opt = torch.optim.Adam(agent.parameters(), lr=3e-4)
        rname = path+"/" + args.env + "-" + args.alg + "-" + args.policy + "-score-agent-" + str(i)
        data = train_online(env, agent, opt, batch_size, iterations, log_interval, render=ren
        sac_data.append(data)
    all_data["direct_rotor_command"]["hover"]["sac"] = sac_data

#### 5.2.3 Random Waypoint, New Heading

In [0]:
env = gym.make("Hover-v0")
state_dim = env.observation_space.shape[0]
hidden_dim = 256
action_dim = env.action_space.shape[0]

print("Working directory: ", path)
print("Observation space dim: ", env.observation_space.shape)
print("Action space dim: ", env.action_space.shape)

action_frequencies = [0.025, 0.05, 0.075, 0.1, 0.125, 0.15]

for af in action_frequencies:
    
    env.set_action_frequency(af)
    
    # REINFORCE
    reinforce_data = []
    for i in range(5):
        q_fn = ValueNet(state_dim + action_dim, hidden_dim, 1)
        v_fn = ValueNet(state_dim, hidden_dim, 1)
        beta = IndependentGaussianPolicy(state_dim, hidden_dim, action_dim)
        agent = REINFORCE(beta, q_fn, v_fn)
        opt = torch.optim.Adam(agent.parameters(), lr=3e-4)
        rname = path+"/" + args.env + "-" + args.alg + "-" + args.policy + "-score-agent-" + str(i)
        data = train_offline(env, agent, opt, batch_size, iterations, log_interval, render=ren
        reinforce_data.append(data)
    all_data["direct_rotor_command"]["hover"]["reinforce"] = reinforce_data

    # PPO
    ppo_data = []
    for i in range(5):
        q_fn = ValueNet(state_dim + action_dim, hidden_dim, 1)
        v_fn = ValueNet(state_dim, hidden_dim, 1)
        beta = IndependentGaussianPolicy(state_dim, hidden_dim, action_dim)
        agent = PPO(beta, q_fn, v_fn)
        opt = torch.optim.Adam(agent.parameters(), lr=3e-4)
        rname = path+"/" + args.env + "-" + args.alg + "-" + args.policy + "-score-agent-" + str(i)
        data = train_offline(env, agent, opt, batch_size, iterations, log_interval, render=ren
        ppo_data.append(data)
    all_data["direct_rotor_command"]["hover"]["ppo"] = ppo_data

    # TRPO
    trpo_data = []
    for i in range(5):
        q_fn = ValueNet(state_dim + action_dim, hidden_dim, 1)
        v_fn = ValueNet(state_dim, hidden_dim, 1)
        beta = IndependentGaussianPolicy(state_dim, hidden_dim, action_dim)
        agent = TRPO(beta, q_fn, v_fn)
        opt = torch.optim.Adam(agent.parameters(), lr=3e-4)
        rname = path+"/" + args.env + "-" + args.alg + "-" + args.policy + "-score-agent-" + str(i)
        data = train_offline(env, agent, opt, batch_size, iterations, log_interval, render=ren
        trpo_data.append(data)
    all_data["direct_rotor_command"]["hover"]["trpo"] = trpo_data

    # DDPG
    ddpg_data = []
    for i in range(5):
        q_fn = ValueNet(state_dim + action_dim, hidden_dim, 1)
        v_fn = ValueNet(state_dim, hidden_dim, 1)
        beta = IndependentGaussianPolicy(state_dim, hidden_dim, action_dim)
        agent = TRPO(beta, q_fn, v_fn)
        opt = torch.optim.Adam(agent.parameters(), lr=3e-4)
        rname = path+"/" + args.env + "-" + args.alg + "-" + args.policy + "-score-agent-" + str(i)
        data = train_online(env, agent, opt, batch_size, iterations, log_interval, render=ren
        ddpg_data.append(data)
    all_data["direct_rotor_command"]["hover"]["ddpg"] = ddpg_data

    # TD3
    td3_data = []
    for i in range(5):
        q_fn = ValueNet(state_dim + action_dim, hidden_dim, 1)
        v_fn = ValueNet(state_dim, hidden_dim, 1)
        beta = IndependentGaussianPolicy(state_dim, hidden_dim, action_dim)
        agent = TRPO(beta, q_fn, v_fn)
        opt = torch.optim.Adam(agent.parameters(), lr=3e-4)
        rname = path+"/" + args.env + "-" + args.alg + "-" + args.policy + "-score-agent-" + str(i)
        data = train_online(env, agent, opt, batch_size, iterations, log_interval, render=ren
        td3_data.append(data)
    all_data["direct_rotor_command"]["hover"]["td3"] = td3_data

    # SVG(0)
    svg_data = []
    for i in range(5):
        q_fn = ValueNet(state_dim + action_dim, hidden_dim, 1)
        v_fn = ValueNet(state_dim, hidden_dim, 1)
        beta = IndependentGaussianPolicy(state_dim, hidden_dim, action_dim)
        agent = TRPO(beta, q_fn, v_fn)
        opt = torch.optim.Adam(agent.parameters(), lr=3e-4)
        rname = path+"/" + args.env + "-" + args.alg + "-" + args.policy + "-score-agent-" + str(i)
        data = train_online(env, agent, opt, batch_size, iterations, log_interval, render=ren
        svg_data.append(data)
    all_data["direct_rotor_command"]["hover"]["svg"] = svg_data

    # SAC
    sac_data = []
    for i in range(5):
        q_fn = ValueNet(state_dim + action_dim, hidden_dim, 1)
        v_fn = ValueNet(state_dim, hidden_dim, 1)
        beta = IndependentGaussianPolicy(state_dim, hidden_dim, action_dim)
        agent = TRPO(beta, q_fn, v_fn)
        opt = torch.optim.Adam(agent.parameters(), lr=3e-4)
        rname = path+"/" + args.env + "-" + args.alg + "-" + args.policy + "-score-agent-" + str(i)
        data = train_online(env, agent, opt, batch_size, iterations, log_interval, render=ren
        sac_data.append(data)
    all_data["direct_rotor_command"]["hover"]["sac"] = sac_data

#### 5.2.4 Landing

In [0]:
env = gym.make("Hover-v0")
state_dim = env.observation_space.shape[0]
hidden_dim = 256
action_dim = env.action_space.shape[0]

print("Working directory: ", path)
print("Observation space dim: ", env.observation_space.shape)
print("Action space dim: ", env.action_space.shape)

action_frequencies = [0.025, 0.05, 0.075, 0.1, 0.125, 0.15]

for af in action_frequencies:
    
    env.set_action_frequency(af)
    
    # REINFORCE
    reinforce_data = []
    for i in range(5):
        q_fn = ValueNet(state_dim + action_dim, hidden_dim, 1)
        v_fn = ValueNet(state_dim, hidden_dim, 1)
        beta = IndependentGaussianPolicy(state_dim, hidden_dim, action_dim)
        agent = REINFORCE(beta, q_fn, v_fn)
        opt = torch.optim.Adam(agent.parameters(), lr=3e-4)
        rname = path+"/" + args.env + "-" + args.alg + "-" + args.policy + "-score-agent-" + str(i)
        data = train_offline(env, agent, opt, batch_size, iterations, log_interval, render=ren
        reinforce_data.append(data)
    all_data["direct_rotor_command"]["hover"]["reinforce"] = reinforce_data

    # PPO
    ppo_data = []
    for i in range(5):
        q_fn = ValueNet(state_dim + action_dim, hidden_dim, 1)
        v_fn = ValueNet(state_dim, hidden_dim, 1)
        beta = IndependentGaussianPolicy(state_dim, hidden_dim, action_dim)
        agent = PPO(beta, q_fn, v_fn)
        opt = torch.optim.Adam(agent.parameters(), lr=3e-4)
        rname = path+"/" + args.env + "-" + args.alg + "-" + args.policy + "-score-agent-" + str(i)
        data = train_offline(env, agent, opt, batch_size, iterations, log_interval, render=ren
        ppo_data.append(data)
    all_data["direct_rotor_command"]["hover"]["ppo"] = ppo_data

    # TRPO
    trpo_data = []
    for i in range(5):
        q_fn = ValueNet(state_dim + action_dim, hidden_dim, 1)
        v_fn = ValueNet(state_dim, hidden_dim, 1)
        beta = IndependentGaussianPolicy(state_dim, hidden_dim, action_dim)
        agent = TRPO(beta, q_fn, v_fn)
        opt = torch.optim.Adam(agent.parameters(), lr=3e-4)
        rname = path+"/" + args.env + "-" + args.alg + "-" + args.policy + "-score-agent-" + str(i)
        data = train_offline(env, agent, opt, batch_size, iterations, log_interval, render=ren
        trpo_data.append(data)
    all_data["direct_rotor_command"]["hover"]["trpo"] = trpo_data

    # DDPG
    ddpg_data = []
    for i in range(5):
        q_fn = ValueNet(state_dim + action_dim, hidden_dim, 1)
        v_fn = ValueNet(state_dim, hidden_dim, 1)
        beta = IndependentGaussianPolicy(state_dim, hidden_dim, action_dim)
        agent = TRPO(beta, q_fn, v_fn)
        opt = torch.optim.Adam(agent.parameters(), lr=3e-4)
        rname = path+"/" + args.env + "-" + args.alg + "-" + args.policy + "-score-agent-" + str(i)
        data = train_online(env, agent, opt, batch_size, iterations, log_interval, render=ren
        ddpg_data.append(data)
    all_data["direct_rotor_command"]["hover"]["ddpg"] = ddpg_data

    # TD3
    td3_data = []
    for i in range(5):
        q_fn = ValueNet(state_dim + action_dim, hidden_dim, 1)
        v_fn = ValueNet(state_dim, hidden_dim, 1)
        beta = IndependentGaussianPolicy(state_dim, hidden_dim, action_dim)
        agent = TRPO(beta, q_fn, v_fn)
        opt = torch.optim.Adam(agent.parameters(), lr=3e-4)
        rname = path+"/" + args.env + "-" + args.alg + "-" + args.policy + "-score-agent-" + str(i)
        data = train_online(env, agent, opt, batch_size, iterations, log_interval, render=ren
        td3_data.append(data)
    all_data["direct_rotor_command"]["hover"]["td3"] = td3_data

    # SVG(0)
    svg_data = []
    for i in range(5):
        q_fn = ValueNet(state_dim + action_dim, hidden_dim, 1)
        v_fn = ValueNet(state_dim, hidden_dim, 1)
        beta = IndependentGaussianPolicy(state_dim, hidden_dim, action_dim)
        agent = TRPO(beta, q_fn, v_fn)
        opt = torch.optim.Adam(agent.parameters(), lr=3e-4)
        rname = path+"/" + args.env + "-" + args.alg + "-" + args.policy + "-score-agent-" + str(i)
        data = train_online(env, agent, opt, batch_size, iterations, log_interval, render=ren
        svg_data.append(data)
    all_data["direct_rotor_command"]["hover"]["svg"] = svg_data

    # SAC
    sac_data = []
    for i in range(5):
        q_fn = ValueNet(state_dim + action_dim, hidden_dim, 1)
        v_fn = ValueNet(state_dim, hidden_dim, 1)
        beta = IndependentGaussianPolicy(state_dim, hidden_dim, action_dim)
        agent = TRPO(beta, q_fn, v_fn)
        opt = torch.optim.Adam(agent.parameters(), lr=3e-4)
        rname = path+"/" + args.env + "-" + args.alg + "-" + args.policy + "-score-agent-" + str(i)
        data = train_online(env, agent, opt, batch_size, iterations, log_interval, render=ren
        sac_data.append(data)
    all_data["direct_rotor_command"]["hover"]["sac"] = sac_data

## 6. Results

### 6.1 Data Processing



In [0]:
for arch in all_data:
  architecture = all_data[arch]
  plt.figure(figsize=(10,10))
  for t in task:
    task_data = architecture[t]
    for alg in task_data:
      alg_data = np.array(task_data[alg])
      mu = np.mean(algorithm_data, axis=1)
      std = np.std(algorithm_data, axis=1)
      plt.plot(np.linspace(0, len(mu)), mu)
    plt.legend(["REINFORCE", "PPO", "TRPO", "DDPG", "TD3", "SVG(0)", "SAC"])
    plt.show()

### 6.2 Performance by Algorithm

### 6.3 Online Versus Offline

### 6.4 Effect of PID Controller on Learning Speed and Maximal Reward

### 6.5 Effect of Action Selection Frequency

### 6.6 Results Table

|Control Frequency: 0.01s                                               |
|-----------------------------------------------------------------------|
|Task               |REINFORCE    |PPO   |TRPO  |DDPG   |SVG(0)  |SAC   |
|-------------------|-------------|------|------|-------|--------|------|
|Hover              |stuff        |TBT   |TBT   |TBT    |TBT     | TBT  |
|Random Waypoint FH |stuff        |Stuff |TBT   |TBT    |TBT     | TBT  |
|Random Waypoint NH |stuff        |TBT   |TBT   |TBT    |TBT     | TBT  |
|Landing            |stuff        |stuff |stuff |stuff  |stuff   |stuff |

|Control Frequency: 0.02s                                               |
|-----------------------------------------------------------------------|
|Task               |REINFORCE    |PPO   |TRPO  |DDPG   |SVG(0)  |SAC   |
|-------------------|-------------|------|------|-------|--------|------|
|Hover              |stuff        |TBT   |TBT   |TBT    |TBT     | TBT  |
|Random Waypoint FH |stuff        |Stuff |TBT   |TBT    |TBT     | TBT  |
|Random Waypoint NH |stuff        |TBT   |TBT   |TBT    |TBT     | TBT  |
|Landing            |stuff        |stuff |stuff |stuff  |stuff   |stuff |

|Control Frequency: 0.03s                                               |
|-----------------------------------------------------------------------|
|Task               |REINFORCE    |PPO   |TRPO  |DDPG   |SVG(0)  |SAC   |
|-------------------|-------------|------|------|-------|--------|------|
|Hover              |stuff        |TBT   |TBT   |TBT    |TBT     | TBT  |
|Random Waypoint FH |stuff        |Stuff |TBT   |TBT    |TBT     | TBT  |
|Random Waypoint NH |stuff        |TBT   |TBT   |TBT    |TBT     | TBT  |
|Landing            |stuff        |stuff |stuff |stuff  |stuff   |stuff |

|Control Frequency: 0.04s                                               |
|-----------------------------------------------------------------------|
|Task               |REINFORCE    |PPO   |TRPO  |DDPG   |SVG(0)  |SAC   |
|-------------------|-------------|------|------|-------|--------|------|
|Hover              |stuff        |TBT   |TBT   |TBT    |TBT     | TBT  |
|Random Waypoint FH |stuff        |Stuff |TBT   |TBT    |TBT     | TBT  |
|Random Waypoint NH |stuff        |TBT   |TBT   |TBT    |TBT     | TBT  |
|Landing            |stuff        |stuff |stuff |stuff  |stuff   |stuff |

|Control Frequency: 0.05s                                               |
|-----------------------------------------------------------------------|
|Task               |REINFORCE    |PPO   |TRPO  |DDPG   |SVG(0)  |SAC   |
|-------------------|-------------|------|------|-------|--------|------|
|Hover              |stuff        |TBT   |TBT   |TBT    |TBT     | TBT  |
|Random Waypoint FH |stuff        |Stuff |TBT   |TBT    |TBT     | TBT  |
|Random Waypoint NH |stuff        |TBT   |TBT   |TBT    |TBT     | TBT  |
|Landing            |stuff        |stuff |stuff |stuff  |stuff   |stuff |

## 7. Discussion and Conclusion

## 8. References

1. Russell, S.J. and Norvig, P., 2016. Artificial intelligence: a modern approach. Malaysia; Pearson Education Limited.

2. Goodfellow, I., Bengio, Y., Courville, A. and Bengio, Y., 2016. Deep learning (Vol. 1). Cambridge: MIT press.

3. Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M. and Dieleman, S., 2016. Mastering the game of Go with deep neural networks and tree search. nature, 529(7587), p.484.

4. Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A. and Chen, Y., 2017. Mastering the game of Go without human knowledge. Nature, 550(7676), p.354.

5. Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D. and Riedmiller, M., 2013. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602.

6. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G. and Petersen, S., 2015. Human-level control through deep reinforcement learning. Nature, 518(7540), p.529.

7. Heess, N., Sriram, S., Lemmon, J., Merel, J., Wayne, G., Tassa, Y., Erez, T., Wang, Z., Eslami, A., Riedmiller, M. and Silver, D., 2017. Emergence of locomotion behaviours in rich environments. arXiv preprint arXiv:1707.02286.

8. Arulkumaran, K., Deisenroth, M.P., Brundage, M. and Bharath, A.A., 2017. A brief survey of deep reinforcement learning. arXiv preprint arXiv:1708.05866.

9. Hwangbo, J., Sa, I., Siegwart, R. and Hutter, M., 2017. Control of a quadrotor with reinforcement learning. IEEE Robotics and Automation Letters, 2(4), pp.2096-2103.

10. Andersson, O., Wzorek, M. and Doherty, P., 2017, February. Deep Learning Quadcopter Control via Risk-Aware Active Learning. In AAAI (pp. 3812-3818).
ROTORS

11. Kingma, D.P. and Welling, M., 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.

12. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. and Bengio, Y., 2014. Generative adversarial nets. In Advances in neural information processing systems (pp. 2672-2680).

13. Graves, A., Wayne, G. and Danihelka, I., 2014. Neural turing machines. arXiv preprint arXiv:1410.5401.

14. Graves, A., Wayne, G., Reynolds, M., Harley, T., Danihelka, I., Grabska-Barwińska, A., Colmenarejo, S.G., Grefenstette, E., Ramalho, T., Agapiou, J. and Badia, A.P., 2016. Hybrid computing using a neural network with dynamic external memory. Nature, 538(7626), p.471.

15. Zhang, J., Tai, L., Boedecker, J., Burgard, W. and Liu, M., 2017. Neural SLAM. arXiv preprint arXiv:1706.09520.

17. Ha, D. and Schmidhuber, J., 2018. World Models. arXiv preprint arXiv:1803.10122.

18. Sutton, R.S. and Barto, A.G., 1998. Reinforcement learning: An introduction (Vol. 1, No. 1). Cambridge: MIT press.

19. Peters, J. and Schaal, S., 2006, October. Policy gradient methods for robotics. In Intelligent Robots and Systems, 2006 IEEE/RSJ International Conference on (pp. 2219-2225). IEEE.

20. Peters, J. and Schaal, S., 2008. Reinforcement learning of motor skills with policy gradients. Neural networks, 21(4), pp.682-697.

21. Schulman, J., Moritz, P., Levine, S., Jordan, M. and Abbeel, P., 2015. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438.

22. Schulman, J., Wolski, F., Dhariwal, P., Radford, A. and Klimov, O., 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.

23. Schulman, J., Levine, S., Abbeel, P., Jordan, M. and Moritz, P., 2015, June. Trust region policy optimization. In International Conference on Machine Learning (pp. 1889-1897).

## 9. Appendix

### 9.1 Appendix A

#### Proof of 1.1
We start with a parameterized distribution $p_{\theta}(x)$ and a fixed distribution $q(x)$. The cross-entropy is:

$H(p,q) = -\int_{x}p_\theta(x)\log{q(x)}dx = \mathbb{E}_{x\sim p}\left[\log{q(x)}\right]$

Our goal is to reduce $H(p,q)$ by updating the parameters of $p_\theta(x)$:

$\nabla_{\theta}H(p,q) = -\nabla_{\theta}\int_{x}p_{\theta}(x)\log{q(x)}dx$

Using standard identities from the policy gradient theorem:

$\nabla_{\theta}H(p,q) = -\mathbb{E}_{x\sim p}\left[\nabla_\theta\log{p_\theta(x)}\log{q(x)}\right]$

We can recover the policy gradient theorem by letting $p_\theta(x) = \pi_{\theta}(a_t|s_t)$, and $q(x) = \frac{\exp{(Q^\pi(s_t,a_t)-V^{\pi}(s_t)}}{Z_t}$, and summing over timesteps $T$:

$\nabla_{\theta}J(\theta) = \sum_t^T \nabla_{\theta}H(p,q) = -\mathbb{E}_{s,a\sim\pi}\left[\sum_t^T \nabla_{\theta}\pi_\theta (a_t|s_t)\left(Q^\pi(s_t,a_t) - V^\pi(s_t) - \log{Z_t}\right)\right]$

This also implies a form of the policy gradient theorem that minimizes the KL-divergence instead. This can be recovered using:

$KL(p||q) = H(p,q) + H(p)$

Where $H(p) = \mathbb{E}_{x \sim p}\left[\log{p_\theta (x)}\right]$. Thus:

$\nabla_\theta KL(p||q) =  \mathbb{E}_{x\sim p}\left[\nabla_\theta \log{p_\theta}(x)\right] - \mathbb{E}_{x\sim p}\left[\nabla_\theta\log{p_\theta(x)}\log{q(x)}\right]$

Which reduces to:

$\nabla_\theta KL(p||q) =  \mathbb{E}_{x\sim p}\left[\nabla_\theta \log{p_\theta}(x)(1 - \log{q(x)})\right]$

We can recover the target objective in [] by substituting $p_\theta(x) = \pi_{\theta}(a|s)$, and $\log q(x) = Q^\pi(s,a) - \log\pi_\theta(a|s)$. Alternatively, we could make the same substitutions as we previously made to recover the policy gradient theorem, and get a policy gradient that minimizes the KL-divergence rather than the cross entropy.

It is also straightforward to show that the pathwise derivative as used in DDPG, TD3, SVG(0), and SAC follows the same pattern. Using a change of variables:

$\nabla_{\theta}H(p,q) = -\int_{x}p_{\theta}(x)\nabla_{\theta}\log p_\theta\log{q(x)}dx = -\int_{x}p(\epsilon)\nabla_{\theta}\log{q(x(\theta; \epsilon))}d\epsilon$

We can recover the DDPG update by letting $x = a$ such that $a = \pi(\theta) + \epsilon$, where epsilon is some form of noise injection (Ornstein-Uhlenbeck in the case of DDPG, $\mathcal{N}(0, \sigma \mathbf{I})$ in the case of TD3) and $q(x) = Q^\mu(s,a)$ is the off-policy value function: 

$\nabla_{\theta}H(p,q) = -\int_{s, \epsilon} p(s) p(\epsilon)\nabla_{\theta}Q^\mu(s,a(\theta; \epsilon))\, d\epsilon \, ds = -\mathbb{E}_{s \sim \mathcal{D}, \epsilon \sim \mathcal{N}}\left[\nabla_{\theta}Q^\mu(s,a(\theta; \epsilon))\right] = -\mathbb{E}_{s \sim \mathcal{D}, a \sim \pi}\left[\nabla_{\theta}Q^\mu(s,a(\theta; \epsilon))\right]$

Again, this implies a form of the loss function that minimizes the KL-divergence instead, which -- using the same identities as previously, gives us:

$\nabla_{\theta}KL(p||q) = \mathbb{E}_{\pi}\left[\log \pi_\theta (a|s)\right]-\int_{x}p(\epsilon)\nabla_{\theta}\log{q(x(\theta; \epsilon))}d\epsilon$

### 9.2 Appendix B
#### 9.2.1 Further Implementation Details
As noted above, our primary goal with our implementations was simplicity and consistency between the algorithms. These goals are somewhat confounded by differing conventions for online and offline algorithms; for example, online algorithms typically use a Tanh squashing function to constrain the action to the interval $[-1, 1]$. In the absence of this squashing activation, these algorithms typically struggle to function, which we show below on a unit test:

FIGURE

We found consrained actions to be a key component in getting DDPG-based methods (including DDPG and TD3) to work, so it's not surprising that a stochastic-policy method like SAC that is itself evolved from DDPG would use the same technique. This necessitates the change of variables formula for getting the log-probability of the squashed action that is used in SAC, which would seem at odds with offline, REINFORCE-based methods. Our unit-testing revealed that DDPG itself was actually surprisingly resilient to unconstrained actions if Tanh activations were used in the policy, but the same wasn't true of the other algorithms. To keep things consistent, we opted to stick with RELU activations in the policy network, and a Tanh activation at the output.

This requirement is in stark contrast to REINFORCE-based algorithms, which typically have an unconstrained action space, and -- in our experience at least -- will generally work with *any* parameterized density. Nevertheless, most implementations and experiments that we have seen stick to independent (or Multivariate) Gaussian distributions for the case of continuous actions. To keep things consistent between all of the algorithms, we felt that it was important to compare them on a like-for-like basis. That is, we wanted all algorithms to use the exact same policy parameterization, and relatively consistent hyperparameters (i.e. number of neurons, number of layers, learning rate, etc.). Given that the online algorithms were unable to perform with an unconstrained action space (and given that action space constraints are a reasonable assumption to make) we used the same change-of-variables formula that is used in SAC to squash the action for our policy gradient algorithms.