# Learning Quadrotor Flight Control Tasks Using Deep Reinforcement Learning

## 1. Introduction

## 2. Related Work
\cite{c7,c9} used MPC-Guided Policy Search to learn to fly a quadrotor in Gazebo through unstructured environments using LIDAR. This was done using a heavily modified fork of RotorS \cite{c8}, with a custom control node running the policy search algorithm. This research proved the efficacy of using RL to train continuous-action controllers for quadrotors.

\cite{c10} used a method similar to guided policy search to train a quadrotor controller to avoid collisions in obstacle-filled environments. In order to overcome the relative sparsity of such events, \cite{c10} used a novel sampling strategy to overexpose the agent to collisions, resulting in a more risk averse policy. The authors successfully tested the learned policy on the CrazyFlie nanoquad using a native C-library for the network weights.

\cite{c6} used a custom natural policy gradient method to recover a tumbling aircraft. To do this, the authors used a novel "junction" exploration strategy, and wrote a very simple flight dynamics model -- omitting the effects of drag -- to demonstrate that they could still learn robust control policies using deep RL. The policies were successfully tested on a real-world aircraft using a Vicon camera system for state estimation.

These previous studies are typically constrained to a single task, and -- in the case of \cite{c6,c7,c9,c10} -- make use of an expert to guide learning. In \cite{c7,c9,c10}, a trajectory planner was used to turn the task of policy search into a supervised learning task, whereas in \cite{c6}, a PD controller was used to help stabilize learning (though the final policy did not use the controller, and far exceeded its capabilities). Our work does not make use of such additional machinery, and learns the policy directly. Furthermore, we learn controllers for a greater number of tasks than these previous works.

Surprisingly, no work that we know of has explicitly attempted training current state-of-the-art algorithms without any additional aids; [] comes the closest, but used a PD controller to bias the policy search algorithm in the direction of the solution, and make no mention of whether this was done for their implementations of TRPO and DDPG. Other works have reported that DDPG struggles to learn continuous flight control tasks \cite{c6}, but no works have attempted using REINFORCE with GAE, or PPO, for the problem of continuous-action flight control. Given that PPO and TRPO are state-of-the-art algorithms with notable success in continuous-action control tasks -- and, more recently -- DOTA2 and StarCraft II \cite{c11}, it stands to reason that such algorithms should be able to learn simple flight control tasks.

## 3. Background

### 3.1 Policy Search
We assume a Markov Decision Process with continuous states $s \in \mathcal{S}$ and continuous actions $a \in \mathcal{A}$. We denote a trajectory $\tau$ as an ordered set of events $\{s_{0}, a_{0}, r_{0}, ...,s_{H-1},a_{H-1}, r_{H-1},s_{H}\}$ from time $0 \leq t \leq H$, for states and actions distributed according to a policy $\pi$, and rewards $r \in \mathcal{R}$ from the environment \cite{c12}.

The goal of our agent is to maximize a gamma-discounted sum of rewards over time:

\begin{equation}
	G_{t} = r_{0} +\gamma r_{1} + \gamma^{2} r_{2} + ... + \gamma^{H} r_{H} =  \sum_{i=0}^{H} \gamma^{i}r_{t+i+1}
\end{equation}

Where $H$ can be either a finite or infinite horizon, and $\gamma \in [0, 1]$ is a discount factor that ensures that $G_{t}$ converges for $H=\infty$.

The state and state-action value functions of policy $\pi$ are defined as:

\begin{equation}
	V^{\pi}(s_{t}) = \mathbb{E}_{\pi}\left[\sum_{i=0}^{H} \gamma^{i}r_{t+i+1}|s_{t}=s\right]
\end{equation}

\begin{equation}
	Q^{\pi}(s_{t}, a_{t}) = \mathbb{E}_{\pi}\left[\sum_{i=0}^{H} \gamma^{i}r_{t+i+1}|s_{t}=s, a_{t}=a\right]
\end{equation}

With the advantage function of $\pi$ being:

\begin{equation}
    A^{\pi}(s_{t},a_{t})=Q^{\pi}(s_{t},a_{t})-V^{\pi}(s_{t})
\end{equation}

Our policy is a stochastic function $\pi: \mathcal{S} \rightarrow \mathcal{A}$, and our goal is to find a set of policy weights $\theta$ that maximizes the expected return over trajectories sampled under $\pi$. 

We focus on the class of methods known as Monte-Carlo Policy Gradients, for which the policy selects actions, and a learned value function of the form $V:\mathcal{S} \rightarrow \mathbb{R}$ evaluates the resulting state. The value function is typically trained by minimizing the error:

\begin{equation}
\label{eqn:td_err}
    \mathcal{L}(\phi) = \mathbb{E}_{\pi}\left[\left(r_{t}+\gamma \hat{V}^{\pi}(s_{t+1})-V_{\phi}^{\pi}(s_{t})\right)^{2}\right]
\end{equation}

In which $\hat{V}^{\pi}$ refers to the Monte-Carlo estimate of $V^{\pi}$, and $V_{\phi}^{\pi}$ corresponds to the value function estimate parameterized by $\phi$. Gradient ascent is used to maximize the return of the policy by ascending the value function, using a stochastic gradient estimator:

\begin{equation}
\nabla_{\theta}\mathcal{L}(\theta)=\mathbb{E}_{\pi}\left[\sum_{t=0}^{H-1}\nabla_{\theta} \log \pi_{\theta}(a_{t}|s_{t})f^{\pi}(s_{t},a_{t})\right]
\end{equation}

 to learn on-policy \cite{c12,c13}, or using an importance sampling estimator:
 
\begin{equation}
\nabla_{\theta} \mathcal{L}(\theta) = \mathbb{E}_{\beta}\left[\sum_{t=0}^{H-1}\frac{\nabla_{\theta}\pi_{\theta}(a_{t}|s_{t})}{\mu(a_{t}|s_{t})}f^{\mu}(s_{t},a_{t})\right]\end{equation}

to learn off-policy \cite{c12,c14,c15}. $f^{\pi}(s_{t},a_{t})$ can take multiple forms (see \cite{c16}) but typically takes the form $r_{t}+\gamma \hat{V}^{\pi}(s_{t+1})-V_{\phi}^{\pi}(s_{t})$, or interpolates between the one-step and infinite horizon using an additional weighting (see TD($\lambda$) and generalized advantage estimation \cite{c16}). 

State-of-the-art policy gradient methods such as PPO and TRPO \cite{c17,c18} use the importance sampling estimator for on-policy learning, by bounding the update to a trust region over multiple update steps. The reason for this is that when updating the policy, we aim to take small steps, but small steps in the parameter space may not correspond to small steps in the action space. TRPO and PPO restrict the KL-Divergence between the old and new policies, which in theory guarantees monotonic policy improvement.

PPO is a first-order method that approximates the constrained second-order update used in TRPO, by using either a clipped objective, or an adaptive penalty based on the KL-divergence. Such methods typically use generalized advantage estimation to update the policy. Per the definition of the value function, $V^{\pi}(s_{t}) \neq V^{\mu}(s_{t})$ for arbitrary policies $\pi$ and $\mu$ \cite{c14}, and so an off-policy correction should be made when taking multiple gradient steps using a single roll-out batch.

We can produce more flexible controllers by using goal-conditioned policies, for which a goal is passed as contextual input to $\pi$ \cite{c19,c20}.  Similarly, a value function that takes an additional goal argument is a universal value function approximator (UVFA) \cite{c21}. Such value functions can estimate the value of states or state-action pairs given the context of the goal, making them useful for learning more general policies. Our research focuses on goal-conditioned policies as we aim to learn general flight control policies. 

### 3.2 Quadrotor Flight Dynamics
We briefly cover quadrotor flight dynamics here as it pertains to the learning problem.

Our aircraft is modeled in a NED axis system, and we use the full 6DOF nonlinear equations of motion:

\begin{equation}
    \begin{bmatrix}
		0 \\
		\mathbf{\dot{v}} \\
	\end{bmatrix} = 
	\begin{bmatrix}
		0 \\
		\mathbf{a}_{b}
	\end{bmatrix}
	+\mathbf{q}\otimes
	\begin{bmatrix}
		0 \\
		\mathbf{G}_{i}
	\end{bmatrix}\otimes
	\mathbf{q}^{-1}
	-\begin{bmatrix}
		0 \\
		\mathbf{\omega} \times \mathbf{v}
	\end{bmatrix}
\end{equation}

\begin{equation}
	\mathbf{\dot{\omega}} = \mathbf{J}^{-1}(\mathbf{M}_{b}-\mathbf{\omega}\times \mathbf{J}\mathbf{\omega})
\end{equation}

\begin{equation}	
	\begin{bmatrix}
		0 \\
		\mathbf{\dot{x}}
	\end{bmatrix} = \mathbf{q}^{-1} \otimes 
	\begin{bmatrix}
		0 \\
		\mathbf{v}
	\end{bmatrix} \otimes
	\mathbf{q}
\end{equation}

\begin{equation}
    \mathbf{\dot{q}} = -\frac{1}{2}
    \begin{bmatrix}
		0 \\\omega 
	\end{bmatrix}\otimes\mathbf{q}
\end{equation}

We model the motor thrust and torque as being $\propto \Omega^2$, with motor lag characterized by a linear first-order ODE:

\begin{equation}
\dot{\Omega} = -k_{\Omega}(\Omega-\Omega_{C})
\end{equation}

We include a simple aerodynamic force and moment model:

\begin{equation}
F_A = -k_F \mathbf{v}^T \mathbf{v} \hat{\mathbf{v}}
\end{equation}

\begin{equation}
Q_A = -k_Q \mathbf{v}^T \mathbf{v} \hat{\mathbf{v}}
\end{equation}

Our model is implemented in C++, with code generated using MATLAB Simulink. We wrote Python wrappers for the code so that we could interface it with standard Python ML libraries such as PyTorch, and test our own agents on standard benchmarks. We did this as we found that existing packages such as AirSim and Gazebo were unsuited to training agents at speeds faster than real-time, and that their decentralized nature made controlling action-selection frequency difficult (and largely dependent on hardware). Furthermore, previous works such as [] have demonstrated the efficacy of using simple models to train aircraft-ready policies. All environments used in our experiments can be accessed online at https://github.com/seanny1986/gym-aero/. We provide an OpenGL-based visualization module along with the basic simulator.

Actions in quadrotors are additive, meaning that -- for example -- if we want to travel in a diagonal direction, we can add actions for going forwards and going sideways in order to do so. This is a common strategy in designing quadrotor controllers -- we can nest them and additively compose actions using controllers designed for different roles (e.g. combining a hover controller with a tracking controller). Furthermore, quadrotors do not need to face "fowards" into the direction of flight, and can translate in any direction without having to yaw into the direction of travel. Nevertheless, this is a constraint we would like to impose, as it reflects the way in which human pilots fly. This breaks many of the symmetries inherent in quadrotor flight, but provides us with a more interesting learning problem.

## 4. Methodology

## 5. Implementations
For our environments, we first create a base class, and then inherit from it for all others:

## 6. Experiments

## 7. Results